Please submit your .Rmd
and .html
files in Sakai. If you are working together, both people should submit the files.
The goal of the midterm project is to showcase skills that you have learned in class so far. The midterm is open note, but if you use someone else’s code, you must attribute them.
# This code came from (included each time in the code chunk)
Potential Sources for data: Tidy Tuesday: https://github.com/rfordatascience/tidytuesday
.csv
file into your data
folder.You may use another dataset or your own data, but please make sure it is de-identified.
If you’d like to work together, that is encouraged, but you must divide the work equitably and you must note who worked on what. This is probably easiest as notes in the text. Please let Eric or Me know that you’ll be working together.
No acknowledgements of contributions = -10 points overall.
I will take off points (-5 points for each section) if you don’t add observations and notes in your RMarkdown document. I want you to think and reason through your analysis, even if they are preliminary thoughts.
Define your research question below. What about the data interests you? What is a specific question you want to find out about the data?
Given your question, what is your expectation about the data?
My research question: In Oregon is there a significant difference between MMR vaccination rates in schools between urban and rural counties?
Load the data below and use
dplyr::glimpse()
orskimr::skim()
on the data. You should upload the data file into thedata
directory.
measles <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-25/measles.csv')
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## index = col_double(),
## state = col_character(),
## year = col_character(),
## name = col_character(),
## type = col_character(),
## city = col_character(),
## county = col_character(),
## district = col_logical(),
## enroll = col_double(),
## mmr = col_double(),
## overall = col_double(),
## xrel = col_logical(),
## xmed = col_double(),
## xper = col_double(),
## lat = col_double(),
## lng = col_double()
## )
wsj <- read_csv("https://raw.githubusercontent.com/WSJ/measles-data/master/all-measles-rates.csv")
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## index = col_double(),
## state = col_character(),
## year = col_character(),
## name = col_character(),
## type = col_character(),
## city = col_character(),
## county = col_character(),
## district = col_logical(),
## enroll = col_double(),
## mmr = col_double(),
## overall = col_double(),
## xrel = col_logical(),
## xmed = col_double(),
## xper = col_double()
## )
## Warning: 19449 parsing failures.
## row col expected actual file
## 9891 district 1/0/T/F/TRUE/FALSE Harrison 2 'https://raw.githubusercontent.com/WSJ/measles-data/master/all-measles-rates.csv'
## 9892 district 1/0/T/F/TRUE/FALSE Denver County 1 'https://raw.githubusercontent.com/WSJ/measles-data/master/all-measles-rates.csv'
## 9893 district 1/0/T/F/TRUE/FALSE Academy 20 'https://raw.githubusercontent.com/WSJ/measles-data/master/all-measles-rates.csv'
## 9893 xrel 1/0/T/F/TRUE/FALSE 1.52 'https://raw.githubusercontent.com/WSJ/measles-data/master/all-measles-rates.csv'
## 9894 district 1/0/T/F/TRUE/FALSE Colorado Springs 11 'https://raw.githubusercontent.com/WSJ/measles-data/master/all-measles-rates.csv'
## .... ........ .................. ................... .................................................................................
## See problems(...) for more details.
glimpse(wsj)
## Rows: 46,411
## Columns: 14
## $ index <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
## $ state <chr> "Arizona", "Arizona", "Arizona", "Arizona", "Arizona", "Ariz…
## $ year <chr> "2018-19", "2018-19", "2018-19", "2018-19", "2018-19", "2018…
## $ name <chr> "A J Mitchell Elementary", "Academy Del Sol", "Academy Del S…
## $ type <chr> "Public", "Charter", "Charter", "Charter", "Charter", "Publi…
## $ city <chr> "Nogales", "Tucson", "Tucson", "Phoenix", "Phoenix", "Phoeni…
## $ county <chr> "Santa Cruz", "Pima", "Pima", "Maricopa", "Maricopa", "Maric…
## $ district <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ enroll <dbl> 51, 22, 85, 60, 43, 36, 24, 22, 26, 78, 35, 54, 54, 34, 57, …
## $ mmr <dbl> 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, …
## $ overall <dbl> -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, …
## $ xrel <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ xmed <dbl> NA, NA, NA, NA, 2.33, NA, NA, NA, NA, NA, 2.86, NA, 7.41, NA…
## $ xper <dbl> NA, NA, NA, NA, 2.33, NA, 4.17, NA, NA, NA, NA, NA, NA, NA, …
skim(wsj)
Name | wsj |
Number of rows | 46411 |
Number of columns | 14 |
_______________________ | |
Column type frequency: | |
character | 6 |
logical | 2 |
numeric | 6 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
state | 0 | 1.00 | 4 | 14 | 0 | 32 | 0 |
year | 0 | 1.00 | 4 | 7 | 0 | 4 | 0 |
name | 0 | 1.00 | 1 | 91 | 0 | 36129 | 0 |
type | 27174 | 0.41 | 5 | 12 | 0 | 6 | 0 |
city | 17339 | 0.63 | 2 | 43 | 0 | 5665 | 0 |
county | 5157 | 0.89 | 3 | 21 | 0 | 1158 | 0 |
Variable type: logical
skim_variable | n_missing | complete_rate | mean | count |
---|---|---|---|---|
district | 46411 | 0 | NaN | : |
xrel | 46317 | 0 | 1 | TRU: 94 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
index | 0 | 1.00 | 1457.45 | 1686.44 | 1.00 | 371.0 | 881.00 | 1735.00 | 8066.00 | ▇▂▁▁▁ |
enroll | 12844 | 0.72 | 120.54 | 161.40 | 0.00 | 39.0 | 73.00 | 116.00 | 6222.00 | ▇▁▁▁▁ |
mmr | 0 | 1.00 | 57.18 | 47.05 | -1.00 | -1.0 | 91.67 | 98.00 | 100.00 | ▆▁▁▁▇ |
overall | 0 | 1.00 | 51.57 | 46.79 | -1.00 | -1.0 | 83.33 | 95.61 | 100.00 | ▇▁▁▁▇ |
xmed | 33439 | 0.28 | 3.10 | 4.32 | 0.04 | 1.0 | 2.00 | 4.00 | 100.00 | ▇▁▁▁▁ |
xper | 40000 | 0.14 | 6.93 | 8.16 | 0.17 | 2.7 | 5.00 | 7.76 | 169.23 | ▇▁▁▁▁ |
glimpse(measles)
## Rows: 66,113
## Columns: 16
## $ index <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 10, 11, 12, 13, 14, 15, 15, 1…
## $ state <chr> "Arizona", "Arizona", "Arizona", "Arizona", "Arizona", "Ariz…
## $ year <chr> "2018-19", "2018-19", "2018-19", "2018-19", "2018-19", "2018…
## $ name <chr> "A J Mitchell Elementary", "Academy Del Sol", "Academy Del S…
## $ type <chr> "Public", "Charter", "Charter", "Charter", "Charter", "Publi…
## $ city <chr> "Nogales", "Tucson", "Tucson", "Phoenix", "Phoenix", "Phoeni…
## $ county <chr> "Santa Cruz", "Pima", "Pima", "Maricopa", "Maricopa", "Maric…
## $ district <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ enroll <dbl> 51, 22, 85, 60, 43, 36, 24, 22, 26, 78, 78, 35, 54, 54, 34, …
## $ mmr <dbl> 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, …
## $ overall <dbl> -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, …
## $ xrel <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ xmed <dbl> NA, NA, NA, NA, 2.33, NA, NA, NA, NA, NA, NA, 2.86, NA, 7.41…
## $ xper <dbl> NA, NA, NA, NA, 2.33, NA, 4.17, NA, NA, NA, NA, NA, NA, NA, …
## $ lat <dbl> 31.34782, 32.22192, 32.13049, 33.48545, 33.49562, 33.43532, …
## $ lng <dbl> -110.9380, -110.8961, -111.1170, -112.1306, -112.2247, -112.…
skim(measles)
Name | measles |
Number of rows | 66113 |
Number of columns | 16 |
_______________________ | |
Column type frequency: | |
character | 6 |
logical | 2 |
numeric | 8 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
state | 0 | 1.00 | 4 | 14 | 0 | 32 | 0 |
year | 0 | 1.00 | 4 | 7 | 0 | 4 | 0 |
name | 0 | 1.00 | 1 | 91 | 0 | 36129 | 0 |
type | 36622 | 0.45 | 5 | 12 | 0 | 6 | 0 |
city | 20071 | 0.70 | 2 | 43 | 0 | 5665 | 0 |
county | 6254 | 0.91 | 3 | 21 | 0 | 1158 | 0 |
Variable type: logical
skim_variable | n_missing | complete_rate | mean | count |
---|---|---|---|---|
district | 66113 | 0 | NaN | : |
xrel | 66004 | 0 | 1 | TRU: 109 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
index | 0 | 1.00 | 1607.69 | 1706.77 | 1.00 | 429.00 | 997.00 | 2133.00 | 8066.00 | ▇▂▁▁▁ |
enroll | 16260 | 0.75 | 131.93 | 162.16 | 0.00 | 46.00 | 80.00 | 129.00 | 6222.00 | ▇▁▁▁▁ |
mmr | 0 | 1.00 | 63.17 | 45.72 | -1.00 | -1.00 | 95.00 | 98.00 | 100.00 | ▅▁▁▁▇ |
overall | 0 | 1.00 | 54.09 | 46.67 | -1.00 | -1.00 | 87.00 | 96.10 | 100.00 | ▆▁▁▁▇ |
xmed | 45122 | 0.32 | 2.91 | 3.85 | 0.04 | 1.00 | 2.00 | 3.53 | 100.00 | ▇▁▁▁▁ |
xper | 57560 | 0.13 | 6.78 | 7.62 | 0.17 | 2.84 | 5.00 | 7.55 | 169.23 | ▇▁▁▁▁ |
lat | 1549 | 0.98 | 39.15 | 4.58 | 24.55 | 35.69 | 40.21 | 42.18 | 49.00 | ▁▃▅▇▂ |
lng | 1549 | 0.98 | -96.28 | 18.44 | -124.50 | -117.63 | -89.97 | -81.75 | 80.21 | ▇▃▁▁▁ |
If there are any quirks that you have to deal with
NA
coded as something else, or it is multiple tables, please make some notes here about what you need to do before you start transforming the data in the next section.
Make sure your data types are correct!
The data came as two datasets that they want you to join. The measles dataset and the wsj dataset. The datasets are pretty similar for Oregon but it appears that they vary more for the national datasets. Generally the essential data are very complete for some columns, but with many missing values in columns I dont care as much about. Will deal with below.
If the data needs to be transformed in any way (values recoded, pivoted, etc), do it here. Examples include transforming a continuous variable into a categorical using
case_when()
, etc.
Subset of the data and think of a question to answer the subset
Bonus points (5 points) for datasets that require merging of tables, but only if you reason through whether you should use
left_join
,inner_join
, orright_join
on these tables. No credit will be provided if you don’t.
ORmeasles<-measles%>% filter(state == "Oregon")
ORWSJ<-wsj %>% filter(state=="Oregon")
#I'm using skim to look at my datasets to identify how to combine them.
skim(ORmeasles)
Name | ORmeasles |
Number of rows | 817 |
Number of columns | 16 |
_______________________ | |
Column type frequency: | |
character | 6 |
logical | 2 |
numeric | 8 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
state | 0 | 1 | 6 | 6 | 0 | 1 | 0 |
year | 0 | 1 | 7 | 7 | 0 | 1 | 0 |
name | 0 | 1 | 4 | 61 | 0 | 698 | 0 |
type | 817 | 0 | NA | NA | 0 | 0 | 0 |
city | 817 | 0 | NA | NA | 0 | 0 | 0 |
county | 0 | 1 | 4 | 10 | 0 | 36 | 0 |
Variable type: logical
skim_variable | n_missing | complete_rate | mean | count |
---|---|---|---|---|
district | 817 | 0 | NaN | : |
xrel | 817 | 0 | NaN | : |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
index | 0 | 1.00 | 361.22 | 209.47 | 1.00 | 180.00 | 359.00 | 540.00 | 733.00 | ▇▇▇▇▇ |
enroll | 0 | 1.00 | 57.11 | 33.18 | 1.00 | 36.00 | 58.00 | 76.00 | 379.00 | ▇▂▁▁▁ |
mmr | 0 | 1.00 | 92.80 | 12.55 | -1.00 | 93.49 | 95.58 | 97.27 | 100.00 | ▁▁▁▁▇ |
overall | 0 | 1.00 | 90.26 | 12.98 | -1.00 | 90.59 | 93.44 | 95.44 | 100.00 | ▁▁▁▁▇ |
xmed | 609 | 0.25 | 0.46 | 0.60 | 0.11 | 0.22 | 0.31 | 0.51 | 7.14 | ▇▁▁▁▁ |
xper | 39 | 0.95 | 3.82 | 4.57 | 0.21 | 1.44 | 2.50 | 4.10 | 37.95 | ▇▁▁▁▁ |
lat | 0 | 1.00 | 44.60 | 1.09 | 42.01 | 44.05 | 44.96 | 45.48 | 46.19 | ▂▁▃▆▇ |
lng | 0 | 1.00 | -122.39 | 1.43 | -124.50 | -123.07 | -122.82 | -122.49 | -116.94 | ▅▇▁▁▁ |
skim(ORWSJ)
Name | ORWSJ |
Number of rows | 733 |
Number of columns | 14 |
_______________________ | |
Column type frequency: | |
character | 6 |
logical | 2 |
numeric | 6 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
state | 0 | 1 | 6 | 6 | 0 | 1 | 0 |
year | 0 | 1 | 7 | 7 | 0 | 1 | 0 |
name | 0 | 1 | 4 | 61 | 0 | 698 | 0 |
type | 733 | 0 | NA | NA | 0 | 0 | 0 |
city | 733 | 0 | NA | NA | 0 | 0 | 0 |
county | 0 | 1 | 4 | 10 | 0 | 36 | 0 |
Variable type: logical
skim_variable | n_missing | complete_rate | mean | count |
---|---|---|---|---|
district | 733 | 0 | NaN | : |
xrel | 733 | 0 | NaN | : |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
index | 0 | 1.00 | 367.00 | 211.74 | 1.00 | 184.00 | 367.00 | 550.00 | 733.00 | ▇▇▇▇▇ |
enroll | 0 | 1.00 | 56.24 | 33.70 | 1.00 | 31.00 | 57.00 | 76.00 | 379.00 | ▇▂▁▁▁ |
mmr | 0 | 1.00 | 92.62 | 12.69 | -1.00 | 93.33 | 95.49 | 97.24 | 100.00 | ▁▁▁▁▇ |
overall | 0 | 1.00 | 90.02 | 13.17 | -1.00 | 90.32 | 93.33 | 95.37 | 100.00 | ▁▁▁▁▇ |
xmed | 541 | 0.26 | 0.47 | 0.62 | 0.11 | 0.22 | 0.31 | 0.52 | 7.14 | ▇▁▁▁▁ |
xper | 37 | 0.95 | 3.92 | 4.71 | 0.21 | 1.41 | 2.51 | 4.22 | 37.95 | ▇▁▁▁▁ |
#I see that the ORmeasles dataset has more observations and and more columns that the ORWSJ dataset. The additional columns are latitude and longitude which I'm not using, so that doesn't really play into my decision. The years are the same for both datasets in oregon (there is variation in years in the larger not subseted data set), but there are more observations in the ORmeasles set so I'm going to use that dataset as the left dataset and leftjoin.
#I've decided that there are a lot of variables that are empty or that I'm not interested in, I'm going to further subset the data to exclude the data I dont care about to make the join less messy. I got this method from: https://www.statmethods.net/management/subset.html
getrid1<-names(ORmeasles) %in% c("type", "city", "district", "enroll", "xrel", "xmed", "xper", "lat", "lng")
getrid2<-names(ORWSJ) %in% c("type", "city", "district", "enroll", "xrel", "xmed", "xper")
clean_OR_measles<-ORmeasles[!getrid1]
clean_ORWSJ<-ORWSJ[!getrid2]
OR_MMR<-left_join(clean_OR_measles,clean_ORWSJ, by = c("index","state","year" ,"name","county","mmr","overall" ))
#In order to compare urban versus rural I need to assign counties the designation of urban or rural. In looking at doing so I landed on using the 2013 US office of budget and management definitions of Metropolitan, micropolitan, and non core, rather than a binary of rural and urban, which is a little tricker in Oregon. https://oregonexplorer.info/content/what-rural?topic=173&ptopic=140 See image below.
knitr::include_graphics("images/ORmap.png")
#I factored the counties into three different levels to classify them as "metropolitan", "micropolitan" or "noncore". Code sourced from: https://forcats.tidyverse.org/reference/fct_collapse.html
urbanstatus <- fct_collapse(OR_MMR$county,
metro = c("Multnomah", "Clackamas","Washington", "Yamhill", "Polk", "Columbia", "Benton", "Lane", "Linn", "Deschutes", "Jackson", "Josephine", "Marion"),
micro = c("Hood River", "Crook", "Umatilla", "Union", "Malheur", "Klamath", "Douglas", "Wasco", "Curry", "Coos", "Lincoln", "Clatsop", "Morrow"),
noncore = c("Wallowa", "Baker", "Grant", "Wheeler", "Gilliam", "Sherman", "Jefferson", "Tillamook", "Harney", "Lake")
)
levels(urbanstatus)
## [1] "noncore" "metro" "micro"
#checked my levels to see that they were not in a logical order. Reordered them to reflect a logical rural to urban gradient.
urbanstatus <-factor(urbanstatus, c("noncore", "micro","metro"))
levels(urbanstatus)
## [1] "noncore" "micro" "metro"
#used the cbind function to add the new categories to my OR_MMR dataframe for data analysis
OR_MMR<-cbind(OR_MMR, urbanstatus)
Show your transformed table here. Use tools such as
glimpse()
,skim()
orhead()
to illustrate your point.
head(OR_MMR)
## index state year name county mmr overall
## 1 1 Oregon 2018-19 Big Muddy Elementary Wasco 100 88.23529
## 2 2 Oregon 2018-19 Cairo Elementary Malheur 100 100.00000
## 3 3 Oregon 2018-19 Jasper Mountain Center Lane 100 100.00000
## 4 4 Oregon 2018-19 Jordan Valley Elementary Malheur 100 100.00000
## 5 5 Oregon 2018-19 Lifeways Day Treatment Program Umatilla 100 100.00000
## 6 6 Oregon 2018-19 Morrow Education Center Morrow 100 96.15385
## urbanstatus
## 1 micro
## 2 micro
## 3 metro
## 4 micro
## 5 micro
## 6 micro
Are the values what you expected for the variables? Why or Why not?
Yes, everything looks as I would expected. Only the percentages of vaccine coverage for MMR and overall vaccinations are numeric and the rest are character
Use
group_by()/summarize()
to make a summary of the data here. The summary should be relevant to your research question
OR_MMR%>% group_by(urbanstatus)%>%
summarize(mean = mean(mmr))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3 x 2
## urbanstatus mean
## <fct> <dbl>
## 1 noncore 81.6
## 2 micro 92.9
## 3 metro 93.6
OR_MMR%>% group_by(urbanstatus)%>%
summarize(median = median(mmr))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3 x 2
## urbanstatus median
## <fct> <dbl>
## 1 noncore 94.5
## 2 micro 95.9
## 3 metro 95.6
OR_MMR%>% group_by(urbanstatus)%>%
summarize(mean = mean(overall))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3 x 2
## urbanstatus mean
## <fct> <dbl>
## 1 noncore 78.5
## 2 micro 90.7
## 3 metro 91.1
OR_MMR%>% group_by(urbanstatus)%>%
summarize(median = median(overall))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3 x 2
## urbanstatus median
## <fct> <dbl>
## 1 noncore 91.7
## 2 micro 93.8
## 3 metro 93.4
What are your findings about the summary? Are they what you expected?
The findings demonstrate that there is a positive relationship between level of urbanization and vaccine coverage for both MMR and overall vaccination rates. At first I just ran the mean and the relationship looked quite strong but when I looked at my boxplots below and saw the number of outliers I realized that my mean was probably skewed and decided to compare with the median as well. When looking at the median the immunization rates are fairly similar between urban and rural. I suspected that there would be a difference but I didn’t really think about how it would be fairly likely to have such a high proportion of major outliers. It would have been very interesting to have the data on public/charter/private. Unfortunately these data were not available for the Oregon data. I’m imagining that many of the outliers might be associated with religious/waldorf private or charter schools
Make at least two plots that help you answer your question on the transformed or summarized data.
ggplot(OR_MMR) + aes(x=urbanstatus, y = mmr, color = urbanstatus) + geom_boxplot()+
labs( x = "Urban status", y= "Percent MMR vaccine coverage",
title = "MMR Vaccine Coverage by Urban/Rural Status"
)