Midterm (Due 2/12/2021 at 11:55 pm)

Please submit your .Rmd and .html files in Sakai. If you are working together, both people should submit the files.

60 / 60 points total

The goal of the midterm project is to showcase skills that you have learned in class so far. The midterm is open note, but if you use someone else’s code, you must attribute them.

# This code came from (included each time in the code chunk)

Before you get Started

  1. Pick a dataset. Ideally, the dataset should be around 2000 rows, and should have both categorical and numeric covariates.

Potential Sources for data: Tidy Tuesday: https://github.com/rfordatascience/tidytuesday

  • Note that most of these are .csv files. There is code to load the files from csv for each of the datasets and a short description of the variables, or you can upload the .csv file into your data folder.

You may use another dataset or your own data, but please make sure it is de-identified.

  1. Please schedule a time with Eric or Me to discuss your dataset and research question. We just want to look at the data and make sure that it is appropriate for your question.

Working Together

If you’d like to work together, that is encouraged, but you must divide the work equitably and you must note who worked on what. This is probably easiest as notes in the text. Please let Eric or Me know that you’ll be working together.

No acknowledgements of contributions = -10 points overall.

Please Note

I will take off points (-5 points for each section) if you don’t add observations and notes in your RMarkdown document. I want you to think and reason through your analysis, even if they are preliminary thoughts.

Define Your Research Question (10 points)

Define your research question below. What about the data interests you? What is a specific question you want to find out about the data?

Given your question, what is your expectation about the data?

My research question: In Oregon is there a significant difference between MMR vaccination rates in schools between urban and rural counties?

Loading the Data (10 points)

Load the data below and use dplyr::glimpse() or skimr::skim() on the data. You should upload the data file into the data directory.

measles <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-25/measles.csv')
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   index = col_double(),
##   state = col_character(),
##   year = col_character(),
##   name = col_character(),
##   type = col_character(),
##   city = col_character(),
##   county = col_character(),
##   district = col_logical(),
##   enroll = col_double(),
##   mmr = col_double(),
##   overall = col_double(),
##   xrel = col_logical(),
##   xmed = col_double(),
##   xper = col_double(),
##   lat = col_double(),
##   lng = col_double()
## )
wsj <- read_csv("https://raw.githubusercontent.com/WSJ/measles-data/master/all-measles-rates.csv")
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   index = col_double(),
##   state = col_character(),
##   year = col_character(),
##   name = col_character(),
##   type = col_character(),
##   city = col_character(),
##   county = col_character(),
##   district = col_logical(),
##   enroll = col_double(),
##   mmr = col_double(),
##   overall = col_double(),
##   xrel = col_logical(),
##   xmed = col_double(),
##   xper = col_double()
## )
## Warning: 19449 parsing failures.
##  row      col           expected              actual                                                                              file
## 9891 district 1/0/T/F/TRUE/FALSE Harrison 2          'https://raw.githubusercontent.com/WSJ/measles-data/master/all-measles-rates.csv'
## 9892 district 1/0/T/F/TRUE/FALSE Denver County 1     'https://raw.githubusercontent.com/WSJ/measles-data/master/all-measles-rates.csv'
## 9893 district 1/0/T/F/TRUE/FALSE Academy 20          'https://raw.githubusercontent.com/WSJ/measles-data/master/all-measles-rates.csv'
## 9893 xrel     1/0/T/F/TRUE/FALSE 1.52                'https://raw.githubusercontent.com/WSJ/measles-data/master/all-measles-rates.csv'
## 9894 district 1/0/T/F/TRUE/FALSE Colorado Springs 11 'https://raw.githubusercontent.com/WSJ/measles-data/master/all-measles-rates.csv'
## .... ........ .................. ................... .................................................................................
## See problems(...) for more details.
glimpse(wsj)
## Rows: 46,411
## Columns: 14
## $ index    <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
## $ state    <chr> "Arizona", "Arizona", "Arizona", "Arizona", "Arizona", "Ariz…
## $ year     <chr> "2018-19", "2018-19", "2018-19", "2018-19", "2018-19", "2018…
## $ name     <chr> "A J Mitchell Elementary", "Academy Del Sol", "Academy Del S…
## $ type     <chr> "Public", "Charter", "Charter", "Charter", "Charter", "Publi…
## $ city     <chr> "Nogales", "Tucson", "Tucson", "Phoenix", "Phoenix", "Phoeni…
## $ county   <chr> "Santa Cruz", "Pima", "Pima", "Maricopa", "Maricopa", "Maric…
## $ district <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ enroll   <dbl> 51, 22, 85, 60, 43, 36, 24, 22, 26, 78, 35, 54, 54, 34, 57, …
## $ mmr      <dbl> 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, …
## $ overall  <dbl> -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, …
## $ xrel     <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ xmed     <dbl> NA, NA, NA, NA, 2.33, NA, NA, NA, NA, NA, 2.86, NA, 7.41, NA…
## $ xper     <dbl> NA, NA, NA, NA, 2.33, NA, 4.17, NA, NA, NA, NA, NA, NA, NA, …
skim(wsj)
Data summary
Name wsj
Number of rows 46411
Number of columns 14
_______________________
Column type frequency:
character 6
logical 2
numeric 6
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
state 0 1.00 4 14 0 32 0
year 0 1.00 4 7 0 4 0
name 0 1.00 1 91 0 36129 0
type 27174 0.41 5 12 0 6 0
city 17339 0.63 2 43 0 5665 0
county 5157 0.89 3 21 0 1158 0

Variable type: logical

skim_variable n_missing complete_rate mean count
district 46411 0 NaN :
xrel 46317 0 1 TRU: 94

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
index 0 1.00 1457.45 1686.44 1.00 371.0 881.00 1735.00 8066.00 ▇▂▁▁▁
enroll 12844 0.72 120.54 161.40 0.00 39.0 73.00 116.00 6222.00 ▇▁▁▁▁
mmr 0 1.00 57.18 47.05 -1.00 -1.0 91.67 98.00 100.00 ▆▁▁▁▇
overall 0 1.00 51.57 46.79 -1.00 -1.0 83.33 95.61 100.00 ▇▁▁▁▇
xmed 33439 0.28 3.10 4.32 0.04 1.0 2.00 4.00 100.00 ▇▁▁▁▁
xper 40000 0.14 6.93 8.16 0.17 2.7 5.00 7.76 169.23 ▇▁▁▁▁
glimpse(measles)
## Rows: 66,113
## Columns: 16
## $ index    <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 10, 11, 12, 13, 14, 15, 15, 1…
## $ state    <chr> "Arizona", "Arizona", "Arizona", "Arizona", "Arizona", "Ariz…
## $ year     <chr> "2018-19", "2018-19", "2018-19", "2018-19", "2018-19", "2018…
## $ name     <chr> "A J Mitchell Elementary", "Academy Del Sol", "Academy Del S…
## $ type     <chr> "Public", "Charter", "Charter", "Charter", "Charter", "Publi…
## $ city     <chr> "Nogales", "Tucson", "Tucson", "Phoenix", "Phoenix", "Phoeni…
## $ county   <chr> "Santa Cruz", "Pima", "Pima", "Maricopa", "Maricopa", "Maric…
## $ district <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ enroll   <dbl> 51, 22, 85, 60, 43, 36, 24, 22, 26, 78, 78, 35, 54, 54, 34, …
## $ mmr      <dbl> 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, …
## $ overall  <dbl> -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, …
## $ xrel     <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ xmed     <dbl> NA, NA, NA, NA, 2.33, NA, NA, NA, NA, NA, NA, 2.86, NA, 7.41…
## $ xper     <dbl> NA, NA, NA, NA, 2.33, NA, 4.17, NA, NA, NA, NA, NA, NA, NA, …
## $ lat      <dbl> 31.34782, 32.22192, 32.13049, 33.48545, 33.49562, 33.43532, …
## $ lng      <dbl> -110.9380, -110.8961, -111.1170, -112.1306, -112.2247, -112.…
skim(measles)
Data summary
Name measles
Number of rows 66113
Number of columns 16
_______________________
Column type frequency:
character 6
logical 2
numeric 8
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
state 0 1.00 4 14 0 32 0
year 0 1.00 4 7 0 4 0
name 0 1.00 1 91 0 36129 0
type 36622 0.45 5 12 0 6 0
city 20071 0.70 2 43 0 5665 0
county 6254 0.91 3 21 0 1158 0

Variable type: logical

skim_variable n_missing complete_rate mean count
district 66113 0 NaN :
xrel 66004 0 1 TRU: 109

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
index 0 1.00 1607.69 1706.77 1.00 429.00 997.00 2133.00 8066.00 ▇▂▁▁▁
enroll 16260 0.75 131.93 162.16 0.00 46.00 80.00 129.00 6222.00 ▇▁▁▁▁
mmr 0 1.00 63.17 45.72 -1.00 -1.00 95.00 98.00 100.00 ▅▁▁▁▇
overall 0 1.00 54.09 46.67 -1.00 -1.00 87.00 96.10 100.00 ▆▁▁▁▇
xmed 45122 0.32 2.91 3.85 0.04 1.00 2.00 3.53 100.00 ▇▁▁▁▁
xper 57560 0.13 6.78 7.62 0.17 2.84 5.00 7.55 169.23 ▇▁▁▁▁
lat 1549 0.98 39.15 4.58 24.55 35.69 40.21 42.18 49.00 ▁▃▅▇▂
lng 1549 0.98 -96.28 18.44 -124.50 -117.63 -89.97 -81.75 80.21 ▇▃▁▁▁

If there are any quirks that you have to deal with NA coded as something else, or it is multiple tables, please make some notes here about what you need to do before you start transforming the data in the next section.

Make sure your data types are correct!

The data came as two datasets that they want you to join. The measles dataset and the wsj dataset. The datasets are pretty similar for Oregon but it appears that they vary more for the national datasets. Generally the essential data are very complete for some columns, but with many missing values in columns I dont care as much about. Will deal with below.

Transforming the data (15 points)

If the data needs to be transformed in any way (values recoded, pivoted, etc), do it here. Examples include transforming a continuous variable into a categorical using case_when(), etc.

Subset of the data and think of a question to answer the subset

Bonus points (5 points) for datasets that require merging of tables, but only if you reason through whether you should use left_join, inner_join, or right_join on these tables. No credit will be provided if you don’t.

ORmeasles<-measles%>% filter(state == "Oregon")
ORWSJ<-wsj %>% filter(state=="Oregon")

#I'm using skim to look at my datasets to identify how to combine them. 
skim(ORmeasles)
Data summary
Name ORmeasles
Number of rows 817
Number of columns 16
_______________________
Column type frequency:
character 6
logical 2
numeric 8
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
state 0 1 6 6 0 1 0
year 0 1 7 7 0 1 0
name 0 1 4 61 0 698 0
type 817 0 NA NA 0 0 0
city 817 0 NA NA 0 0 0
county 0 1 4 10 0 36 0

Variable type: logical

skim_variable n_missing complete_rate mean count
district 817 0 NaN :
xrel 817 0 NaN :

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
index 0 1.00 361.22 209.47 1.00 180.00 359.00 540.00 733.00 ▇▇▇▇▇
enroll 0 1.00 57.11 33.18 1.00 36.00 58.00 76.00 379.00 ▇▂▁▁▁
mmr 0 1.00 92.80 12.55 -1.00 93.49 95.58 97.27 100.00 ▁▁▁▁▇
overall 0 1.00 90.26 12.98 -1.00 90.59 93.44 95.44 100.00 ▁▁▁▁▇
xmed 609 0.25 0.46 0.60 0.11 0.22 0.31 0.51 7.14 ▇▁▁▁▁
xper 39 0.95 3.82 4.57 0.21 1.44 2.50 4.10 37.95 ▇▁▁▁▁
lat 0 1.00 44.60 1.09 42.01 44.05 44.96 45.48 46.19 ▂▁▃▆▇
lng 0 1.00 -122.39 1.43 -124.50 -123.07 -122.82 -122.49 -116.94 ▅▇▁▁▁
skim(ORWSJ)
Data summary
Name ORWSJ
Number of rows 733
Number of columns 14
_______________________
Column type frequency:
character 6
logical 2
numeric 6
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
state 0 1 6 6 0 1 0
year 0 1 7 7 0 1 0
name 0 1 4 61 0 698 0
type 733 0 NA NA 0 0 0
city 733 0 NA NA 0 0 0
county 0 1 4 10 0 36 0

Variable type: logical

skim_variable n_missing complete_rate mean count
district 733 0 NaN :
xrel 733 0 NaN :

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
index 0 1.00 367.00 211.74 1.00 184.00 367.00 550.00 733.00 ▇▇▇▇▇
enroll 0 1.00 56.24 33.70 1.00 31.00 57.00 76.00 379.00 ▇▂▁▁▁
mmr 0 1.00 92.62 12.69 -1.00 93.33 95.49 97.24 100.00 ▁▁▁▁▇
overall 0 1.00 90.02 13.17 -1.00 90.32 93.33 95.37 100.00 ▁▁▁▁▇
xmed 541 0.26 0.47 0.62 0.11 0.22 0.31 0.52 7.14 ▇▁▁▁▁
xper 37 0.95 3.92 4.71 0.21 1.41 2.51 4.22 37.95 ▇▁▁▁▁
#I see that the ORmeasles dataset has more observations and and more columns that the ORWSJ dataset. The additional columns are latitude and longitude which I'm not using, so that doesn't really play into my decision. The years are the same for both datasets in oregon (there is variation in years in the larger not subseted data set), but there are more observations in the ORmeasles set so I'm going to use that dataset as the left dataset and leftjoin. 

#I've decided that there are a lot of variables that are empty or that I'm not interested in, I'm going to further subset the data to exclude the data I dont care about to make the join less messy. I got this method from: https://www.statmethods.net/management/subset.html

getrid1<-names(ORmeasles) %in% c("type", "city", "district", "enroll", "xrel", "xmed", "xper", "lat", "lng")
getrid2<-names(ORWSJ) %in% c("type", "city", "district", "enroll", "xrel", "xmed", "xper")
clean_OR_measles<-ORmeasles[!getrid1]
clean_ORWSJ<-ORWSJ[!getrid2]

OR_MMR<-left_join(clean_OR_measles,clean_ORWSJ, by = c("index","state","year" ,"name","county","mmr","overall" ))

#In order to compare urban versus rural I need to  assign counties the designation of urban or rural. In looking at doing so I landed on using the 2013 US office of budget and management definitions of Metropolitan, micropolitan, and non core, rather than a binary of rural and urban, which is a little tricker in Oregon. https://oregonexplorer.info/content/what-rural?topic=173&ptopic=140 See image below.

knitr::include_graphics("images/ORmap.png")

#I factored the counties into three different levels to classify them as "metropolitan", "micropolitan" or "noncore". Code sourced from: https://forcats.tidyverse.org/reference/fct_collapse.html

urbanstatus <- fct_collapse(OR_MMR$county,
  metro = c("Multnomah", "Clackamas","Washington", "Yamhill", "Polk", "Columbia", "Benton", "Lane", "Linn", "Deschutes", "Jackson", "Josephine", "Marion"),
  micro = c("Hood River", "Crook", "Umatilla", "Union", "Malheur", "Klamath", "Douglas", "Wasco", "Curry", "Coos", "Lincoln", "Clatsop", "Morrow"),
  noncore = c("Wallowa", "Baker", "Grant", "Wheeler", "Gilliam", "Sherman", "Jefferson", "Tillamook", "Harney", "Lake")
)
levels(urbanstatus)
## [1] "noncore" "metro"   "micro"
#checked my levels to see that they were not in a logical order. Reordered them to reflect a logical rural to urban gradient. 

urbanstatus <-factor(urbanstatus, c("noncore", "micro","metro"))
levels(urbanstatus)
## [1] "noncore" "micro"   "metro"
#used the cbind function to add the new categories to my OR_MMR dataframe for data analysis
OR_MMR<-cbind(OR_MMR, urbanstatus)

Show your transformed table here. Use tools such as glimpse(), skim() or head() to illustrate your point.

head(OR_MMR)
##   index  state    year                           name   county mmr   overall
## 1     1 Oregon 2018-19           Big Muddy Elementary    Wasco 100  88.23529
## 2     2 Oregon 2018-19               Cairo Elementary  Malheur 100 100.00000
## 3     3 Oregon 2018-19         Jasper Mountain Center     Lane 100 100.00000
## 4     4 Oregon 2018-19       Jordan Valley Elementary  Malheur 100 100.00000
## 5     5 Oregon 2018-19 Lifeways Day Treatment Program Umatilla 100 100.00000
## 6     6 Oregon 2018-19        Morrow Education Center   Morrow 100  96.15385
##   urbanstatus
## 1       micro
## 2       micro
## 3       metro
## 4       micro
## 5       micro
## 6       micro

Are the values what you expected for the variables? Why or Why not?

Yes, everything looks as I would expected. Only the percentages of vaccine coverage for MMR and overall vaccinations are numeric and the rest are character

Visualizing and Summarizing the Data (15 points)

Use group_by()/summarize() to make a summary of the data here. The summary should be relevant to your research question

OR_MMR%>% group_by(urbanstatus)%>%
  summarize(mean = mean(mmr))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3 x 2
##   urbanstatus  mean
##   <fct>       <dbl>
## 1 noncore      81.6
## 2 micro        92.9
## 3 metro        93.6
OR_MMR%>% group_by(urbanstatus)%>%
  summarize(median = median(mmr))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3 x 2
##   urbanstatus median
##   <fct>        <dbl>
## 1 noncore       94.5
## 2 micro         95.9
## 3 metro         95.6
OR_MMR%>% group_by(urbanstatus)%>%
  summarize(mean = mean(overall))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3 x 2
##   urbanstatus  mean
##   <fct>       <dbl>
## 1 noncore      78.5
## 2 micro        90.7
## 3 metro        91.1
OR_MMR%>% group_by(urbanstatus)%>%
  summarize(median = median(overall))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3 x 2
##   urbanstatus median
##   <fct>        <dbl>
## 1 noncore       91.7
## 2 micro         93.8
## 3 metro         93.4

What are your findings about the summary? Are they what you expected?

The findings demonstrate that there is a positive relationship between level of urbanization and vaccine coverage for both MMR and overall vaccination rates. At first I just ran the mean and the relationship looked quite strong but when I looked at my boxplots below and saw the number of outliers I realized that my mean was probably skewed and decided to compare with the median as well. When looking at the median the immunization rates are fairly similar between urban and rural. I suspected that there would be a difference but I didn’t really think about how it would be fairly likely to have such a high proportion of major outliers. It would have been very interesting to have the data on public/charter/private. Unfortunately these data were not available for the Oregon data. I’m imagining that many of the outliers might be associated with religious/waldorf private or charter schools

Make at least two plots that help you answer your question on the transformed or summarized data.

ggplot(OR_MMR) + aes(x=urbanstatus, y = mmr, color = urbanstatus) + geom_boxplot()+
  labs( x = "Urban status", y= "Percent MMR vaccine coverage",  
    title = "MMR Vaccine Coverage by Urban/Rural Status"
    ) 

ggplot(OR_MMR) + aes(x=county, y = mmr, color = urbanstatus) + geom_boxplot()+ facet_wrap(vars(urbanstatus))+
   labs( x = "County", y= "Percent MMR vaccine coverage",  
    title = "MMR Vaccine Coverage by County"
    ) 

ggplot(OR_MMR) + aes(x=overall, y = mmr, color = urbanstatus) + geom_point()+
   labs( x = "Percent overall vaccine coverage", y= "Percent MMR vaccine coverage",  
    title = "Overall vaccine & MMR coverage"
    ) 

These plots were really helpful for me to be able to see spread and get a much more accurate understanding of the nuances of what was happening compared to looking simply at the mean and median values.

Final Summary (10 points)

Summarize your research question and findings below.

I wanted to identify whether vaccination rates differed between urban and rural counties in Oregon. Rather than to simply use a binary approach to urban and rural I used a 3 stage gradient that included “noncore” for the most rural, “micropolitan” for smaller semi-urban regions, and “metropolitan” for most urban settings. When looking just at mean, it seems that there is a strong relationship between being more rural and having lower vaccination rates, however a closer look at the data indicates that a large proportion of the differences in vaccination rates are accounted for by significant outliers in the data. These are schools with very low vaccination rates. There are significant outliers in both the urban and rural regions, but due to the relatively lower number of schools in the rural regions, the outliers skew the mean vaccination rates more than they do in the urban areas. There also appears to be a high degree of variability in the spread of vaccination rates in schools by county, as evidenced by the second box plot entitled ’MMR vaccination coverage by county. Not surprisingly, overall vaccination coverage and MMR vaccination coverage are strongly correlated (see graph 3).

Are your findings what you expected? Why or Why not?

Given that my original research hope was to look at vaccination rate by school type rather than by urban/rural (those data were not available), I did have a sense that there might be really extreme variability between types of schools, and I am guessing that charter/private schools account for a large proportion of the major outliers in these numbers. I dont think I expected the degree to which outliers accounted for the variation between urban and rural vaccination rates. It would be very interesting to see how these numbers compare to percapita vaccination rates rather than by school, and it would also be interesting to see to what extent school type accounted for vaccination rates in each of the three types on the urban-rural gradient. Overall this ended up being a more interesting analysis than I anticipated