Midterm (Due 2/12/2021 at 11:55 pm)

Please submit your .Rmd and .html files in Sakai. If you are working together, both people should submit the files.

60 / 60 points total

The goal of the midterm project is to showcase skills that you have learned in class so far. The midterm is open note, but if you use someone else’s code, you must attribute them.

Before you get Started

  1. Pick a dataset. Ideally, the dataset should be around 2000 rows, and should have both categorical and numeric covariates.

Potential Sources for data: Tidy Tuesday: https://github.com/rfordatascience/tidytuesday

  • Note that most of these are .csv files. There is code to load the files from csv for each of the datasets and a short description of the variables, or you can upload the .csv file into your data folder.

You may use another dataset or your own data, but please make sure it is de-identified.

  1. Please schedule a time with Eric or Me to discuss your dataset and research question. We just want to look at the data and make sure that it is appropriate for your question.

Working Together

If you’d like to work together, that is encouraged, but you must divide the work equitably and you must note who worked on what. This is probably easiest as notes in the text. Please let Eric or Me know that you’ll be working together.

No acknowledgements of contributions = -10 points overall.

Please Note

I will take off points (-5 points for each section) if you don’t add observations and notes in your RMarkdown document. I want you to think and reason through your analysis, even if they are preliminary thoughts.

Define Your Research Question (10 points)

Define your research question below. What about the data interests you? What is a specific question you want to find out about the data?

Given your question, what is your expectation about the data?

The dataset I will be using for the midterm is the measles.csv data that comes from The Wallstreet Journal, specifically an article that looking at 46,412 across 32 US States. The vaccination school year differes for some states between 2017-18 and 2018-19 school year. I was particularly interested in the Oregon data since it was one of the states included of the 32 and I recently did some reading for another class on the MMR vaccination, specifically about its opposition. That led me to the research question below:

Does the school vaccination rate for the MMR vaccine in Oregon schools in the 2018-2019 differ by county?

I do anticipate for there to be difference by county, specifically those in Salem such as Marion and Polk Counties as there have been anti-vaccination protests and groups forming in this area. Additionally, I expect Yamhill to be lower as it is also a known measles hot spot.

Loading the Data (10 points)

Load the data below and use dplyr::glimpse() or skimr::skim() on the data. You should upload the data file into the data directory.

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.4     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(janitor)
## 
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
library(here)
## here() starts at /cloud/project
library(ggplot2)
library(forcats)
library(skimr)

#The code to load the data comes directly from the TidyTuesday page where it was found: https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-02-25/readme.md

measles <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-25/measles.csv', 
                           na = "NA")
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   index = col_double(),
##   state = col_character(),
##   year = col_character(),
##   name = col_character(),
##   type = col_character(),
##   city = col_character(),
##   county = col_character(),
##   district = col_logical(),
##   enroll = col_double(),
##   mmr = col_double(),
##   overall = col_double(),
##   xrel = col_logical(),
##   xmed = col_double(),
##   xper = col_double(),
##   lat = col_double(),
##   lng = col_double()
## )
glimpse(measles)
## Rows: 66,113
## Columns: 16
## $ index    <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 10, 11, 12, 13, 14, 15, 15, 1…
## $ state    <chr> "Arizona", "Arizona", "Arizona", "Arizona", "Arizona", "Ariz…
## $ year     <chr> "2018-19", "2018-19", "2018-19", "2018-19", "2018-19", "2018…
## $ name     <chr> "A J Mitchell Elementary", "Academy Del Sol", "Academy Del S…
## $ type     <chr> "Public", "Charter", "Charter", "Charter", "Charter", "Publi…
## $ city     <chr> "Nogales", "Tucson", "Tucson", "Phoenix", "Phoenix", "Phoeni…
## $ county   <chr> "Santa Cruz", "Pima", "Pima", "Maricopa", "Maricopa", "Maric…
## $ district <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ enroll   <dbl> 51, 22, 85, 60, 43, 36, 24, 22, 26, 78, 78, 35, 54, 54, 34, …
## $ mmr      <dbl> 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, …
## $ overall  <dbl> -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, …
## $ xrel     <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ xmed     <dbl> NA, NA, NA, NA, 2.33, NA, NA, NA, NA, NA, NA, 2.86, NA, 7.41…
## $ xper     <dbl> NA, NA, NA, NA, 2.33, NA, 4.17, NA, NA, NA, NA, NA, NA, NA, …
## $ lat      <dbl> 31.34782, 32.22192, 32.13049, 33.48545, 33.49562, 33.43532, …
## $ lng      <dbl> -110.9380, -110.8961, -111.1170, -112.1306, -112.2247, -112.…

If there are any quirks that you have to deal with NA coded as something else, or it is multiple tables, please make some notes here about what you need to do before you start transforming the data in the next section.

Make sure your data types are correct!

Transforming the data (15 points)

If the data needs to be transformed in any way (values recoded, pivoted, etc), do it here. Examples include transforming a continuous variable into a categorical using case_when(), etc.

Bonus points (5 points) for datasets that require merging of tables, but only if you reason through whether you should use left_join, inner_join, or right_join on these tables. No credit will be provided if you don’t.

Since we only want to look at Oregon data, I am going to create a subset of the original data set which will allow us to look at measles in Oregon. The data set will be called measles_OR

#developed this code from part 4 of BSTA 504, credit to Ted Laderas 


measles_OR <- measles %>%
  filter(state %in% "Oregon", na.rm = TRUE)

measles_OR$mmr[measles_OR$mmr==-1] <- NA

Show your transformed table here. Use tools such as glimpse(), skim() or head() to illustrate your point.

glimpse(measles_OR)
## Rows: 817
## Columns: 16
## $ index    <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
## $ state    <chr> "Oregon", "Oregon", "Oregon", "Oregon", "Oregon", "Oregon", …
## $ year     <chr> "2018-19", "2018-19", "2018-19", "2018-19", "2018-19", "2018…
## $ name     <chr> "Big Muddy Elementary", "Cairo Elementary", "Jasper Mountain…
## $ type     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ city     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ county   <chr> "Wasco", "Malheur", "Lane", "Malheur", "Umatilla", "Morrow",…
## $ district <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ enroll   <dbl> 1, 17, 1, 1, 2, 2, 102, 1, 75, 45, 60, 67, 48, 46, 9, 64, 67…
## $ mmr      <dbl> 100.00000, 100.00000, 100.00000, 100.00000, 100.00000, 100.0…
## $ overall  <dbl> 88.23529, 100.00000, 100.00000, 100.00000, 100.00000, 96.153…
## $ xrel     <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ xmed     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, 0.2801120, NA, 0.2096436…
## $ xper     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 0.3115265, 0.2096436…
## $ lat      <dbl> 44.83375, 43.98788, 43.97460, 42.97615, 45.91627, 45.89574, …
## $ lng      <dbl> -120.4839, -117.0125, -122.8622, -117.0544, -119.2905, -119.…
skim(measles_OR)
Data summary
Name measles_OR
Number of rows 817
Number of columns 16
_______________________
Column type frequency:
character 6
logical 2
numeric 8
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
state 0 1 6 6 0 1 0
year 0 1 7 7 0 1 0
name 0 1 4 61 0 698 0
type 817 0 NA NA 0 0 0
city 817 0 NA NA 0 0 0
county 0 1 4 10 0 36 0

Variable type: logical

skim_variable n_missing complete_rate mean count
district 817 0 NaN :
xrel 817 0 NaN :

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
index 0 1.00 361.22 209.47 1.00 180.00 359.00 540.00 733.00 ▇▇▇▇▇
enroll 0 1.00 57.11 33.18 1.00 36.00 58.00 76.00 379.00 ▇▂▁▁▁
mmr 11 0.99 94.08 6.14 43.11 93.56 95.68 97.30 100.00 ▁▁▁▁▇
overall 0 1.00 90.26 12.98 -1.00 90.59 93.44 95.44 100.00 ▁▁▁▁▇
xmed 609 0.25 0.46 0.60 0.11 0.22 0.31 0.51 7.14 ▇▁▁▁▁
xper 39 0.95 3.82 4.57 0.21 1.44 2.50 4.10 37.95 ▇▁▁▁▁
lat 0 1.00 44.60 1.09 42.01 44.05 44.96 45.48 46.19 ▂▁▃▆▇
lng 0 1.00 -122.39 1.43 -124.50 -123.07 -122.82 -122.49 -116.94 ▅▇▁▁▁
summary(measles_OR)
##      index          state               year               name          
##  Min.   :  1.0   Length:817         Length:817         Length:817        
##  1st Qu.:180.0   Class :character   Class :character   Class :character  
##  Median :359.0   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :361.2                                                           
##  3rd Qu.:540.0                                                           
##  Max.   :733.0                                                           
##                                                                          
##      type               city              county          district      
##  Length:817         Length:817         Length:817         Mode:logical  
##  Class :character   Class :character   Class :character   NA's:817      
##  Mode  :character   Mode  :character   Mode  :character                 
##                                                                         
##                                                                         
##                                                                         
##                                                                         
##      enroll            mmr            overall         xrel        
##  Min.   :  1.00   Min.   : 43.11   Min.   : -1.00   Mode:logical  
##  1st Qu.: 36.00   1st Qu.: 93.56   1st Qu.: 90.59   NA's:817      
##  Median : 58.00   Median : 95.68   Median : 93.44                 
##  Mean   : 57.11   Mean   : 94.08   Mean   : 90.26                 
##  3rd Qu.: 76.00   3rd Qu.: 97.30   3rd Qu.: 95.44                 
##  Max.   :379.00   Max.   :100.00   Max.   :100.00                 
##                   NA's   :11                                      
##       xmed             xper              lat             lng        
##  Min.   :0.1127   Min.   : 0.2096   Min.   :42.01   Min.   :-124.5  
##  1st Qu.:0.2191   1st Qu.: 1.4404   1st Qu.:44.05   1st Qu.:-123.1  
##  Median :0.3053   Median : 2.4961   Median :44.96   Median :-122.8  
##  Mean   :0.4638   Mean   : 3.8172   Mean   :44.60   Mean   :-122.4  
##  3rd Qu.:0.5060   3rd Qu.: 4.0972   3rd Qu.:45.48   3rd Qu.:-122.5  
##  Max.   :7.1429   Max.   :37.9487   Max.   :46.19   Max.   :-116.9  
##  NA's   :609      NA's   :39

Are the values what you expected for the variables? Why or Why not?

The glimpse() helped verify that my variables were correctly categorize as well as visualize easily the total # schools included, which is equivalent to the rows. I think mean vaccination rates for schools for the MMR vaccine was slightly higher than I expected.

Visualizing and Summarizing the Data (15 points)

Use group_by()/summarize() to make a summary of the data here. The summary should be relevant to your research question

mean_county_mmr <- measles_OR %>%
  group_by(county) %>%
  summarize(mean_mmr = mean(mmr, na.rm = TRUE)) %>%
  arrange(desc(mean_mmr))
## `summarise()` ungrouping output (override with `.groups` argument)
mean_county_mmr
## # A tibble: 36 x 2
##    county     mean_mmr
##    <chr>         <dbl>
##  1 Morrow         98.4
##  2 Malheur        98.1
##  3 Umatilla       97.4
##  4 Gilliam        96.9
##  5 Lake           96.7
##  6 Washington     96.5
##  7 Tillamook      96.4
##  8 Lincoln        96.3
##  9 Wasco          96.2
## 10 Linn           96.2
## # … with 26 more rows

What are your findings about the summary? Are they what you expected?

My findings are not completely consistent with what I expected. Yamhill and Marion county are above 95% for the mean overall vaccination rate for the MMR vaccine, which is above Multnomah county even which is residing at about 93.8% for the 2018-2019 school year. However, Polk county is amoung the bottom 8 overall vaccinations rates for the MMR vaccine in the 2018-2019 school year at 91.8%

Make at least two plots that help you answer your question on the transformed or summarized data.

To visualize the data the first graph we are going to create is a ridgeline density plot of of Mean MMR vaccination rate by County for the Oregon Counties.

#Code adapted from Jessica Minnier and Meike Neiderhausen's Data Visualization with R and ggplot2 slides;  https://sph-r-programming.netlify.app/ggplot_flipbookr_berd/04_ggplot_slides.html#44

library(ggridges)
library(ggthemes)

ggplot(data = measles_OR,
       aes(x = mmr,
           y = county,
           fill = county)
       ) + 
  geom_density_ridges(alpha = 0.8) +
  ggthemes::theme_clean() + 
  theme(legend.position="none") + 
  labs(
    x = "School's MMR vaccine rate",   
    y = "County",
    title = "Ridgeline Density Plot of MMR vaccination rate by County (OR)")
## Picking joint bandwidth of 1.41
## Warning: Removed 11 rows containing non-finite values (stat_density_ridges).

The plot above is useful for visualizing the changes in distribution of the MMR vaccination rate per county. Above we can see that Grant county Oregon has a wide range of MMR vaccination rates [~78 - 100]. Additionally we can see that Harney County has some outliers in their county which may be lowering the mean MMR rate. Additionally, Clackamas and Marian county appear to have the most narrow range, demonstrating a pretty normal distribution on the ridgeline density plot. The plot above also helps answer our research question, that while there is some difference in average school MMR vaccination rate by county, overall the range of most counties in Oregone do not differ significantly.

The next plot we are going to create to visualize the data is Box plot. Box plots divide the sata into sections that contain approximatelt 25% of the data in the set. Its useful for visualizing summary of the data to quickly identify mean values, dispersion of the data, and signs of skewness. Since it is hard to visualize all 36 counties on one plot, we will be looking at the ten counties with the lowest mean school MMR vaccination rate.

#Code adapted from Jessica Minnier and Meike Neiderhausen's Data Visualization with R and ggplot2 slides;  https://sph-r-programming.netlify.app/ggplot_flipbookr_berd/04_ggplot_slides.html#44

bottom_10_mean_mmr <- c("Lane", "Columbia", "Polk", "Harney", "Curry", "Jackson",   "Baker",    "Josephine", "Grant", "Deschutes")

OR_bottom_10 <- measles_OR %>%
  filter(county %in% bottom_10_mean_mmr)

ggplot(data = OR_bottom_10,
       aes(x = mmr, 
           y = county,
           fill = county)
       ) + 
  geom_boxplot(alpha = 0.5) +
  theme_fivethirtyeight() +
  theme(axis.title = element_text()) + 
  scale_fill_fivethirtyeight() + 
  theme(legend.position = "none") + 
  geom_jitter(width = .1, alpha = 0.3) +
  geom_violin(colour = "black", alpha = .2) + 
  labs(
    x = "School's MMR Rate",    
    y = "Counties with bottom 10 mean school MMR rates in OR",   
    title = "MMR vaccination rate of the bottom 10 mean MMR rates Counties (OR)")
## Warning: Removed 4 rows containing non-finite values (stat_boxplot).
## Warning: Removed 4 rows containing non-finite values (stat_ydensity).
## Warning: This manual palette can handle a maximum of 3 values. You have supplied
## 10.
## Warning: Removed 4 rows containing missing values (geom_point).

The boxplots above show us that the counties in the bottom 10 for average school MMR vaccination rate have extreme outliers which may be biasing the average school MMR vaccinations rate and therefore skewing our interpretation of which schools counties are actually of concern for a measles outbreak.

Final Summary (10 points)

Summarize your research question and findings below.

The research question introduced was Does the school vaccination rate for the MMR vaccine in Oregon schools in the 2018-2019 differ by county?, from the data transformations and visualizations we can see that overall the range of school MMR vaccination rates across counties is relatively similar. The mean school vaccination rates only range from 98.4 to 88.9, however we can see from both the ridgeline density plot and box plots that outlier schools exist with lower vaccination rates, they just do not alter the difference in mean school vaccination rate by county super significantly.

Are your findings what you expected? Why or Why not?

My findings are not exactly what I expected. I expected the mean school vaccination rate to dip below 85% into the low 80s for the state of Oregon which is not what we saw. Additionally, I did expect counties in Salem, OR as well as Yamhill to have a significantly different mean school vaccination rate as they are known areas of measles hot spots, however Marion county one of the Salem counties appeared to be among the top counties for vaccination rates and with a narrower distribution between its schools.