Midterm (Due 2/12/2021 at 11:55 pm)

Please submit your .Rmd and .html files in Sakai. If you are working together, both people should submit the files.

60 / 60 points total

The goal of the midterm project is to showcase skills that you have learned in class so far. The midterm is open note, but if you use someone else’s code, you must attribute them.

# This code came from (included each time in the code chunk)

Before you get Started

Pick a dataset. Ideally, the dataset should be around 2000 rows, and should have both categorical and numeric covariates.

Potential Sources for data: Tidy Tuesday: https://github.com/rfordatascience/tidytuesday

Note that most of these are .csv files. There is code to load the files from csv for each of the datasets and a short description of the variables, or you can upload the .csv file into your data folder.

You may use another dataset or your own data, but please make sure it is de-identified.

Please schedule a time with Eric or Me to discuss your dataset and research question. We just want to look at the data and make sure that it is appropriate for your question.

Working Together

If you’d like to work together, that is encouraged, but you must divide the work equitably and you must note who worked on what. This is probably easiest as notes in the text. Please let Eric or Me know that you’ll be working together.

No acknowledgements of contributions = -10 points overall.

Please Note

I will take off points (-5 points for each section) if you don’t add observations and notes in your RMarkdown document. I want you to think and reason through your analysis, even if they are preliminary thoughts.

Define Your Research Question (10 points)

Define your research question below. What about the data interests you? What is a specific question you want to find out about the data?

Given your question, what is your expectation about the data?

My research question: In Oregon is there a significant difference between MMR vaccination rates in schools between urban and rural counties?

Loading the Data (10 points)

Load the data below and use dplyr::glimpse() or skimr::skim() on the data. You should upload the data file into the data directory.

measles <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-25/measles.csv')

## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   index = col_double(),
##   state = col_character(),
##   year = col_character(),
##   name = col_character(),
##   type = col_character(),
##   city = col_character(),
##   county = col_character(),
##   district = col_logical(),
##   enroll = col_double(),
##   mmr = col_double(),
##   overall = col_double(),
##   xrel = col_logical(),
##   xmed = col_double(),
##   xper = col_double(),
##   lat = col_double(),
##   lng = col_double()
## )

wsj <- read_csv("https://raw.githubusercontent.com/WSJ/measles-data/master/all-measles-rates.csv")

## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   index = col_double(),
##   state = col_character(),
##   year = col_character(),
##   name = col_character(),
##   type = col_character(),
##   city = col_character(),
##   county = col_character(),
##   district = col_logical(),
##   enroll = col_double(),
##   mmr = col_double(),
##   overall = col_double(),
##   xrel = col_logical(),
##   xmed = col_double(),
##   xper = col_double()
## )

## Warning: 19449 parsing failures.
##  row      col           expected              actual                                                                              file
## 9891 district 1/0/T/F/TRUE/FALSE Harrison 2          'https://raw.githubusercontent.com/WSJ/measles-data/master/all-measles-rates.csv'
## 9892 district 1/0/T/F/TRUE/FALSE Denver County 1     'https://raw.githubusercontent.com/WSJ/measles-data/master/all-measles-rates.csv'
## 9893 district 1/0/T/F/TRUE/FALSE Academy 20          'https://raw.githubusercontent.com/WSJ/measles-data/master/all-measles-rates.csv'
## 9893 xrel     1/0/T/F/TRUE/FALSE 1.52                'https://raw.githubusercontent.com/WSJ/measles-data/master/all-measles-rates.csv'
## 9894 district 1/0/T/F/TRUE/FALSE Colorado Springs 11 'https://raw.githubusercontent.com/WSJ/measles-data/master/all-measles-rates.csv'
## .... ........ .................. ................... .................................................................................
## See problems(...) for more details.

glimpse(wsj)

## Rows: 46,411
## Columns: 14
## $ index    <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
## $ state    <chr> "Arizona", "Arizona", "Arizona", "Arizona", "Arizona", "Ariz…
## $ year     <chr> "2018-19", "2018-19", "2018-19", "2018-19", "2018-19", "2018…
## $ name     <chr> "A J Mitchell Elementary", "Academy Del Sol", "Academy Del S…
## $ type     <chr> "Public", "Charter", "Charter", "Charter", "Charter", "Publi…
## $ city     <chr> "Nogales", "Tucson", "Tucson", "Phoenix", "Phoenix", "Phoeni…
## $ county   <chr> "Santa Cruz", "Pima", "Pima", "Maricopa", "Maricopa", "Maric…
## $ district <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ enroll   <dbl> 51, 22, 85, 60, 43, 36, 24, 22, 26, 78, 35, 54, 54, 34, 57, …
## $ mmr      <dbl> 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, …
## $ overall  <dbl> -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, …
## $ xrel     <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ xmed     <dbl> NA, NA, NA, NA, 2.33, NA, NA, NA, NA, NA, 2.86, NA, 7.41, NA…
## $ xper     <dbl> NA, NA, NA, NA, 2.33, NA, 4.17, NA, NA, NA, NA, NA, NA, NA, …

skim(wsj)

Data summary
Name	wsj
Number of rows	46411
Number of columns	14
_______________________
Column type frequency:
character	6
logical	2
numeric	6
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
state	0	1.00	4	14	32
year	0	1.00	4	7	4
name	0	1.00	1	91	36129
type	27174	0.41	5	12	6
city	17339	0.63	2	43	5665
county	5157	0.89	3	21	1158

Variable type: logical

skim_variable	n_missing	complete_rate	mean	count
district	46411	0	NaN	:
xrel	46317	0	1	TRU: 94

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
index	0	1.00	1457.45	1686.44	1.00	371.0	881.00	1735.00	8066.00	▇▂▁▁▁
enroll	12844	0.72	120.54	161.40	0.00	39.0	73.00	116.00	6222.00	▇▁▁▁▁
mmr	0	1.00	57.18	47.05	-1.00	-1.0	91.67	98.00	100.00	▆▁▁▁▇
overall	0	1.00	51.57	46.79	-1.00	-1.0	83.33	95.61	100.00	▇▁▁▁▇
xmed	33439	0.28	3.10	4.32	0.04	1.0	2.00	4.00	100.00	▇▁▁▁▁
xper	40000	0.14	6.93	8.16	0.17	2.7	5.00	7.76	169.23	▇▁▁▁▁

glimpse(measles)

## Rows: 66,113
## Columns: 16
## $ index    <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 10, 11, 12, 13, 14, 15, 15, 1…
## $ state    <chr> "Arizona", "Arizona", "Arizona", "Arizona", "Arizona", "Ariz…
## $ year     <chr> "2018-19", "2018-19", "2018-19", "2018-19", "2018-19", "2018…
## $ name     <chr> "A J Mitchell Elementary", "Academy Del Sol", "Academy Del S…
## $ type     <chr> "Public", "Charter", "Charter", "Charter", "Charter", "Publi…
## $ city     <chr> "Nogales", "Tucson", "Tucson", "Phoenix", "Phoenix", "Phoeni…
## $ county   <chr> "Santa Cruz", "Pima", "Pima", "Maricopa", "Maricopa", "Maric…
## $ district <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ enroll   <dbl> 51, 22, 85, 60, 43, 36, 24, 22, 26, 78, 78, 35, 54, 54, 34, …
## $ mmr      <dbl> 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, …
## $ overall  <dbl> -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, …
## $ xrel     <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ xmed     <dbl> NA, NA, NA, NA, 2.33, NA, NA, NA, NA, NA, NA, 2.86, NA, 7.41…
## $ xper     <dbl> NA, NA, NA, NA, 2.33, NA, 4.17, NA, NA, NA, NA, NA, NA, NA, …
## $ lat      <dbl> 31.34782, 32.22192, 32.13049, 33.48545, 33.49562, 33.43532, …
## $ lng      <dbl> -110.9380, -110.8961, -111.1170, -112.1306, -112.2247, -112.…

skim(measles)

Data summary
Name	measles
Number of rows	66113
Number of columns	16
_______________________
Column type frequency:
character	6
logical	2
numeric	8
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
state	0	1.00	4	14	32
year	0	1.00	4	7	4
name	0	1.00	1	91	36129
type	36622	0.45	5	12	6
city	20071	0.70	2	43	5665
county	6254	0.91	3	21	1158

Variable type: logical

skim_variable	n_missing	complete_rate	mean	count
district	66113	0	NaN	:
xrel	66004	0	1	TRU: 109

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
index	0	1.00	1607.69	1706.77	1.00	429.00	997.00	2133.00	8066.00	▇▂▁▁▁
enroll	16260	0.75	131.93	162.16	0.00	46.00	80.00	129.00	6222.00	▇▁▁▁▁
mmr	0	1.00	63.17	45.72	-1.00	-1.00	95.00	98.00	100.00	▅▁▁▁▇
overall	0	1.00	54.09	46.67	-1.00	-1.00	87.00	96.10	100.00	▆▁▁▁▇
xmed	45122	0.32	2.91	3.85	0.04	1.00	2.00	3.53	100.00	▇▁▁▁▁
xper	57560	0.13	6.78	7.62	0.17	2.84	5.00	7.55	169.23	▇▁▁▁▁
lat	1549	0.98	39.15	4.58	24.55	35.69	40.21	42.18	49.00	▁▃▅▇▂
lng	1549	0.98	-96.28	18.44	-124.50	-117.63	-89.97	-81.75	80.21	▇▃▁▁▁

If there are any quirks that you have to deal with NA coded as something else, or it is multiple tables, please make some notes here about what you need to do before you start transforming the data in the next section.

Make sure your data types are correct!

The data came as two datasets that they want you to join. The measles dataset and the wsj dataset. The datasets are pretty similar for Oregon but it appears that they vary more for the national datasets. Generally the essential data are very complete for some columns, but with many missing values in columns I dont care as much about. Will deal with below.

Transforming the data (15 points)

If the data needs to be transformed in any way (values recoded, pivoted, etc), do it here. Examples include transforming a continuous variable into a categorical using case_when(), etc.

Subset of the data and think of a question to answer the subset

Bonus points (5 points) for datasets that require merging of tables, but only if you reason through whether you should use left_join, inner_join, or right_join on these tables. No credit will be provided if you don’t.

ORmeasles<-measles%>% filter(state == "Oregon")
ORWSJ<-wsj %>% filter(state=="Oregon")

#I'm using skim to look at my datasets to identify how to combine them. 
skim(ORmeasles)

Data summary
Name	ORmeasles
Number of rows	817
Number of columns	16
_______________________
Column type frequency:
character	6
logical	2
numeric	8
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
state	0	1	6	6	1
year	0	1	7	7	1
name	0	1	4	61	698
type	817	0	NA	NA	0
city	817	0	NA	NA	0
county	0	1	4	10	36

Variable type: logical

skim_variable	n_missing	complete_rate	mean	count
district	817	0	NaN	:
xrel	817	0	NaN	:

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
index	0	1.00	361.22	209.47	1.00	180.00	359.00	540.00	733.00	▇▇▇▇▇
enroll	0	1.00	57.11	33.18	1.00	36.00	58.00	76.00	379.00	▇▂▁▁▁
mmr	0	1.00	92.80	12.55	-1.00	93.49	95.58	97.27	100.00	▁▁▁▁▇
overall	0	1.00	90.26	12.98	-1.00	90.59	93.44	95.44	100.00	▁▁▁▁▇
xmed	609	0.25	0.46	0.60	0.11	0.22	0.31	0.51	7.14	▇▁▁▁▁
xper	39	0.95	3.82	4.57	0.21	1.44	2.50	4.10	37.95	▇▁▁▁▁
lat	0	1.00	44.60	1.09	42.01	44.05	44.96	45.48	46.19	▂▁▃▆▇
lng	0	1.00	-122.39	1.43	-124.50	-123.07	-122.82	-122.49	-116.94	▅▇▁▁▁

skim(ORWSJ)

Data summary
Name	ORWSJ
Number of rows	733
Number of columns	14
_______________________
Column type frequency:
character	6
logical	2
numeric	6
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
state	0	1	6	6	1
year	0	1	7	7	1
name	0	1	4	61	698
type	733	0	NA	NA	0
city	733	0	NA	NA	0
county	0	1	4	10	36

Variable type: logical

skim_variable	n_missing	complete_rate	mean	count
district	733	0	NaN	:
xrel	733	0	NaN	:

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
index	0	1.00	367.00	211.74	1.00	184.00	367.00	550.00	733.00	▇▇▇▇▇
enroll	0	1.00	56.24	33.70	1.00	31.00	57.00	76.00	379.00	▇▂▁▁▁
mmr	0	1.00	92.62	12.69	-1.00	93.33	95.49	97.24	100.00	▁▁▁▁▇
overall	0	1.00	90.02	13.17	-1.00	90.32	93.33	95.37	100.00	▁▁▁▁▇
xmed	541	0.26	0.47	0.62	0.11	0.22	0.31	0.52	7.14	▇▁▁▁▁
xper	37	0.95	3.92	4.71	0.21	1.41	2.51	4.22	37.95	▇▁▁▁▁

#I see that the ORmeasles dataset has more observations and and more columns that the ORWSJ dataset. The additional columns are latitude and longitude which I'm not using, so that doesn't really play into my decision. The years are the same for both datasets in oregon (there is variation in years in the larger not subseted data set), but there are more observations in the ORmeasles set so I'm going to use that dataset as the left dataset and leftjoin. 

#I've decided that there are a lot of variables that are empty or that I'm not interested in, I'm going to further subset the data to exclude the data I dont care about to make the join less messy. I got this method from: https://www.statmethods.net/management/subset.html

getrid1<-names(ORmeasles) %in% c("type", "city", "district", "enroll", "xrel", "xmed", "xper", "lat", "lng")
getrid2<-names(ORWSJ) %in% c("type", "city", "district", "enroll", "xrel", "xmed", "xper")
clean_OR_measles<-ORmeasles[!getrid1]
clean_ORWSJ<-ORWSJ[!getrid2]

OR_MMR<-left_join(clean_OR_measles,clean_ORWSJ, by = c("index","state","year" ,"name","county","mmr","overall" ))

#In order to compare urban versus rural I need to  assign counties the designation of urban or rural. In looking at doing so I landed on using the 2013 US office of budget and management definitions of Metropolitan, micropolitan, and non core, rather than a binary of rural and urban, which is a little tricker in Oregon. https://oregonexplorer.info/content/what-rural?topic=173&ptopic=140 See image below.

knitr::include_graphics("images/ORmap.png")

#I factored the counties into three different levels to classify them as "metropolitan", "micropolitan" or "noncore". Code sourced from: https://forcats.tidyverse.org/reference/fct_collapse.html

urbanstatus <- fct_collapse(OR_MMR$county,
  metro = c("Multnomah", "Clackamas","Washington", "Yamhill", "Polk", "Columbia", "Benton", "Lane", "Linn", "Deschutes", "Jackson", "Josephine", "Marion"),
  micro = c("Hood River", "Crook", "Umatilla", "Union", "Malheur", "Klamath", "Douglas", "Wasco", "Curry", "Coos", "Lincoln", "Clatsop", "Morrow"),
  noncore = c("Wallowa", "Baker", "Grant", "Wheeler", "Gilliam", "Sherman", "Jefferson", "Tillamook", "Harney", "Lake")
)
levels(urbanstatus)

## [1] "noncore" "metro"   "micro"

#checked my levels to see that they were not in a logical order. Reordered them to reflect a logical rural to urban gradient. 

urbanstatus <-factor(urbanstatus, c("noncore", "micro","metro"))
levels(urbanstatus)

## [1] "noncore" "micro"   "metro"

#used the cbind function to add the new categories to my OR_MMR dataframe for data analysis
OR_MMR<-cbind(OR_MMR, urbanstatus)

Show your transformed table here. Use tools such as glimpse(), skim() or head() to illustrate your point.

head(OR_MMR)

##   index  state    year                           name   county mmr   overall
## 1     1 Oregon 2018-19           Big Muddy Elementary    Wasco 100  88.23529
## 2     2 Oregon 2018-19               Cairo Elementary  Malheur 100 100.00000
## 3     3 Oregon 2018-19         Jasper Mountain Center     Lane 100 100.00000
## 4     4 Oregon 2018-19       Jordan Valley Elementary  Malheur 100 100.00000
## 5     5 Oregon 2018-19 Lifeways Day Treatment Program Umatilla 100 100.00000
## 6     6 Oregon 2018-19        Morrow Education Center   Morrow 100  96.15385
##   urbanstatus
## 1       micro
## 2       micro
## 3       metro
## 4       micro
## 5       micro
## 6       micro

Are the values what you expected for the variables? Why or Why not?

Yes, everything looks as I would expected. Only the percentages of vaccine coverage for MMR and overall vaccinations are numeric and the rest are character

Visualizing and Summarizing the Data (15 points)

Use group_by()/summarize() to make a summary of the data here. The summary should be relevant to your research question

OR_MMR%>% group_by(urbanstatus)%>%
  summarize(mean = mean(mmr))

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 3 x 2
##   urbanstatus  mean
##   <fct>       <dbl>
## 1 noncore      81.6
## 2 micro        92.9
## 3 metro        93.6

OR_MMR%>% group_by(urbanstatus)%>%
  summarize(median = median(mmr))

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 3 x 2
##   urbanstatus median
##   <fct>        <dbl>
## 1 noncore       94.5
## 2 micro         95.9
## 3 metro         95.6

OR_MMR%>% group_by(urbanstatus)%>%
  summarize(mean = mean(overall))

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 3 x 2
##   urbanstatus  mean
##   <fct>       <dbl>
## 1 noncore      78.5
## 2 micro        90.7
## 3 metro        91.1

OR_MMR%>% group_by(urbanstatus)%>%
  summarize(median = median(overall))

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 3 x 2
##   urbanstatus median
##   <fct>        <dbl>
## 1 noncore       91.7
## 2 micro         93.8
## 3 metro         93.4

What are your findings about the summary? Are they what you expected?

The findings demonstrate that there is a positive relationship between level of urbanization and vaccine coverage for both MMR and overall vaccination rates. At first I just ran the mean and the relationship looked quite strong but when I looked at my boxplots below and saw the number of outliers I realized that my mean was probably skewed and decided to compare with the median as well. When looking at the median the immunization rates are fairly similar between urban and rural. I suspected that there would be a difference but I didn’t really think about how it would be fairly likely to have such a high proportion of major outliers. It would have been very interesting to have the data on public/charter/private. Unfortunately these data were not available for the Oregon data. I’m imagining that many of the outliers might be associated with religious/waldorf private or charter schools

Make at least two plots that help you answer your question on the transformed or summarized data.

ggplot(OR_MMR) + aes(x=urbanstatus, y = mmr, color = urbanstatus) + geom_boxplot()+
  labs( x = "Urban status", y= "Percent MMR vaccine coverage",  
    title = "MMR Vaccine Coverage by Urban/Rural Status"
    )

ggplot(OR_MMR) + aes(x=county, y = mmr, color = urbanstatus) + geom_boxplot()+ facet_wrap(vars(urbanstatus))+
   labs( x = "County", y= "Percent MMR vaccine coverage",  
    title = "MMR Vaccine Coverage by County"
    )

ggplot(OR_MMR) + aes(x=overall, y = mmr, color = urbanstatus) + geom_point()+
   labs( x = "Percent overall vaccine coverage", y= "Percent MMR vaccine coverage",  
    title = "Overall vaccine & MMR coverage"
    )

These plots were really helpful for me to be able to see spread and get a much more accurate understanding of the nuances of what was happening compared to looking simply at the mean and median values.

Final Summary (10 points)

Summarize your research question and findings below.

I wanted to identify whether vaccination rates differed between urban and rural counties in Oregon. Rather than to simply use a binary approach to urban and rural I used a 3 stage gradient that included “noncore” for the most rural, “micropolitan” for smaller semi-urban regions, and “metropolitan” for most urban settings. When looking just at mean, it seems that there is a strong relationship between being more rural and having lower vaccination rates, however a closer look at the data indicates that a large proportion of the differences in vaccination rates are accounted for by significant outliers in the data. These are schools with very low vaccination rates. There are significant outliers in both the urban and rural regions, but due to the relatively lower number of schools in the rural regions, the outliers skew the mean vaccination rates more than they do in the urban areas. There also appears to be a high degree of variability in the spread of vaccination rates in schools by county, as evidenced by the second box plot entitled ’MMR vaccination coverage by county. Not surprisingly, overall vaccination coverage and MMR vaccination coverage are strongly correlated (see graph 3).

Are your findings what you expected? Why or Why not?

Given that my original research hope was to look at vaccination rate by school type rather than by urban/rural (those data were not available), I did have a sense that there might be really extreme variability between types of schools, and I am guessing that charter/private schools account for a large proportion of the major outliers in these numbers. I dont think I expected the degree to which outliers accounted for the variation between urban and rural vaccination rates. It would be very interesting to see how these numbers compare to percapita vaccination rates rather than by school, and it would also be interesting to see to what extent school type accounted for vaccination rates in each of the three types on the urban-rural gradient. Overall this ended up being a more interesting analysis than I anticipated

Midterm

Stephanie Elliott

2021-02-12

Midterm (Due 2/12/2021 at 11:55 pm)

60 / 60 points total

Before you get Started

Working Together

Please Note

Define Your Research Question (10 points)

Loading the Data (10 points)

Transforming the data (15 points)

Visualizing and Summarizing the Data (15 points)

Final Summary (10 points)