Please submit your .Rmd
and .html
files in Sakai. If you are working together, both people should submit the files.
The goal of the midterm project is to showcase skills that you have learned in class so far. The midterm is open note, but if you use someone else’s code, you must attribute them.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2 ✓ purrr 0.3.4
## ✓ tibble 3.0.4 ✓ dplyr 1.0.2
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(dplyr)
library(readxl)
library(janitor)
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(forcats)
library(skimr)
Potential Sources for data: Tidy Tuesday: https://github.com/rfordatascience/tidytuesday
.csv
file into your data
folder.You may use another dataset or your own data, but please make sure it is de-identified.
If you’d like to work together, that is encouraged, but you must divide the work equitably and you must note who worked on what. This is probably easiest as notes in the text. Please let Eric or Me know that you’ll be working together.
No acknowledgements of contributions = -10 points overall.
I will take off points (-5 points for each section) if you don’t add observations and notes in your RMarkdown document. I want you to think and reason through your analysis, even if they are preliminary thoughts.
Define your research question below. What about the data interests you? What is a specific question you want to find out about the data?
Given your question, what is your expectation about the data?
The US Department of Agriculture (USDA) publishes monthly reports for the US dairy industry.[1] Annual statistical analyses and at-a-glance reports are also provided. These reports highlight the production, consumption and commercial import/export of US dairy. According to the NPR article “Nobody is Moving our Cheese” and a similar article by The Washington Post,[2,3] milk production has been exceeding the rates of consumption for years. As a more sustainable and less perishable storage option, over the past 10 years, excess dairy is being turned into cheese.
However, the US population has also gradually steered away from eating processed cheese over the years, opting for European style cheeses instead. This shift in consumption has resulted in a nearly 1.4 billion stockpile of cheese.
I’m interested in understanding how this trend has impacted the cows and the amount of milk produced over time and by region. My hypothesis is that there will be a correlation in the number of cows and the average milk produced per cow. I’m curious to see if the supply sources and production, i.e., the cows and the milk they produce, has changed over time and by region of the US. Additionally, the articles focused specifically on milk and cheese consumption. I am interested to learn if the same trends can be seen in soft milk products, particularly ice cream and yogurt.
To illustrate this point, I will be utilizing two different datasets provided by the USDA. The first dataset is the number of cows and the average milk produced per cow, between 1970 and 2019, by state and region. The second dataset is the consumption, in millions of pounds, of soft dairy products between 1975 and 2019.
There are two separate datasets comprising the number of cows and average milk produced per cow by year and region. I will combine these two datasets by year and region to look at the trends in number of cows and dairy produced over time and by region.
[1] https://www.ers.usda.gov/data-products/dairy-data/documentation/#Loc3
[2,3] https://www.npr.org/2019/01/09/683339929/nobody-is-moving-our-cheese-american-surplus-reaches-record-high, https://www.washingtonpost.com/news/wonk/wp/2018/06/28/americas-cheese-stockpile-just-hit-an-all-time-high/?noredirect=on
Load the data below and use
dplyr::glimpse()
orskimr::skim()
on the data. You should upload the data file into thedata
directory.
#The first two data files correspond to the number of cows and the average milk produced per cow. The third data file corresponds to the soft milk product consumption over time.
#The first spreadsheet contains the number of milk cows by region and state, 1970-2019
milk_cows <- read_excel("data/milkcowsandprod.xlsx",
sheet=1,
na="NA")
#The second spreadsheet contains the average milk produced (in lbs) per cow, by region and state, 1970-2019
milk_per_cow <- read_excel("data/milkcowsandprod.xlsx",
sheet=2,
na="NA")
#The third data file contains the consumption of soft dairy products over time, 1970-2019
milk_soft <- read_excel("data/softprod.xlsx",
sheet=1,
na="NA")
## New names:
## * `` -> ...1
#Visualizing the data to get a sense of what (if any) data management is required in each data set.
head(milk_cows)
## # A tibble: 6 x 52
## Region State `1970` `1971` `1972` `1973` `1974` `1975` `1976` `1977` `1978`
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 North… <NA> 2306. 2267. 2234. 2181. 2195. 2218. 2206. 2194. 2170.
## 2 <NA> Maine 62 61 61 60 60 61 59 58 58
## 3 <NA> New … 36 35 33 33 32 33 31 31 31
## 4 <NA> Verm… 194 195 195 191 193 193 194 191 186
## 5 <NA> Mass… 60 59 57 55 54 55 54 51 49
## 6 <NA> Rhod… 6.9 6.3 6.2 5.9 5.9 6 5.5 5 4.7
## # … with 41 more variables: `1979` <dbl>, `1980` <dbl>, `1981` <dbl>,
## # `1982` <dbl>, `1983` <dbl>, `1984` <dbl>, `1985` <dbl>, `1986` <dbl>,
## # `1987` <dbl>, `1988` <dbl>, `1989` <dbl>, `1990` <dbl>, `1991` <dbl>,
## # `1992` <dbl>, `1993` <dbl>, `1994` <dbl>, `1995` <dbl>, `1996` <dbl>,
## # `1997` <dbl>, `1998` <dbl>, `1999` <dbl>, `2000` <dbl>, `2001` <dbl>,
## # `2002` <dbl>, `2003` <dbl>, `2004` <dbl>, `2005` <dbl>, `2006` <dbl>,
## # `2007` <dbl>, `2008` <dbl>, `2009` <dbl>, `2010` <dbl>, `2011` <dbl>,
## # `2012` <dbl>, `2013` <dbl>, `2014` <dbl>, `2015` <dbl>, `2016` <dbl>,
## # `2017` <dbl>, `2018` <dbl>, `2019` <dbl>
head(milk_per_cow)
## # A tibble: 6 x 52
## Region State `1970` `1971` `1972` `1973` `1974` `1975` `1976` `1977` `1978`
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 <NA> <NA> NA NA NA NA NA NA NA NA NA
## 2 North… <NA> 10503. 10749. 10755. 10448. 10542. 10621. 11000. 11226. 11450.
## 3 <NA> Maine 9984 10311 10459 10233 10183 10311 10644 11000 11052
## 4 <NA> New … 9889 10257 10697 10212 10281 10182 10806 10935 11000
## 5 <NA> Verm… 10155 10385 10456 10199 10078 10409 10789 10995 11484
## 6 <NA> Mass… 10967 11153 11035 10818 10981 10927 11074 11706 11673
## # … with 41 more variables: `1979` <dbl>, `1980` <dbl>, `1981` <dbl>,
## # `1982` <dbl>, `1983` <dbl>, `1984` <dbl>, `1985` <dbl>, `1986` <dbl>,
## # `1987` <dbl>, `1988` <dbl>, `1989` <dbl>, `1990` <dbl>, `1991` <dbl>,
## # `1992` <dbl>, `1993` <dbl>, `1994` <dbl>, `1995` <dbl>, `1996` <dbl>,
## # `1997` <dbl>, `1998` <dbl>, `1999` <dbl>, `2000` <dbl>, `2001` <dbl>,
## # `2002` <dbl>, `2003` <dbl>, `2004` <dbl>, `2005` <dbl>, `2006` <dbl>,
## # `2007` <dbl>, `2008` <dbl>, `2009` <dbl>, `2010` <dbl>, `2011` <dbl>,
## # `2012` <dbl>, `2013` <dbl>, `2014` <dbl>, `2015` <dbl>, `2016` <dbl>,
## # `2017` <dbl>, `2018` <dbl>, `2019` <dbl>
head(milk_soft)
## # A tibble: 6 x 16
## ...1 `Ice Cream Regu… `Ice Cream Low-… `Ice Cream Non-… `Frozen yogurt`
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1975 3932. 1404. NA NA
## 2 1976 3846. 1345. NA NA
## 3 1977 3808. 1448. NA NA
## 4 1978 3834. 1458 NA NA
## 5 1979 3813. 1395. NA NA
## 6 1980 3898. 1378. NA NA
## # … with 11 more variables: Sherbet <dbl>, `Other frozen dairy` <dbl>, `Water
## # and juice ices` <dbl>, `Yogurt other than frozen` <dbl>, `Cottage
## # cheese` <dbl>, `Sour cream` <dbl>, `Light cream` <dbl>, `Heavy
## # cream` <dbl>, `Light cream \r\n+ heavy cream` <dbl>, `Half and half` <dbl>,
## # `Total fluid cream` <dbl>
If there are any quirks that you have to deal with
NA
coded as something else, or it is multiple tables, please make some notes here about what you need to do before you start transforming the data in the next section.
Make sure your data types are correct!
In the milk_cows and milk_per_cow data files, there are empty rows between each region type. Additionally, there are several empty cells beneath each region, where the data is being described by state.
I want to first get rid of the empty rows. Then, I want to transpose the data from wide to long, so that I have a column for region, state, year, and number of cows. Similarly, in the milk_per_cow data file, I want a column for region, state, year, and average production per cow. Then, I will merge these two data files by year and region.
The milk_soft data file has some interesting variable names, which I will use janitor::clean_names() to tidy up.
If the data needs to be transformed in any way (values recoded, pivoted, etc), do it here. Examples include transforming a continuous variable into a categorical using
case_when()
, etc.
Subset of the data and think of a question to answer the subset
Bonus points (5 points) for datasets that require merging of tables, but only if you reason through whether you should use
left_join
,inner_join
, orright_join
on these tables. No credit will be provided if you don’t.
#Removing rows containing all NAs from the milk_cows and milk_per_cow data frames
#The following code to remove rows with all NA values is courtesy of: https://stackoverflow.com/questions/51214357/how-to-remove-only-rows-that-have-all-na-in-r
milk_cows[milk_cows == ""] <- NA # define NA pattern
milk_cows_clean <- milk_cows[rowSums(is.na(milk_cows)) != ncol(milk_cows), ] # result
milk_per_cow[milk_per_cow == ""] <- NA # define NA pattern
milk_per_cow_clean <- milk_per_cow[rowSums(is.na(milk_per_cow)) != ncol(milk_per_cow), ] # result
#Using tidyr::fill() (function of the week shout out!) to fill the regions.
milk_cows_clean <- milk_cows_clean %>%
fill(Region)
milk_per_cow_clean <- milk_per_cow_clean %>%
fill(Region)
#Subsetting to remove State variable since we are only interested in trends by region for now
milk_cows_clean2 <- milk_cows_clean[c(-2)]
milk_per_cow_clean2 <- milk_per_cow_clean[c(-2)]
#Pivoting the milk_cows_clean and milk_per_cow_clean data frames from wide to long format.
milk_cows_long <-
milk_cows_clean2 %>%
pivot_longer (cols = "1970":"2019",
names_to = "year",
values_to = "no_cows"
)
milk_per_cow_long <-
milk_per_cow_clean2 %>%
pivot_longer (cols = "1970":"2019",
names_to = "year",
values_to = "avg_prod"
)
#I want to look at the relationship between the number of cows and the average production over time and region. In order to do this, I need to perform an inner_join between milk_cows_long and milk_per_cow_long.
milk_cows_all <- milk_cows_long %>%
inner_join(milk_per_cow_long,
by=c("Region","year")
)
#For the final data frame, milk_soft, using janitor::clean_names() to tidy up column names
milk_soft <- clean_names(milk_soft)
#Renaming x1 to year in the milk_soft data frame
milk_soft <- rename(milk_soft, year = x1)
#For both data frames, milk_cows_all and milk_soft, creating categories for year by decade
milk_cows_all <-
milk_cows_all %>%
mutate(year_category =
case_when(year < 1980 ~ "70s",
year >= 1980 & year < 1990 ~ "80s",
year >= 1990 & year < 2000 ~ "90s",
year >= 2000 & year < 2010 ~ "2000s",
year >= 2010 ~ "2010+"
)
) %>%
mutate(year_category = factor(year_category,
levels = c("70s", "80s", "90s", "2000s", "2010+" )
)
)
milk_soft <-
milk_soft %>%
mutate(year_category =
case_when(year < 1980 ~ "70s",
year >= 1980 & year < 1990 ~ "80s",
year >= 1990 & year < 2000 ~ "90s",
year >= 2000 & year < 2010 ~ "2000s",
year >= 2010 ~ "2010+"
)
) %>%
mutate(year_category = factor(year_category,
levels = c("70s", "80s", "90s", "2000s", "2010+" )
)
)
Show your transformed table here. Use tools such as
glimpse()
,skim()
orhead()
to illustrate your point.
#Visualizing the final data sets I have created
head(milk_cows_all)
## # A tibble: 6 x 5
## Region year no_cows avg_prod year_category
## <chr> <chr> <dbl> <dbl> <fct>
## 1 Northeast 1970 2306. 10503. 70s
## 2 Northeast 1970 2306. 9984 70s
## 3 Northeast 1970 2306. 9889 70s
## 4 Northeast 1970 2306. 10155 70s
## 5 Northeast 1970 2306. 10967 70s
## 6 Northeast 1970 2306. 10870 70s
head(milk_soft)
## # A tibble: 6 x 17
## year ice_cream_regul… ice_cream_low_f… ice_cream_non_f… frozen_yogurt sherbet
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1975 3932. 1404. NA NA 291.
## 2 1976 3846. 1345. NA NA 298.
## 3 1977 3808. 1448. NA NA 299.
## 4 1978 3834. 1458 NA NA 288.
## 5 1979 3813. 1395. NA NA 271.
## 6 1980 3898. 1378. NA NA 271.
## # … with 11 more variables: other_frozen_dairy <dbl>,
## # water_and_juice_ices <dbl>, yogurt_other_than_frozen <dbl>,
## # cottage_cheese <dbl>, sour_cream <dbl>, light_cream <dbl>,
## # heavy_cream <dbl>, light_cream_heavy_cream <dbl>, half_and_half <dbl>,
## # total_fluid_cream <dbl>, year_category <fct>
glimpse(milk_cows_all)
## Rows: 20,700
## Columns: 5
## $ Region <chr> "Northeast", "Northeast", "Northeast", "Northeast", "No…
## $ year <chr> "1970", "1970", "1970", "1970", "1970", "1970", "1970",…
## $ no_cows <dbl> 2306.4, 2306.4, 2306.4, 2306.4, 2306.4, 2306.4, 2306.4,…
## $ avg_prod <dbl> 10502.95, 9984.00, 9889.00, 10155.00, 10967.00, 10870.0…
## $ year_category <fct> 70s, 70s, 70s, 70s, 70s, 70s, 70s, 70s, 70s, 70s, 70s, …
glimpse(milk_soft)
## Rows: 45
## Columns: 17
## $ year <dbl> 1975, 1976, 1977, 1978, 1979, 1980, 1981, 19…
## $ ice_cream_regular <dbl> 3931.8, 3845.8, 3807.7, 3834.1, 3813.2, 3898…
## $ ice_cream_low_fat <dbl> 1404.3, 1345.1, 1447.9, 1458.0, 1394.7, 1378…
## $ ice_cream_non_fat <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ frozen_yogurt <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ sherbet <dbl> 291.3, 297.5, 298.7, 288.2, 270.7, 271.1, 27…
## $ other_frozen_dairy <dbl> 163.0, 137.9, 134.9, 132.7, 125.6, 106.9, 18…
## $ water_and_juice_ices <dbl> 229.4, 228.0, 223.6, 203.8, 192.5, 200.3, 20…
## $ yogurt_other_than_frozen <dbl> 425.0000, 465.0000, 515.0000, 545.0000, 550.…
## $ cottage_cheese <dbl> 991, 1010, 1017, 1024, 998, 1004, 982, 966, …
## $ sour_cream <dbl> 350, 350, 364, 374, 395, 408, 424, 451, 484,…
## $ light_cream <dbl> 87, 76, 68, 70, 66, 55, 56, 62, 67, 74, 85, …
## $ heavy_cream <dbl> 119, 129, 126, 123, 139, 159, 166, 172, 196,…
## $ light_cream_heavy_cream <dbl> 206, 205, 194, 193, 205, 214, 222, 234, 263,…
## $ half_and_half <dbl> 514, 530, 536, 537, 543, 551, 568, 569, 599,…
## $ total_fluid_cream <dbl> 720, 735, 730, 730, 748, 765, 790, 803, 862,…
## $ year_category <fct> 70s, 70s, 70s, 70s, 70s, 80s, 80s, 80s, 80s,…
skim(milk_cows_all)
Name | milk_cows_all |
Number of rows | 20700 |
Number of columns | 5 |
_______________________ | |
Column type frequency: | |
character | 2 |
factor | 1 |
numeric | 2 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
Region | 0 | 1 | 8 | 15 | 0 | 12 | 0 |
year | 0 | 1 | 4 | 4 | 0 | 50 | 0 |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
year_category | 0 | 1 | FALSE | 5 | 70s: 4140, 80s: 4140, 90s: 4140, 200: 4140 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
no_cows | 6 | 1 | 325.45 | 685.12 | 0.3 | 30 | 102 | 319 | 12000 | ▇▁▁▁▁ |
avg_prod | 6 | 1 | 15400.75 | 4133.36 | 4636.0 | 12033 | 14903 | 18417 | 26725 | ▁▇▇▅▂ |
skim(milk_soft)
Name | milk_soft |
Number of rows | 45 |
Number of columns | 17 |
_______________________ | |
Column type frequency: | |
factor | 1 |
numeric | 16 |
________________________ | |
Group variables | None |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
year_category | 0 | 1 | FALSE | 5 | 80s: 10, 90s: 10, 200: 10, 201: 10 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
year | 0 | 1.00 | 1997.00 | 13.13 | 1975.0 | 1986.00 | 1997.00 | 2008.00 | 2019.00 | ▇▇▇▇▇ |
ice_cream_regular | 0 | 1.00 | 4157.82 | 239.39 | 3807.7 | 3976.10 | 4130.60 | 4334.60 | 4666.60 | ▇▇▃▃▃ |
ice_cream_low_fat | 0 | 1.00 | 1693.46 | 235.85 | 1317.1 | 1476.90 | 1695.90 | 1855.00 | 2135.60 | ▆▂▇▃▃ |
ice_cream_non_fat | 20 | 0.56 | 102.11 | 51.08 | 42.5 | 67.10 | 79.60 | 105.70 | 203.20 | ▇▇▁▁▃ |
frozen_yogurt | 14 | 0.69 | 514.63 | 188.55 | 276.2 | 391.45 | 441.10 | 573.45 | 910.80 | ▇▅▂▁▃ |
sherbet | 0 | 1.00 | 300.03 | 29.05 | 230.1 | 278.30 | 298.10 | 319.80 | 376.60 | ▂▆▇▅▁ |
other_frozen_dairy | 0 | 1.00 | 149.35 | 102.62 | 43.3 | 74.40 | 118.20 | 182.90 | 462.20 | ▇▃▂▁▁ |
water_and_juice_ices | 0 | 1.00 | 354.13 | 87.96 | 192.5 | 294.60 | 392.70 | 427.00 | 475.60 | ▅▃▂▆▇ |
yogurt_other_than_frozen | 0 | 1.00 | 2180.04 | 1514.91 | 425.0 | 940.00 | 1636.07 | 3573.11 | 4735.27 | ▇▃▁▂▅ |
cottage_cheese | 0 | 1.00 | 812.02 | 122.51 | 668.0 | 711.00 | 770.00 | 958.00 | 1024.00 | ▇▃▁▁▅ |
sour_cream | 0 | 1.00 | 848.14 | 345.88 | 350.0 | 565.00 | 794.00 | 1151.00 | 1425.66 | ▇▇▅▃▇ |
light_cream | 20 | 0.56 | 88.96 | 25.37 | 55.0 | 70.00 | 87.00 | 101.00 | 168.00 | ▇▇▅▁▁ |
heavy_cream | 20 | 0.56 | 282.32 | 131.39 | 119.0 | 166.00 | 272.00 | 349.00 | 555.00 | ▇▅▅▂▂ |
light_cream_heavy_cream | 18 | 0.60 | 400.81 | 181.90 | 193.0 | 228.00 | 393.00 | 476.00 | 797.00 | ▇▆▃▂▂ |
half_and_half | 18 | 0.60 | 736.89 | 165.64 | 514.0 | 568.50 | 755.00 | 826.00 | 1146.00 | ▇▅▇▂▁ |
total_fluid_cream | 13 | 0.71 | 1307.91 | 518.22 | 720.0 | 847.25 | 1176.00 | 1578.75 | 2459.30 | ▇▇▃▂▃ |
Are the values what you expected for the variables? Why or Why not?
I am actually really surprised by what initially looks like an inverse relationship between the number of cows and average milk produced per cow, based on the histograms for each.
Looking at the mean soft milk product consumed, I am not surprised, as I anticipated consumption of ice cream and yogurt would be greatest.
Otherwise, my transformations and data management all appear to have manipulated the data into the formats I was hoping for.
Use
group_by()/summarize()
to make a summary of the data here. The summary should be relevant to your research question
#First grouping milk_cows_all data by region and summarizing the median number of cows, and median milk production
milk_cows_all %>%
group_by(Region) %>%
summarize(median_no_cows = median(no_cows, na.rm=TRUE),
min=min(no_cows),
max=max(no_cows)) %>%
arrange(median_no_cows)
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 12 x 4
## Region median_no_cows min max
## <chr> <dbl> <dbl> <dbl>
## 1 Other States 3.9 NA NA
## 2 Northeast 47 0.7 2306.
## 3 Delta States 68 5 440
## 4 Mountain 81 3.8 1483
## 5 Southeast 103 5 523
## 6 Northern Plains 115 15 682
## 7 Appalachian 126. 6 1044
## 8 Corn Belt 236 78 1814
## 9 Southern Plains 356. 37 606
## 10 West Coast 518. 89 2202
## 11 Lake States 1092. 299 3196
## 12 United States 9480 9010 12000
milk_cows_all %>%
group_by(Region) %>%
summarize(median_production = median(avg_prod, na.rm=TRUE),
min=min(avg_prod),
max=max(avg_prod)) %>%
arrange(median_production)
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 12 x 4
## Region median_production min max
## <chr> <dbl> <dbl> <dbl>
## 1 Delta States 12032 5860 15750
## 2 Other States 13056. NA NA
## 3 Northern Plains 13695. 7774 24293
## 4 Appalachian 13753. 7192 21476
## 5 Southeast 13966. 6689 21905
## 6 Southern Plains 14395 8634 24513
## 7 Corn Belt 14938. 8938 24271
## 8 Northeast 15590. 9630 24118
## 9 Lake States 15791. 10120 26725
## 10 United States 16292 9751 23391
## 11 Mountain 17450 8359 25892
## 12 West Coast 18715 10000 24318
#Next grouping by year category and summarizing the median number of cows and median milk production
milk_cows_all %>%
group_by(year_category) %>%
summarize(median_no_cows = median(no_cows, na.rm=TRUE),
min=min(no_cows),
max=max(no_cows)) %>%
arrange(median_no_cows)
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 5 x 4
## year_category median_no_cows min max
## <fct> <dbl> <dbl> <dbl>
## 1 2010+ 90 NA NA
## 2 2000s 93 0.6 9314
## 3 90s 98 0.7 9993
## 4 80s 110. 1 11059
## 5 70s 140. 1.2 12000
milk_cows_all %>%
group_by(year_category) %>%
summarize(median_production = median(avg_prod, na.rm=TRUE),
min=min(avg_prod),
max=max(avg_prod)) %>%
arrange(median_production)
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 5 x 4
## year_category median_production min max
## <fct> <dbl> <dbl> <dbl>
## 1 70s 10568 5860 14472
## 2 80s 12690 8337 18552
## 3 90s 15440. 10947 22409
## 4 2000s 17889 10000 24320
## 5 2010+ 20548. NA NA
#Summarizing the median ice cream and yogurt consumption by year_category
milk_soft %>%
group_by(year_category) %>%
summarize(median_ice_cream = median(ice_cream_regular, na.rm=TRUE),
min=min(ice_cream_regular),
max=max(ice_cream_regular)) %>%
arrange(median_ice_cream)
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 5 x 4
## year_category median_ice_cream min max
## <fct> <dbl> <dbl> <dbl>
## 1 70s 3834. 3808. 3932.
## 2 90s 4041. 3851. 4519.
## 3 2010+ 4126. 3914. 4322.
## 4 80s 4137. 3896. 4354.
## 5 2000s 4488. 4280. 4667.
milk_soft %>%
group_by(year_category) %>%
summarize(median_yogurt = median(yogurt_other_than_frozen, na.rm=TRUE),
min=min(yogurt_other_than_frozen),
max=max(yogurt_other_than_frozen)) %>%
arrange(median_yogurt)
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 5 x 4
## year_category median_yogurt min max
## <fct> <dbl> <dbl> <dbl>
## 1 70s 515 425 550
## 2 80s 874. 560 1150
## 3 90s 1471. 974. 1712.
## 4 2000s 2886. 1835. 3830.
## 5 2010+ 4446. 4163. 4735.
What are your findings about the summary? Are they what you expected?
The median number of cows is greatest in the Lake States with 1,092.5 (299.0, 3196.0) cows. While I expected this region to be high, I am surprised the next region with greatest median number of cows is the West Coast with a median of 517.5 (89.0, 2202.0). I actually thought the regions of the plains states (Corn Belt, Northern Plains, and Southern Plains) would be higher than they are. Additionally, I did not expect the median number of cows in the West Coast to essentially be half that of the Lake States. Interestingly, the average milk produced per cow is greatest in the West Coast and Mountain states.
My observation from earlier was confirmed, in that the median number of cows has declined, from a median of 140.5 in the 1970s to a median of 90.0 between 2010 and 2019. This is in contrast to the average milk produced per cow, which has nearly doubled since the 1970s.
Ice cream consumption appears to have increased steadily since the 1970s while yogurt consumption is approximately 8x what it was in the 1970s.
Make at least two plots that help you answer your question on the transformed or summarized data.
#Loading Wes Anderson color palette
library (wesanderson)
#Scatter plot of number of cows by average milk produced per cow, colored by region
ggplot(milk_cows_all) +
aes(x = no_cows,
y = avg_prod,
color = Region) +
geom_point()+
labs (title = "Number of Cows by Average Milk Produced per Cow",
x = "Number of Cows",
y = "Average Milk Produced per Cow (lbs)") +
scale_fill_manual(values = wes_palette("Darjeeling2"))
## Warning: Removed 8 rows containing missing values (geom_point).
#Scatter plot of number of cows by year, colored by region
ggplot(milk_cows_all) +
aes(x = year,
y = no_cows,
color = Region) +
geom_point()+
labs (title = "Number of Cows over Time in the US, 1970-2019",
x = "Year",
y = "Number of Cows") +
scale_fill_manual(values = wes_palette("IsleofDogs1")) +
theme(axis.text.x=element_text(angle = -90, hjust = 0))
## Warning: Removed 6 rows containing missing values (geom_point).
#Scatter plot of number of cows by average milk produced per cow, colored by year category
ggplot(milk_cows_all) +
aes(x = no_cows,
y = avg_prod,
color = year_category) +
geom_point()+
labs (title = "Number of Cows by Average Milk Produced per Cow",
x = "Number of Cows",
y = "Average Milk Produced per Cow (lbs)") +
scale_fill_manual(values = wes_palette("Moonrise3"))
## Warning: Removed 8 rows containing missing values (geom_point).
#Box plot of yogurt consumption by year category
ggplot(milk_soft) +
aes(x = year_category,
y = yogurt_other_than_frozen) +
geom_boxplot(color="navy")+
theme(panel.background = element_rect(fill = 'grey75')) +
labs(title = "Yogurt Consumption by Year Category",
x = "Year Category",
y = "Yogurt Consumption")
#Scatter plot of yogurt consumption over time
ggplot(milk_soft)+
aes(x = year,
y = yogurt_other_than_frozen) +
geom_point(color="purple")+
labs (title = "Yogurt Consumption, 1975-2019",
x = "Year",
y = "Yogurt consumption (millions of lbs")
Summarize your research question and findings below.
Dairy consumption in the US has been declining, and in an effort to preserve the dairy, milk is being turned into a less-perishable form, cheese. However, processed cheese consumption has also been declining in the US, resulting in a nearly 1.4 billion stockpile of cheese in the last 10 years. Given this information, I sought to understand how production has changed over time, specifically by looking at the number of cows and average milk produced per cow over time and by region of the US. Additionally, given consumption of both milk and processed cheese has decreased in the US, I wanted to explore how consumption of other dairy products have changed over the same time period.
Are your findings what you expected? Why or Why not?
In accordance with the background research, my findings supported the increased production of milk over the last 30+ years, with production nearly doubling since the 1970s. To keep up with the supply, I hypothesized the number of cows in the US would have also increased over time, however, there is actually an inverse relationship between the number of cows in the US and the average amount of milk being produced per cow. Looking at other possible sources of dairy consumption, I observed ice cream consumption has steadily increased since 1970, however, I would hypothesize any increase seen here is proportional to the increase in population over the same time period. Yogurt consumption increased approximately 8x since the 1970s, and I would be interested to see if this finding is still significant after accounting for the population increase between 1975 and 2019. It is interesting to see that production continues, despite there being fewer cows and less demand for dairy.