Midterm (Due 2/12/2021 at 11:55 pm)

Please submit your .Rmd and .html files in Sakai. If you are working together, both people should submit the files.

60 / 60 points total

The goal of the midterm project is to showcase skills that you have learned in class so far. The midterm is open note, but if you use someone else’s code, you must attribute them.

Loading libraries

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.4     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(dplyr)
library(readxl)
library(janitor)
## 
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
library(forcats)
library(skimr)

Before you get Started

  1. Pick a dataset. Ideally, the dataset should be around 2000 rows, and should have both categorical and numeric covariates.

Potential Sources for data: Tidy Tuesday: https://github.com/rfordatascience/tidytuesday

  • Note that most of these are .csv files. There is code to load the files from csv for each of the datasets and a short description of the variables, or you can upload the .csv file into your data folder.

You may use another dataset or your own data, but please make sure it is de-identified.

  1. Please schedule a time with Eric or Me to discuss your dataset and research question. We just want to look at the data and make sure that it is appropriate for your question.

Working Together

If you’d like to work together, that is encouraged, but you must divide the work equitably and you must note who worked on what. This is probably easiest as notes in the text. Please let Eric or Me know that you’ll be working together.

No acknowledgements of contributions = -10 points overall.

Please Note

I will take off points (-5 points for each section) if you don’t add observations and notes in your RMarkdown document. I want you to think and reason through your analysis, even if they are preliminary thoughts.

Define Your Research Question (10 points)

Define your research question below. What about the data interests you? What is a specific question you want to find out about the data?

Given your question, what is your expectation about the data?

The US Department of Agriculture (USDA) publishes monthly reports for the US dairy industry.[1] Annual statistical analyses and at-a-glance reports are also provided. These reports highlight the production, consumption and commercial import/export of US dairy. According to the NPR article “Nobody is Moving our Cheese” and a similar article by The Washington Post,[2,3] milk production has been exceeding the rates of consumption for years. As a more sustainable and less perishable storage option, over the past 10 years, excess dairy is being turned into cheese.

However, the US population has also gradually steered away from eating processed cheese over the years, opting for European style cheeses instead. This shift in consumption has resulted in a nearly 1.4 billion stockpile of cheese.

I’m interested in understanding how this trend has impacted the cows and the amount of milk produced over time and by region. My hypothesis is that there will be a correlation in the number of cows and the average milk produced per cow. I’m curious to see if the supply sources and production, i.e., the cows and the milk they produce, has changed over time and by region of the US. Additionally, the articles focused specifically on milk and cheese consumption. I am interested to learn if the same trends can be seen in soft milk products, particularly ice cream and yogurt.

To illustrate this point, I will be utilizing two different datasets provided by the USDA. The first dataset is the number of cows and the average milk produced per cow, between 1970 and 2019, by state and region. The second dataset is the consumption, in millions of pounds, of soft dairy products between 1975 and 2019.

There are two separate datasets comprising the number of cows and average milk produced per cow by year and region. I will combine these two datasets by year and region to look at the trends in number of cows and dairy produced over time and by region.

[1] https://www.ers.usda.gov/data-products/dairy-data/documentation/#Loc3

[2,3] https://www.npr.org/2019/01/09/683339929/nobody-is-moving-our-cheese-american-surplus-reaches-record-high, https://www.washingtonpost.com/news/wonk/wp/2018/06/28/americas-cheese-stockpile-just-hit-an-all-time-high/?noredirect=on

Loading the Data (10 points)

Load the data below and use dplyr::glimpse() or skimr::skim() on the data. You should upload the data file into the data directory.

#The first two data files correspond to the number of cows and the average milk produced per cow. The third data file corresponds to the soft milk product consumption over time. 

#The first spreadsheet contains the number of milk cows by region and state, 1970-2019
milk_cows <- read_excel("data/milkcowsandprod.xlsx", 
                             sheet=1, 
                             na="NA")

#The second spreadsheet contains the average milk produced (in lbs) per cow, by region and state, 1970-2019
milk_per_cow <- read_excel("data/milkcowsandprod.xlsx", 
                             sheet=2, 
                             na="NA")

#The third data file contains the consumption of soft dairy products over time, 1970-2019
milk_soft <- read_excel("data/softprod.xlsx", 
                             sheet=1, 
                             na="NA")
## New names:
## * `` -> ...1
#Visualizing the data to get a sense of what (if any) data management is required in each data set.
head(milk_cows)
## # A tibble: 6 x 52
##   Region State `1970` `1971` `1972` `1973` `1974` `1975` `1976` `1977` `1978`
##   <chr>  <chr>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1 North… <NA>  2306.  2267.  2234.  2181.  2195.   2218. 2206.   2194. 2170. 
## 2 <NA>   Maine   62     61     61     60     60      61    59      58    58  
## 3 <NA>   New …   36     35     33     33     32      33    31      31    31  
## 4 <NA>   Verm…  194    195    195    191    193     193   194     191   186  
## 5 <NA>   Mass…   60     59     57     55     54      55    54      51    49  
## 6 <NA>   Rhod…    6.9    6.3    6.2    5.9    5.9     6     5.5     5     4.7
## # … with 41 more variables: `1979` <dbl>, `1980` <dbl>, `1981` <dbl>,
## #   `1982` <dbl>, `1983` <dbl>, `1984` <dbl>, `1985` <dbl>, `1986` <dbl>,
## #   `1987` <dbl>, `1988` <dbl>, `1989` <dbl>, `1990` <dbl>, `1991` <dbl>,
## #   `1992` <dbl>, `1993` <dbl>, `1994` <dbl>, `1995` <dbl>, `1996` <dbl>,
## #   `1997` <dbl>, `1998` <dbl>, `1999` <dbl>, `2000` <dbl>, `2001` <dbl>,
## #   `2002` <dbl>, `2003` <dbl>, `2004` <dbl>, `2005` <dbl>, `2006` <dbl>,
## #   `2007` <dbl>, `2008` <dbl>, `2009` <dbl>, `2010` <dbl>, `2011` <dbl>,
## #   `2012` <dbl>, `2013` <dbl>, `2014` <dbl>, `2015` <dbl>, `2016` <dbl>,
## #   `2017` <dbl>, `2018` <dbl>, `2019` <dbl>
head(milk_per_cow)
## # A tibble: 6 x 52
##   Region State `1970` `1971` `1972` `1973` `1974` `1975` `1976` `1977` `1978`
##   <chr>  <chr>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1 <NA>   <NA>     NA     NA     NA     NA     NA     NA     NA     NA     NA 
## 2 North… <NA>  10503. 10749. 10755. 10448. 10542. 10621. 11000. 11226. 11450.
## 3 <NA>   Maine  9984  10311  10459  10233  10183  10311  10644  11000  11052 
## 4 <NA>   New …  9889  10257  10697  10212  10281  10182  10806  10935  11000 
## 5 <NA>   Verm… 10155  10385  10456  10199  10078  10409  10789  10995  11484 
## 6 <NA>   Mass… 10967  11153  11035  10818  10981  10927  11074  11706  11673 
## # … with 41 more variables: `1979` <dbl>, `1980` <dbl>, `1981` <dbl>,
## #   `1982` <dbl>, `1983` <dbl>, `1984` <dbl>, `1985` <dbl>, `1986` <dbl>,
## #   `1987` <dbl>, `1988` <dbl>, `1989` <dbl>, `1990` <dbl>, `1991` <dbl>,
## #   `1992` <dbl>, `1993` <dbl>, `1994` <dbl>, `1995` <dbl>, `1996` <dbl>,
## #   `1997` <dbl>, `1998` <dbl>, `1999` <dbl>, `2000` <dbl>, `2001` <dbl>,
## #   `2002` <dbl>, `2003` <dbl>, `2004` <dbl>, `2005` <dbl>, `2006` <dbl>,
## #   `2007` <dbl>, `2008` <dbl>, `2009` <dbl>, `2010` <dbl>, `2011` <dbl>,
## #   `2012` <dbl>, `2013` <dbl>, `2014` <dbl>, `2015` <dbl>, `2016` <dbl>,
## #   `2017` <dbl>, `2018` <dbl>, `2019` <dbl>
head(milk_soft)
## # A tibble: 6 x 16
##    ...1 `Ice Cream Regu… `Ice Cream Low-… `Ice Cream Non-… `Frozen yogurt`
##   <dbl>            <dbl>            <dbl>            <dbl>           <dbl>
## 1  1975            3932.            1404.               NA              NA
## 2  1976            3846.            1345.               NA              NA
## 3  1977            3808.            1448.               NA              NA
## 4  1978            3834.            1458                NA              NA
## 5  1979            3813.            1395.               NA              NA
## 6  1980            3898.            1378.               NA              NA
## # … with 11 more variables: Sherbet <dbl>, `Other frozen dairy` <dbl>, `Water
## #   and juice ices` <dbl>, `Yogurt other than frozen` <dbl>, `Cottage
## #   cheese` <dbl>, `Sour cream` <dbl>, `Light cream` <dbl>, `Heavy
## #   cream` <dbl>, `Light cream \r\n+ heavy cream` <dbl>, `Half and half` <dbl>,
## #   `Total fluid cream` <dbl>

If there are any quirks that you have to deal with NA coded as something else, or it is multiple tables, please make some notes here about what you need to do before you start transforming the data in the next section.

Make sure your data types are correct!

In the milk_cows and milk_per_cow data files, there are empty rows between each region type. Additionally, there are several empty cells beneath each region, where the data is being described by state.

I want to first get rid of the empty rows. Then, I want to transpose the data from wide to long, so that I have a column for region, state, year, and number of cows. Similarly, in the milk_per_cow data file, I want a column for region, state, year, and average production per cow. Then, I will merge these two data files by year and region.

The milk_soft data file has some interesting variable names, which I will use janitor::clean_names() to tidy up.

Transforming the data (15 points)

If the data needs to be transformed in any way (values recoded, pivoted, etc), do it here. Examples include transforming a continuous variable into a categorical using case_when(), etc.

Subset of the data and think of a question to answer the subset

Bonus points (5 points) for datasets that require merging of tables, but only if you reason through whether you should use left_join, inner_join, or right_join on these tables. No credit will be provided if you don’t.

#Removing rows containing all NAs from the milk_cows and milk_per_cow data frames

#The following code to remove rows with all NA values is courtesy of: https://stackoverflow.com/questions/51214357/how-to-remove-only-rows-that-have-all-na-in-r

milk_cows[milk_cows == ""] <- NA # define NA pattern
milk_cows_clean <- milk_cows[rowSums(is.na(milk_cows)) != ncol(milk_cows), ] # result

milk_per_cow[milk_per_cow == ""] <- NA # define NA pattern
milk_per_cow_clean <- milk_per_cow[rowSums(is.na(milk_per_cow)) != ncol(milk_per_cow), ] # result

#Using tidyr::fill() (function of the week shout out!) to fill the regions.
milk_cows_clean <- milk_cows_clean %>%
  fill(Region)

milk_per_cow_clean <- milk_per_cow_clean %>%
  fill(Region)

#Subsetting to remove State variable since we are only interested in trends by region for now
milk_cows_clean2 <- milk_cows_clean[c(-2)]
milk_per_cow_clean2 <- milk_per_cow_clean[c(-2)]

#Pivoting the milk_cows_clean and milk_per_cow_clean data frames from wide to long format.

milk_cows_long <-
  milk_cows_clean2 %>%
    pivot_longer (cols = "1970":"2019",
                  names_to = "year",
                  values_to = "no_cows"
 )

milk_per_cow_long <-
  milk_per_cow_clean2 %>%
    pivot_longer (cols = "1970":"2019",
                  names_to = "year",
                  values_to = "avg_prod"
 )

#I want to look at the relationship between the number of cows and the average production over time and region. In order to do this, I need to perform an inner_join between milk_cows_long and milk_per_cow_long.

milk_cows_all <- milk_cows_long %>%
  inner_join(milk_per_cow_long, 
             by=c("Region","year")
             )

#For the final data frame, milk_soft, using janitor::clean_names() to tidy up column names
milk_soft <- clean_names(milk_soft)

#Renaming x1 to year in the milk_soft data frame
milk_soft <- rename(milk_soft, year = x1)

#For both data frames, milk_cows_all and milk_soft, creating categories for year by decade
milk_cows_all <-
milk_cows_all %>% 
    mutate(year_category = 
               case_when(year < 1980 ~ "70s",
                         year >= 1980 & year < 1990 ~ "80s",
                         year >= 1990 & year < 2000 ~ "90s",
                         year >= 2000 & year < 2010 ~ "2000s",
                         year >= 2010 ~ "2010+"
                         )
           ) %>%
    mutate(year_category = factor(year_category,
                                 levels = c("70s", "80s", "90s", "2000s", "2010+"                                           )
                                 )
           )

milk_soft <-
milk_soft %>% 
    mutate(year_category = 
               case_when(year < 1980 ~ "70s",
                         year >= 1980 & year < 1990 ~ "80s",
                         year >= 1990 & year < 2000 ~ "90s",
                         year >= 2000 & year < 2010 ~ "2000s",
                         year >= 2010 ~ "2010+"
                         )
           ) %>%
    mutate(year_category = factor(year_category,
                                 levels = c("70s", "80s", "90s", "2000s", "2010+"                                           )
                                 )
           )

Show your transformed table here. Use tools such as glimpse(), skim() or head() to illustrate your point.

#Visualizing the final data sets I have created
head(milk_cows_all)
## # A tibble: 6 x 5
##   Region    year  no_cows avg_prod year_category
##   <chr>     <chr>   <dbl>    <dbl> <fct>        
## 1 Northeast 1970    2306.   10503. 70s          
## 2 Northeast 1970    2306.    9984  70s          
## 3 Northeast 1970    2306.    9889  70s          
## 4 Northeast 1970    2306.   10155  70s          
## 5 Northeast 1970    2306.   10967  70s          
## 6 Northeast 1970    2306.   10870  70s
head(milk_soft)
## # A tibble: 6 x 17
##    year ice_cream_regul… ice_cream_low_f… ice_cream_non_f… frozen_yogurt sherbet
##   <dbl>            <dbl>            <dbl>            <dbl>         <dbl>   <dbl>
## 1  1975            3932.            1404.               NA            NA    291.
## 2  1976            3846.            1345.               NA            NA    298.
## 3  1977            3808.            1448.               NA            NA    299.
## 4  1978            3834.            1458                NA            NA    288.
## 5  1979            3813.            1395.               NA            NA    271.
## 6  1980            3898.            1378.               NA            NA    271.
## # … with 11 more variables: other_frozen_dairy <dbl>,
## #   water_and_juice_ices <dbl>, yogurt_other_than_frozen <dbl>,
## #   cottage_cheese <dbl>, sour_cream <dbl>, light_cream <dbl>,
## #   heavy_cream <dbl>, light_cream_heavy_cream <dbl>, half_and_half <dbl>,
## #   total_fluid_cream <dbl>, year_category <fct>
glimpse(milk_cows_all)
## Rows: 20,700
## Columns: 5
## $ Region        <chr> "Northeast", "Northeast", "Northeast", "Northeast", "No…
## $ year          <chr> "1970", "1970", "1970", "1970", "1970", "1970", "1970",…
## $ no_cows       <dbl> 2306.4, 2306.4, 2306.4, 2306.4, 2306.4, 2306.4, 2306.4,…
## $ avg_prod      <dbl> 10502.95, 9984.00, 9889.00, 10155.00, 10967.00, 10870.0…
## $ year_category <fct> 70s, 70s, 70s, 70s, 70s, 70s, 70s, 70s, 70s, 70s, 70s, …
glimpse(milk_soft)
## Rows: 45
## Columns: 17
## $ year                     <dbl> 1975, 1976, 1977, 1978, 1979, 1980, 1981, 19…
## $ ice_cream_regular        <dbl> 3931.8, 3845.8, 3807.7, 3834.1, 3813.2, 3898…
## $ ice_cream_low_fat        <dbl> 1404.3, 1345.1, 1447.9, 1458.0, 1394.7, 1378…
## $ ice_cream_non_fat        <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ frozen_yogurt            <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ sherbet                  <dbl> 291.3, 297.5, 298.7, 288.2, 270.7, 271.1, 27…
## $ other_frozen_dairy       <dbl> 163.0, 137.9, 134.9, 132.7, 125.6, 106.9, 18…
## $ water_and_juice_ices     <dbl> 229.4, 228.0, 223.6, 203.8, 192.5, 200.3, 20…
## $ yogurt_other_than_frozen <dbl> 425.0000, 465.0000, 515.0000, 545.0000, 550.…
## $ cottage_cheese           <dbl> 991, 1010, 1017, 1024, 998, 1004, 982, 966, …
## $ sour_cream               <dbl> 350, 350, 364, 374, 395, 408, 424, 451, 484,…
## $ light_cream              <dbl> 87, 76, 68, 70, 66, 55, 56, 62, 67, 74, 85, …
## $ heavy_cream              <dbl> 119, 129, 126, 123, 139, 159, 166, 172, 196,…
## $ light_cream_heavy_cream  <dbl> 206, 205, 194, 193, 205, 214, 222, 234, 263,…
## $ half_and_half            <dbl> 514, 530, 536, 537, 543, 551, 568, 569, 599,…
## $ total_fluid_cream        <dbl> 720, 735, 730, 730, 748, 765, 790, 803, 862,…
## $ year_category            <fct> 70s, 70s, 70s, 70s, 70s, 80s, 80s, 80s, 80s,…
skim(milk_cows_all)
Data summary
Name milk_cows_all
Number of rows 20700
Number of columns 5
_______________________
Column type frequency:
character 2
factor 1
numeric 2
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Region 0 1 8 15 0 12 0
year 0 1 4 4 0 50 0

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
year_category 0 1 FALSE 5 70s: 4140, 80s: 4140, 90s: 4140, 200: 4140

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
no_cows 6 1 325.45 685.12 0.3 30 102 319 12000 ▇▁▁▁▁
avg_prod 6 1 15400.75 4133.36 4636.0 12033 14903 18417 26725 ▁▇▇▅▂
skim(milk_soft)
Data summary
Name milk_soft
Number of rows 45
Number of columns 17
_______________________
Column type frequency:
factor 1
numeric 16
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
year_category 0 1 FALSE 5 80s: 10, 90s: 10, 200: 10, 201: 10

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
year 0 1.00 1997.00 13.13 1975.0 1986.00 1997.00 2008.00 2019.00 ▇▇▇▇▇
ice_cream_regular 0 1.00 4157.82 239.39 3807.7 3976.10 4130.60 4334.60 4666.60 ▇▇▃▃▃
ice_cream_low_fat 0 1.00 1693.46 235.85 1317.1 1476.90 1695.90 1855.00 2135.60 ▆▂▇▃▃
ice_cream_non_fat 20 0.56 102.11 51.08 42.5 67.10 79.60 105.70 203.20 ▇▇▁▁▃
frozen_yogurt 14 0.69 514.63 188.55 276.2 391.45 441.10 573.45 910.80 ▇▅▂▁▃
sherbet 0 1.00 300.03 29.05 230.1 278.30 298.10 319.80 376.60 ▂▆▇▅▁
other_frozen_dairy 0 1.00 149.35 102.62 43.3 74.40 118.20 182.90 462.20 ▇▃▂▁▁
water_and_juice_ices 0 1.00 354.13 87.96 192.5 294.60 392.70 427.00 475.60 ▅▃▂▆▇
yogurt_other_than_frozen 0 1.00 2180.04 1514.91 425.0 940.00 1636.07 3573.11 4735.27 ▇▃▁▂▅
cottage_cheese 0 1.00 812.02 122.51 668.0 711.00 770.00 958.00 1024.00 ▇▃▁▁▅
sour_cream 0 1.00 848.14 345.88 350.0 565.00 794.00 1151.00 1425.66 ▇▇▅▃▇
light_cream 20 0.56 88.96 25.37 55.0 70.00 87.00 101.00 168.00 ▇▇▅▁▁
heavy_cream 20 0.56 282.32 131.39 119.0 166.00 272.00 349.00 555.00 ▇▅▅▂▂
light_cream_heavy_cream 18 0.60 400.81 181.90 193.0 228.00 393.00 476.00 797.00 ▇▆▃▂▂
half_and_half 18 0.60 736.89 165.64 514.0 568.50 755.00 826.00 1146.00 ▇▅▇▂▁
total_fluid_cream 13 0.71 1307.91 518.22 720.0 847.25 1176.00 1578.75 2459.30 ▇▇▃▂▃

Are the values what you expected for the variables? Why or Why not?

I am actually really surprised by what initially looks like an inverse relationship between the number of cows and average milk produced per cow, based on the histograms for each.

Looking at the mean soft milk product consumed, I am not surprised, as I anticipated consumption of ice cream and yogurt would be greatest.

Otherwise, my transformations and data management all appear to have manipulated the data into the formats I was hoping for.

Visualizing and Summarizing the Data (15 points)

Use group_by()/summarize() to make a summary of the data here. The summary should be relevant to your research question

#First grouping milk_cows_all data by region and summarizing the median number of cows, and median milk production

milk_cows_all %>%
  group_by(Region) %>%
  summarize(median_no_cows = median(no_cows, na.rm=TRUE), 
            min=min(no_cows), 
            max=max(no_cows)) %>%
  arrange(median_no_cows)
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 12 x 4
##    Region          median_no_cows    min    max
##    <chr>                    <dbl>  <dbl>  <dbl>
##  1 Other States               3.9   NA      NA 
##  2 Northeast                 47      0.7  2306.
##  3 Delta States              68      5     440 
##  4 Mountain                  81      3.8  1483 
##  5 Southeast                103      5     523 
##  6 Northern Plains          115     15     682 
##  7 Appalachian              126.     6    1044 
##  8 Corn Belt                236     78    1814 
##  9 Southern Plains          356.    37     606 
## 10 West Coast               518.    89    2202 
## 11 Lake States             1092.   299    3196 
## 12 United States           9480   9010   12000
milk_cows_all %>%
  group_by(Region) %>%
  summarize(median_production = median(avg_prod, na.rm=TRUE), 
            min=min(avg_prod), 
            max=max(avg_prod)) %>%
  arrange(median_production)
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 12 x 4
##    Region          median_production   min   max
##    <chr>                       <dbl> <dbl> <dbl>
##  1 Delta States               12032   5860 15750
##  2 Other States               13056.    NA    NA
##  3 Northern Plains            13695.  7774 24293
##  4 Appalachian                13753.  7192 21476
##  5 Southeast                  13966.  6689 21905
##  6 Southern Plains            14395   8634 24513
##  7 Corn Belt                  14938.  8938 24271
##  8 Northeast                  15590.  9630 24118
##  9 Lake States                15791. 10120 26725
## 10 United States              16292   9751 23391
## 11 Mountain                   17450   8359 25892
## 12 West Coast                 18715  10000 24318
#Next grouping by year category and summarizing the median number of cows and median milk production

milk_cows_all %>%
  group_by(year_category) %>%
  summarize(median_no_cows = median(no_cows, na.rm=TRUE), 
            min=min(no_cows), 
            max=max(no_cows)) %>%
  arrange(median_no_cows)
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 5 x 4
##   year_category median_no_cows   min   max
##   <fct>                  <dbl> <dbl> <dbl>
## 1 2010+                    90   NA      NA
## 2 2000s                    93    0.6  9314
## 3 90s                      98    0.7  9993
## 4 80s                     110.   1   11059
## 5 70s                     140.   1.2 12000
milk_cows_all %>%
  group_by(year_category) %>%
  summarize(median_production = median(avg_prod, na.rm=TRUE), 
            min=min(avg_prod), 
            max=max(avg_prod)) %>%
  arrange(median_production)
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 5 x 4
##   year_category median_production   min   max
##   <fct>                     <dbl> <dbl> <dbl>
## 1 70s                      10568   5860 14472
## 2 80s                      12690   8337 18552
## 3 90s                      15440. 10947 22409
## 4 2000s                    17889  10000 24320
## 5 2010+                    20548.    NA    NA
#Summarizing the median ice cream and yogurt consumption by year_category

milk_soft %>%
  group_by(year_category) %>%
  summarize(median_ice_cream = median(ice_cream_regular, na.rm=TRUE), 
            min=min(ice_cream_regular), 
            max=max(ice_cream_regular)) %>%
  arrange(median_ice_cream)
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 5 x 4
##   year_category median_ice_cream   min   max
##   <fct>                    <dbl> <dbl> <dbl>
## 1 70s                      3834. 3808. 3932.
## 2 90s                      4041. 3851. 4519.
## 3 2010+                    4126. 3914. 4322.
## 4 80s                      4137. 3896. 4354.
## 5 2000s                    4488. 4280. 4667.
milk_soft %>%
  group_by(year_category) %>%
  summarize(median_yogurt = median(yogurt_other_than_frozen, na.rm=TRUE),
            min=min(yogurt_other_than_frozen), 
            max=max(yogurt_other_than_frozen)) %>%
  arrange(median_yogurt)
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 5 x 4
##   year_category median_yogurt   min   max
##   <fct>                 <dbl> <dbl> <dbl>
## 1 70s                    515   425   550 
## 2 80s                    874.  560  1150 
## 3 90s                   1471.  974. 1712.
## 4 2000s                 2886. 1835. 3830.
## 5 2010+                 4446. 4163. 4735.

What are your findings about the summary? Are they what you expected?

The median number of cows is greatest in the Lake States with 1,092.5 (299.0, 3196.0) cows. While I expected this region to be high, I am surprised the next region with greatest median number of cows is the West Coast with a median of 517.5 (89.0, 2202.0). I actually thought the regions of the plains states (Corn Belt, Northern Plains, and Southern Plains) would be higher than they are. Additionally, I did not expect the median number of cows in the West Coast to essentially be half that of the Lake States. Interestingly, the average milk produced per cow is greatest in the West Coast and Mountain states.

My observation from earlier was confirmed, in that the median number of cows has declined, from a median of 140.5 in the 1970s to a median of 90.0 between 2010 and 2019. This is in contrast to the average milk produced per cow, which has nearly doubled since the 1970s.

Ice cream consumption appears to have increased steadily since the 1970s while yogurt consumption is approximately 8x what it was in the 1970s.

Make at least two plots that help you answer your question on the transformed or summarized data.

#Loading Wes Anderson color palette
library (wesanderson)

#Scatter plot of number of cows by average milk produced per cow, colored by region
ggplot(milk_cows_all) + 
       aes(x = no_cows, 
           y = avg_prod, 
           color =  Region) + 
    geom_point()+
    
    labs (title = "Number of Cows by Average Milk Produced per Cow",
          x = "Number of Cows",
          y = "Average Milk Produced per Cow (lbs)") +
    
    scale_fill_manual(values = wes_palette("Darjeeling2"))
## Warning: Removed 8 rows containing missing values (geom_point).

#Scatter plot of number of cows by year, colored by region
ggplot(milk_cows_all) + 
       aes(x = year, 
           y = no_cows, 
           color =  Region) + 
    geom_point()+
    
    labs (title = "Number of Cows over Time in the US, 1970-2019",
          x = "Year",
          y = "Number of Cows") +
    
    scale_fill_manual(values = wes_palette("IsleofDogs1")) +
  
    theme(axis.text.x=element_text(angle = -90, hjust = 0))
## Warning: Removed 6 rows containing missing values (geom_point).

#Scatter plot of number of cows by average milk produced per cow, colored by year category
ggplot(milk_cows_all) + 
       aes(x = no_cows, 
           y = avg_prod, 
           color =  year_category) + 
    geom_point()+
    
    labs (title = "Number of Cows by Average Milk Produced per Cow",
          x = "Number of Cows",
          y = "Average Milk Produced per Cow (lbs)") +
    
    scale_fill_manual(values = wes_palette("Moonrise3"))
## Warning: Removed 8 rows containing missing values (geom_point).

#Box plot of yogurt consumption by year category
ggplot(milk_soft) +
  aes(x = year_category, 
      y = yogurt_other_than_frozen) +
  
  geom_boxplot(color="navy")+
  theme(panel.background = element_rect(fill = 'grey75')) +

 labs(title = "Yogurt Consumption by Year Category",
       x = "Year Category",
       y = "Yogurt Consumption")

#Scatter plot of yogurt consumption over time
ggplot(milk_soft)+ 
       aes(x = year, 
           y = yogurt_other_than_frozen) +
    geom_point(color="purple")+
    
    labs (title = "Yogurt Consumption, 1975-2019",
          x = "Year",
          y = "Yogurt consumption (millions of lbs") 

Final Summary (10 points)

Summarize your research question and findings below.

Dairy consumption in the US has been declining, and in an effort to preserve the dairy, milk is being turned into a less-perishable form, cheese. However, processed cheese consumption has also been declining in the US, resulting in a nearly 1.4 billion stockpile of cheese in the last 10 years. Given this information, I sought to understand how production has changed over time, specifically by looking at the number of cows and average milk produced per cow over time and by region of the US. Additionally, given consumption of both milk and processed cheese has decreased in the US, I wanted to explore how consumption of other dairy products have changed over the same time period.

Are your findings what you expected? Why or Why not?

In accordance with the background research, my findings supported the increased production of milk over the last 30+ years, with production nearly doubling since the 1970s. To keep up with the supply, I hypothesized the number of cows in the US would have also increased over time, however, there is actually an inverse relationship between the number of cows in the US and the average amount of milk being produced per cow. Looking at other possible sources of dairy consumption, I observed ice cream consumption has steadily increased since 1970, however, I would hypothesize any increase seen here is proportional to the increase in population over the same time period. Yogurt consumption increased approximately 8x since the 1970s, and I would be interested to see if this finding is still significant after accounting for the population increase between 1975 and 2019. It is interesting to see that production continues, despite there being fewer cows and less demand for dairy.