Midterm (Due 2/12/2021 at 11:55 pm)

Please submit your .Rmd and .html files in Sakai. If you are working together, both people should submit the files.

60 / 60 points total

The goal of the midterm project is to showcase skills that you have learned in class so far. The midterm is open note, but if you use someone else’s code, you must attribute them.

# This code came from 

Before you get Started

  1. Pick a dataset. Ideally, the dataset should be around 2000 rows, and should have both categorical and numeric covariates.

Potential Sources for data: Tidy Tuesday: https://github.com/rfordatascience/tidytuesday

  • Note that most of these are .csv files. There is code to load the files from csv for each of the datasets and a short description of the variables, or you can upload the .csv file into your data folder.

You may use another dataset or your own data, but please make sure it is de-identified.

  1. Please schedule a time with Eric or Me to discuss your dataset and research question. We just want to look at the data and make sure that it is appropriate for your question.

Working Together

If you’d like to work together, that is encouraged, but you must divide the work equitably and you must note who worked on what. This is probably easiest as notes in the text. Please let Eric or Me know that you’ll be working together.

No acknowledgements of contributions = -10 points overall.

Please Note

I will take off points (-5 points for each section) if you don’t add observations and notes in your RMarkdown document. I want you to think and reason through your analysis, even if they are preliminary thoughts.

Define Your Research Question (10 points)

Define your research question below. What about the data interests you? What is a specific question you want to find out about the data?

Sam nicely worded our research question and expectations below.

With CO2 emissions being so closely linked with climate change, it’s useful to better understand its sources. For example, food production, especially beef, is a large source of CO2 emission. The data we found include information about food consumption in 11 categories for 129 countries around the world. Specifically the kg/person/year of each food category is recorded as well as the kg/person/year of CO2 emission associated with these categories. Since beef is known to be a significant contributor to CO2 emission, we hypothesize that countries with the greatest consumption of beef will have the greatest relative CO2 emissions.

Given your question, what is your expectation about the data?

To get a better idea of per person CO2 emissions, we will combine the country population table (built into R) to be able to come up with per person CO2 information. To narrow the scope, we will subset the data to specifically Asian countries. We expect that Asian countries with the greatest percentage of beef consumption will have the highest CO2 output.

Loading the Data (10 points)

Load the data below and use dplyr::glimpse() or skimr::skim() on the data. You should upload the data file into the data directory.

# load in datasets and the tidyverse; readr::read_csv part copy and pasted from data source

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.4     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(janitor)
## 
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
food_consumption <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-18/food_consumption.csv')
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   country = col_character(),
##   food_category = col_character(),
##   consumption = col_double(),
##   co2_emmission = col_double()
## )
continents <- read.csv("data/continents.csv")
pop_fixed <- read.csv("data/pop_fixed.csv")

Our main dataset came from: https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-02-18/readme.md For filtering purposes, we obtained a dataset of countries with the continent they belong to from: https://github.com/dbouquin/IS_608/blob/master/NanosatDB_munging/Countries-Continents.csv Lastly, we thought population data would also be helpful for our research question. R has a built-in population dataset that we used.

If there are any quirks that you have to deal with NA coded as something else, or it is multiple tables, please make some notes here about what you need to do before you start transforming the data in the next section.

A few of the country names differed slightly between datasets which created NAs, so we needed to fix the names of the countries so they all matched each other.

Make sure your data types are correct!

Transforming the data (15 points)

If the data needs to be transformed in any way (values recoded, pivoted, etc), do it here. Examples include transforming a continuous variable into a categorical using case_when(), etc.

Subset of the data and think of a question to answer the subset

Bonus points (5 points) for datasets that require merging of tables, but only if you reason through whether you should use left_join, inner_join, or right_join on these tables. No credit will be provided if you don’t.

#code for mutate_if came from: https://dplyr.tidyverse.org/reference/mutate_all.html
#Switching all character vectors to factor vectors
food_consumption <- food_consumption %>% 
  mutate_if(is.character, as.factor)
continents <- continents %>%
  mutate_if(is.character,as.factor)

#Making the column names match
continents <- continents %>%
  rename(country=Country)

#Joining the table
food <- food_consumption %>%
  left_join(y=continents,by=("country"="country")) %>%
  left_join(y=pop_fixed,by=("country"="country")) %>%
  filter(year=="2013") %>%
  select(-c("year"))

Lisa did the data cleaning and transformation. We decided to go with population data from the year 2013 because it was most recent.

To start fixing the NAs, we first have to find which countries produce NAs for continent.

food %>%
  filter(is.na(Continent)) 
## # A tibble: 99 x 6
##    country food_category          consumption co2_emmission Continent population
##    <chr>   <fct>                        <dbl>         <dbl> <fct>          <int>
##  1 USA     Pork                         27.6          97.8  <NA>       320050716
##  2 USA     Poultry                      50.0          53.7  <NA>       320050716
##  3 USA     Beef                         36.2        1118.   <NA>       320050716
##  4 USA     Lamb & Goat                   0.43         15.1  <NA>       320050716
##  5 USA     Fish                         12.4          19.7  <NA>       320050716
##  6 USA     Eggs                         14.6          13.4  <NA>       320050716
##  7 USA     Milk - inc. cheese          255.          363.   <NA>       320050716
##  8 USA     Wheat and Wheat Produ…       80.4          15.3  <NA>       320050716
##  9 USA     Rice                          6.88          8.8  <NA>       320050716
## 10 USA     Soybeans                      0.04          0.02 <NA>       320050716
## # … with 89 more rows

USA, Bermuda, Hong Kong SAR. China, French Polynesia, Russia, New Calcedonia, Czech Republic, South Korea, Taiwan. ROC, and Myanmar did not have matches with the continents dataset. So using our own research, we then used case_when() to match them with their proper countries.

#Turning into NAs into the proper continents
food <- food %>%
  mutate(Continent = case_when(country=="USA" | country=="Bermuda" ~ "North America", country=="Hong Kong SAR. China" | country=="South Korea" | country=="Taiwan. ROC" | country=="Myanmar" ~ "Asia", country=="Russia" | country=="New Caledonia" | country=="Czech Republic" ~ "Europe", country=="French Polynesia" ~ "Oceania", TRUE ~ as.character(Continent)))

#Checking to see if we have any NAs left
food %>%
  filter(is.na(Continent)) 
## # A tibble: 0 x 6
## # … with 6 variables: country <chr>, food_category <fct>, consumption <dbl>,
## #   co2_emmission <dbl>, Continent <chr>, population <int>

Now it shows no NAs!

food <- food %>%
  mutate(totalCO2 = co2_emmission*population)

We wanted to show total CO2 emmission from each food, so we created a new variable that multiplies per person emmissions by the population of the country.

Show your transformed table here. Use tools such as glimpse(), skim() or head() to illustrate your point.

#Subsetting to the data to Asian countries only
food_Asia <- food %>%
  filter(Continent=="Asia") 

food_Asia <- food_Asia %>% 
  mutate_if(is.character, as.factor)

food_Asia %>%
  glimpse()
## Rows: 286
## Columns: 7
## $ country       <fct> Kazakhstan, Kazakhstan, Kazakhstan, Kazakhstan, Kazakhs…
## $ food_category <fct> Pork, Poultry, Beef, Lamb & Goat, Fish, Eggs, Milk - in…
## $ consumption   <dbl> 10.36, 18.38, 23.38, 9.56, 5.21, 8.29, 288.12, 92.31, 7…
## $ co2_emmission <dbl> 36.67, 19.74, 721.46, 334.79, 8.32, 7.62, 410.40, 17.60…
## $ Continent     <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, A…
## $ population    <int> 16440586, 16440586, 16440586, 16440586, 16440586, 16440…
## $ totalCO2      <dbl> 6.028763e+08, 3.245372e+08, 1.186123e+10, 5.504144e+09,…

Are the values what you expected for the variables? Why or Why not?

Yes, though it took some modification to get country names to match between the continent and pouplation tables that we joined. After adjusting the names, the datasets combined nicely and we now have a solid base to dive into our research question.

Visualizing and Summarizing the Data (15 points)

Use group_by()/summarize() to make a summary of the data here. The summary should be relevant to your research question

Lisa made simple tables initially, and Sam used the kableExtra package to format them nicely and make everything tidyer all over.

everything <- sum(food_Asia$totalCO2)
everything2 <- sum(food_Asia$consumption[food_Asia$food_category=="Beef"])
table1_df <- food_Asia %>% 
  group_by(country) %>%
  summarise(totalConsumption=sum(consumption*population/1e9),
            pctbeef=consumption[food_category=="Beef"]/everything2*100,
            totCO2=sum(totalCO2)/1e9,
            pctall = (totCO2*1e9)/everything*100,
            adjCO2 = sum(totalCO2)/(population[food_category == "Beef"])
            ) %>%
  mutate(across(where(is.numeric), round, 2)) %>%
  arrange(desc(pctbeef))
## `summarise()` ungrouping output (override with `.groups` argument)

What are your findings about the summary? Are they what you expected?

Through this step in analysis we are able to see what countries have the highest relative beef consumption in Asia (Israel, Hong Kong, Kazakhstan) and their CO2 output. Since these countries are relatively low-population compared to other countries in Asia, so the overall CO2 emission for the countries is low. We need to factor in population to show if the per person CO2 emission is higher in these places, to align with our research question. Table 2 shows that the countries with the greatest overall consumption of all food categories is in a different order than overall consumption of beef.

Make at least two plots that help you answer your question on the transformed or summarized data.

Lisa made the first two plots, and Sam made the third.

library(viridisLite)

#Consumption by food category by country
food_Asia %>%
  ggplot(aes(x=food_category,y=consumption,fill=food_category)) +
  geom_bar(stat='identity') +
  facet_wrap(~country) +
  theme(axis.text.x=element_blank(),
        axis.ticks.x=element_blank()) +
  scale_fill_viridis_d()

#Total consumption from food by country
food_Asia %>%
  ggplot(aes(x=country,y=totalCO2,fill=country)) +
  geom_bar(stat="identity") +
   theme(axis.text.x = element_text(angle = 90)) +
  scale_fill_viridis_d()

# scatter plot total beef vs total consumption
consumption_scatter <- ggplot(table1_df)+
  aes(x=pctbeef,y=totCO2, color = "red", label=country) +
  geom_point() +
  theme(legend.position = "none") +
  labs(x = "% Total Beef Consumption in Asia", y = "Total CO2") +
  geom_text(aes(label=country), hjust=0, vjust=-1.5, size = 2, check_overlap = TRUE) 
consumption_scatter

Final Summary (10 points)

both Lisa and Sam

Summarize your research question and findings below.

Our hypothesis after transforming and subsetting the data is that Asian countries with the greatest per-person consumption of beef will have the greatest overall per person emission of CO2. After plotting the percentage of beef consumption by country vs overall total CO2 emission per country, we see that CO2 emission is mostly correlated with population. While China is on the lower end of beef consumption per person, its total CO2 emissions are highest.

Are your findings what you expected? Why or Why not?

Our findings are not quite what we expected. It may seem obvious given that beef production does produce a much greater magnitude of CO2, but we were unsure whether a greater total consumption of all other food categories would outweigh the effects of eating more beef. It looks like that was the case with population and therefore total consumption having the strongest connection to CO2 emissions. Doing more rigorous statistical analysis would help to tease out more complex relationships, and comparing continents would be good next steps for establishing this relationship.