Please submit your .Rmd
and .html
files in Sakai. If you are working together, both people should submit the files.
The goal of the midterm project is to showcase skills that you have learned in class so far. The midterm is open note, but if you use someone else’s code, you must attribute them.
# This code came from
Potential Sources for data: Tidy Tuesday: https://github.com/rfordatascience/tidytuesday
.csv
file into your data
folder.You may use another dataset or your own data, but please make sure it is de-identified.
If you’d like to work together, that is encouraged, but you must divide the work equitably and you must note who worked on what. This is probably easiest as notes in the text. Please let Eric or Me know that you’ll be working together.
No acknowledgements of contributions = -10 points overall.
I will take off points (-5 points for each section) if you don’t add observations and notes in your RMarkdown document. I want you to think and reason through your analysis, even if they are preliminary thoughts.
Define your research question below. What about the data interests you? What is a specific question you want to find out about the data?
Sam nicely worded our research question and expectations below.
With CO2 emissions being so closely linked with climate change, it’s useful to better understand its sources. For example, food production, especially beef, is a large source of CO2 emission. The data we found include information about food consumption in 11 categories for 129 countries around the world. Specifically the kg/person/year of each food category is recorded as well as the kg/person/year of CO2 emission associated with these categories. Since beef is known to be a significant contributor to CO2 emission, we hypothesize that countries with the greatest consumption of beef will have the greatest relative CO2 emissions.
Given your question, what is your expectation about the data?
To get a better idea of per person CO2 emissions, we will combine the country population table (built into R) to be able to come up with per person CO2 information. To narrow the scope, we will subset the data to specifically Asian countries. We expect that Asian countries with the greatest percentage of beef consumption will have the highest CO2 output.
Load the data below and use
dplyr::glimpse()
orskimr::skim()
on the data. You should upload the data file into thedata
directory.
# load in datasets and the tidyverse; readr::read_csv part copy and pasted from data source
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2 ✓ purrr 0.3.4
## ✓ tibble 3.0.4 ✓ dplyr 1.0.2
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(janitor)
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
food_consumption <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-18/food_consumption.csv')
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## country = col_character(),
## food_category = col_character(),
## consumption = col_double(),
## co2_emmission = col_double()
## )
continents <- read.csv("data/continents.csv")
pop_fixed <- read.csv("data/pop_fixed.csv")
Our main dataset came from: https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-02-18/readme.md For filtering purposes, we obtained a dataset of countries with the continent they belong to from: https://github.com/dbouquin/IS_608/blob/master/NanosatDB_munging/Countries-Continents.csv Lastly, we thought population data would also be helpful for our research question. R has a built-in population dataset that we used.
If there are any quirks that you have to deal with
NA
coded as something else, or it is multiple tables, please make some notes here about what you need to do before you start transforming the data in the next section.
A few of the country names differed slightly between datasets which created NAs, so we needed to fix the names of the countries so they all matched each other.
Make sure your data types are correct!
If the data needs to be transformed in any way (values recoded, pivoted, etc), do it here. Examples include transforming a continuous variable into a categorical using
case_when()
, etc.
Subset of the data and think of a question to answer the subset
Bonus points (5 points) for datasets that require merging of tables, but only if you reason through whether you should use
left_join
,inner_join
, orright_join
on these tables. No credit will be provided if you don’t.
#code for mutate_if came from: https://dplyr.tidyverse.org/reference/mutate_all.html
#Switching all character vectors to factor vectors
food_consumption <- food_consumption %>%
mutate_if(is.character, as.factor)
continents <- continents %>%
mutate_if(is.character,as.factor)
#Making the column names match
continents <- continents %>%
rename(country=Country)
#Joining the table
food <- food_consumption %>%
left_join(y=continents,by=("country"="country")) %>%
left_join(y=pop_fixed,by=("country"="country")) %>%
filter(year=="2013") %>%
select(-c("year"))
Lisa did the data cleaning and transformation. We decided to go with population data from the year 2013 because it was most recent.
To start fixing the NAs, we first have to find which countries produce NAs for continent.
food %>%
filter(is.na(Continent))
## # A tibble: 99 x 6
## country food_category consumption co2_emmission Continent population
## <chr> <fct> <dbl> <dbl> <fct> <int>
## 1 USA Pork 27.6 97.8 <NA> 320050716
## 2 USA Poultry 50.0 53.7 <NA> 320050716
## 3 USA Beef 36.2 1118. <NA> 320050716
## 4 USA Lamb & Goat 0.43 15.1 <NA> 320050716
## 5 USA Fish 12.4 19.7 <NA> 320050716
## 6 USA Eggs 14.6 13.4 <NA> 320050716
## 7 USA Milk - inc. cheese 255. 363. <NA> 320050716
## 8 USA Wheat and Wheat Produ… 80.4 15.3 <NA> 320050716
## 9 USA Rice 6.88 8.8 <NA> 320050716
## 10 USA Soybeans 0.04 0.02 <NA> 320050716
## # … with 89 more rows
USA, Bermuda, Hong Kong SAR. China, French Polynesia, Russia, New Calcedonia, Czech Republic, South Korea, Taiwan. ROC, and Myanmar did not have matches with the continents dataset. So using our own research, we then used case_when() to match them with their proper countries.
#Turning into NAs into the proper continents
food <- food %>%
mutate(Continent = case_when(country=="USA" | country=="Bermuda" ~ "North America", country=="Hong Kong SAR. China" | country=="South Korea" | country=="Taiwan. ROC" | country=="Myanmar" ~ "Asia", country=="Russia" | country=="New Caledonia" | country=="Czech Republic" ~ "Europe", country=="French Polynesia" ~ "Oceania", TRUE ~ as.character(Continent)))
#Checking to see if we have any NAs left
food %>%
filter(is.na(Continent))
## # A tibble: 0 x 6
## # … with 6 variables: country <chr>, food_category <fct>, consumption <dbl>,
## # co2_emmission <dbl>, Continent <chr>, population <int>
Now it shows no NAs!
food <- food %>%
mutate(totalCO2 = co2_emmission*population)
We wanted to show total CO2 emmission from each food, so we created a new variable that multiplies per person emmissions by the population of the country.
Show your transformed table here. Use tools such as
glimpse()
,skim()
orhead()
to illustrate your point.
#Subsetting to the data to Asian countries only
food_Asia <- food %>%
filter(Continent=="Asia")
food_Asia <- food_Asia %>%
mutate_if(is.character, as.factor)
food_Asia %>%
glimpse()
## Rows: 286
## Columns: 7
## $ country <fct> Kazakhstan, Kazakhstan, Kazakhstan, Kazakhstan, Kazakhs…
## $ food_category <fct> Pork, Poultry, Beef, Lamb & Goat, Fish, Eggs, Milk - in…
## $ consumption <dbl> 10.36, 18.38, 23.38, 9.56, 5.21, 8.29, 288.12, 92.31, 7…
## $ co2_emmission <dbl> 36.67, 19.74, 721.46, 334.79, 8.32, 7.62, 410.40, 17.60…
## $ Continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, A…
## $ population <int> 16440586, 16440586, 16440586, 16440586, 16440586, 16440…
## $ totalCO2 <dbl> 6.028763e+08, 3.245372e+08, 1.186123e+10, 5.504144e+09,…
Are the values what you expected for the variables? Why or Why not?
Yes, though it took some modification to get country names to match between the continent and pouplation tables that we joined. After adjusting the names, the datasets combined nicely and we now have a solid base to dive into our research question.
Use
group_by()/summarize()
to make a summary of the data here. The summary should be relevant to your research question
Lisa made simple tables initially, and Sam used the kableExtra package to format them nicely and make everything tidyer all over.
everything <- sum(food_Asia$totalCO2)
everything2 <- sum(food_Asia$consumption[food_Asia$food_category=="Beef"])
table1_df <- food_Asia %>%
group_by(country) %>%
summarise(totalConsumption=sum(consumption*population/1e9),
pctbeef=consumption[food_category=="Beef"]/everything2*100,
totCO2=sum(totalCO2)/1e9,
pctall = (totCO2*1e9)/everything*100,
adjCO2 = sum(totalCO2)/(population[food_category == "Beef"])
) %>%
mutate(across(where(is.numeric), round, 2)) %>%
arrange(desc(pctbeef))
## `summarise()` ungrouping output (override with `.groups` argument)
What are your findings about the summary? Are they what you expected?
Through this step in analysis we are able to see what countries have the highest relative beef consumption in Asia (Israel, Hong Kong, Kazakhstan) and their CO2 output. Since these countries are relatively low-population compared to other countries in Asia, so the overall CO2 emission for the countries is low. We need to factor in population to show if the per person CO2 emission is higher in these places, to align with our research question. Table 2 shows that the countries with the greatest overall consumption of all food categories is in a different order than overall consumption of beef.
Make at least two plots that help you answer your question on the transformed or summarized data.
Lisa made the first two plots, and Sam made the third.
library(viridisLite)
#Consumption by food category by country
food_Asia %>%
ggplot(aes(x=food_category,y=consumption,fill=food_category)) +
geom_bar(stat='identity') +
facet_wrap(~country) +
theme(axis.text.x=element_blank(),
axis.ticks.x=element_blank()) +
scale_fill_viridis_d()
#Total consumption from food by country
food_Asia %>%
ggplot(aes(x=country,y=totalCO2,fill=country)) +
geom_bar(stat="identity") +
theme(axis.text.x = element_text(angle = 90)) +
scale_fill_viridis_d()
# scatter plot total beef vs total consumption
consumption_scatter <- ggplot(table1_df)+
aes(x=pctbeef,y=totCO2, color = "red", label=country) +
geom_point() +
theme(legend.position = "none") +
labs(x = "% Total Beef Consumption in Asia", y = "Total CO2") +
geom_text(aes(label=country), hjust=0, vjust=-1.5, size = 2, check_overlap = TRUE)
consumption_scatter