Midterm

# Midterm (Due 2/12/2021 at 11:55 pm) 

Please submit your `.Rmd` and `.html` files in Sakai. If you are working together, both people should submit the files. 

60 / 60 points total


The goal of the midterm project is to showcase skills that you have learned in class so far. The midterm is open note, but if you use someone else's code, you **must attribute them**. 


```r
# This code came from

Before you get Started

Pick a dataset. Ideally, the dataset should be around 2000 rows, and should have both categorical and numeric covariates.

Potential Sources for data: Tidy Tuesday: https://github.com/rfordatascience/tidytuesday

Note that most of these are .csv files. There is code to load the files from csv for each of the datasets and a short description of the variables, or you can upload the .csv file into your data folder.

You may use another dataset or your own data, but please make sure it is de-identified.

Please schedule a time with Eric or Me to discuss your dataset and research question. We just want to look at the data and make sure that it is appropriate for your question.

Working Together

If you’d like to work together, that is encouraged, but you must divide the work equitably and you must note who worked on what. This is probably easiest as notes in the text. Please let Eric or Me know that you’ll be working together.

No acknowledgements of contributions = -10 points overall.

Please Note

I will take off points (-5 points for each section) if you don’t add observations and notes in your RMarkdown document. I want you to think and reason through your analysis, even if they are preliminary thoughts.

Define Your Research Question (10 points)

Define your research question below. What about the data interests you? What is a specific question you want to find out about the data?

“How does GDP per capita of a country affect the amount of mismanaged plastic waste that each country has?”

Given that the entire world has a serious plastic pollution problem, I thought this dataset might be interesting to explore. I am looking forward to seeing the patterns that emerge as I create the ggplot graphs and can visualize any trends that may exist.

Given your question, what is your expectation about the data?

My expectation for this data is that clear trends will appear with the amount of mismanaged plastic waste based upon GDP per capita of each country. I am hypothesizing that the lower the GDP per capita of the country, the more mismanaged plastic waste that country might have. I am basing this hypothesis upon the knowledge that most lower income countries lack effective waste management infrastructure.

Loading the Data (10 points)

Load the data below and use dplyr::glimpse() or skimr::skim() on the data. You should upload the data file into the data directory.

mismanaged_vs_gdp<- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-05-21/per-capita-mismanaged-plastic-waste-vs-gdp-per-capita.csv", na="NA")

## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   Entity = col_character(),
##   Code = col_character(),
##   Year = col_double(),
##   `Per capita mismanaged plastic waste (kilograms per person per day)` = col_double(),
##   `GDP per capita, PPP (constant 2011 international $) (Rate)` = col_double(),
##   `Total population (Gapminder)` = col_double()
## )

# save as csv file to Data folder 

write_excel_csv(x = mismanaged_vs_gdp ,
                file = "data/mismanaged_vs_gdp.csv")
read_csv(("data/mismanaged_vs_gdp.csv"), na="NA")

## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   Entity = col_character(),
##   Code = col_character(),
##   Year = col_double(),
##   `Per capita mismanaged plastic waste (kilograms per person per day)` = col_double(),
##   `GDP per capita, PPP (constant 2011 international $) (Rate)` = col_double(),
##   `Total population (Gapminder)` = col_double()
## )

glimpse(mismanaged_vs_gdp)

## Rows: 22,204
## Columns: 6
## $ Entity                                                               <chr> …
## $ Code                                                                 <chr> …
## $ Year                                                                 <dbl> …
## $ `Per capita mismanaged plastic waste (kilograms per person per day)` <dbl> …
## $ `GDP per capita, PPP (constant 2011 international $) (Rate)`         <dbl> …
## $ `Total population (Gapminder)`                                       <dbl> …

# I first wanted to list all the countries in order to pick a smaller subset of countries to work with  
#code from Ted Laderas - help from function assignment 
mismanaged_vs_gdp %>% pull(Entity) %>% unique()

##   [1] "Afghanistan"                                       
##   [2] "Albania"                                           
##   [3] "Algeria"                                           
##   [4] "American Samoa"                                    
##   [5] "Andorra"                                           
##   [6] "Angola"                                            
##   [7] "Anguilla"                                          
##   [8] "Antigua and Barbuda"                               
##   [9] "Arab World"                                        
##  [10] "Argentina"                                         
##  [11] "Armenia"                                           
##  [12] "Aruba"                                             
##  [13] "Australia"                                         
##  [14] "Austria"                                           
##  [15] "Azerbaijan"                                        
##  [16] "Bahamas"                                           
##  [17] "Bahrain"                                           
##  [18] "Bangladesh"                                        
##  [19] "Barbados"                                          
##  [20] "Belarus"                                           
##  [21] "Belgium"                                           
##  [22] "Belize"                                            
##  [23] "Benin"                                             
##  [24] "Bermuda"                                           
##  [25] "Bhutan"                                            
##  [26] "Bolivia"                                           
##  [27] "Bosnia and Herzegovina"                            
##  [28] "Botswana"                                          
##  [29] "Brazil"                                            
##  [30] "British Virgin Islands"                            
##  [31] "Brunei"                                            
##  [32] "Bulgaria"                                          
##  [33] "Burkina Faso"                                      
##  [34] "Burundi"                                           
##  [35] "Cambodia"                                          
##  [36] "Cameroon"                                          
##  [37] "Canada"                                            
##  [38] "Cape Verde"                                        
##  [39] "Caribbean small states"                            
##  [40] "Cayman Islands"                                    
##  [41] "Central African Republic"                          
##  [42] "Central Europe and the Baltics"                    
##  [43] "Chad"                                              
##  [44] "Channel Islands"                                   
##  [45] "Chile"                                             
##  [46] "China"                                             
##  [47] "Christmas Island"                                  
##  [48] "Cocos Islands"                                     
##  [49] "Colombia"                                          
##  [50] "Comoros"                                           
##  [51] "Congo"                                             
##  [52] "Cook Islands"                                      
##  [53] "Costa Rica"                                        
##  [54] "Cote d'Ivoire"                                     
##  [55] "Croatia"                                           
##  [56] "Cuba"                                              
##  [57] "Curacao"                                           
##  [58] "Cyprus"                                            
##  [59] "Czech Republic"                                    
##  [60] "Democratic Republic of Congo"                      
##  [61] "Denmark"                                           
##  [62] "Djibouti"                                          
##  [63] "Dominica"                                          
##  [64] "Dominican Republic"                                
##  [65] "Early-demographic dividend"                        
##  [66] "East Asia & Pacific"                               
##  [67] "East Asia & Pacific (IDA & IBRD)"                  
##  [68] "East Asia & Pacific (excluding high income)"       
##  [69] "Ecuador"                                           
##  [70] "Egypt"                                             
##  [71] "El Salvador"                                       
##  [72] "Equatorial Guinea"                                 
##  [73] "Eritrea"                                           
##  [74] "Estonia"                                           
##  [75] "Ethiopia"                                          
##  [76] "Euro area"                                         
##  [77] "Europe & Central Asia"                             
##  [78] "Europe & Central Asia (IDA & IBRD)"                
##  [79] "Europe & Central Asia (excluding high income)"     
##  [80] "European Union"                                    
##  [81] "Faeroe Islands"                                    
##  [82] "Falkland Islands"                                  
##  [83] "Fiji"                                              
##  [84] "Finland"                                           
##  [85] "Fragile and conflict affected situations"          
##  [86] "France"                                            
##  [87] "French Guiana"                                     
##  [88] "French Polynesia"                                  
##  [89] "Gabon"                                             
##  [90] "Gambia"                                            
##  [91] "Georgia"                                           
##  [92] "Germany"                                           
##  [93] "Ghana"                                             
##  [94] "Gibraltar"                                         
##  [95] "Greece"                                            
##  [96] "Greenland"                                         
##  [97] "Grenada"                                           
##  [98] "Guadeloupe"                                        
##  [99] "Guam"                                              
## [100] "Guatemala"                                         
## [101] "Guernsey"                                          
## [102] "Guinea"                                            
## [103] "Guinea-Bissau"                                     
## [104] "Guyana"                                            
## [105] "Haiti"                                             
## [106] "Heavily indebted poor countries (HIPC)"            
## [107] "High income"                                       
## [108] "Honduras"                                          
## [109] "Hong Kong"                                         
## [110] "Hungary"                                           
## [111] "IBRD only"                                         
## [112] "IDA & IBRD total"                                  
## [113] "IDA blend"                                         
## [114] "IDA only"                                          
## [115] "IDA total"                                         
## [116] "Iceland"                                           
## [117] "India"                                             
## [118] "Indonesia"                                         
## [119] "Iran"                                              
## [120] "Iraq"                                              
## [121] "Ireland"                                           
## [122] "Isle of Man"                                       
## [123] "Israel"                                            
## [124] "Italy"                                             
## [125] "Jamaica"                                           
## [126] "Japan"                                             
## [127] "Jersey"                                            
## [128] "Jordan"                                            
## [129] "Kazakhstan"                                        
## [130] "Kenya"                                             
## [131] "Kiribati"                                          
## [132] "Kosovo"                                            
## [133] "Kuwait"                                            
## [134] "Kyrgyzstan"                                        
## [135] "Laos"                                              
## [136] "Late-demographic dividend"                         
## [137] "Latin America & Caribbean"                         
## [138] "Latin America & Caribbean (IDA & IBRD)"            
## [139] "Latin America & Caribbean (excluding high income)" 
## [140] "Latvia"                                            
## [141] "Least developed countries: UN classification"      
## [142] "Lebanon"                                           
## [143] "Lesotho"                                           
## [144] "Liberia"                                           
## [145] "Libya"                                             
## [146] "Liechtenstein"                                     
## [147] "Lithuania"                                         
## [148] "Low & middle income"                               
## [149] "Low income"                                        
## [150] "Lower middle income"                               
## [151] "Luxembourg"                                        
## [152] "Macao"                                             
## [153] "Macedonia"                                         
## [154] "Madagascar"                                        
## [155] "Malawi"                                            
## [156] "Malaysia"                                          
## [157] "Maldives"                                          
## [158] "Mali"                                              
## [159] "Malta"                                             
## [160] "Marshall Islands"                                  
## [161] "Martinique"                                        
## [162] "Mauritania"                                        
## [163] "Mauritius"                                         
## [164] "Mayotte"                                           
## [165] "Mexico"                                            
## [166] "Micronesia (country)"                              
## [167] "Middle East & North Africa"                        
## [168] "Middle East & North Africa (IDA & IBRD)"           
## [169] "Middle East & North Africa (excluding high income)"
## [170] "Middle income"                                     
## [171] "Moldova"                                           
## [172] "Monaco"                                            
## [173] "Mongolia"                                          
## [174] "Montenegro"                                        
## [175] "Montserrat"                                        
## [176] "Morocco"                                           
## [177] "Mozambique"                                        
## [178] "Myanmar"                                           
## [179] "Namibia"                                           
## [180] "Nauru"                                             
## [181] "Nepal"                                             
## [182] "Netherlands"                                       
## [183] "Netherlands Antilles"                              
## [184] "New Caledonia"                                     
## [185] "New Zealand"                                       
## [186] "Nicaragua"                                         
## [187] "Niger"                                             
## [188] "Nigeria"                                           
## [189] "Niue"                                              
## [190] "Norfolk Island"                                    
## [191] "North America"                                     
## [192] "North Korea"                                       
## [193] "Northern Mariana Islands"                          
## [194] "Norway"                                            
## [195] "OECD members"                                      
## [196] "Oman"                                              
## [197] "Other small states"                                
## [198] "Pacific island small states"                       
## [199] "Pakistan"                                          
## [200] "Palau"                                             
## [201] "Palestine"                                         
## [202] "Panama"                                            
## [203] "Papua New Guinea"                                  
## [204] "Paraguay"                                          
## [205] "Peru"                                              
## [206] "Philippines"                                       
## [207] "Pitcairn"                                          
## [208] "Poland"                                            
## [209] "Portugal"                                          
## [210] "Post-demographic dividend"                         
## [211] "Pre-demographic dividend"                          
## [212] "Puerto Rico"                                       
## [213] "Qatar"                                             
## [214] "Reunion"                                           
## [215] "Romania"                                           
## [216] "Russia"                                            
## [217] "Rwanda"                                            
## [218] "Saint Helena"                                      
## [219] "Saint Kitts and Nevis"                             
## [220] "Saint Lucia"                                       
## [221] "Saint Pierre and Miquelon"                         
## [222] "Saint Vincent and the Grenadines"                  
## [223] "Samoa"                                             
## [224] "San Marino"                                        
## [225] "Sao Tome and Principe"                             
## [226] "Saudi Arabia"                                      
## [227] "Senegal"                                           
## [228] "Serbia"                                            
## [229] "Seychelles"                                        
## [230] "Sierra Leone"                                      
## [231] "Singapore"                                         
## [232] "Sint Maarten (Dutch part)"                         
## [233] "Slovakia"                                          
## [234] "Slovenia"                                          
## [235] "Small states"                                      
## [236] "Solomon Islands"                                   
## [237] "Somalia"                                           
## [238] "South Africa"                                      
## [239] "South Asia"                                        
## [240] "South Asia (IDA & IBRD)"                           
## [241] "South Korea"                                       
## [242] "South Sudan"                                       
## [243] "Spain"                                             
## [244] "Sri Lanka"                                         
## [245] "Sub-Saharan Africa"                                
## [246] "Sub-Saharan Africa (IDA & IBRD)"                   
## [247] "Sub-Saharan Africa (excluding high income)"        
## [248] "Sudan"                                             
## [249] "Suriname"                                          
## [250] "Swaziland"                                         
## [251] "Sweden"                                            
## [252] "Switzerland"                                       
## [253] "Syria"                                             
## [254] "Taiwan"                                            
## [255] "Tajikistan"                                        
## [256] "Tanzania"                                          
## [257] "Thailand"                                          
## [258] "Timor"                                             
## [259] "Togo"                                              
## [260] "Tokelau"                                           
## [261] "Tonga"                                             
## [262] "Trinidad and Tobago"                               
## [263] "Tunisia"                                           
## [264] "Turkey"                                            
## [265] "Turkmenistan"                                      
## [266] "Turks and Caicos Islands"                          
## [267] "Tuvalu"                                            
## [268] "Uganda"                                            
## [269] "Ukraine"                                           
## [270] "United Arab Emirates"                              
## [271] "United Kingdom"                                    
## [272] "United States"                                     
## [273] "Upper middle income"                               
## [274] "Uruguay"                                           
## [275] "Uzbekistan"                                        
## [276] "Vanuatu"                                           
## [277] "Venezuela"                                         
## [278] "Vietnam"                                           
## [279] "Western Sahara"                                    
## [280] "World"                                             
## [281] "Yemen"                                             
## [282] "Zambia"                                            
## [283] "Zimbabwe"

If there are any quirks that you have to deal with NA coded as something else, or it is multiple tables, please make some notes here about what you need to do before you start transforming the data in the next section.

Make sure your data types are correct!

Transforming the data (15 points)

If the data needs to be transformed in any way (values recoded, pivoted, etc), do it here. Examples include transforming a continuous variable into a categorical using case_when(), etc.

Subset of the data and think of a question to answer the subset

I wanted to start off by cleaning the names of my data, as well as re-naming some of the variables for easier use in coding the ggplots later.

#Clean names / rename columns 
mis_vs_gdp_clean<- clean_names(mismanaged_vs_gdp)
mis_vs_gdp_clean<- mis_vs_gdp_clean %>%
  rename(country = entity, population = total_population_gapminder, gdp =gdp_per_capita_ppp_constant_2011_international_rate,  mpw_kg_person_day= per_capita_mismanaged_plastic_waste_kilograms_per_person_per_day )

#Now that I cleaned my data, I want to save this as my file to use 

write_excel_csv(x = mis_vs_gdp_clean,
                file = "data/mis_vs_gdp_clean.csv")

skimr::skim (mis_vs_gdp_clean)

Data summary
Name	mis_vs_gdp_clean
Number of rows	22204
Number of columns	6
_______________________
Column type frequency:
character	2
numeric	4
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
country	0	1	4	50	0	283	0
code	0	1	0	8	1240	239	0

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
year	0	1.00	1959.83	51.10	1700.00	1945.00	1973.00	1997.00	2017.0	▁▁▁▂▇
mpw_kg_person_day	22018	0.01	0.05	0.05	0.00	0.01	0.03	0.07	0.3	▇▂▁▁▁
gdp	15797	0.29	14926.10	17739.75	247.44	3021.07	8447.26	19607.54	135318.8	▇▂▁▁▁
population	2123	0.90	20772028.22	84002122.43	0.00	511000.00	3518000.00	11253665.00	1359368470.0	▇▁▁▁▁

I chose to look at a variety of countries with different income levels. The countries in my subset will be: The United States, Japan, Australia, India, China, Ghana, Indonesia, Tanzania, Brazil, France, Ireland, Iran, Finland and Egypt

#Creating my subset of countries to work with. 
subset_countries <- c("United States", "Japan", "Australia", "India", "China", "Ghana", "Indonesia", "Tanzania", "Brazil","France", "Ireland", "Iran", "Finland","Egypt")

mis_vs_gdp_clean <- mis_vs_gdp_clean %>%
  filter(country %in% subset_countries)


# I removed the "code" variable, as I will not be using it
mis_vs_gdp_clean<- mis_vs_gdp_clean %>%
  select(-code)


> *Bonus points (5 points) for datasets that require merging of tables, but only if you reason through whether you should use `left_join`, `inner_join`, or `right_join` on these tables. No credit will be provided if you don't.*

Show your transformed table here. Use tools such as glimpse(), skim() or head() to illustrate your point.

glimpse(mis_vs_gdp_clean)

## Rows: 2,122
## Columns: 5
## $ country           <chr> "Australia", "Australia", "Australia", "Australia",…
## $ year              <dbl> 1700, 1800, 1820, 1821, 1822, 1823, 1824, 1825, 182…
## $ mpw_kg_person_day <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ gdp               <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ population        <dbl> 450000, 351014, 334000, 331000, 329000, 329000, 332…

skimr::skim(mis_vs_gdp_clean)

Data summary
Name	mis_vs_gdp_clean
Number of rows	2122
Number of columns	5
_______________________
Column type frequency:
character	1
numeric	4
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
country	0	1	4	13	0	14	0

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
year	0	1.00	1928.32	60.44	1700.00	1880.00	1938.00	1980.00	2.017000e+03	▁▁▅▆▇
mpw_kg_person_day	2108	0.01	0.03	0.04	0.00	0.01	0.01	0.04	1.200000e-01	▇▂▁▁▁
gdp	1730	0.18	20382.41	16542.10	1361.41	5289.46	13609.88	36233.27	6.733529e+04	▇▁▅▂▁
population	56	0.97	121571641.14	233286511.20	326000.00	5951746.75	35008000.00	96209849.50	1.359368e+09	▇▁▁▁▁

head(mis_vs_gdp_clean)

Are the values what you expected for the variables? Why or Why not?

glimpse() helped me quickly see my column names, but skim() was probably the most helpful as it showed me the missing values for my variable mpw_kg_person_day which I was not expecting and lead me to believe that it was a single year that this data was collected for each country.

Visualizing and Summarizing the Data (15 points)

Use group_by()/summarize() to make a summary of the data here. The summary should be relevant to your research question



```r
mis_vs_gdp_clean %>%
   group_by(country) %>%
   summarize(mean_gdp = mean(gdp, na.rm=TRUE)) %>%
  arrange(desc(mean_gdp))

## `summarise()` ungrouping output (override with `.groups` argument)

What are your findings about the summary? Are they what you expected?

I chose to look at the mean gdp for each country and arranged them in descending order.Since there was only one year of data for mismanaged plastic waste per person (kg/day) I thought this would be the most helpful summary. I expected The United States to be the highest GDP, but was unsure where the other countries would fall, overall I found the summary interesting.

Make at least two plots that help you answer your question on the transformed or summarized data.

The first plot is a box plot to explore our previously summarized data of country and GDP per capita. I thought with the number of years that gdp was available, that this would be a good representation of any skew in the data as well as show the median range for GDP per country.

ggplot(mis_vs_gdp_clean) +
  
  aes(x = country, 
      y = gdp, 
      fill = country) +
  
  geom_boxplot() + 
  labs(title = "GDP Per Capita by Country",
       x = "Country",
       y = "GDP Per Capita")

## Warning: Removed 1730 rows containing non-finite values (stat_boxplot).

# The only year that kilograms of mismanaged plastic waste per person per day was recorded was 2010, therefore I made a smaller dataset limited to the year 2010
small_set <- mis_vs_gdp_clean%>%
  filter(year==2010)

ggplot(small_set, aes(gdp,mpw_kg_person_day)) + 

  geom_point(aes(colour =country)) + 
  labs(title = "Mismanaged Plastic Waste vs. GDP per Country in 2010",
       x = "GDP per Capita",
       y = "Mismanaged Plastic Waste Per Person (kg/day)")

Final Summary (10 points)

Summarize your research question and findings below.

I was dissapointed to find that my main variable of mismanaged plastic waste per person (kg/day) was only available in one year (2010). This was unexpected as the dataset seemed full when intially looking at it. However I made the best of it, and found which ggplots could summarize the data well. The graphs were not quite as exciting as I was hoping for when originally thinking about this dataset.

To answer my question : “How does GDP per capita of a country affect the amount of mismanaged plastic waste that each country has?”
It appears when looking at our second plot, that countries with higher GDP per capita had less mismanaged plastic waste. For example The United States, Ireland, Australia and Finland had the highest GDP per capita when we looked at mean GDP, and on the graph they are near the lowest for mismanaged plastic waste.

Are your findings what you expected? Why or Why not?

These results were consistent with my hypothesis surrounding waste management infrastructure. I was expecting countries with high GDP to have waste management programs in place that would prevent mismanaged plastic waste. However I am curious about countries like Ghana and Tanzania who are the lowest in GDP and also have low amounts of mismanaged plastic waste.

```