# Midterm (Due 2/12/2021 at 11:55 pm)
Please submit your `.Rmd` and `.html` files in Sakai. If you are working together, both people should submit the files.
60 / 60 points total
The goal of the midterm project is to showcase skills that you have learned in class so far. The midterm is open note, but if you use someone else's code, you **must attribute them**.
```r
# This code came from
Before you get Started
Potential Sources for data: Tidy Tuesday: https://github.com/rfordatascience/tidytuesday
.csv
file into your data
folder.You may use another dataset or your own data, but please make sure it is de-identified.
Working Together
If you’d like to work together, that is encouraged, but you must divide the work equitably and you must note who worked on what. This is probably easiest as notes in the text. Please let Eric or Me know that you’ll be working together.
No acknowledgements of contributions = -10 points overall.
Please Note
I will take off points (-5 points for each section) if you don’t add observations and notes in your RMarkdown document. I want you to think and reason through your analysis, even if they are preliminary thoughts.
Define your research question below. What about the data interests you? What is a specific question you want to find out about the data?
Given that the entire world has a serious plastic pollution problem, I thought this dataset might be interesting to explore. I am looking forward to seeing the patterns that emerge as I create the ggplot graphs and can visualize any trends that may exist.
Given your question, what is your expectation about the data?
My expectation for this data is that clear trends will appear with the amount of mismanaged plastic waste based upon GDP per capita of each country. I am hypothesizing that the lower the GDP per capita of the country, the more mismanaged plastic waste that country might have. I am basing this hypothesis upon the knowledge that most lower income countries lack effective waste management infrastructure.
Load the data below and use
dplyr::glimpse()
orskimr::skim()
on the data. You should upload the data file into thedata
directory.
mismanaged_vs_gdp<- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-05-21/per-capita-mismanaged-plastic-waste-vs-gdp-per-capita.csv", na="NA")
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## Entity = col_character(),
## Code = col_character(),
## Year = col_double(),
## `Per capita mismanaged plastic waste (kilograms per person per day)` = col_double(),
## `GDP per capita, PPP (constant 2011 international $) (Rate)` = col_double(),
## `Total population (Gapminder)` = col_double()
## )
# save as csv file to Data folder
write_excel_csv(x = mismanaged_vs_gdp ,
file = "data/mismanaged_vs_gdp.csv")
read_csv(("data/mismanaged_vs_gdp.csv"), na="NA")
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## Entity = col_character(),
## Code = col_character(),
## Year = col_double(),
## `Per capita mismanaged plastic waste (kilograms per person per day)` = col_double(),
## `GDP per capita, PPP (constant 2011 international $) (Rate)` = col_double(),
## `Total population (Gapminder)` = col_double()
## )
glimpse(mismanaged_vs_gdp)
## Rows: 22,204
## Columns: 6
## $ Entity <chr> …
## $ Code <chr> …
## $ Year <dbl> …
## $ `Per capita mismanaged plastic waste (kilograms per person per day)` <dbl> …
## $ `GDP per capita, PPP (constant 2011 international $) (Rate)` <dbl> …
## $ `Total population (Gapminder)` <dbl> …
# I first wanted to list all the countries in order to pick a smaller subset of countries to work with
#code from Ted Laderas - help from function assignment
mismanaged_vs_gdp %>% pull(Entity) %>% unique()
## [1] "Afghanistan"
## [2] "Albania"
## [3] "Algeria"
## [4] "American Samoa"
## [5] "Andorra"
## [6] "Angola"
## [7] "Anguilla"
## [8] "Antigua and Barbuda"
## [9] "Arab World"
## [10] "Argentina"
## [11] "Armenia"
## [12] "Aruba"
## [13] "Australia"
## [14] "Austria"
## [15] "Azerbaijan"
## [16] "Bahamas"
## [17] "Bahrain"
## [18] "Bangladesh"
## [19] "Barbados"
## [20] "Belarus"
## [21] "Belgium"
## [22] "Belize"
## [23] "Benin"
## [24] "Bermuda"
## [25] "Bhutan"
## [26] "Bolivia"
## [27] "Bosnia and Herzegovina"
## [28] "Botswana"
## [29] "Brazil"
## [30] "British Virgin Islands"
## [31] "Brunei"
## [32] "Bulgaria"
## [33] "Burkina Faso"
## [34] "Burundi"
## [35] "Cambodia"
## [36] "Cameroon"
## [37] "Canada"
## [38] "Cape Verde"
## [39] "Caribbean small states"
## [40] "Cayman Islands"
## [41] "Central African Republic"
## [42] "Central Europe and the Baltics"
## [43] "Chad"
## [44] "Channel Islands"
## [45] "Chile"
## [46] "China"
## [47] "Christmas Island"
## [48] "Cocos Islands"
## [49] "Colombia"
## [50] "Comoros"
## [51] "Congo"
## [52] "Cook Islands"
## [53] "Costa Rica"
## [54] "Cote d'Ivoire"
## [55] "Croatia"
## [56] "Cuba"
## [57] "Curacao"
## [58] "Cyprus"
## [59] "Czech Republic"
## [60] "Democratic Republic of Congo"
## [61] "Denmark"
## [62] "Djibouti"
## [63] "Dominica"
## [64] "Dominican Republic"
## [65] "Early-demographic dividend"
## [66] "East Asia & Pacific"
## [67] "East Asia & Pacific (IDA & IBRD)"
## [68] "East Asia & Pacific (excluding high income)"
## [69] "Ecuador"
## [70] "Egypt"
## [71] "El Salvador"
## [72] "Equatorial Guinea"
## [73] "Eritrea"
## [74] "Estonia"
## [75] "Ethiopia"
## [76] "Euro area"
## [77] "Europe & Central Asia"
## [78] "Europe & Central Asia (IDA & IBRD)"
## [79] "Europe & Central Asia (excluding high income)"
## [80] "European Union"
## [81] "Faeroe Islands"
## [82] "Falkland Islands"
## [83] "Fiji"
## [84] "Finland"
## [85] "Fragile and conflict affected situations"
## [86] "France"
## [87] "French Guiana"
## [88] "French Polynesia"
## [89] "Gabon"
## [90] "Gambia"
## [91] "Georgia"
## [92] "Germany"
## [93] "Ghana"
## [94] "Gibraltar"
## [95] "Greece"
## [96] "Greenland"
## [97] "Grenada"
## [98] "Guadeloupe"
## [99] "Guam"
## [100] "Guatemala"
## [101] "Guernsey"
## [102] "Guinea"
## [103] "Guinea-Bissau"
## [104] "Guyana"
## [105] "Haiti"
## [106] "Heavily indebted poor countries (HIPC)"
## [107] "High income"
## [108] "Honduras"
## [109] "Hong Kong"
## [110] "Hungary"
## [111] "IBRD only"
## [112] "IDA & IBRD total"
## [113] "IDA blend"
## [114] "IDA only"
## [115] "IDA total"
## [116] "Iceland"
## [117] "India"
## [118] "Indonesia"
## [119] "Iran"
## [120] "Iraq"
## [121] "Ireland"
## [122] "Isle of Man"
## [123] "Israel"
## [124] "Italy"
## [125] "Jamaica"
## [126] "Japan"
## [127] "Jersey"
## [128] "Jordan"
## [129] "Kazakhstan"
## [130] "Kenya"
## [131] "Kiribati"
## [132] "Kosovo"
## [133] "Kuwait"
## [134] "Kyrgyzstan"
## [135] "Laos"
## [136] "Late-demographic dividend"
## [137] "Latin America & Caribbean"
## [138] "Latin America & Caribbean (IDA & IBRD)"
## [139] "Latin America & Caribbean (excluding high income)"
## [140] "Latvia"
## [141] "Least developed countries: UN classification"
## [142] "Lebanon"
## [143] "Lesotho"
## [144] "Liberia"
## [145] "Libya"
## [146] "Liechtenstein"
## [147] "Lithuania"
## [148] "Low & middle income"
## [149] "Low income"
## [150] "Lower middle income"
## [151] "Luxembourg"
## [152] "Macao"
## [153] "Macedonia"
## [154] "Madagascar"
## [155] "Malawi"
## [156] "Malaysia"
## [157] "Maldives"
## [158] "Mali"
## [159] "Malta"
## [160] "Marshall Islands"
## [161] "Martinique"
## [162] "Mauritania"
## [163] "Mauritius"
## [164] "Mayotte"
## [165] "Mexico"
## [166] "Micronesia (country)"
## [167] "Middle East & North Africa"
## [168] "Middle East & North Africa (IDA & IBRD)"
## [169] "Middle East & North Africa (excluding high income)"
## [170] "Middle income"
## [171] "Moldova"
## [172] "Monaco"
## [173] "Mongolia"
## [174] "Montenegro"
## [175] "Montserrat"
## [176] "Morocco"
## [177] "Mozambique"
## [178] "Myanmar"
## [179] "Namibia"
## [180] "Nauru"
## [181] "Nepal"
## [182] "Netherlands"
## [183] "Netherlands Antilles"
## [184] "New Caledonia"
## [185] "New Zealand"
## [186] "Nicaragua"
## [187] "Niger"
## [188] "Nigeria"
## [189] "Niue"
## [190] "Norfolk Island"
## [191] "North America"
## [192] "North Korea"
## [193] "Northern Mariana Islands"
## [194] "Norway"
## [195] "OECD members"
## [196] "Oman"
## [197] "Other small states"
## [198] "Pacific island small states"
## [199] "Pakistan"
## [200] "Palau"
## [201] "Palestine"
## [202] "Panama"
## [203] "Papua New Guinea"
## [204] "Paraguay"
## [205] "Peru"
## [206] "Philippines"
## [207] "Pitcairn"
## [208] "Poland"
## [209] "Portugal"
## [210] "Post-demographic dividend"
## [211] "Pre-demographic dividend"
## [212] "Puerto Rico"
## [213] "Qatar"
## [214] "Reunion"
## [215] "Romania"
## [216] "Russia"
## [217] "Rwanda"
## [218] "Saint Helena"
## [219] "Saint Kitts and Nevis"
## [220] "Saint Lucia"
## [221] "Saint Pierre and Miquelon"
## [222] "Saint Vincent and the Grenadines"
## [223] "Samoa"
## [224] "San Marino"
## [225] "Sao Tome and Principe"
## [226] "Saudi Arabia"
## [227] "Senegal"
## [228] "Serbia"
## [229] "Seychelles"
## [230] "Sierra Leone"
## [231] "Singapore"
## [232] "Sint Maarten (Dutch part)"
## [233] "Slovakia"
## [234] "Slovenia"
## [235] "Small states"
## [236] "Solomon Islands"
## [237] "Somalia"
## [238] "South Africa"
## [239] "South Asia"
## [240] "South Asia (IDA & IBRD)"
## [241] "South Korea"
## [242] "South Sudan"
## [243] "Spain"
## [244] "Sri Lanka"
## [245] "Sub-Saharan Africa"
## [246] "Sub-Saharan Africa (IDA & IBRD)"
## [247] "Sub-Saharan Africa (excluding high income)"
## [248] "Sudan"
## [249] "Suriname"
## [250] "Swaziland"
## [251] "Sweden"
## [252] "Switzerland"
## [253] "Syria"
## [254] "Taiwan"
## [255] "Tajikistan"
## [256] "Tanzania"
## [257] "Thailand"
## [258] "Timor"
## [259] "Togo"
## [260] "Tokelau"
## [261] "Tonga"
## [262] "Trinidad and Tobago"
## [263] "Tunisia"
## [264] "Turkey"
## [265] "Turkmenistan"
## [266] "Turks and Caicos Islands"
## [267] "Tuvalu"
## [268] "Uganda"
## [269] "Ukraine"
## [270] "United Arab Emirates"
## [271] "United Kingdom"
## [272] "United States"
## [273] "Upper middle income"
## [274] "Uruguay"
## [275] "Uzbekistan"
## [276] "Vanuatu"
## [277] "Venezuela"
## [278] "Vietnam"
## [279] "Western Sahara"
## [280] "World"
## [281] "Yemen"
## [282] "Zambia"
## [283] "Zimbabwe"
If there are any quirks that you have to deal with
NA
coded as something else, or it is multiple tables, please make some notes here about what you need to do before you start transforming the data in the next section.
Make sure your data types are correct!
If the data needs to be transformed in any way (values recoded, pivoted, etc), do it here. Examples include transforming a continuous variable into a categorical using
case_when()
, etc.
Subset of the data and think of a question to answer the subset
I wanted to start off by cleaning the names of my data, as well as re-naming some of the variables for easier use in coding the ggplots later.
#Clean names / rename columns
mis_vs_gdp_clean<- clean_names(mismanaged_vs_gdp)
mis_vs_gdp_clean<- mis_vs_gdp_clean %>%
rename(country = entity, population = total_population_gapminder, gdp =gdp_per_capita_ppp_constant_2011_international_rate, mpw_kg_person_day= per_capita_mismanaged_plastic_waste_kilograms_per_person_per_day )
#Now that I cleaned my data, I want to save this as my file to use
write_excel_csv(x = mis_vs_gdp_clean,
file = "data/mis_vs_gdp_clean.csv")
skimr::skim (mis_vs_gdp_clean)
Name | mis_vs_gdp_clean |
Number of rows | 22204 |
Number of columns | 6 |
_______________________ | |
Column type frequency: | |
character | 2 |
numeric | 4 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
country | 0 | 1 | 4 | 50 | 0 | 283 | 0 |
code | 0 | 1 | 0 | 8 | 1240 | 239 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
year | 0 | 1.00 | 1959.83 | 51.10 | 1700.00 | 1945.00 | 1973.00 | 1997.00 | 2017.0 | ▁▁▁▂▇ |
mpw_kg_person_day | 22018 | 0.01 | 0.05 | 0.05 | 0.00 | 0.01 | 0.03 | 0.07 | 0.3 | ▇▂▁▁▁ |
gdp | 15797 | 0.29 | 14926.10 | 17739.75 | 247.44 | 3021.07 | 8447.26 | 19607.54 | 135318.8 | ▇▂▁▁▁ |
population | 2123 | 0.90 | 20772028.22 | 84002122.43 | 0.00 | 511000.00 | 3518000.00 | 11253665.00 | 1359368470.0 | ▇▁▁▁▁ |
I chose to look at a variety of countries with different income levels. The countries in my subset will be: The United States, Japan, Australia, India, China, Ghana, Indonesia, Tanzania, Brazil, France, Ireland, Iran, Finland and Egypt
#Creating my subset of countries to work with.
subset_countries <- c("United States", "Japan", "Australia", "India", "China", "Ghana", "Indonesia", "Tanzania", "Brazil","France", "Ireland", "Iran", "Finland","Egypt")
mis_vs_gdp_clean <- mis_vs_gdp_clean %>%
filter(country %in% subset_countries)
# I removed the "code" variable, as I will not be using it
mis_vs_gdp_clean<- mis_vs_gdp_clean %>%
select(-code)
> *Bonus points (5 points) for datasets that require merging of tables, but only if you reason through whether you should use `left_join`, `inner_join`, or `right_join` on these tables. No credit will be provided if you don't.*
Show your transformed table here. Use tools such as
glimpse()
,skim()
orhead()
to illustrate your point.
glimpse(mis_vs_gdp_clean)
## Rows: 2,122
## Columns: 5
## $ country <chr> "Australia", "Australia", "Australia", "Australia",…
## $ year <dbl> 1700, 1800, 1820, 1821, 1822, 1823, 1824, 1825, 182…
## $ mpw_kg_person_day <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ gdp <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ population <dbl> 450000, 351014, 334000, 331000, 329000, 329000, 332…
skimr::skim(mis_vs_gdp_clean)
Name | mis_vs_gdp_clean |
Number of rows | 2122 |
Number of columns | 5 |
_______________________ | |
Column type frequency: | |
character | 1 |
numeric | 4 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
country | 0 | 1 | 4 | 13 | 0 | 14 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
year | 0 | 1.00 | 1928.32 | 60.44 | 1700.00 | 1880.00 | 1938.00 | 1980.00 | 2.017000e+03 | ▁▁▅▆▇ |
mpw_kg_person_day | 2108 | 0.01 | 0.03 | 0.04 | 0.00 | 0.01 | 0.01 | 0.04 | 1.200000e-01 | ▇▂▁▁▁ |
gdp | 1730 | 0.18 | 20382.41 | 16542.10 | 1361.41 | 5289.46 | 13609.88 | 36233.27 | 6.733529e+04 | ▇▁▅▂▁ |
population | 56 | 0.97 | 121571641.14 | 233286511.20 | 326000.00 | 5951746.75 | 35008000.00 | 96209849.50 | 1.359368e+09 | ▇▁▁▁▁ |
head(mis_vs_gdp_clean)
Are the values what you expected for the variables? Why or Why not?
glimpse()
helped me quickly see my column names, butskim()
was probably the most helpful as it showed me the missing values for my variablempw_kg_person_day
which I was not expecting and lead me to believe that it was a single year that this data was collected for each country.
Use
group_by()/summarize()
to make a summary of the data here. The summary should be relevant to your research question
```r
mis_vs_gdp_clean %>%
group_by(country) %>%
summarize(mean_gdp = mean(gdp, na.rm=TRUE)) %>%
arrange(desc(mean_gdp))
## `summarise()` ungrouping output (override with `.groups` argument)
What are your findings about the summary? Are they what you expected?
I chose to look at the mean gdp for each country and arranged them in descending order.Since there was only one year of data for mismanaged plastic waste per person (kg/day) I thought this would be the most helpful summary. I expected The United States to be the highest GDP, but was unsure where the other countries would fall, overall I found the summary interesting.
Make at least two plots that help you answer your question on the transformed or summarized data.
The first plot is a box plot to explore our previously summarized data of country and GDP per capita. I thought with the number of years that gdp was available, that this would be a good representation of any skew in the data as well as show the median range for GDP per country.
ggplot(mis_vs_gdp_clean) +
aes(x = country,
y = gdp,
fill = country) +
geom_boxplot() +
labs(title = "GDP Per Capita by Country",
x = "Country",
y = "GDP Per Capita")
## Warning: Removed 1730 rows containing non-finite values (stat_boxplot).
# The only year that kilograms of mismanaged plastic waste per person per day was recorded was 2010, therefore I made a smaller dataset limited to the year 2010
small_set <- mis_vs_gdp_clean%>%
filter(year==2010)
ggplot(small_set, aes(gdp,mpw_kg_person_day)) +
geom_point(aes(colour =country)) +
labs(title = "Mismanaged Plastic Waste vs. GDP per Country in 2010",
x = "GDP per Capita",
y = "Mismanaged Plastic Waste Per Person (kg/day)")
Summarize your research question and findings below.
I was dissapointed to find that my main variable of mismanaged plastic waste per person (kg/day) was only available in one year (2010). This was unexpected as the dataset seemed full when intially looking at it. However I made the best of it, and found which ggplots could summarize the data well. The graphs were not quite as exciting as I was hoping for when originally thinking about this dataset.
To answer my question : “How does GDP per capita of a country affect the amount of mismanaged plastic waste that each country has?”
It appears when looking at our second plot, that countries with higher GDP per capita had less mismanaged plastic waste. For example The United States, Ireland, Australia and Finland had the highest GDP per capita when we looked at mean GDP, and on the graph they are near the lowest for mismanaged plastic waste.
Are your findings what you expected? Why or Why not?
These results were consistent with my hypothesis surrounding waste management infrastructure. I was expecting countries with high GDP to have waste management programs in place that would prevent mismanaged plastic waste. However I am curious about countries like Ghana and Tanzania who are the lowest in GDP and also have low amounts of mismanaged plastic waste.
```