Midterm (Due 2/12/2021 at 11:55 pm)

Please submit your .Rmd and .html files in Sakai. If you are working together, both people should submit the files.

60 / 60 points total

The goal of the midterm project is to showcase skills that you have learned in class so far. The midterm is open note, but if you use someone else’s code, you must attribute them.

# This code came from 

Before you get Started

  1. Pick a dataset. Ideally, the dataset should be around 2000 rows, and should have both categorical and numeric covariates.

Potential Sources for data: Tidy Tuesday: https://github.com/rfordatascience/tidytuesday

I am using the ‘Global Mortality’ dataset from the data available on the Tidy Tuesday website from 2018.
https://github.com/rfordatascience/tidytuesday/tree/master/data/2018/2018-04-16

  • Note that most of these are .csv files. There is code to load the files from csv for each of the datasets and a short description of the variables, or you can upload the .csv file into your data folder.

You may use another dataset or your own data, but please make sure it is de-identified.

  1. Please schedule a time with Eric or Me to discuss your dataset and research question. We just want to look at the data and make sure that it is appropriate for your question.

Working Together

If you’d like to work together, that is encouraged, but you must divide the work equitably and you must note who worked on what. This is probably easiest as notes in the text. Please let Eric or Me know that you’ll be working together.

No acknowledgements of contributions = -10 points overall.

Please Note

I will take off points (-5 points for each section) if you don’t add observations and notes in your RMarkdown document. I want you to think and reason through your analysis, even if they are preliminary thoughts.

Define Your Research Question (10 points)

Define your research question below. What about the data interests you? What is a specific question you want to find out about the data?

My research question is: which country had the highest mean mortality attributed to Tuberculosis from 1990 to 2016 (over 26 years)?

The dataset comes from a website called ourworldindata.org, an organization dedicated to making data about big world problems accessible and understandable in order to make progress against world issues such as poverty, disease, inequality, etc.

I was interested in this dataset because I am interested in global health and it seems similar to datasets that I would want to work with in the future. The data are also more robust that what I have worked with in class, so I think it will be good practice with getting me out of my comfort zone. Because the data are relative large, I chose to narrow down my question to a specific disease to make it more manageable.

Given your question, what is your expectation about the data?

According to the WHO, in 2019 the largest number of new TB cases occurred in the Southeast Asia region. The countries that accounted for two thirds of the new TB cases were: India, Indonesia, China, Philippines, Pakistan, Nigeria, Bangaldesh and South Africa. I hypothesize that the country with the highest mortality secondary to Tuberculosis will be India because they have the world’s second largest population in a very condensed space, which could increase the transmission of this disease (China has the largest population but it is a larger country than India).

Source: https://www.who.int/news-room/fact-sheets/detail/tuberculosis

Loading the Data (10 points)

Load the data below and use dplyr::glimpse() or skimr::skim() on the data. You should upload the data file into the data directory.

Getting started, I will first load all the packages I might need.

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.4     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(readxl)
library(skimr)
library(janitor)
## 
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
library(here)
## here() starts at /cloud/project

I have uploaded my file to the data directory and now I will load my dataset and examine it using the glimpse() function:

global_mortality <- read_excel(here('data/global_mortality.xlsx'),
                               na="NA") %>% 
  glimpse()
## Rows: 6,156
## Columns: 35
## $ country                                    <chr> "Afghanistan", "Afghanista…
## $ country_code                               <chr> "AFG", "AFG", "AFG", "AFG"…
## $ year                                       <dbl> 1990, 1991, 1992, 1993, 19…
## $ `Cardiovascular diseases (%)`              <dbl> 17.61040, 17.80181, 18.386…
## $ `Cancers (%)`                              <dbl> 4.025975, 4.054145, 4.1739…
## $ `Respiratory diseases (%)`                 <dbl> 2.106626, 2.134176, 2.2082…
## $ `Diabetes (%)`                             <dbl> 3.832555, 3.822228, 3.9001…
## $ `Dementia (%)`                             <dbl> 0.5314287, 0.5324973, 0.54…
## $ `Lower respiratory infections (%)`         <dbl> 10.886362, 10.356968, 10.0…
## $ `Neonatal deaths (%)`                      <dbl> 9.184653, 8.938897, 8.8413…
## $ `Diarrheal diseases (%)`                   <dbl> 2.497141, 2.572228, 2.7077…
## $ `Road accidents (%)`                       <dbl> 3.715944, 3.729142, 3.8163…
## $ `Liver disease (%)`                        <dbl> 0.8369093, 0.8455159, 0.87…
## $ `Tuberculosis (%)`                         <dbl> 5.877075, 5.891704, 6.0346…
## $ `Kidney disease (%)`                       <dbl> 1.680611, 1.671115, 1.7009…
## $ `Digestive diseases (%)`                   <dbl> 1.058771, 1.049322, 1.0628…
## $ `HIV/AIDS (%)`                             <dbl> 0.01301948, 0.01451458, 0.…
## $ `Suicide (%)`                              <dbl> 0.4366105, 0.4422802, 0.45…
## $ `Malaria (%)`                              <dbl> 0.4488863, 0.4550191, 0.46…
## $ `Homicide (%)`                             <dbl> 1.287020, 1.290991, 1.3261…
## $ `Nutritional deficiencies (%)`             <dbl> 0.3505045, 0.3432123, 0.34…
## $ `Meningitis (%)`                           <dbl> 3.037603, 2.903202, 2.8406…
## $ `Protein-energy malnutrition (%)`          <dbl> 0.3297599, 0.3221711, 0.32…
## $ `Drowning (%)`                             <dbl> 0.9838624, 0.9545860, 0.95…
## $ `Maternal deaths (%)`                      <dbl> 1.769213, 1.749264, 1.7642…
## $ `Parkinson disease (%)`                    <dbl> 0.02515859, 0.02545063, 0.…
## $ `Alcohol disorders (%)`                    <dbl> 0.02899828, 0.02917152, 0.…
## $ `Intestinal infectious diseases (%)`       <dbl> 0.1833303, 0.1781074, 0.17…
## $ `Drug disorders (%)`                       <dbl> 0.04120540, 0.04203340, 0.…
## $ `Hepatitis (%)`                            <dbl> 0.1387378, 0.1350081, 0.13…
## $ `Fire (%)`                                 <dbl> 0.1741567, 0.1706712, 0.17…
## $ `Heat-related (hot and cold exposure) (%)` <dbl> 0.1378229, 0.1348266, 0.13…
## $ `Natural disasters (%)`                    <dbl> 0.00000000, 0.79760256, 0.…
## $ `Conflict (%)`                             <dbl> 0.932, 2.044, 2.408, NA, 4…
## $ `Terrorism (%)`                            <dbl> 0.007, 0.040, 0.027, NA, 0…

And now I’ll look at it using skim():

skim(global_mortality)
Data summary
Name global_mortality
Number of rows 6156
Number of columns 35
_______________________
Column type frequency:
character 2
numeric 33
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
country 0 1.00 4 32 0 228 0
country_code 864 0.86 3 8 0 196 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
year 0 1.00 2003.00 7.79 1990.00 1996.00 2003.00 2010.00 2016.00 ▇▇▇▇▇
Cardiovascular diseases (%) 0 1.00 29.93 14.02 1.43 18.74 30.65 38.45 67.39 ▃▅▇▂▁
Cancers (%) 0 1.00 14.39 8.15 0.58 6.93 13.31 21.36 33.62 ▇▇▆▆▃
Respiratory diseases (%) 0 1.00 4.10 2.35 0.30 2.26 3.63 5.38 16.29 ▇▇▂▁▁
Diabetes (%) 0 1.00 6.29 4.44 0.33 3.20 4.99 7.93 35.82 ▇▂▁▁▁
Dementia (%) 0 1.00 3.22 2.75 0.04 1.01 2.53 4.33 16.67 ▇▃▁▁▁
Lower respiratory infections (%) 0 1.00 5.84 3.42 0.68 3.21 5.14 8.16 20.04 ▇▆▃▁▁
Neonatal deaths (%) 0 1.00 4.57 3.85 0.04 0.69 3.89 7.74 17.81 ▇▃▃▁▁
Diarrheal diseases (%) 0 1.00 3.20 4.36 0.01 0.18 0.77 5.29 25.18 ▇▂▁▁▁
Road accidents (%) 0 1.00 2.53 2.19 0.28 1.36 1.93 2.90 20.90 ▇▁▁▁▁
Liver disease (%) 0 1.00 2.12 1.24 0.19 1.34 1.83 2.54 11.65 ▇▂▁▁▁
Tuberculosis (%) 0 1.00 2.13 2.65 0.01 0.24 0.91 3.33 16.47 ▇▂▁▁▁
Kidney disease (%) 0 1.00 2.09 1.49 0.06 0.90 1.73 2.96 9.95 ▇▅▂▁▁
Digestive diseases (%) 0 1.00 1.97 0.68 0.31 1.51 1.93 2.29 5.16 ▂▇▃▁▁
HIV/AIDS (%) 0 1.00 3.35 7.81 0.00 0.08 0.44 2.33 62.19 ▇▁▁▁▁
Suicide (%) 0 1.00 1.39 1.11 0.10 0.69 1.18 1.80 15.41 ▇▁▁▁▁
Malaria (%) 0 1.00 1.80 4.21 0.00 0.00 0.00 0.48 24.43 ▇▁▁▁▁
Homicide (%) 0 1.00 0.98 1.45 0.05 0.24 0.52 1.00 14.23 ▇▁▁▁▁
Nutritional deficiencies (%) 0 1.00 1.10 1.87 0.00 0.09 0.40 1.34 35.55 ▇▁▁▁▁
Meningitis (%) 0 1.00 0.78 0.97 0.03 0.11 0.35 1.02 6.98 ▇▂▁▁▁
Protein-energy malnutrition (%) 0 1.00 1.00 1.81 0.00 0.06 0.32 1.22 35.52 ▇▁▁▁▁
Drowning (%) 0 1.00 0.71 0.51 0.05 0.35 0.61 0.96 4.51 ▇▂▁▁▁
Maternal deaths (%) 0 1.00 0.59 0.70 0.00 0.03 0.24 1.00 3.41 ▇▂▂▁▁
Parkinson disease (%) 0 1.00 0.29 0.26 0.00 0.07 0.21 0.43 1.59 ▇▃▁▁▁
Alcohol disorders (%) 0 1.00 0.32 0.41 0.01 0.08 0.16 0.39 3.08 ▇▁▁▁▁
Intestinal infectious diseases (%) 0 1.00 0.18 0.30 0.00 0.00 0.01 0.27 2.28 ▇▁▁▁▁
Drug disorders (%) 0 1.00 0.18 0.19 0.00 0.06 0.12 0.23 1.31 ▇▂▁▁▁
Hepatitis (%) 0 1.00 0.16 0.17 0.00 0.04 0.11 0.24 1.58 ▇▁▁▁▁
Fire (%) 0 1.00 0.33 0.18 0.06 0.19 0.32 0.43 1.34 ▇▇▁▁▁
Heat-related (hot and cold exposure) (%) 0 1.00 0.10 0.13 0.01 0.04 0.07 0.11 1.17 ▇▁▁▁▁
Natural disasters (%) 0 1.00 0.09 1.28 0.00 0.00 0.00 0.02 65.29 ▇▁▁▁▁
Conflict (%) 1398 0.77 0.29 2.40 0.00 0.00 0.00 0.02 82.32 ▇▁▁▁▁
Terrorism (%) 1398 0.77 0.04 0.23 0.00 0.00 0.00 0.00 5.88 ▇▁▁▁▁

If there are any quirks that you have to deal with NA coded as something else, or it is multiple tables, please make some notes here about what you need to do before you start transforming the data in the next section.

The global_morality dataset needs to be cleaned before I can start analyzing it to address my research question. Here are the steps I plan to take:

  1. I will only focus on Tuberculosis, so I need to remove the other listed causes of mortality. I will keep the country and year columns as well.

  2. Keeping all the countries makes it difficult to visualize the data because there are so many values. Therefore, I will only keep eight countries: India, Indonesia, China, Philippines, Pakistan, Nigeria, Bangladesh, South Africa.

Make sure your data types are correct!

Transforming the data (15 points)

If the data needs to be transformed in any way (values recoded, pivoted, etc), do it here. Examples include transforming a continuous variable into a categorical using case_when(), etc.

The values don’t need to be recoded at this time because the data are already in the form that I am interested in analyzing (countries are a categorical variable and mortality rate are a continuous variable).

Subset of the data and think of a question to answer the subset

As outlined above, first I need to remove the columns of mortality causes that I am not interested in within the context of my research question. I will also want to keep the following columns: country and year.

tuberculosis_mortality <- global_mortality %>% 
  select(country, year, `Tuberculosis (%)`)

# This code came from lecture 2 of BSTA 504

Show your transformed table here. Use tools such as glimpse(), skim() or head() to illustrate your point.

skim(tuberculosis_mortality)
Data summary
Name tuberculosis_mortality
Number of rows 6156
Number of columns 3
_______________________
Column type frequency:
character 1
numeric 2
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
country 0 1 4 32 0 228 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
year 0 1 2003.00 7.79 1990.00 1996.00 2003.00 2010.00 2016.00 ▇▇▇▇▇
Tuberculosis (%) 0 1 2.13 2.65 0.01 0.24 0.91 3.33 16.47 ▇▂▁▁▁
glimpse(tuberculosis_mortality)
## Rows: 6,156
## Columns: 3
## $ country            <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afgh…
## $ year               <dbl> 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 19…
## $ `Tuberculosis (%)` <dbl> 5.877075, 5.891704, 6.034669, 6.160951, 6.142278, …

Step 1 of my data management looks complete. I have all the columns in my dataset that are relevant to my research question: mortality (%) of Tuberculosis, year, and country (name).

Now I can move on to step 2: narrowing my dataset down to eight countries of interest:

tuberculosis_mortality_8 <- c("India", "Indonesia", "China", "Philippines", "Pakistan", "Nigeria", "Bangladesh", "South Africa")

Now to complete my dataset:

Tb_mortality_final <- tuberculosis_mortality %>%
  filter(country %in% tuberculosis_mortality_8)

Now the dataset is ready for analysis so I can start answering my research question.

skim(Tb_mortality_final)
Data summary
Name Tb_mortality_final
Number of rows 216
Number of columns 3
_______________________
Column type frequency:
character 1
numeric 2
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
country 0 1 5 12 0 8 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
year 0 1 2003.00 7.81 1990.00 1996.00 2003.00 2010.00 2016.00 ▇▇▇▇▇
Tuberculosis (%) 0 1 4.84 2.46 0.42 3.16 4.89 6.25 11.22 ▅▆▇▃▂

Bonus points (5 points) for datasets that require merging of tables, but only if you reason through whether you should use left_join, inner_join, or right_join on these tables. No credit will be provided if you don’t.

My dataset were already merged before it was uploaded onto TidyTuesday so I will not do any additional merging at this time.

Are the values what you expected for the variables? Why or Why not?

I was surprised that the mean Tuberculosis mortality rate is only 4.84% over the span of 26 years across the countries surveyed. This rate is a lot lower than I had expected because about 1/4 of the world is infected with it and the countries. Here are two thoughts that could account for this discrepancy from my expectation: 1) There are two types of Tuberculosis: latent and active. Therefore, just because someone has Tuberculosis, it does not mean that they die. 2) There is treatment for Tuberculosis, which could improve mortality rates.

Visualizing and Summarizing the Data (15 points)

Use group_by()/summarize() to make a summary of the data here. The summary should be relevant to your research question

Since I am looking at data across 26 years, I will examine the mean mortality rate:

Tb_mortality_final %>%
  group_by(country) %>%
  summarize(mean_mortality = mean(`Tuberculosis (%)`)) %>%
  arrange(desc(mean_mortality))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 8 x 2
##   country      mean_mortality
##   <chr>                 <dbl>
## 1 Indonesia              8.77
## 2 India                  6.33
## 3 Philippines            6.31
## 4 Pakistan               4.87
## 5 South Africa           4.86
## 6 Nigeria                3.90
## 7 Bangladesh             2.61
## 8 China                  1.08

What are your findings about the summary? Are they what you expected?

The findings of my summary show that at 8.78% (rounded to three significant figures) Indonesia has the highest mortality rate from Tuberculosis. These findings were not what I had expected because I thought India would have the highest mortality rate. In hindsight, the information I looked at from the WHO was based on incidence, not mortality. And although India has a high population within a concentrated area, the incidence of Tuberculosis in India could be latent Tuberculosis and not active Tuberculosis, which would impact the mortality rate. However, it is worth noting that India has the second highest mortality rate (6.33%).

Make at least two plots that help you answer your question on the transformed or summarized data.

ggplot(Tb_mortality_final) +
  aes(x=year, y=country, fill=`Tuberculosis (%)`)+
  geom_tile() + 
  labs(x="Year",
       y="Country",
       Title="Tuberculosis Mortality (%) 1990 - 2016")

# This code adapted from BSTA 503 Lecture from part 5

The above heatmap shows that Indonesia has the highest mortality rate due to Tuberculosis. And the highest rate of mortality (~9%) is observed at the beginning of the specified time period (1990) and appears to improve as time progresses. The Philippines and India appear to have the next highest rates of Tuberculosis mortality. Of the eight countries studied, China appears to have the lowest Tuberculosis mortality rate at ~3%.

ggplot(Tb_mortality_final) +

  aes(x = year, 
      y = `Tuberculosis (%)`, 
      color = country) +
  
  geom_point() +

  theme_bw() +
  
   labs(title = "Tuberculosis Mortality (%) 1990 - 2016 By Country",
       x = "Year",
       y = "Tuberculosis Mortality (%)")

  # This code adapted from BSTA 503 Lecture from part 3

The above scatterplot depicts the mortality rate of Tuberculosis between 1990-2016. Overall, the highest rates of Tuberculosis mortality per year occur in 1990 with the highest mortality rate occurring in Indonesia (9%+). As time passes, the mortality rate decreases but overall trends stay consistent when comparing different countries (Indonesia has the highest mortality rates and China has the lowest).

Final Summary (10 points)

Summarize your research question and findings below. My research question was: which country had the highest mean mortality attributed to Tuberculosis from 1990 to 2016 (over 26 years)?

I found that the country with the highest mean Tuberculosis mortality rate between 1990 and 2016 was Indonesia (8.78%).

The country with the second highest mortality mean was India (6.33%). Of the eight countries studied, China had the lowest mean mortality (1.08%).

Are your findings what you expected? Why or Why not?

My findings were not what I had expected because I had hypothesized that India would have the greatest mortality rate because the WHO had listed India as one of the top eight nations with the highest Tb cases and I had believed their dense population would increase transmission. However, there are two types of Tuberculosis: latent vs. active. Therefore, having a high rate of Tuberculosis is not correlated with death secondary to Tuberculosis.

Overall, the Tuberculosis mortality rates were less than I had anticipated. I had expected larger mortality rates because at least 1/4 of the world population is diagnosed with Tuberculosis and it is the world’s leading cause of death by an infectious agent. The lower than anticipated mortality rate could be because there are two types of Tuberculosis: latent and active. Therefore, diagnosis of Tuberculosis is not necessarily a death sentence. Surveillance of Tuberculosis and aggressive treatment can also improve outcomes and reduce mortality. Because the prevalence of Tuberculosis is greater in India than Indonesia, the interventions available for treatment might be more robust which could explain why Indonesia has worse mortality outcomes.