Midterm (Due 2/12/2021 at 11:55 pm)

Please submit your .Rmd and .html files in Sakai. If you are working together, both people should submit the files.

60 / 60 points total

The goal of the midterm project is to showcase skills that you have learned in class so far. The midterm is open note, but if you use someone else’s code, you must attribute them.

Before you get Started

Pick a dataset. Ideally, the dataset should be around 2000 rows or less, and should have both categorical and numeric covariates.

Potential Sources for data: Tidy Tuesday: https://github.com/rfordatascience/tidytuesday

Note that most of these are .csv files. There is code to load the files from csv for each of the datasets and a short description of the variables, or you can upload the .csv file into your data folder.

You may use another dataset or your own data, but please make sure it is de-identified.

Please schedule a time with Eric or Me to discuss your dataset and research question. We just want to look at the data and make sure that it is appropriate for your question.

Working Together

If you’d like to work together, that is encouraged, but you must divide the work equitably and you must note who worked on what. This is probably easiest as notes in the text. Please let Eric or Me know that you’ll be working together.

No acknowledgements of contributions = -10 points overall.

Please Note

I will take off points (-5 points for each section) if you don’t add observations and notes in your RMarkdown document. I want you to think and reason through your analysis, even if they are preliminary thoughts.

Define Your Research Question (10 points)

Define your research question below. What about the data interests you? What is a specific question you want to find out about the data?

I am interested in the differences in gestation time and birthweight among smoking and non-smoking expectant mothers. This dataset is from a 1960’s-1970’s cross-sectional study (Yerushalmy, Johns Hopkins University, 1970) aiming to find an association between smoking and low birthweight. At the time, smoking research was in its infancy and the deleterious effects of smoking were not widely known. My research question is: Does length of gestation differ among smoking versus non-smoking women?

Given your question, what is your expectation about the data?

I expect that women that who smoke have shorter periods of gestation. I also expect that shorter gestation duration may lead to infants that weighed less.

Loading the Data (10 points)

Load the data below and use dplyr::glimpse() or skimr::skim() on the data. You should upload the data file into the data directory.

chds <- read_csv("data/CHDS.csv")

glimpse(chds)

Rows: 680
Columns: 6
$ bwt     <dbl> 7.3, 8.0, 7.5, 7.0, 5.3, 8.6, 9.1, 6.5, 3.3, 8.1, 7…
$ gestwks <dbl> 37, 41, 39, 39, 37, 43, 40, 37, 29, 41, 40, 39, 41,…
$ age     <dbl> 33, 28, 32, 27, 32, 30, 23, 27, 32, 28, 26, 19, 37,…
$ mnocig  <dbl> 25, 0, 0, 2, 17, 0, 0, 17, 0, 0, 25, 0, 25, 17, 0, …
$ mheight <dbl> 66, 63, 61, 68, 67, 63, 65, 64, 64, 66, 61, 65, 63,…
$ mppwt   <dbl> 140, 130, 126, 150, 112, 131, 134, 125, 142, 113, 1…

chds

# A tibble: 680 x 6
     bwt gestwks   age mnocig mheight mppwt
   <dbl>   <dbl> <dbl>  <dbl>   <dbl> <dbl>
 1  7.30      37    33     25      66   140
 2  8         41    28      0      63   130
 3  7.5       39    32      0      61   126
 4  7         39    27      2      68   150
 5  5.30      37    32     17      67   112
 6  8.60      43    30      0      63   131
 7  9.10      40    23      0      65   134
 8  6.5       37    27     17      64   125
 9  3.3       29    32      0      64   142
10  8.10      41    28      0      66   113
# … with 670 more rows

skim(chds)

Table 1: Data summary
Name	chds
Number of rows	680
Number of columns	6
_______________________
Column type frequency:
numeric	6
________________________
Group variables	None

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
bwt	1	7.52	1.09	3.3	6.8	7.6	8.2	11.4	▁▂▇▃▁
gestwks	1	39.77	1.88	29.0	39.0	40.0	41.0	48.0	▁▁▇▃▁
age	1	25.86	5.46	15.0	21.0	25.0	29.0	42.0	▅▇▇▂▁
mnocig	1	7.43	11.27	0.0	0.0	0.0	12.0	50.0	▇▁▂▁▁
mheight	1	64.43	2.48	57.0	63.0	64.0	66.0	71.0	▁▃▇▅▁
mppwt	1	126.90	17.88	85.0	115.0	125.0	135.0	246.0	▅▇▁▁▁

If there are any quirks that you have to deal with NA coded as something else, or it is multiple tables, please make some notes here about what you need to do before you start transforming the data in the next section.

Make sure your data types are correct!

Transforming the data (15 points)

If the data needs to be transformed in any way (values recoded, pivoted, etc), do it here. Examples include transforming a continuous variable into a categorical using case_when(), etc.

Bonus points (5 points) for datasets that require merging of tables, but only if you reason through whether you should use left_join, inner_join, or right_join on these tables. No credit will be provided if you don’t.

# categorization of mnocig to smoke_status
chds <- chds %>% mutate(smoking_status = case_when(
  mnocig >= 1 ~ "smoker",
  mnocig == 0 ~ 'nonsmoker'
))

# creation of new bmi variable
chds$bmi <- 703*chds$mppwt/chds$mheight^2

head(chds)

# A tibble: 6 x 8
    bwt gestwks   age mnocig mheight mppwt smoking_status   bmi
  <dbl>   <dbl> <dbl>  <dbl>   <dbl> <dbl> <chr>          <dbl>
1  7.30      37    33     25      66   140 smoker          22.6
2  8         41    28      0      63   130 nonsmoker       23.0
3  7.5       39    32      0      61   126 nonsmoker       23.8
4  7         39    27      2      68   150 smoker          22.8
5  5.30      37    32     17      67   112 smoker          17.5
6  8.60      43    30      0      63   131 nonsmoker       23.2

glimpse(chds)

Rows: 680
Columns: 8
$ bwt            <dbl> 7.3, 8.0, 7.5, 7.0, 5.3, 8.6, 9.1, 6.5, 3.3,…
$ gestwks        <dbl> 37, 41, 39, 39, 37, 43, 40, 37, 29, 41, 40, …
$ age            <dbl> 33, 28, 32, 27, 32, 30, 23, 27, 32, 28, 26, …
$ mnocig         <dbl> 25, 0, 0, 2, 17, 0, 0, 17, 0, 0, 25, 0, 25, …
$ mheight        <dbl> 66, 63, 61, 68, 67, 63, 65, 64, 64, 66, 61, …
$ mppwt          <dbl> 140, 130, 126, 150, 112, 131, 134, 125, 142,…
$ smoking_status <chr> "smoker", "nonsmoker", "nonsmoker", "smoker"…
$ bmi            <dbl> 22.59412, 23.02595, 23.80489, 22.80493, 17.5…

Show your transformed table here. Use tools such as glimpse(), skim() or head() to illustrate your point.

glimpse(chds)

Rows: 680
Columns: 8
$ bwt            <dbl> 7.3, 8.0, 7.5, 7.0, 5.3, 8.6, 9.1, 6.5, 3.3,…
$ gestwks        <dbl> 37, 41, 39, 39, 37, 43, 40, 37, 29, 41, 40, …
$ age            <dbl> 33, 28, 32, 27, 32, 30, 23, 27, 32, 28, 26, …
$ mnocig         <dbl> 25, 0, 0, 2, 17, 0, 0, 17, 0, 0, 25, 0, 25, …
$ mheight        <dbl> 66, 63, 61, 68, 67, 63, 65, 64, 64, 66, 61, …
$ mppwt          <dbl> 140, 130, 126, 150, 112, 131, 134, 125, 142,…
$ smoking_status <chr> "smoker", "nonsmoker", "nonsmoker", "smoker"…
$ bmi            <dbl> 22.59412, 23.02595, 23.80489, 22.80493, 17.5…

skim(chds)

Table 2: Data summary
Name	chds
Number of rows	680
Number of columns	8
_______________________
Column type frequency:
character	1
numeric	7
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
smoking_status	0	1	6	9	0	2	0

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
bwt	1	7.52	1.09	3.30	6.80	7.60	8.20	11.4	▁▂▇▃▁
gestwks	1	39.77	1.88	29.00	39.00	40.00	41.00	48.0	▁▁▇▃▁
age	1	25.86	5.46	15.00	21.00	25.00	29.00	42.0	▅▇▇▂▁
mnocig	1	7.43	11.27	0.00	0.00	0.00	12.00	50.0	▇▁▂▁▁
mheight	1	64.43	2.48	57.00	63.00	64.00	66.00	71.0	▁▃▇▅▁
mppwt	1	126.90	17.88	85.00	115.00	125.00	135.00	246.0	▅▇▁▁▁
bmi	1	21.47	2.63	15.55	19.74	21.12	22.59	39.7	▅▇▁▁▁

head(chds)

# A tibble: 6 x 8
    bwt gestwks   age mnocig mheight mppwt smoking_status   bmi
  <dbl>   <dbl> <dbl>  <dbl>   <dbl> <dbl> <chr>          <dbl>
1  7.30      37    33     25      66   140 smoker          22.6
2  8         41    28      0      63   130 nonsmoker       23.0
3  7.5       39    32      0      61   126 nonsmoker       23.8
4  7         39    27      2      68   150 smoker          22.8
5  5.30      37    32     17      67   112 smoker          17.5
6  8.60      43    30      0      63   131 nonsmoker       23.2

Are the values what you expected for the variables? Why or Why not?

The variable for birthweight (bwt) looks like follows a normal distribution, which makes sense. The age ranges for the women are 15 to 42, which also make sense, as these ages are generally when women have chidren. The data on cigarettes per day, age, pre-pregnancy weight, height, and BMI were right-skewed, indicating that smaller values dominated the data for these variables.

Visualizing and Summarizing the Data (15 points)

Use group_by()/summarize() to make a summary of the data here. The summary should be relevant to your research question

chds <- chds %>% group_by(smoking_status)

chds %>% summarize(mean_gest = mean(gestwks), mean_bwt = mean(bwt), mean_bmi = mean(bmi))

# A tibble: 2 x 4
  smoking_status mean_gest mean_bwt mean_bmi
  <chr>              <dbl>    <dbl>    <dbl>
1 nonsmoker           39.9     7.73     21.8
2 smoker              39.6     7.24     21.0

# mean gestwks in nonsmokers is 39.93438 and 39.56187 in smokers 
# mean bwt in nonsmokers 7.732808 and 7.240803 in nonsmokers 
# mean bmi in nonsmokers is 21.81487 and 21.02475 in smokers 

# creating a Q-Q plot to see if gestwks is normally distributed enough to use a two-sample t test comparing the means between the groups (otherwise non-parametric test would be needed)
qqnorm(chds$gestwks)

# looks close enough to normal distribution 

qqnorm(chds$bmi)

# bmi looks less normal, will use Wilcoxon test to compare means

nonsmoker_gestwks <- filter(chds, smoking_status == "nonsmoker") %>% pull(gestwks)

smoker_gestwks <- filter(chds, smoking_status == "smoker") %>% pull(gestwks)

t.test(nonsmoker_gestwks, smoker_gestwks)


    Welch Two Sample t-test

data:  nonsmoker_gestwks and smoker_gestwks
t = 2.581, df = 639.57, p-value = 0.01007
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.08909663 0.65592396
sample estimates:
mean of x mean of y 
 39.93438  39.56187

# at 0.05 significance level, p-value 0.01, means different 

nonsmoker_bwt <- filter(chds, smoking_status == "nonsmoker") %>% pull(bwt)

smoker_bwt <- filter(chds, smoking_status == "smoker") %>% pull(bwt)

nonsmoker_bmi <- filter(chds, smoking_status == "nonsmoker") %>% pull(bmi)

smoker_bmi <- filter(chds, smoking_status == "smoker") %>% pull(bmi)

t.test(nonsmoker_bwt, smoker_bwt)


    Welch Two Sample t-test

data:  nonsmoker_bwt and smoker_bwt
t = 5.957, df = 631.7, p-value = 4.266e-09
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.3298171 0.6541943
sample estimates:
mean of x mean of y 
 7.732808  7.240803

# p-value <0.001, means different between groups

wilcox.test(nonsmoker_bmi, smoker_bmi)


    Wilcoxon rank sum test with continuity correction

data:  nonsmoker_bmi and smoker_bmi
W = 68448, p-value = 6.24e-06
alternative hypothesis: true location shift is not equal to 0

# p-value <0.001, this may either be a confounder or a mediator between smoking and lower gestwks or birthweight

What are your findings about the summary? Are they what you expected?

Yes, I expected that mothers who smoke while they are pregnant would have shorter gestation periods and smaller infants. This is based on knowledge from extant literature that was unavailable at the time that this study took place.

Make at least two plots that help you answer your question on the transformed or summarized data.

ggplot(chds, aes(x=smoking_status, y=gestwks, fill=smoking_status)) + geom_boxplot() + ggtitle("Length of Gestation \n by Smoking Status") + ylab("Gestation (weeks)") + xlab(NULL) + labs(fill=NULL) + theme_bw()

ggplot(chds, aes(x=smoking_status, y=bwt, fill=smoking_status)) + geom_boxplot() + ggtitle("Infant Birthweight \n by Smoking Status") + ylab("Birthweight (lbs)") + xlab(NULL) + labs(fill=NULL) + theme_bw()

ggplot(chds, aes(x=smoking_status, y=bmi, fill=smoking_status)) + geom_boxplot() + ggtitle("BMI by Smoking Status") + ylab("BMI") + xlab(NULL) + labs(fill=NULL) + theme_bw()

Final Summary (10 points)

Summarize your research question and findings below.

Smokers had shorter gestation duration and had lower birthweight infants than nonsmokers, but they also had lower BMI’s, which could have had an effect on the true association between smoking status and gestation and infant birthweight. However, these differences are subtle and would warrant further research.

Are your findings what you expected? Why or Why not?

Yes, they are what I expected. From prior literature review, I have learned that there is an association between smoking during pregnancy and lower birthweight infants. However, I am not entirely sure what the real assocation between smoking and the length of gestation is. BMI seemed appropriate to include because of the likelihood of there being a relationship between BMI and gestation, birthweight, and possibly smoking status.