Please submit your .Rmd
and .html
files in Sakai. If you are working together, both people should submit the files.
The goal of the midterm project is to showcase skills that you have learned in class so far. The midterm is open note, but if you use someone else’s code, you must attribute them.
Potential Sources for data: Tidy Tuesday: https://github.com/rfordatascience/tidytuesday
.csv
file into your data
folder.You may use another dataset or your own data, but please make sure it is de-identified.
If you’d like to work together, that is encouraged, but you must divide the work equitably and you must note who worked on what. This is probably easiest as notes in the text. Please let Eric or Me know that you’ll be working together.
No acknowledgements of contributions = -10 points overall.
I will take off points (-5 points for each section) if you don’t add observations and notes in your RMarkdown document. I want you to think and reason through your analysis, even if they are preliminary thoughts.
Define your research question below. What about the data interests you? What is a specific question you want to find out about the data?
I am interested in the differences in gestation time and birthweight among smoking and non-smoking expectant mothers. This dataset is from a 1960’s-1970’s cross-sectional study (Yerushalmy, Johns Hopkins University, 1970) aiming to find an association between smoking and low birthweight. At the time, smoking research was in its infancy and the deleterious effects of smoking were not widely known. My research question is: Does length of gestation differ among smoking versus non-smoking women?
Given your question, what is your expectation about the data?
I expect that women that who smoke have shorter periods of gestation. I also expect that shorter gestation duration may lead to infants that weighed less.
Load the data below and use
dplyr::glimpse()
orskimr::skim()
on the data. You should upload the data file into thedata
directory.
Rows: 680
Columns: 6
$ bwt <dbl> 7.3, 8.0, 7.5, 7.0, 5.3, 8.6, 9.1, 6.5, 3.3, 8.1, 7…
$ gestwks <dbl> 37, 41, 39, 39, 37, 43, 40, 37, 29, 41, 40, 39, 41,…
$ age <dbl> 33, 28, 32, 27, 32, 30, 23, 27, 32, 28, 26, 19, 37,…
$ mnocig <dbl> 25, 0, 0, 2, 17, 0, 0, 17, 0, 0, 25, 0, 25, 17, 0, …
$ mheight <dbl> 66, 63, 61, 68, 67, 63, 65, 64, 64, 66, 61, 65, 63,…
$ mppwt <dbl> 140, 130, 126, 150, 112, 131, 134, 125, 142, 113, 1…
chds
# A tibble: 680 x 6
bwt gestwks age mnocig mheight mppwt
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 7.30 37 33 25 66 140
2 8 41 28 0 63 130
3 7.5 39 32 0 61 126
4 7 39 27 2 68 150
5 5.30 37 32 17 67 112
6 8.60 43 30 0 63 131
7 9.10 40 23 0 65 134
8 6.5 37 27 17 64 125
9 3.3 29 32 0 64 142
10 8.10 41 28 0 66 113
# … with 670 more rows
skim(chds)
Name | chds |
Number of rows | 680 |
Number of columns | 6 |
_______________________ | |
Column type frequency: | |
numeric | 6 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
bwt | 0 | 1 | 7.52 | 1.09 | 3.3 | 6.8 | 7.6 | 8.2 | 11.4 | ▁▂▇▃▁ |
gestwks | 0 | 1 | 39.77 | 1.88 | 29.0 | 39.0 | 40.0 | 41.0 | 48.0 | ▁▁▇▃▁ |
age | 0 | 1 | 25.86 | 5.46 | 15.0 | 21.0 | 25.0 | 29.0 | 42.0 | ▅▇▇▂▁ |
mnocig | 0 | 1 | 7.43 | 11.27 | 0.0 | 0.0 | 0.0 | 12.0 | 50.0 | ▇▁▂▁▁ |
mheight | 0 | 1 | 64.43 | 2.48 | 57.0 | 63.0 | 64.0 | 66.0 | 71.0 | ▁▃▇▅▁ |
mppwt | 0 | 1 | 126.90 | 17.88 | 85.0 | 115.0 | 125.0 | 135.0 | 246.0 | ▅▇▁▁▁ |
If there are any quirks that you have to deal with
NA
coded as something else, or it is multiple tables, please make some notes here about what you need to do before you start transforming the data in the next section.
Make sure your data types are correct!
If the data needs to be transformed in any way (values recoded, pivoted, etc), do it here. Examples include transforming a continuous variable into a categorical using
case_when()
, etc.
Bonus points (5 points) for datasets that require merging of tables, but only if you reason through whether you should use
left_join
,inner_join
, orright_join
on these tables. No credit will be provided if you don’t.
# categorization of mnocig to smoke_status
chds <- chds %>% mutate(smoking_status = case_when(
mnocig >= 1 ~ "smoker",
mnocig == 0 ~ 'nonsmoker'
))
# creation of new bmi variable
chds$bmi <- 703*chds$mppwt/chds$mheight^2
head(chds)
# A tibble: 6 x 8
bwt gestwks age mnocig mheight mppwt smoking_status bmi
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
1 7.30 37 33 25 66 140 smoker 22.6
2 8 41 28 0 63 130 nonsmoker 23.0
3 7.5 39 32 0 61 126 nonsmoker 23.8
4 7 39 27 2 68 150 smoker 22.8
5 5.30 37 32 17 67 112 smoker 17.5
6 8.60 43 30 0 63 131 nonsmoker 23.2
glimpse(chds)
Rows: 680
Columns: 8
$ bwt <dbl> 7.3, 8.0, 7.5, 7.0, 5.3, 8.6, 9.1, 6.5, 3.3,…
$ gestwks <dbl> 37, 41, 39, 39, 37, 43, 40, 37, 29, 41, 40, …
$ age <dbl> 33, 28, 32, 27, 32, 30, 23, 27, 32, 28, 26, …
$ mnocig <dbl> 25, 0, 0, 2, 17, 0, 0, 17, 0, 0, 25, 0, 25, …
$ mheight <dbl> 66, 63, 61, 68, 67, 63, 65, 64, 64, 66, 61, …
$ mppwt <dbl> 140, 130, 126, 150, 112, 131, 134, 125, 142,…
$ smoking_status <chr> "smoker", "nonsmoker", "nonsmoker", "smoker"…
$ bmi <dbl> 22.59412, 23.02595, 23.80489, 22.80493, 17.5…
Show your transformed table here. Use tools such as
glimpse()
,skim()
orhead()
to illustrate your point.
glimpse(chds)
Rows: 680
Columns: 8
$ bwt <dbl> 7.3, 8.0, 7.5, 7.0, 5.3, 8.6, 9.1, 6.5, 3.3,…
$ gestwks <dbl> 37, 41, 39, 39, 37, 43, 40, 37, 29, 41, 40, …
$ age <dbl> 33, 28, 32, 27, 32, 30, 23, 27, 32, 28, 26, …
$ mnocig <dbl> 25, 0, 0, 2, 17, 0, 0, 17, 0, 0, 25, 0, 25, …
$ mheight <dbl> 66, 63, 61, 68, 67, 63, 65, 64, 64, 66, 61, …
$ mppwt <dbl> 140, 130, 126, 150, 112, 131, 134, 125, 142,…
$ smoking_status <chr> "smoker", "nonsmoker", "nonsmoker", "smoker"…
$ bmi <dbl> 22.59412, 23.02595, 23.80489, 22.80493, 17.5…
skim(chds)
Name | chds |
Number of rows | 680 |
Number of columns | 8 |
_______________________ | |
Column type frequency: | |
character | 1 |
numeric | 7 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
smoking_status | 0 | 1 | 6 | 9 | 0 | 2 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
bwt | 0 | 1 | 7.52 | 1.09 | 3.30 | 6.80 | 7.60 | 8.20 | 11.4 | ▁▂▇▃▁ |
gestwks | 0 | 1 | 39.77 | 1.88 | 29.00 | 39.00 | 40.00 | 41.00 | 48.0 | ▁▁▇▃▁ |
age | 0 | 1 | 25.86 | 5.46 | 15.00 | 21.00 | 25.00 | 29.00 | 42.0 | ▅▇▇▂▁ |
mnocig | 0 | 1 | 7.43 | 11.27 | 0.00 | 0.00 | 0.00 | 12.00 | 50.0 | ▇▁▂▁▁ |
mheight | 0 | 1 | 64.43 | 2.48 | 57.00 | 63.00 | 64.00 | 66.00 | 71.0 | ▁▃▇▅▁ |
mppwt | 0 | 1 | 126.90 | 17.88 | 85.00 | 115.00 | 125.00 | 135.00 | 246.0 | ▅▇▁▁▁ |
bmi | 0 | 1 | 21.47 | 2.63 | 15.55 | 19.74 | 21.12 | 22.59 | 39.7 | ▅▇▁▁▁ |
head(chds)
# A tibble: 6 x 8
bwt gestwks age mnocig mheight mppwt smoking_status bmi
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
1 7.30 37 33 25 66 140 smoker 22.6
2 8 41 28 0 63 130 nonsmoker 23.0
3 7.5 39 32 0 61 126 nonsmoker 23.8
4 7 39 27 2 68 150 smoker 22.8
5 5.30 37 32 17 67 112 smoker 17.5
6 8.60 43 30 0 63 131 nonsmoker 23.2
Are the values what you expected for the variables? Why or Why not?
The variable for birthweight (bwt) looks like follows a normal distribution, which makes sense. The age ranges for the women are 15 to 42, which also make sense, as these ages are generally when women have chidren. The data on cigarettes per day, age, pre-pregnancy weight, height, and BMI were right-skewed, indicating that smaller values dominated the data for these variables.
Use
group_by()/summarize()
to make a summary of the data here. The summary should be relevant to your research question
chds <- chds %>% group_by(smoking_status)
chds %>% summarize(mean_gest = mean(gestwks), mean_bwt = mean(bwt), mean_bmi = mean(bmi))
# A tibble: 2 x 4
smoking_status mean_gest mean_bwt mean_bmi
<chr> <dbl> <dbl> <dbl>
1 nonsmoker 39.9 7.73 21.8
2 smoker 39.6 7.24 21.0
# mean gestwks in nonsmokers is 39.93438 and 39.56187 in smokers
# mean bwt in nonsmokers 7.732808 and 7.240803 in nonsmokers
# mean bmi in nonsmokers is 21.81487 and 21.02475 in smokers
# creating a Q-Q plot to see if gestwks is normally distributed enough to use a two-sample t test comparing the means between the groups (otherwise non-parametric test would be needed)
qqnorm(chds$gestwks)
# looks close enough to normal distribution
qqnorm(chds$bmi)
# bmi looks less normal, will use Wilcoxon test to compare means
nonsmoker_gestwks <- filter(chds, smoking_status == "nonsmoker") %>% pull(gestwks)
smoker_gestwks <- filter(chds, smoking_status == "smoker") %>% pull(gestwks)
t.test(nonsmoker_gestwks, smoker_gestwks)
Welch Two Sample t-test
data: nonsmoker_gestwks and smoker_gestwks
t = 2.581, df = 639.57, p-value = 0.01007
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.08909663 0.65592396
sample estimates:
mean of x mean of y
39.93438 39.56187
# at 0.05 significance level, p-value 0.01, means different
nonsmoker_bwt <- filter(chds, smoking_status == "nonsmoker") %>% pull(bwt)
smoker_bwt <- filter(chds, smoking_status == "smoker") %>% pull(bwt)
nonsmoker_bmi <- filter(chds, smoking_status == "nonsmoker") %>% pull(bmi)
smoker_bmi <- filter(chds, smoking_status == "smoker") %>% pull(bmi)
t.test(nonsmoker_bwt, smoker_bwt)
Welch Two Sample t-test
data: nonsmoker_bwt and smoker_bwt
t = 5.957, df = 631.7, p-value = 4.266e-09
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.3298171 0.6541943
sample estimates:
mean of x mean of y
7.732808 7.240803
# p-value <0.001, means different between groups
wilcox.test(nonsmoker_bmi, smoker_bmi)
Wilcoxon rank sum test with continuity correction
data: nonsmoker_bmi and smoker_bmi
W = 68448, p-value = 6.24e-06
alternative hypothesis: true location shift is not equal to 0
# p-value <0.001, this may either be a confounder or a mediator between smoking and lower gestwks or birthweight
What are your findings about the summary? Are they what you expected?
Yes, I expected that mothers who smoke while they are pregnant would have shorter gestation periods and smaller infants. This is based on knowledge from extant literature that was unavailable at the time that this study took place.
Make at least two plots that help you answer your question on the transformed or summarized data.
ggplot(chds, aes(x=smoking_status, y=gestwks, fill=smoking_status)) + geom_boxplot() + ggtitle("Length of Gestation \n by Smoking Status") + ylab("Gestation (weeks)") + xlab(NULL) + labs(fill=NULL) + theme_bw()
ggplot(chds, aes(x=smoking_status, y=bwt, fill=smoking_status)) + geom_boxplot() + ggtitle("Infant Birthweight \n by Smoking Status") + ylab("Birthweight (lbs)") + xlab(NULL) + labs(fill=NULL) + theme_bw()
ggplot(chds, aes(x=smoking_status, y=bmi, fill=smoking_status)) + geom_boxplot() + ggtitle("BMI by Smoking Status") + ylab("BMI") + xlab(NULL) + labs(fill=NULL) + theme_bw()
Summarize your research question and findings below.
Smokers had shorter gestation duration and had lower birthweight infants than nonsmokers, but they also had lower BMI’s, which could have had an effect on the true association between smoking status and gestation and infant birthweight. However, these differences are subtle and would warrant further research.
Are your findings what you expected? Why or Why not?
Yes, they are what I expected. From prior literature review, I have learned that there is an association between smoking during pregnancy and lower birthweight infants. However, I am not entirely sure what the real assocation between smoking and the length of gestation is. BMI seemed appropriate to include because of the likelihood of there being a relationship between BMI and gestation, birthweight, and possibly smoking status.