Please submit your .Rmd
and .html
files in Sakai. If you are working together, both people should submit the files.
The goal of the midterm project is to showcase skills that you have learned in class so far. The midterm is open note, but if you use someone else’s code, you must attribute them.
# This code came from-- this work is my own, except for the code suggested to me by Dr. Laderas from our meeting, namely the use of tidytext to count words in a single cell--but I guess if we're counting that then everything in this came from Dr. Laderas in that I learned it through this class.
Potential Sources for data: Tidy Tuesday: https://github.com/rfordatascience/tidytuesday
.csv
file into your data
folder.tuesdata <- tidytuesdayR::tt_load(2020, week = 33)
## --- Compiling #TidyTuesday Information for 2020-08-11 ----
## --- There are 2 files available ---
## --- Starting Download ---
##
## Downloading file 1 of 2: `avatar.csv`
## Downloading file 2 of 2: `scene_description.csv`
## --- Download complete ---
avatar <- tuesdata$avatar
You may use another dataset or your own data, but please make sure it is de-identified.
If you’d like to work together, that is encouraged, but you must divide the work equitably and you must note who worked on what. This is probably easiest as notes in the text. Please let Eric or Me know that you’ll be working together.
No acknowledgments of contributions = -10 points overall.
I will take off points (-5 points for each section) if you don’t add observations and notes in your RMarkdown document. I want you to think and reason through your analysis, even if they are preliminary thoughts.
Define your research question below. What about the data interests you? What is a specific question you want to find out about the data?
My research question is: how does the number of words Katara says in an episode affect the IMDB rating of the episode, and how does this compare to other characters?
There are a few reasons why this question is of interest. One: this is a popular show that has recently had a bit of resurgence in popularity from netflix adding it back to streaming and two: Katara is a particularly polarizing character in the show and as a result, is likely to have an effect on the reception of episodes. > Given your question, what is your expectation about the data?
My expectation is that because of Katara’s polarizing-ness, IMDB ratings will be lower for episodes where she says more words. What I’m unsure about is how this will compare to other characters, that is, is this a trend for how people feel about dialogue heavy episodes in general, or just Katara heavy episodes?
Load the data below and use
dplyr::glimpse()
orskimr::skim()
on the data. You should upload the data file into thedata
directory.
Because of the nature of the dataset, only certain values in the skim() output are of much interest. notably, n_unique is helpful as the episodes are broken up by scene rather than by episode. There are 3 books (seasons), 61 episodes, 374 characters, 23 writers, and 7 directors. As far as descriptive statistics goes, The show has a mean IMDB rating of 8.62, with a range of 7.1 to 9.8, though its possible the mean and standard deviation (0.57) is skewed by the dataset having different number of rows per episode
skim(avatar)
Name | avatar |
Number of rows | 13385 |
Number of columns | 11 |
_______________________ | |
Column type frequency: | |
character | 7 |
numeric | 4 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
book | 0 | 1.00 | 4 | 5 | 0 | 3 | 0 |
chapter | 0 | 1.00 | 3 | 42 | 0 | 61 | 0 |
character | 0 | 1.00 | 2 | 26 | 0 | 374 | 0 |
full_text | 0 | 1.00 | 3 | 6032 | 0 | 13205 | 0 |
character_words | 3393 | 0.75 | 3 | 1008 | 0 | 9575 | 0 |
writer | 0 | 1.00 | 8 | 192 | 0 | 23 | 0 |
director | 0 | 1.00 | 11 | 23 | 0 | 7 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
id | 0 | 1.00 | 6693.00 | 3864.06 | 1.0 | 3347.0 | 6693.0 | 10039.0 | 13385.0 | ▇▇▇▇▇ |
book_num | 0 | 1.00 | 1.98 | 0.82 | 1.0 | 1.0 | 2.0 | 3.0 | 3.0 | ▇▁▇▁▇ |
chapter_num | 0 | 1.00 | 10.43 | 5.70 | 1.0 | 5.0 | 10.0 | 15.0 | 21.0 | ▇▆▆▇▅ |
imdb_rating | 161 | 0.99 | 8.62 | 0.57 | 7.1 | 8.2 | 8.7 | 9.1 | 9.8 | ▁▅▇▇▂ |
If there are any quirks that you have to deal with
NA
coded as something else, or it is multiple tables, please make some notes here about what you need to do before you start transforming the data in the next section.
I see two potential difficulties regarding the research question at hand; the first is that each row is not a single episode but a single scene, whereas our outcome of interest varies by episode, not by scene. The second is that the actual words are listed rather than the number of words each character says per scene. And lastly is that there are some rows for scene description where the character_words variable is NA that need accounting for.
A thought for dealing with these issues is creating a variable that sums the words by episode, and then from that variable a new data frame that includes only the first value of every episode so we have one value per episode. In this process we can code NA=0 for character_words so the sum doesn’t get messed up.
Make sure your data types are correct!
character_words is going to be treated as a numeric value, but that transformation will be in the form of creating a new variable so for now it is okay for it to be a categorical variable.
If the data needs to be transformed in any way (values recoded, pivoted, etc), do it here. Examples include transforming a continuous variable into a categorical using
case_when()
, etc.
Bonus points (5 points) for datasets that require merging of tables, but only if you reason through whether you should use
left_join
,inner_join
, orright_join
on these tables. No credit will be provided if you don’t.
# at this point we should make a decision, what characters are we going to compare?
# I'd like to compare Katara to Sokka and Aang (seperately), since those are less controversial, more beloved characters
# first split into 3 datasets
Katara <- avatar %>%
filter(character=="Katara")
Sokka <- avatar %>%
filter(character=="Sokka")
Aang <- avatar %>%
filter(character=="Aang")
# then create a num_words variable
Katara_words <- Katara %>%
unnest_tokens(input=character_words,
output=words,
token="words",
drop=T)
Katara_count <- Katara_words %>%
group_by(chapter) %>%
count("words")
Katara_chap <- Katara %>%
group_by(chapter)%>%
summarize(imdb_rating)
## `summarise()` regrouping output by 'chapter' (override with `.groups` argument)
Katara_chap_dist <- distinct(Katara_chap)
katara_final <- inner_join(x=Katara_count, y=Katara_chap_dist, by = c("chapter"="chapter")) # here we have some flexibility around whether to use inner join, left join, or right join since there are no rows in one dataframe that do not have a match in our key column in the other dataframe; I just chose to use inner_join because I'm more comfortable with that one.
# I wanna try to make that a function, so I can do it for any character.
character_final <- function(df, char){
character <- df %>% filter(character==char)
character_words <- character %>%
unnest_tokens(input=character_words,
output=words,
token="words",
drop=T)
character_count <- character_words %>%
group_by(chapter)%>%
count("words")
character_chap <- character %>%
group_by(chapter) %>%
summarize(imdb_rating)
character_chap_dist <- distinct(character_chap)
character_final <- inner_join(x=character_count, y=character_chap_dist, by = c("chapter"="chapter"))
return(character_final)
}
sokka_final<-character_final(avatar, "Sokka")
## `summarise()` regrouping output by 'chapter' (override with `.groups` argument)
Aang_final<-character_final(avatar, "Aang")
## `summarise()` regrouping output by 'chapter' (override with `.groups` argument)
# honestly pretty stoked that worked, now that its pretty easy to do it with multiple characters I want to include Iroh. Theoretically I could do a for loop to get this for every character, but I think that might be a bit much.
Iroh_final<-character_final(avatar,"Iroh")
## `summarise()` regrouping output by 'chapter' (override with `.groups` argument)
Show your transformed table here. Use tools such as
glimpse()
,skim()
orhead()
to illustrate your point.
# there are 4 transformed tables now:
head(katara_final)
## # A tibble: 6 x 4
## # Groups: chapter [6]
## chapter `"words"` n imdb_rating
## <chr> <chr> <int> <dbl>
## 1 Avatar Day words 280 7.5
## 2 Bato of the Water Tribe words 206 7.9
## 3 Bitter Work words 217 8.6
## 4 City of Walls and Secrets words 228 8.7
## 5 Imprisoned words 673 8
## 6 Jet words 421 7.8
head(sokka_final)
## # A tibble: 6 x 4
## # Groups: chapter [6]
## chapter `"words"` n imdb_rating
## <chr> <chr> <int> <dbl>
## 1 Avatar Day words 427 7.5
## 2 Bato of the Water Tribe words 421 7.9
## 3 Bitter Work words 375 8.6
## 4 City of Walls and Secrets words 386 8.7
## 5 Imprisoned words 415 8
## 6 Jet words 525 7.8
head(Aang_final)
## # A tibble: 6 x 4
## # Groups: chapter [6]
## chapter `"words"` n imdb_rating
## <chr> <chr> <int> <dbl>
## 1 Appa's Lost Days words 14 8.8
## 2 Avatar Day words 236 7.5
## 3 Bato of the Water Tribe words 262 7.9
## 4 Bitter Work words 440 8.6
## 5 City of Walls and Secrets words 188 8.7
## 6 Imprisoned words 157 8
head(Iroh_final)
## # A tibble: 6 x 4
## # Groups: chapter [6]
## chapter `"words"` n imdb_rating
## <chr> <chr> <int> <dbl>
## 1 Appa's Lost Days words 7 8.8
## 2 Avatar Day words 136 7.5
## 3 Bato of the Water Tribe words 61 7.9
## 4 Bitter Work words 638 8.6
## 5 City of Walls and Secrets words 145 8.7
## 6 Lake Laogai words 289 9.1
Are the values what you expected for the variables? Why or Why not?
To the extent that I expected it to work, the values are what I expected, which is pretty cool.
Use
group_by()/summarize()
to make a summary of the data here. The summary should be relevant to your research question
katara_final %>%
group_by(imdb_rating)%>%
summary()
## chapter "words" n imdb_rating
## Length:59 Length:59 Min. : 22.0 Min. :7.10
## Class :character Class :character 1st Qu.:119.0 1st Qu.:8.20
## Mode :character Mode :character Median :213.0 Median :8.70
## Mean :253.6 Mean :8.65
## 3rd Qu.:370.5 3rd Qu.:9.10
## Max. :690.0 Max. :9.80
## NA's :1
sokka_final %>%
group_by(imdb_rating)%>%
summary()
## chapter "words" n imdb_rating
## Length:59 Length:59 Min. : 16.0 Min. :7.10
## Class :character Class :character 1st Qu.:171.5 1st Qu.:8.20
## Mode :character Mode :character Median :299.0 Median :8.70
## Mean :310.1 Mean :8.65
## 3rd Qu.:424.0 3rd Qu.:9.10
## Max. :788.0 Max. :9.80
## NA's :1
Aang_final %>%
group_by(imdb_rating)%>%
summary()
## chapter "words" n imdb_rating
## Length:60 Length:60 Min. : 6.0 Min. :7.100
## Class :character Class :character 1st Qu.:168.8 1st Qu.:8.200
## Mode :character Mode :character Median :275.5 Median :8.700
## Mean :297.0 Mean :8.653
## 3rd Qu.:388.5 3rd Qu.:9.100
## Max. :769.0 Max. :9.800
## NA's :1
Iroh_final %>%
group_by(imdb_rating)%>%
summary()
## chapter "words" n imdb_rating
## Length:36 Length:36 Min. : 7.00 Min. :7.500
## Class :character Class :character 1st Qu.: 69.75 1st Qu.:8.350
## Mode :character Mode :character Median :129.00 Median :8.800
## Mean :145.89 Mean :8.783
## 3rd Qu.:176.75 3rd Qu.:9.100
## Max. :638.00 Max. :9.800
## NA's :1
# a statistic that may be useful is the slope coefficient of words vs imdb rating. To make it easier to interpret, we will multiple by 100
character_slope <- function(df){
model <- lm(imdb_rating~n,df)
slope <- 100*model$coefficients[2]
return(slope)
}
character_slope(katara_final)
## n
## -0.1471868
character_slope(sokka_final)
## n
## -0.1300191
character_slope(Aang_final)
## n
## -0.1517895
character_slope(Iroh_final)
## n
## 0.004841303
What are your findings about the summary? Are they what you expected?
Episodes in which Katara had lines had a mean of 253 words spoken by her, and a mean imdb rating of 8.65.
Episodes in which Sokka had lines had a mean of 310 words spoken by him, and a mean imdb rating of 8.65.
Episodes in which Aang had lines had a mean of 297 words spoken by him, and a mean imdb rating of 8.65.
Episodes in which Iroh had lines had a mean of 145 words spoken by him, and a mean imdb rating of 8.78.
It is worth noting that episodes in which Sokka, Aang, or Katara had lines all had the same mean imdb rating, which is likely a result of the fact that most episodes in which one had lines, all had lines.
for every 100 words Katara speaks, the imdb rating decreases by 0.147. For every 100 words sokka speaks, the imdb rating decreases by 0.130, and for every 100 words Aang speaks, the imdb rating decreases by 0.125. On the other hand, for every 100 words Iroh speaks, the imdb rating increases by 0.005
The finding about Iroh is what I expected, however I am surprised to learn that Aang is apparently less popular than katara, and that Sokka and Aang’s lines do not have positive effects on the imdb rating. It may be that this suggests that generally, episodes with more dialogue are received more poorly.
Make at least two plots that help you answer your question on the transformed or summarized data.
katara_final%>%
ggplot(aes(x=n, y=imdb_rating))+
geom_point()+
geom_smooth(method=lm)+
labs(title="scatterplot of Katara's words vs IMDB rating", x="Katara words", y="IMDB rating")
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 1 rows containing non-finite values (stat_smooth).
## Warning: Removed 1 rows containing missing values (geom_point).
sokka_final %>%
ggplot(aes(x=n, y=imdb_rating))+
geom_point()+
geom_smooth(method=lm)+
labs(title="scatterplot of Sokka's words vs IMDB rating", x="Sokka words", y="IMDB rating")
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 1 rows containing non-finite values (stat_smooth).
## Warning: Removed 1 rows containing missing values (geom_point).
Aang_final %>%
ggplot(aes(x=n, y=imdb_rating))+
geom_point()+
geom_smooth(method=lm)+
labs(title="scatterplot of Aang's words vs IMDB rating", x="Aang words", y="IMDB rating")
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 1 rows containing non-finite values (stat_smooth).
## Warning: Removed 1 rows containing missing values (geom_point).
Iroh_final%>%
ggplot(aes(x=n, y=imdb_rating))+
geom_point()+
geom_smooth(method=lm)+
labs(title="scatterplot of Iroh's words vs IMDB rating", x="Iroh words", y="IMDB rating")
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 1 rows containing non-finite values (stat_smooth).
## Warning: Removed 1 rows containing missing values (geom_point).
Summarize your research question and findings below.
In our crude model, Katara’s lines were associated with a decrease in the imdb rating of episodes.
Without getting too heavily into the weeds of linear regression, characters’ popularity based on the relationship of their total word count in an episode to the imdb rating of that episode is, in order from most to least popular, Iroh, Sokka, Katara, Aang.
Are your findings what you expected? Why or Why not?
While the answer to our main question of interest, “do episodes in which Katara has more lines have lower imdb ratings?”, is roughly what I expected, The rest of the findings are not what I expected; I certainly didn’t expect the crude relationship of Aang and Sokka’s words to imdb rating to be so negative. There are two possibilities here; one is that my finger is not so clearly on the pulse of the ATLA fandom as to be able to assess the popularity of various characters (likely true). The other, and this is probably a stronger effect, is that I am only comparing crude linear models, and that transformations to fit the assumptions of linear regression and adjusting for confounding, interaction effects, and other potential predictors may create a better understanding of each character’s actual popularity.