Midterm (Due 2/12/2021 at 11:55 pm)

Please submit your .Rmd and .html files in Sakai. If you are working together, both people should submit the files.

60 / 60 points total

The goal of the midterm project is to showcase skills that you have learned in class so far. The midterm is open note, but if you use someone else’s code, you must attribute them.

# This code came from-- this work is my own, except for the code suggested to me by Dr. Laderas from our meeting, namely the use of tidytext to count words in a single cell--but I guess if we're counting that then everything in this came from Dr. Laderas in that I learned it through this class. 

Before you get Started

  1. Pick a dataset. Ideally, the dataset should be around 2000 rows, and should have both categorical and numeric covariates.

Potential Sources for data: Tidy Tuesday: https://github.com/rfordatascience/tidytuesday

  • Note that most of these are .csv files. There is code to load the files from csv for each of the datasets and a short description of the variables, or you can upload the .csv file into your data folder.
tuesdata <- tidytuesdayR::tt_load(2020, week = 33)
## --- Compiling #TidyTuesday Information for 2020-08-11 ----
## --- There are 2 files available ---
## --- Starting Download ---
## 
##  Downloading file 1 of 2: `avatar.csv`
##  Downloading file 2 of 2: `scene_description.csv`
## --- Download complete ---
avatar <- tuesdata$avatar

You may use another dataset or your own data, but please make sure it is de-identified.

  1. Please schedule a time with Eric or Me to discuss your dataset and research question. We just want to look at the data and make sure that it is appropriate for your question.

Working Together

If you’d like to work together, that is encouraged, but you must divide the work equitably and you must note who worked on what. This is probably easiest as notes in the text. Please let Eric or Me know that you’ll be working together.

No acknowledgments of contributions = -10 points overall.

Please Note

I will take off points (-5 points for each section) if you don’t add observations and notes in your RMarkdown document. I want you to think and reason through your analysis, even if they are preliminary thoughts.

Define Your Research Question (10 points)

Define your research question below. What about the data interests you? What is a specific question you want to find out about the data?

My research question is: how does the number of words Katara says in an episode affect the IMDB rating of the episode, and how does this compare to other characters?

There are a few reasons why this question is of interest. One: this is a popular show that has recently had a bit of resurgence in popularity from netflix adding it back to streaming and two: Katara is a particularly polarizing character in the show and as a result, is likely to have an effect on the reception of episodes. > Given your question, what is your expectation about the data?

My expectation is that because of Katara’s polarizing-ness, IMDB ratings will be lower for episodes where she says more words. What I’m unsure about is how this will compare to other characters, that is, is this a trend for how people feel about dialogue heavy episodes in general, or just Katara heavy episodes?

Loading the Data (10 points)

Load the data below and use dplyr::glimpse() or skimr::skim() on the data. You should upload the data file into the data directory.

Because of the nature of the dataset, only certain values in the skim() output are of much interest. notably, n_unique is helpful as the episodes are broken up by scene rather than by episode. There are 3 books (seasons), 61 episodes, 374 characters, 23 writers, and 7 directors. As far as descriptive statistics goes, The show has a mean IMDB rating of 8.62, with a range of 7.1 to 9.8, though its possible the mean and standard deviation (0.57) is skewed by the dataset having different number of rows per episode

skim(avatar)
Data summary
Name avatar
Number of rows 13385
Number of columns 11
_______________________
Column type frequency:
character 7
numeric 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
book 0 1.00 4 5 0 3 0
chapter 0 1.00 3 42 0 61 0
character 0 1.00 2 26 0 374 0
full_text 0 1.00 3 6032 0 13205 0
character_words 3393 0.75 3 1008 0 9575 0
writer 0 1.00 8 192 0 23 0
director 0 1.00 11 23 0 7 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
id 0 1.00 6693.00 3864.06 1.0 3347.0 6693.0 10039.0 13385.0 ▇▇▇▇▇
book_num 0 1.00 1.98 0.82 1.0 1.0 2.0 3.0 3.0 ▇▁▇▁▇
chapter_num 0 1.00 10.43 5.70 1.0 5.0 10.0 15.0 21.0 ▇▆▆▇▅
imdb_rating 161 0.99 8.62 0.57 7.1 8.2 8.7 9.1 9.8 ▁▅▇▇▂

If there are any quirks that you have to deal with NA coded as something else, or it is multiple tables, please make some notes here about what you need to do before you start transforming the data in the next section.

I see two potential difficulties regarding the research question at hand; the first is that each row is not a single episode but a single scene, whereas our outcome of interest varies by episode, not by scene. The second is that the actual words are listed rather than the number of words each character says per scene. And lastly is that there are some rows for scene description where the character_words variable is NA that need accounting for.

A thought for dealing with these issues is creating a variable that sums the words by episode, and then from that variable a new data frame that includes only the first value of every episode so we have one value per episode. In this process we can code NA=0 for character_words so the sum doesn’t get messed up.

Make sure your data types are correct!

character_words is going to be treated as a numeric value, but that transformation will be in the form of creating a new variable so for now it is okay for it to be a categorical variable.

Transforming the data (15 points)

If the data needs to be transformed in any way (values recoded, pivoted, etc), do it here. Examples include transforming a continuous variable into a categorical using case_when(), etc.

Bonus points (5 points) for datasets that require merging of tables, but only if you reason through whether you should use left_join, inner_join, or right_join on these tables. No credit will be provided if you don’t.

# at this point we should make a decision, what characters are we going to compare?
# I'd like to compare Katara to Sokka and Aang (seperately), since those are less controversial, more beloved characters 

# first split into 3 datasets

Katara <- avatar %>%
  filter(character=="Katara")

Sokka <- avatar %>%
  filter(character=="Sokka")

Aang <- avatar %>%
  filter(character=="Aang")

# then create a num_words variable

Katara_words <- Katara %>%
  unnest_tokens(input=character_words,
                output=words,
                token="words",
                drop=T)
Katara_count <- Katara_words %>%
  group_by(chapter) %>%
  count("words")

Katara_chap <- Katara %>%
  group_by(chapter)%>%
  summarize(imdb_rating)
## `summarise()` regrouping output by 'chapter' (override with `.groups` argument)
Katara_chap_dist <- distinct(Katara_chap)

katara_final <- inner_join(x=Katara_count, y=Katara_chap_dist, by = c("chapter"="chapter")) # here we have some flexibility around whether to use inner join, left join, or right join since there are no rows in one dataframe that do not have a match in our key column in the other dataframe; I just chose to use inner_join because I'm more comfortable with that one. 

# I wanna try to make that a function, so I can do it for any character. 

character_final <- function(df, char){
  character <- df %>% filter(character==char)
  character_words <- character %>%
    unnest_tokens(input=character_words,
                output=words,
                token="words",
                drop=T)
  character_count <- character_words %>%
    group_by(chapter)%>%
    count("words")
  
  character_chap <- character %>%
    group_by(chapter) %>%
    summarize(imdb_rating)
  
  character_chap_dist <- distinct(character_chap)
  
  character_final <- inner_join(x=character_count, y=character_chap_dist, by = c("chapter"="chapter"))
  
  return(character_final)
    
}

sokka_final<-character_final(avatar, "Sokka")
## `summarise()` regrouping output by 'chapter' (override with `.groups` argument)
Aang_final<-character_final(avatar, "Aang")
## `summarise()` regrouping output by 'chapter' (override with `.groups` argument)
# honestly pretty stoked that worked, now that its pretty easy to do it with multiple characters I want to include Iroh. Theoretically I could do a for loop to get this for every character, but I think that might be a bit much. 

Iroh_final<-character_final(avatar,"Iroh")
## `summarise()` regrouping output by 'chapter' (override with `.groups` argument)

Show your transformed table here. Use tools such as glimpse(), skim() or head() to illustrate your point.

# there are 4 transformed tables now:
head(katara_final)
## # A tibble: 6 x 4
## # Groups:   chapter [6]
##   chapter                   `"words"`     n imdb_rating
##   <chr>                     <chr>     <int>       <dbl>
## 1 Avatar Day                words       280         7.5
## 2 Bato of the Water Tribe   words       206         7.9
## 3 Bitter Work               words       217         8.6
## 4 City of Walls and Secrets words       228         8.7
## 5 Imprisoned                words       673         8  
## 6 Jet                       words       421         7.8
head(sokka_final)
## # A tibble: 6 x 4
## # Groups:   chapter [6]
##   chapter                   `"words"`     n imdb_rating
##   <chr>                     <chr>     <int>       <dbl>
## 1 Avatar Day                words       427         7.5
## 2 Bato of the Water Tribe   words       421         7.9
## 3 Bitter Work               words       375         8.6
## 4 City of Walls and Secrets words       386         8.7
## 5 Imprisoned                words       415         8  
## 6 Jet                       words       525         7.8
head(Aang_final)
## # A tibble: 6 x 4
## # Groups:   chapter [6]
##   chapter                   `"words"`     n imdb_rating
##   <chr>                     <chr>     <int>       <dbl>
## 1 Appa's Lost Days          words        14         8.8
## 2 Avatar Day                words       236         7.5
## 3 Bato of the Water Tribe   words       262         7.9
## 4 Bitter Work               words       440         8.6
## 5 City of Walls and Secrets words       188         8.7
## 6 Imprisoned                words       157         8
head(Iroh_final)
## # A tibble: 6 x 4
## # Groups:   chapter [6]
##   chapter                   `"words"`     n imdb_rating
##   <chr>                     <chr>     <int>       <dbl>
## 1 Appa's Lost Days          words         7         8.8
## 2 Avatar Day                words       136         7.5
## 3 Bato of the Water Tribe   words        61         7.9
## 4 Bitter Work               words       638         8.6
## 5 City of Walls and Secrets words       145         8.7
## 6 Lake Laogai               words       289         9.1

Are the values what you expected for the variables? Why or Why not?

To the extent that I expected it to work, the values are what I expected, which is pretty cool.

Visualizing and Summarizing the Data (15 points)

Use group_by()/summarize() to make a summary of the data here. The summary should be relevant to your research question

katara_final %>%
  group_by(imdb_rating)%>%
  summary()
##    chapter            "words"                n          imdb_rating  
##  Length:59          Length:59          Min.   : 22.0   Min.   :7.10  
##  Class :character   Class :character   1st Qu.:119.0   1st Qu.:8.20  
##  Mode  :character   Mode  :character   Median :213.0   Median :8.70  
##                                        Mean   :253.6   Mean   :8.65  
##                                        3rd Qu.:370.5   3rd Qu.:9.10  
##                                        Max.   :690.0   Max.   :9.80  
##                                                        NA's   :1
sokka_final %>%
  group_by(imdb_rating)%>%
  summary()
##    chapter            "words"                n          imdb_rating  
##  Length:59          Length:59          Min.   : 16.0   Min.   :7.10  
##  Class :character   Class :character   1st Qu.:171.5   1st Qu.:8.20  
##  Mode  :character   Mode  :character   Median :299.0   Median :8.70  
##                                        Mean   :310.1   Mean   :8.65  
##                                        3rd Qu.:424.0   3rd Qu.:9.10  
##                                        Max.   :788.0   Max.   :9.80  
##                                                        NA's   :1
Aang_final %>%
  group_by(imdb_rating)%>%
  summary()
##    chapter            "words"                n          imdb_rating   
##  Length:60          Length:60          Min.   :  6.0   Min.   :7.100  
##  Class :character   Class :character   1st Qu.:168.8   1st Qu.:8.200  
##  Mode  :character   Mode  :character   Median :275.5   Median :8.700  
##                                        Mean   :297.0   Mean   :8.653  
##                                        3rd Qu.:388.5   3rd Qu.:9.100  
##                                        Max.   :769.0   Max.   :9.800  
##                                                        NA's   :1
Iroh_final %>%
  group_by(imdb_rating)%>%
  summary()
##    chapter            "words"                n           imdb_rating   
##  Length:36          Length:36          Min.   :  7.00   Min.   :7.500  
##  Class :character   Class :character   1st Qu.: 69.75   1st Qu.:8.350  
##  Mode  :character   Mode  :character   Median :129.00   Median :8.800  
##                                        Mean   :145.89   Mean   :8.783  
##                                        3rd Qu.:176.75   3rd Qu.:9.100  
##                                        Max.   :638.00   Max.   :9.800  
##                                                         NA's   :1
# a statistic that may be useful is the slope coefficient of words vs imdb rating. To make it easier to interpret, we will multiple by 100

character_slope <- function(df){
  model <- lm(imdb_rating~n,df)
  slope <- 100*model$coefficients[2]
  return(slope)
}

character_slope(katara_final)
##          n 
## -0.1471868
character_slope(sokka_final)
##          n 
## -0.1300191
character_slope(Aang_final)
##          n 
## -0.1517895
character_slope(Iroh_final)
##           n 
## 0.004841303

What are your findings about the summary? Are they what you expected?

Episodes in which Katara had lines had a mean of 253 words spoken by her, and a mean imdb rating of 8.65.

Episodes in which Sokka had lines had a mean of 310 words spoken by him, and a mean imdb rating of 8.65.

Episodes in which Aang had lines had a mean of 297 words spoken by him, and a mean imdb rating of 8.65.

Episodes in which Iroh had lines had a mean of 145 words spoken by him, and a mean imdb rating of 8.78.

It is worth noting that episodes in which Sokka, Aang, or Katara had lines all had the same mean imdb rating, which is likely a result of the fact that most episodes in which one had lines, all had lines.

for every 100 words Katara speaks, the imdb rating decreases by 0.147. For every 100 words sokka speaks, the imdb rating decreases by 0.130, and for every 100 words Aang speaks, the imdb rating decreases by 0.125. On the other hand, for every 100 words Iroh speaks, the imdb rating increases by 0.005

The finding about Iroh is what I expected, however I am surprised to learn that Aang is apparently less popular than katara, and that Sokka and Aang’s lines do not have positive effects on the imdb rating. It may be that this suggests that generally, episodes with more dialogue are received more poorly.

Make at least two plots that help you answer your question on the transformed or summarized data.

katara_final%>%
  ggplot(aes(x=n, y=imdb_rating))+
  geom_point()+
  geom_smooth(method=lm)+
  labs(title="scatterplot of Katara's words vs IMDB rating", x="Katara words", y="IMDB rating")
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 1 rows containing non-finite values (stat_smooth).
## Warning: Removed 1 rows containing missing values (geom_point).

sokka_final %>%
  ggplot(aes(x=n, y=imdb_rating))+
  geom_point()+
  geom_smooth(method=lm)+
  labs(title="scatterplot of Sokka's words vs IMDB rating", x="Sokka words", y="IMDB rating")
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 1 rows containing non-finite values (stat_smooth).

## Warning: Removed 1 rows containing missing values (geom_point).

Aang_final %>%
  ggplot(aes(x=n, y=imdb_rating))+
  geom_point()+
  geom_smooth(method=lm)+
  labs(title="scatterplot of Aang's words vs IMDB rating", x="Aang words", y="IMDB rating")
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 1 rows containing non-finite values (stat_smooth).

## Warning: Removed 1 rows containing missing values (geom_point).

Iroh_final%>%
  ggplot(aes(x=n, y=imdb_rating))+
  geom_point()+
  geom_smooth(method=lm)+
  labs(title="scatterplot of Iroh's words vs IMDB rating", x="Iroh words", y="IMDB rating")
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 1 rows containing non-finite values (stat_smooth).

## Warning: Removed 1 rows containing missing values (geom_point).

Final Summary (10 points)

Summarize your research question and findings below.

In our crude model, Katara’s lines were associated with a decrease in the imdb rating of episodes.

Without getting too heavily into the weeds of linear regression, characters’ popularity based on the relationship of their total word count in an episode to the imdb rating of that episode is, in order from most to least popular, Iroh, Sokka, Katara, Aang.

Are your findings what you expected? Why or Why not?

While the answer to our main question of interest, “do episodes in which Katara has more lines have lower imdb ratings?”, is roughly what I expected, The rest of the findings are not what I expected; I certainly didn’t expect the crude relationship of Aang and Sokka’s words to imdb rating to be so negative. There are two possibilities here; one is that my finger is not so clearly on the pulse of the ATLA fandom as to be able to assess the popularity of various characters (likely true). The other, and this is probably a stronger effect, is that I am only comparing crude linear models, and that transformations to fit the assumptions of linear regression and adjusting for confounding, interaction effects, and other potential predictors may create a better understanding of each character’s actual popularity.