Midterm (Due 2/12/2021 at 11:55 pm)

Please submit your .Rmd and .html files in Sakai. If you are working together, both people should submit the files.

60 / 60 points total

The goal of the midterm project is to showcase skills that you have learned in class so far. The midterm is open note, but if you use someone else’s code, you must attribute them.

Before you get Started

  1. Pick a dataset. Ideally, the dataset should be around 2000 rows, and should have both categorical and numeric covariates.

Potential Sources for data: Tidy Tuesday: https://github.com/rfordatascience/tidytuesday

  • Note that most of these are .csv files. There is code to load the files from csv for each of the datasets and a short description of the variables, or you can upload the .csv file into your data folder.

You may use another dataset or your own data, but please make sure it is de-identified.

  1. Please schedule a time with Eric or Me to discuss your dataset and research question. We just want to look at the data and make sure that it is appropriate for your question.

Working Together

If you’d like to work together, that is encouraged, but you must divide the work equitably and you must note who worked on what. This is probably easiest as notes in the text. Please let Eric or Me know that you’ll be working together.

No acknowledgements of contributions = -10 points overall.

Please Note

I will take off points (-5 points for each section) if you don’t add observations and notes in your RMarkdown document. I want you to think and reason through your analysis, even if they are preliminary thoughts.

Define Your Research Question (10 points)

Define your research question below. What about the data interests you? What is a specific question you want to find out about the data?

I am interested in how the popularity of different TV genres has changed over time. I have heard, for example, that Sci-Fi, which was very popular in the early days of TV, may be having something of a resurgence in the past couple of decades due to improvements in special effects and storytelling that has made it a more broadly appealing medium. My research question of interest is whether the data shows any increase in popularity of Sci-Fi as a genre in the past few decades.

Given your question, what is your expectation about the data?

I would expect that there would be an upward trend in either ratings (av_rating), audience share (share), or total number of Sci-Fi shows produced over time (the data will have to be manipulated to show this).

Loading the Data (10 points)

Load the data below and use dplyr::glimpse() or skimr::skim() on the data. You should upload the data file into the data directory.

# This data set came from the Tidy Tuesday release "TV's Golden Age is Real".

# I just wanted to emphasize that all of this code is my own code. I looked up
# documentation online and also referred to past homework assignments to get
# ideas, but I have copied nobody else's work. 

tv <- read.csv("data/tv.csv")

glimpse(tv)
## Rows: 2,266
## Columns: 7
## $ titleId      <chr> "tt2879552", "tt3148266", "tt3148266", "tt3148266", "tt3…
## $ seasonNumber <int> 1, 1, 2, 3, 4, 1, 2, 1, 2, 3, 4, 5, 6, 7, 8, 1, 1, 1, 1,…
## $ title        <chr> "11.22.63", "12 Monkeys", "12 Monkeys", "12 Monkeys", "1…
## $ date         <chr> "2016-03-10", "2015-02-27", "2016-05-30", "2017-05-19", …
## $ av_rating    <dbl> 8.4890, 8.3407, 8.8196, 9.0369, 9.1363, 8.4370, 7.5089, …
## $ share        <dbl> 0.51, 0.46, 0.25, 0.19, 0.38, 2.38, 2.19, 6.67, 7.13, 5.…
## $ genres       <chr> "Drama,Mystery,Sci-Fi", "Adventure,Drama,Mystery", "Adve…
sum(is.na(tv))
## [1] 0

If there are any quirks that you have to deal with NA coded as something else, or it is multiple tables, please make some notes here about what you need to do before you start transforming the data in the next section.

Make sure your data types are correct!

To get my data into the format I want, I will need to do the following:

  • Convert date into a numerical format and also create a categorical variable for decades to use for graphics and tables
  • Split up genres so that each genre can be visually represented separately and create an “other” category to catch all of the niche genres and sub-genres, such as “war” or “musical”
  • I will also make a simple categorical “yes”/“no” indicator variable to specify whether a genre is Sci-Fi or not

Otherwise, it appears that my data has all loaded correctly and is in the format I want. The two variables I am mainly interested in, av_rating and share, are both numerical vectors. I also checked for NA values but did not find any.

Transforming the data (15 points)

If the data needs to be transformed in any way (values recoded, pivoted, etc), do it here. Examples include transforming a continuous variable into a categorical using case_when(), etc.

Bonus points (5 points) for datasets that require merging of tables, but only if you reason through whether you should use left_join, inner_join, or right_join on these tables. No credit will be provided if you don’t.

tv_by_genre <- tv %>%
  separate_rows(genres, sep = ",") %>%
  mutate(
    air_date_year = year(date),
    air_date_decimal = year(date) + month(date) / 12 + day(date) / 365,
    air_date_decade = factor(
      case_when(
        air_date_year < 2000 ~ "90's",
        air_date_year < 2010 ~ "00's",
        air_date_year < 2020 ~ "10's"),
      levels = c("90's", "00's", "10's")),
    genres = factor(
      case_when(
        genres == "Sci-Fi" ~ "Sci-Fi",
        genres == "Action" ~ "Action",
        genres == "Comedy" ~ "Comedy",
        genres == "Drama" ~ "Drama",
        TRUE ~ "Other"),
      levels = c("Sci-Fi", "Action", "Comedy", "Drama", "Other")), 
    is_scifi = factor(
      ifelse(genres == "Sci-Fi", "Yes", "No"),
      levels = c("Yes", "No")))

Show your transformed table here. Use tools such as glimpse(), skim() or head() to illustrate your point.

glimpse(tv_by_genre)
## Rows: 5,944
## Columns: 11
## $ titleId          <chr> "tt2879552", "tt2879552", "tt2879552", "tt3148266", …
## $ seasonNumber     <int> 1, 1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 1, 1, 2…
## $ title            <chr> "11.22.63", "11.22.63", "11.22.63", "12 Monkeys", "1…
## $ date             <chr> "2016-03-10", "2016-03-10", "2016-03-10", "2015-02-2…
## $ av_rating        <dbl> 8.4890, 8.4890, 8.4890, 8.3407, 8.3407, 8.3407, 8.81…
## $ share            <dbl> 0.51, 0.51, 0.51, 0.46, 0.46, 0.46, 0.25, 0.25, 0.25…
## $ genres           <fct> Drama, Other, Sci-Fi, Other, Drama, Other, Other, Dr…
## $ air_date_year    <dbl> 2016, 2016, 2016, 2015, 2015, 2015, 2016, 2016, 2016…
## $ air_date_decimal <dbl> 2016.277, 2016.277, 2016.277, 2015.241, 2015.241, 20…
## $ air_date_decade  <fct> 10's, 10's, 10's, 10's, 10's, 10's, 10's, 10's, 10's…
## $ is_scifi         <fct> No, No, Yes, No, No, No, No, No, No, No, No, No, No,…

Are the values what you expected for the variables? Why or Why not?

The transformations I performed do appear to have worked correctly but I am not able to tell yet what this means in the context of my research question. I was able to successfully convert the date column to produce the numerical vectors air_date_year and air_date_decimal, which correspond to the year for each date in integer and decimal form. I then created a categorical variable for the year that classifies by air_date_decade. I also was able to split the genres column by rows, so that each show that had multiple genres listed now has duplicate rows for each genre. I then created an “other” category for all but 4 of the genres included in the data and the variable is_scifi as a Yes/No ordinal indicator for whether the show is Sci-Fi.

I will next create some displays to find out how the ratings, audience share, and number of shows produced for Sci-Fi has changed over time.

Visualizing and Summarizing the Data (15 points)

Use group_by()/summarize() to make a summary of the data here. The summary should be relevant to your research question

tv_by_genre %>%
  filter(genres == "Sci-Fi") %>%
  group_by(air_date_decade) %>%
  summarize(
    n = n(),
    mean_ratings = mean(av_rating),
    mean_share = mean(share))
## # A tibble: 3 x 4
##   air_date_decade     n mean_ratings mean_share
##   <fct>           <int>        <dbl>      <dbl>
## 1 90's               10         8.11      23.4 
## 2 00's               37         7.79       2.28
## 3 10's              107         7.95       1.31

What are your findings about the summary? Are they what you expected?

I was able to get some interesting results when I group the data by decade. The total number of shows in the Sci-Fi genre has increased dramatically with each decade but the mean share has gone down (which seems reasonable in the sense that a higher total number of shows would result in fragmentation of the total viewing audience). The mean ratings have remained about the same, with a slight dip in the 00’s compared to the 90’s and a net decrease overall. This could be a reflection of the fact that there were far fewer Sci-Fi shows in the 90’s and those that were running at the time, such as The X-files, had generally higher ratings.

Overall, I did not expect that ratings would generally remain the same, but it is also possible that the way the popularity has changed over time could be reflected in different ways. The increase in total number of Sci-Fi shows, for example, may not signify that Sci-Fi is becoming more popular but more shows are being produced overall. One way to verify this would be to show how Sci-Fi compares with other genres in total number of shows produced.

Make at least two plots that help you answer your question on the transformed or summarized data.

tv_by_genre %>%
  filter(genres %in% c("Sci-Fi", "Action", "Comedy", "Drama")) %>%
  group_by(air_date_year, genres) %>%
  summarize(n = n()) %>%
  ggplot(
    aes(
      x = air_date_year,
      y = n,
      fill = fct_rev(genres))) +
  geom_area(alpha = 0.5, color = "black") +
  scale_fill_manual(values = wes_palette("GrandBudapest2")) +
  labs(
    title = "Number of TV shows aired per year, by genre",
    x = "Year",
    y = "Number of TV shows",
    fill = "Genre") +
  theme_minimal()

From this plot, we can see that the increase in the number of Sci-Fi shows can be explained by every genre exploding since the early 2000’s. Additionally, the growth of Sci-Fi as a genre seems to pale in comparison to other genres, such as Drama.

Another way to look at it would be to compare how av_rating and share has changed over time between genres:

default_gg <- ggplot(tv_by_genre,
  aes(
    x = air_date_decimal,
    color = is_scifi,
    alpha = is_scifi)) +
  scale_color_manual(values = c("Steel Blue", "Grey")) +
  scale_alpha_ordinal(range = c(1, 0.1)) +
  theme_minimal()

grid.arrange(
  default_gg +
    geom_point(aes(y = av_rating)) +
    labs(y = "AV rating") +
    theme(
      legend.position = "none",
      axis.title.x = element_blank(),
      axis.text.x = element_blank()),
  default_gg +
    geom_point(aes(y = share)) +
    labs(
      x = "Year",
      y = "Audience Share") +
    theme(legend.position = "none"),
  top = "AV rating and audience share of Sci-Fi (shown in blue)
  compared to other genres")

From the plot above, we can see what is happening more clearly, although this does not change what we have found previously. Sci-Fi shows, in blue on the chart, have increased in numbers over time but not out of proportion with other genres. There also does not appear to be an appreciable general change in av_rating, as any increase in shows with higher ratings seems to be balanced by new shows with lower ratings. Similarly, there does not appear to be any appreciable change in share over time, with the notable exception of a small run of 6-8 data points in the late 90’s that do not appear to follow the same trends followed by any other genre or any other decade. These data points more than likely represent The X-files, as this was one of the few Sci-Fi shows that were running in the 90’s. It is also interesting to note that the stark contrast these 6-8 data points have with the rest of the data in the Audience Share plot is hardly apparent in the AV rating plot.

Perhaps one of the most striking features of the AV rating plot shown above is that the cone of data points seems to be widening as the number of shows increase, suggesting that, while there is more interest in developing new shows, the quality seems to vary.

Final Summary (10 points)

Summarize your research question and findings below.

My research question was to evaluate whether Sci-Fi is increasing in popularity over time. I attempted to answer this question by comparing the ratings and audience share over time of Sci-Fi with other genres. I also compared the total number of new Sci-Fi shows over time with other genres. I was unable to find any convincing evidence that Sci-Fi is increasing in popularity by any of these metrics, at least out of proportion with other genres. While the number of Sci-Fi shows being made seem to have increased in the past decade compared to previous decades, this is consistent with how television shows in general have increased in numbers. Similarly, the AV ratings and audience share of Sci-Fi shows has remained fairly consistent over time, which is similar to the general trend for television shows in general.

Are your findings what you expected? Why or Why not?

I did not expect that Sci-Fi would follow the general trends of the rest of television shows so consistently. I expected that, even if I was wrong, that there would be at least some noticeable difference in the data for Sci-Fi compared to other genres. What I think I did not account for is that, while Sci-Fi may have become more popular over time, in general TV has just become more popular and there is more interest in creating new shows than there used to be. It may also be that the metrics I chose to evaluate my question are not the best choice for this and that there may be more conclusive answers if I tried to look at the data in a different way.