Midterm

Define Your Research Question (10 points)

Define your research question below. What about the data interests you? What is a specific question you want to find out about the data?

Given your question, what is your expectation about the data?

It is a passion of mine to track the weekly box office for movies. I think it is fascinating to see what movies become blockbuster and what movies become box office bombs. There are certain characteristics that may contribute to a movies box office success, such as; genre, director, marketing campaign, and awards success. However, whether a movie succeeds or fails is very hard to predict and often defies expectations. For this project, I used a dataset from the “Tidy Tuesday” website called “movie_profit” to analyze box office data. I am curious what movie studio made the most profit? Additionally, I want to know what genre and mpaa rating was the most profitable for that studio?

I am expecting that Walt Disney Studios is the most profitable studio. This is because they make widely accessible films for a general audience (in terms of mpaa rating and subject matter) and they are very aggressive in their marketing campaigns. Smaller studios like Mirmimax and New Line make specific types of movies for adult audiences that do not have as much potential for great profit. Additionally, more then any other studio, Disney has a built-in audience that will automatically watch everything Disney. I also expect the PG rating to be the most profitable rating for Disney. This is because most of their highly profitable animated films (including Pixar) are rated PG. Lastly, I expect that the adventure genre will be their most profitable genre. This is because Disney does not make pure action, comedy, or horror films. Most of the content that they produce are in the family adventure realm.

Loading the Data (10 points)

Load the data below and use dplyr::glimpse() or skimr::skim() on the data. You should upload the data file into the data directory.

movie_profit_2 <- read.csv("data/movie_profit.csv")
# This code came from https://www.rdocumentation.org/packages/utils/versions/3.6.2/topics/read.table

# Convert 'mpaa_rating', 'genre', and 'distributor' into factors
movie_profit_2 <- movie_profit_2 %>% mutate(mpaa_rating = as_factor(mpaa_rating)) %>%
  mutate(genre = as_factor(genre)) %>%
  mutate(distributor = as_factor(distributor))

# Remove the first cloumn because it contains redundant information
movie_profit_2 <- movie_profit_2 %>%
  select(-X)

skim(movie_profit_2)
Data summary
Name movie_profit_2
Number of rows 3401
Number of columns 8
_______________________
Column type frequency:
character 2
factor 3
numeric 3
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
release_date 0 1 8 10 0 1768 0
movie 0 1 1 35 0 3400 0

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
distributor 48 0.99 FALSE 201 War: 374, Son: 339, Uni: 307, 20t: 282
mpaa_rating 137 0.96 FALSE 4 R: 1514, PG-: 1092, PG: 573, G: 85
genre 0 1.00 FALSE 5 Dra: 1236, Com: 813, Act: 573, Adv: 481

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
production_budget 0 1 33284743 34892391 250000 9000000 20000000 45000000 175000000 ▇▂▁▁▁
domestic_gross 0 1 45421793 58825661 0 6118683 25533818 60323786 474544677 ▇▁▁▁▁
worldwide_gross 0 1 94115117 140918242 0 10618813 40159017 117615211 1304866322 ▇▁▁▁▁

This step shows that the data was loaded correctly. There are two character variables (‘release_date’ and ‘movie’), three numeric variables (‘production_budget’, ‘domestic_gross’, and ‘worldwide_gross’), and three factors (‘distributor’, ‘mpaa_rating’, and ‘genre’). Missing data were already coded as “NA” in the original dataset. There are 48 missing data in ‘distributor’ and 137 missing data in ‘mapaa_rating’. No additional data cleaning is needed.

Transforming the data (15 points)

If the data needs to be transformed in any way (values recoded, pivoted, etc), do it here. Examples include transforming a continuous variable into a categorical using case_when(), etc.

Subset of the data and think of a question to answer the subset

# Make a new variable called 'profit' by subtracting 'worldwide_gross' from 'production_budget'
movie_profit_2 <- movie_profit_2 %>%
  mutate(profit = worldwide_gross - production_budget)

# Find the number of films for each distributor (in descending order)
movie_profit_dist <- movie_profit_2 %>%
  tabyl(distributor) %>%
  arrange(desc(n))

head(movie_profit_dist, n=10)
##         distributor   n    percent valid_percent
##        Warner Bros. 374 0.10996766    0.11154190
##       Sony Pictures 339 0.09967657    0.10110349
##           Universal 307 0.09026757    0.09155980
##    20th Century Fox 282 0.08291679    0.08410379
##  Paramount Pictures 267 0.07850632    0.07963018
##         Walt Disney 240 0.07056748    0.07157769
##           Lionsgate 147 0.04322258    0.04384134
##                 MGM 121 0.03557777    0.03608709
##             Miramax 103 0.03028521    0.03071876
##            New Line 100 0.02940312    0.02982404

There are 201 film distributors and many of these distributors only had one film selected in the data. Therefore, I decided to only include distributors that had at least 100 films selected. These films are listed in the table above.

# Subset the data to only include distributors that had 100 films selected
movie_profit_studio <- movie_profit_2 %>%
  filter(distributor == "Universal" | distributor == "MGM" | distributor == "20th Century Fox" | distributor == "Paramount Pictures" | distributor == "Miramax" | distributor == "Sony Pictures" | distributor == "Lionsgate" | distributor == "Walt Disney" | distributor == "Warner Bros." | distributor == "New Line")

# Find the total profit for each distributor (in descending order)
movie_profit_studio %>%
  group_by(distributor) %>%
  summarize(total_profit = sum(profit)) %>%
  arrange(desc(total_profit))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 10 x 2
##    distributor        total_profit
##    <fct>                     <dbl>
##  1 20th Century Fox    32968248156
##  2 Universal           32292335713
##  3 Warner Bros.        29818525883
##  4 Paramount Pictures  24774713946
##  5 Sony Pictures       23350589619
##  6 Walt Disney         23194357038
##  7 Lionsgate            5394905305
##  8 MGM                  4804570425
##  9 New Line             3436634735
## 10 Miramax              3163003512
# Make a subset of the data that only includes the distributor with the highest profit
movie_profit_Fox <- movie_profit_studio %>%
  filter(distributor == "20th Century Fox")

20th Century Fox made the highest total profit. Therefore, I made a subset to only include data with 20th Century Fox as the distributor. I can now use the subset data to calculate what genre and mpaa rating made the most profit for 20th Century Fox.

Show your transformed table here. Use tools such as glimpse(), skim() or head() to illustrate your point.

# The transformed table is a subset of the original dataset that only includes movies with 20th Century Fox as distributor
# Additionally, the transformed table includes my new profit variable
head(movie_profit_Fox)
##   release_date                          movie production_budget domestic_gross
## 1    7/11/2014 Dawn of the Planet of the Apes          1.70e+08      208545589
## 2    6/24/2016   Independence Day: Resurgence          1.65e+08      103144286
## 3     6/3/2011             X-Men: First Class          1.60e+08      146408305
## 4    7/14/2017 War for the Planet of the Apes          1.52e+08      146880162
## 5     5/1/2009       X-Men Origins: Wolverine          1.50e+08      179883157
## 6    6/13/2014     How to Train Your Dragon 2          1.45e+08      177002924
##   worldwide_gross      distributor mpaa_rating     genre    profit
## 1       710644566 20th Century Fox       PG-13 Adventure 540644566
## 2       384413934 20th Century Fox       PG-13    Action 219413934
## 3       355408305 20th Century Fox       PG-13    Action 195408305
## 4       489592267 20th Century Fox       PG-13    Action 337592267
## 5       374825760 20th Century Fox       PG-13    Action 224825760
## 6       614586270 20th Century Fox          PG Adventure 469586270

Are the values what you expected for the variables? Why or Why not?

I was surprised that 20th Century Fox made the most profit given that Walt Disney Studios own Marvel and Star Wars. A reason for this could be that 20th Century Fox has 282 movies listed in the dataset, while Walt Disney only has 240 movies listed. This could make for a less accurate sample, especially if the number of movies selected in the sample that made a profit for Fox exceeded that of Disney.

Visualizing and Summarizing the Data (15 points)

Use group_by()/summarize() to make a summary of the data here. The summary should be relevant to your research question

# Calculate the total profit for each genre for 20th Century Fox (in descending order)
movie_profit_Fox %>%
  group_by(genre) %>%
  summarize(total_profit = sum(profit)) %>%
  arrange(desc(total_profit))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 5 x 2
##   genre     total_profit
##   <fct>            <dbl>
## 1 Adventure  15673162822
## 2 Action      9276012513
## 3 Comedy      3956244515
## 4 Drama       3309580303
## 5 Horror       753248003
# Calculate the total profit for each mpaa rating for 20th Century Fox (in descending order)
movie_profit_Fox %>%
  group_by(mpaa_rating) %>%
  summarize(total_profit = sum(profit)) %>%
  arrange(desc(total_profit))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 5 x 2
##   mpaa_rating total_profit
##   <fct>              <dbl>
## 1 PG           12524769511
## 2 PG-13        11494852190
## 3 R             7618981448
## 4 G             1237535173
## 5 <NA>            92109834

What are your findings about the summary? Are they what you expected?

For 20th Century Fox, the adventure genre and the PG mpaa rating were the most profitable. This is what I expected because action and adventure movies (like Star Wars) are the kind of movies that people see in theaters. Comedy, drama, and horror (especially) attract more niche audiences, while action/adventure are more universal in their appeal. Similarly, people feel like they need to see an adventure or action movie on the big screen, while a drama or horror film can be watched at home. I expected that R rated movies would make the least amount of money because they are age restricted. It is surprising that G rated movies were the least profitable. Perhaps, this is because there are less G rated movies made then any other genre. Many family movies are rated PG.

Make at least two plots that help you answer your question on the transformed or summarized data.

# Reorder the distributor factor levels by profit so that the bar graph displays in descending order
movie_profit_studio_2 <- movie_profit_studio %>%
  mutate(distributor_reorder = fct_reorder(distributor, profit, .fun = sum, .desc = TRUE))

# Create bar graph showing the ten distributors and total profit
ggplot(data=movie_profit_studio_2,
       aes(x=distributor_reorder, y= profit, fill=mpaa_rating)) +
  theme(axis.text.x = element_text(angle = 90)) +
  labs(title = "Total Profit by Distributor",
       x = "Distributor",
       y = "Profit",
       fill="MPAA Rating") +
  geom_col()

# Reorder the genre factor levels by profit so that the bar graph displays in descending order
movie_profit_Fox_2 <- movie_profit_Fox %>%
  mutate(genre_reorder = fct_reorder(genre, profit, .fun = sum, .desc = TRUE))

# Create bar graph of 20th Century Fox subset showing genre and total profit 
ggplot(data=movie_profit_Fox_2,
       aes(x=genre_reorder, y=profit, fill = mpaa_rating)) +
  theme(axis.text.x = element_text(angle = 90)) +
  labs(title = "Total Profit of 20th Century Fox by Genre",
       x = "Genre",
       y = "Profit") +
  geom_col()

Final Summary (10 points)

Summarize your research question and findings below. Are your findings what you expected? Why or Why not?

Among all of the distributors in the sample, 20th Century Fox made the most profit. For 20th Century Fox, the adventure genre and the PG mpaa rating were the most profitable genre and film rating. This is not surprising given the kind of movies that general audiences enjoy watching in theaters. Drama and horror movies are more niche and less wide in their appeal. Similarly, comedy movies do not necitate the theatrical viewing experience and are often enjoyed at home. On the other hand, adventure and action movies draw big audiences because people feel that movies that focus on spectacle must be enjoyed on the big screen. I expected R rated movies to be the least profitable because it is age restricted. I was somewhat surprised that for 20th Century Fox, G rated movies were the least profitable. You can see on the graphs that G rated movies were not widely represented in the sample which could account for this phenomenon. The graph entitled “Total Profit by Distributor” shows that 20th Century Fox is the most profitable distributor while Miramax is the least. This makes sense because 20th Century Fox distributes blockbuster movies intended for big profits while Mirimax distributed art house movies marketed for awards. The graph entitled “Total Profit of 20th Century Fox by Genre” shows that the adventure genre was the most profitable and that many of those films were rated PG.

It is important to note that this dataset was made in 2018. This occurred before the Disney/20th Century Fox buy out. Additionally, this dataset does not include a number of recent disney live-action remakes, Marvel, Star Wars, and Disney Animated films. Many of these films made over a billion dollars at the world wide box office. Finally, it is a bit strange that the original dataset categorized “adventure” and “action” as separate genres. Especially, sense the difference between “adventure” and “action” genres is arbitrary and subjective. However, this would not have changed my results because in the “Total Profit for 20th Century Fox by Genre” graph, “adventure” and “action” were the first and second most profitable genres.