Define your research question below. What about the data interests you? What is a specific question you want to find out about the data?
Given your question, what is your expectation about the data?
It is a passion of mine to track the weekly box office for movies. I think it is fascinating to see what movies become blockbuster and what movies become box office bombs. There are certain characteristics that may contribute to a movies box office success, such as; genre, director, marketing campaign, and awards success. However, whether a movie succeeds or fails is very hard to predict and often defies expectations. For this project, I used a dataset from the “Tidy Tuesday” website called “movie_profit” to analyze box office data. I am curious what movie studio made the most profit? Additionally, I want to know what genre and mpaa rating was the most profitable for that studio?
I am expecting that Walt Disney Studios is the most profitable studio. This is because they make widely accessible films for a general audience (in terms of mpaa rating and subject matter) and they are very aggressive in their marketing campaigns. Smaller studios like Mirmimax and New Line make specific types of movies for adult audiences that do not have as much potential for great profit. Additionally, more then any other studio, Disney has a built-in audience that will automatically watch everything Disney. I also expect the PG rating to be the most profitable rating for Disney. This is because most of their highly profitable animated films (including Pixar) are rated PG. Lastly, I expect that the adventure genre will be their most profitable genre. This is because Disney does not make pure action, comedy, or horror films. Most of the content that they produce are in the family adventure realm.
Load the data below and use
dplyr::glimpse()
orskimr::skim()
on the data. You should upload the data file into thedata
directory.
movie_profit_2 <- read.csv("data/movie_profit.csv")
# This code came from https://www.rdocumentation.org/packages/utils/versions/3.6.2/topics/read.table
# Convert 'mpaa_rating', 'genre', and 'distributor' into factors
movie_profit_2 <- movie_profit_2 %>% mutate(mpaa_rating = as_factor(mpaa_rating)) %>%
mutate(genre = as_factor(genre)) %>%
mutate(distributor = as_factor(distributor))
# Remove the first cloumn because it contains redundant information
movie_profit_2 <- movie_profit_2 %>%
select(-X)
skim(movie_profit_2)
Name | movie_profit_2 |
Number of rows | 3401 |
Number of columns | 8 |
_______________________ | |
Column type frequency: | |
character | 2 |
factor | 3 |
numeric | 3 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
release_date | 0 | 1 | 8 | 10 | 0 | 1768 | 0 |
movie | 0 | 1 | 1 | 35 | 0 | 3400 | 0 |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
distributor | 48 | 0.99 | FALSE | 201 | War: 374, Son: 339, Uni: 307, 20t: 282 |
mpaa_rating | 137 | 0.96 | FALSE | 4 | R: 1514, PG-: 1092, PG: 573, G: 85 |
genre | 0 | 1.00 | FALSE | 5 | Dra: 1236, Com: 813, Act: 573, Adv: 481 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
production_budget | 0 | 1 | 33284743 | 34892391 | 250000 | 9000000 | 20000000 | 45000000 | 175000000 | ▇▂▁▁▁ |
domestic_gross | 0 | 1 | 45421793 | 58825661 | 0 | 6118683 | 25533818 | 60323786 | 474544677 | ▇▁▁▁▁ |
worldwide_gross | 0 | 1 | 94115117 | 140918242 | 0 | 10618813 | 40159017 | 117615211 | 1304866322 | ▇▁▁▁▁ |
This step shows that the data was loaded correctly. There are two character variables (‘release_date’ and ‘movie’), three numeric variables (‘production_budget’, ‘domestic_gross’, and ‘worldwide_gross’), and three factors (‘distributor’, ‘mpaa_rating’, and ‘genre’). Missing data were already coded as “NA” in the original dataset. There are 48 missing data in ‘distributor’ and 137 missing data in ‘mapaa_rating’. No additional data cleaning is needed.
If the data needs to be transformed in any way (values recoded, pivoted, etc), do it here. Examples include transforming a continuous variable into a categorical using
case_when()
, etc.
Subset of the data and think of a question to answer the subset
# Make a new variable called 'profit' by subtracting 'worldwide_gross' from 'production_budget'
movie_profit_2 <- movie_profit_2 %>%
mutate(profit = worldwide_gross - production_budget)
# Find the number of films for each distributor (in descending order)
movie_profit_dist <- movie_profit_2 %>%
tabyl(distributor) %>%
arrange(desc(n))
head(movie_profit_dist, n=10)
## distributor n percent valid_percent
## Warner Bros. 374 0.10996766 0.11154190
## Sony Pictures 339 0.09967657 0.10110349
## Universal 307 0.09026757 0.09155980
## 20th Century Fox 282 0.08291679 0.08410379
## Paramount Pictures 267 0.07850632 0.07963018
## Walt Disney 240 0.07056748 0.07157769
## Lionsgate 147 0.04322258 0.04384134
## MGM 121 0.03557777 0.03608709
## Miramax 103 0.03028521 0.03071876
## New Line 100 0.02940312 0.02982404
There are 201 film distributors and many of these distributors only had one film selected in the data. Therefore, I decided to only include distributors that had at least 100 films selected. These films are listed in the table above.
# Subset the data to only include distributors that had 100 films selected
movie_profit_studio <- movie_profit_2 %>%
filter(distributor == "Universal" | distributor == "MGM" | distributor == "20th Century Fox" | distributor == "Paramount Pictures" | distributor == "Miramax" | distributor == "Sony Pictures" | distributor == "Lionsgate" | distributor == "Walt Disney" | distributor == "Warner Bros." | distributor == "New Line")
# Find the total profit for each distributor (in descending order)
movie_profit_studio %>%
group_by(distributor) %>%
summarize(total_profit = sum(profit)) %>%
arrange(desc(total_profit))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 10 x 2
## distributor total_profit
## <fct> <dbl>
## 1 20th Century Fox 32968248156
## 2 Universal 32292335713
## 3 Warner Bros. 29818525883
## 4 Paramount Pictures 24774713946
## 5 Sony Pictures 23350589619
## 6 Walt Disney 23194357038
## 7 Lionsgate 5394905305
## 8 MGM 4804570425
## 9 New Line 3436634735
## 10 Miramax 3163003512
# Make a subset of the data that only includes the distributor with the highest profit
movie_profit_Fox <- movie_profit_studio %>%
filter(distributor == "20th Century Fox")
20th Century Fox made the highest total profit. Therefore, I made a subset to only include data with 20th Century Fox as the distributor. I can now use the subset data to calculate what genre and mpaa rating made the most profit for 20th Century Fox.
Show your transformed table here. Use tools such as
glimpse()
,skim()
orhead()
to illustrate your point.
# The transformed table is a subset of the original dataset that only includes movies with 20th Century Fox as distributor
# Additionally, the transformed table includes my new profit variable
head(movie_profit_Fox)
## release_date movie production_budget domestic_gross
## 1 7/11/2014 Dawn of the Planet of the Apes 1.70e+08 208545589
## 2 6/24/2016 Independence Day: Resurgence 1.65e+08 103144286
## 3 6/3/2011 X-Men: First Class 1.60e+08 146408305
## 4 7/14/2017 War for the Planet of the Apes 1.52e+08 146880162
## 5 5/1/2009 X-Men Origins: Wolverine 1.50e+08 179883157
## 6 6/13/2014 How to Train Your Dragon 2 1.45e+08 177002924
## worldwide_gross distributor mpaa_rating genre profit
## 1 710644566 20th Century Fox PG-13 Adventure 540644566
## 2 384413934 20th Century Fox PG-13 Action 219413934
## 3 355408305 20th Century Fox PG-13 Action 195408305
## 4 489592267 20th Century Fox PG-13 Action 337592267
## 5 374825760 20th Century Fox PG-13 Action 224825760
## 6 614586270 20th Century Fox PG Adventure 469586270
Are the values what you expected for the variables? Why or Why not?
I was surprised that 20th Century Fox made the most profit given that Walt Disney Studios own Marvel and Star Wars. A reason for this could be that 20th Century Fox has 282 movies listed in the dataset, while Walt Disney only has 240 movies listed. This could make for a less accurate sample, especially if the number of movies selected in the sample that made a profit for Fox exceeded that of Disney.
Use
group_by()/summarize()
to make a summary of the data here. The summary should be relevant to your research question
# Calculate the total profit for each genre for 20th Century Fox (in descending order)
movie_profit_Fox %>%
group_by(genre) %>%
summarize(total_profit = sum(profit)) %>%
arrange(desc(total_profit))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 5 x 2
## genre total_profit
## <fct> <dbl>
## 1 Adventure 15673162822
## 2 Action 9276012513
## 3 Comedy 3956244515
## 4 Drama 3309580303
## 5 Horror 753248003
# Calculate the total profit for each mpaa rating for 20th Century Fox (in descending order)
movie_profit_Fox %>%
group_by(mpaa_rating) %>%
summarize(total_profit = sum(profit)) %>%
arrange(desc(total_profit))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 5 x 2
## mpaa_rating total_profit
## <fct> <dbl>
## 1 PG 12524769511
## 2 PG-13 11494852190
## 3 R 7618981448
## 4 G 1237535173
## 5 <NA> 92109834
What are your findings about the summary? Are they what you expected?
For 20th Century Fox, the adventure genre and the PG mpaa rating were the most profitable. This is what I expected because action and adventure movies (like Star Wars) are the kind of movies that people see in theaters. Comedy, drama, and horror (especially) attract more niche audiences, while action/adventure are more universal in their appeal. Similarly, people feel like they need to see an adventure or action movie on the big screen, while a drama or horror film can be watched at home. I expected that R rated movies would make the least amount of money because they are age restricted. It is surprising that G rated movies were the least profitable. Perhaps, this is because there are less G rated movies made then any other genre. Many family movies are rated PG.
Make at least two plots that help you answer your question on the transformed or summarized data.
# Reorder the distributor factor levels by profit so that the bar graph displays in descending order
movie_profit_studio_2 <- movie_profit_studio %>%
mutate(distributor_reorder = fct_reorder(distributor, profit, .fun = sum, .desc = TRUE))
# Create bar graph showing the ten distributors and total profit
ggplot(data=movie_profit_studio_2,
aes(x=distributor_reorder, y= profit, fill=mpaa_rating)) +
theme(axis.text.x = element_text(angle = 90)) +
labs(title = "Total Profit by Distributor",
x = "Distributor",
y = "Profit",
fill="MPAA Rating") +
geom_col()
# Reorder the genre factor levels by profit so that the bar graph displays in descending order
movie_profit_Fox_2 <- movie_profit_Fox %>%
mutate(genre_reorder = fct_reorder(genre, profit, .fun = sum, .desc = TRUE))
# Create bar graph of 20th Century Fox subset showing genre and total profit
ggplot(data=movie_profit_Fox_2,
aes(x=genre_reorder, y=profit, fill = mpaa_rating)) +
theme(axis.text.x = element_text(angle = 90)) +
labs(title = "Total Profit of 20th Century Fox by Genre",
x = "Genre",
y = "Profit") +
geom_col()
Summarize your research question and findings below. Are your findings what you expected? Why or Why not?
Among all of the distributors in the sample, 20th Century Fox made the most profit. For 20th Century Fox, the adventure genre and the PG mpaa rating were the most profitable genre and film rating. This is not surprising given the kind of movies that general audiences enjoy watching in theaters. Drama and horror movies are more niche and less wide in their appeal. Similarly, comedy movies do not necitate the theatrical viewing experience and are often enjoyed at home. On the other hand, adventure and action movies draw big audiences because people feel that movies that focus on spectacle must be enjoyed on the big screen. I expected R rated movies to be the least profitable because it is age restricted. I was somewhat surprised that for 20th Century Fox, G rated movies were the least profitable. You can see on the graphs that G rated movies were not widely represented in the sample which could account for this phenomenon. The graph entitled “Total Profit by Distributor” shows that 20th Century Fox is the most profitable distributor while Miramax is the least. This makes sense because 20th Century Fox distributes blockbuster movies intended for big profits while Mirimax distributed art house movies marketed for awards. The graph entitled “Total Profit of 20th Century Fox by Genre” shows that the adventure genre was the most profitable and that many of those films were rated PG.
It is important to note that this dataset was made in 2018. This occurred before the Disney/20th Century Fox buy out. Additionally, this dataset does not include a number of recent disney live-action remakes, Marvel, Star Wars, and Disney Animated films. Many of these films made over a billion dollars at the world wide box office. Finally, it is a bit strange that the original dataset categorized “adventure” and “action” as separate genres. Especially, sense the difference between “adventure” and “action” genres is arbitrary and subjective. However, this would not have changed my results because in the “Total Profit for 20th Century Fox by Genre” graph, “adventure” and “action” were the first and second most profitable genres.