Midterm
Define Your Research Question (10 points)
Define your research question below. What about the data interests you? What is a specific question you want to find out about the data?
I found this BoardGames dataset on the TidyTuesday github page, and I thought it looked really interesting because I love playing board games but I had no idea there were so many of them! I was interested to learn more about what made a board game popular, when there were so many that I had never heard of. Therefore, I came up with the following research question: How does minimum length of play affect average rating of board game on Board Game Geek? How does this change if you stratify by minimum age of player? I was interested in whether people prefer longer or shorter games, and if this changes based on the age group that is the target audience of the game.
Given your question, what is your expectation about the data?
I would expect that overall, longer minimum playtime would lead to lower ratings because aside from committed board game enthusiasts, the casual player might either not want to try the game, or might get bored of it before they can learn to enjoy it. I would think that this effect might be more pronounced among games with a lower minimum player age, due to the typically lower attention span of children and tweens. Games aimed at an older teenage or adult audience might see less of an effect of minimum playtime on rating since those players might be more willing to devote longer to the process, and also may be more likely to already be enthusiastic board-game fans and therefore more used to playing for longer periods or in a social setting.
Loading the Data (10 points)
Load the data below and use dplyr::glimpse()
or skimr::skim()
on the data. You should upload the data file into the data
directory.
I got this data from the TidyTuesday github page, and the folder containing the data can be accessed using the following link: https://github.com/rfordatascience/tidytuesday/tree/master/data/2019/2019-03-12
First I will load the BoardGames dataset:
BoardGames <- read.csv("board_games.csv")
head(BoardGames)
Then I will check to make sure that the average_rating, min_playtime, and min_age columns are all encoded as numbers and not characters, so that I can do my analysis properly.
glimpse(BoardGames)
## Rows: 10,532
## Columns: 22
## $ game_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…
## $ description <chr> "Die Macher is a game about seven sequential political…
## $ image <chr> "//cf.geekdo-images.com/images/pic159509.jpg", "//cf.g…
## $ max_players <int> 5, 4, 4, 4, 6, 6, 2, 5, 4, 6, 7, 5, 4, 4, 6, 4, 2, 8, …
## $ max_playtime <int> 240, 30, 60, 60, 90, 240, 20, 120, 90, 60, 45, 60, 120…
## $ min_age <int> 14, 12, 10, 12, 12, 12, 8, 12, 13, 10, 13, 12, 10, 10,…
## $ min_players <int> 3, 3, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 3, 2, 3, 2, 2, …
## $ min_playtime <int> 240, 30, 30, 60, 90, 240, 20, 120, 90, 60, 45, 45, 60,…
## $ name <chr> "Die Macher", "Dragonmaster", "Samurai", "Tal der Köni…
## $ playing_time <int> 240, 30, 60, 60, 90, 240, 20, 120, 90, 60, 45, 60, 120…
## $ thumbnail <chr> "//cf.geekdo-images.com/images/pic159509_t.jpg", "//cf…
## $ year_published <int> 1986, 1981, 1998, 1992, 1964, 1989, 1978, 1993, 1998, …
## $ artist <chr> "Marcus Gschwendtner", "Bob Pepper", "Franz Vohwinkel"…
## $ category <chr> "Economic,Negotiation,Political", "Card Game,Fantasy",…
## $ compilation <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "CATAN…
## $ designer <chr> "Karl-Heinz Schmiel", "G. W. \"Jerry\" D'Arcey", "Rein…
## $ expansion <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, "Elfengold,Elfenla…
## $ family <chr> "Country: Germany,Valley Games Classic Line", "Animals…
## $ mechanic <chr> "Area Control / Area Influence,Auction/Bidding,Dice Ro…
## $ publisher <chr> "Hans im Glück Verlags-GmbH,Moskito Spiele,Valley Game…
## $ average_rating <dbl> 7.66508, 6.60815, 7.44119, 6.60675, 7.35830, 6.52534, …
## $ users_rated <int> 4498, 478, 12019, 314, 15195, 73, 2751, 186, 1263, 672…
Lastly, I’ll check for NA values in my columns of interest (again, average_rating, min_playtime, and min_age):
#I just found this code/refreshed my memory using the help function in RStudio
anyNA(BoardGames$min_age)
## [1] FALSE
anyNA(BoardGames$min_playtime)
## [1] FALSE
anyNA(BoardGames$average_rating)
## [1] FALSE
There are no missing values encoded as NA for my three columns of interest.
If there are any quirks that you have to deal with NA
coded as something else, or it is multiple tables, please make some notes here about what you need to do before you start transforming the data in the next section.
Note: I thought I did not have NA values, but once I started exploring the data further, it turned out that there was minimum age data encoded as zero, and minimum playtime data encoded as zero. It was unsure if this meant that there really was no minimum age/playtime or just that the data was missing. I chose to create a separate category for these values when I categorized both of those variables. I looked at them briefly in my tables, and excluded them from my graphs, using the category I had created.
Make sure your data types are correct!
Done! See above.
Transforming the data (15 points)
If the data needs to be transformed in any way (values recoded, pivoted, etc), do it here. Examples include transforming a continuous variable into a categorical using case_when()
, etc.
The three main things I’d like to do to my dataframe to make it more useful for this project are:
Move the columns I will be using the the beginning of the table, because the 2nd column containing the description of the board-game is very long and makes the table unwieldy to look at quickly using functions like head, and doesn’t show up well in the final html file.
Divide minimum playtime into groups, to ease analysis; this will involve making a new categorical variable for minimum playtime.
Divide minimum age into categories for my analysis; I’ll need to look at the range and distribution of ages first, and then choose appropriate category breaks and add a new column with the re-coded categorical variable. This will allow me to subset for the next question.
Subset of the data and think of a question to answer the subset
I will be sub-setting the data into age groups in order to look not only the correlation between minimum playtime and average rating overall, but also in each of the separate age groups.
Bonus points (5 points) for datasets that require merging of tables, but only if you reason through whether you should use left_join
, inner_join
, or right_join
on these tables. No credit will be provided if you don’t.
I don’t anticipate needing to merge tables for this project, as the BoardGames dataset came in a complete form.
#First, I am rearranging the columns in the table. This code came from Dr. Minnier and Dr. Niederhausen's "An Introduction to R and RStudio for Exploratory Data Analysis" Part 2 slides. https://jminnier-berd-r-courses.netlify.app/01-intro-r-eda/01_intro_r_eda_part2#1
BoardGames_V1 <- BoardGames %>%
relocate(min_age)
BoardGames_V1 <- BoardGames_V1 %>%
relocate(min_playtime, .before = game_id)
BoardGames_V1 <- BoardGames_V1 %>%
relocate(average_rating, .before = game_id)
head(BoardGames_V1)
#Now that my columns are in a good order, I want to look at my minimum playtime data, and see what the distribution is so that I can find good cut points for my categorization. First I used the count function to see the distribution of observations for each playtime.
BoardGames_V1 %>%
count(min_playtime)
#Then I made a boxplot of the minimum play times to visualize the distribution
boxplot(BoardGames_V1$min_playtime, horizontal = T, xlab = "Minimum Playtime (Min)", main = "Boxplot of Min Playtime Distribution")
As we can see, there are a few extreme outliers; which is making it hard to visualize the data; I’m going to remove those points temporarily.
# I'm filtering out the highest 3 observations which were much higher than any of the other min play-times
BoardGames_V1.1 <- BoardGames_V1 %>%
filter(min_playtime <= 6000)
boxplot(BoardGames_V1.1$min_playtime, horizontal = T, xlab = "Minimum Playtime (Min)", main = "Boxplot of Min Playtime Distribution w/ Out Most Extreme Outliers")
As we can see, this graph still isn’t great, with most of the data cramped at the low end and a few high values, however, at this point I’m still going to try to divide up the data into categories:
# Here I am using case_when to create categories for the minimum playtime for each game, with +10-30 meaning more than 10, up to 30 minutes, etc.
BoardGames_V2 <- BoardGames_V1 %>%
mutate(min_play_categories = case_when(
min_playtime == 0 ~ "0:No Min Playtime",
min_playtime > 0 & min_playtime <= 10 ~ "1:10 min or less",
min_playtime > 10 & min_playtime <= 30 ~ "2:+10-30 min",
min_playtime > 30 & min_playtime <= 60 ~ "3:+30-60 min",
min_playtime > 60 & min_playtime <= 180 ~ "4:+1-3 hrs",
min_playtime > 180 & min_playtime <= 720 ~ "5:+3-12 hrs",
min_playtime > 720 ~ "6:More than 12 hrs"
)
)
BoardGames_V2 <- BoardGames_V2 %>%
relocate(min_play_categories, .after = min_playtime)
head(BoardGames_V2)
#Similar to above, I want to look at my age data, and see what the distribution is so that I can find good cut points for my categorization. First I used the count function to see the distribution of observations for each age.
BoardGames_V2 %>%
count(min_age)
#Then I made a rough histogram to visualize the distribution
hist(BoardGames_V2$min_age, xlab = "Minimum Age (yrs)", ylab = "Frequency", main = "Histogram of Min Age Distribution")
Based on the information from my exploratory analysis of the minimum age distribution, and my own knowledge of children, I have chosen to cut the age distribution into the following categories: No Min Age, Young Child (1-6 yrs), Older Child (7-12 yrs), Teen (13-17 yrs), and Adult (18+ yrs)
# Here I am using case_when to create categories for the minimum age of player for each game
BoardGames_V3 <- BoardGames_V2 %>%
mutate(min_age_categories = case_when(
min_age == 0 ~ "0:No Min Age",
min_age > 0 & min_age <= 6 ~ "1:Young Child (0-6 yrs)",
min_age > 6 & min_age <= 12 ~ "2:Older Child (7-12 yrs)",
min_age > 12 & min_age <= 17 ~ "3:Teen (13-17 yrs)",
min_age > 17 ~ "4:Adult (18+ yrs)"
)
)
BoardGames_V3 <- BoardGames_V3 %>%
relocate(min_age_categories, .after = min_age)
Show your transformed table here. Use tools such as glimpse()
, skim()
or head()
to illustrate your point.
Now, we can see that our rearrangement and re-categorization worked:
head(BoardGames_V3)
I can also make separate tables for each of the different age groups, if needed for subsetting analysis later:
#Again, this code came from Dr. Minnier and Dr. Niederhausen's "An Introduction to R and RStudio for Exploratory Data Analysis" Part 2 slides. https://jminnier-berd-r-courses.netlify.app/01-intro-r-eda/01_intro_r_eda_part2#1, as mentioned above for the relocation data
NoMinAge_Games <- BoardGames_V3 %>%
filter(min_age_categories == "0:No Min Age")
head(NoMinAge_Games)
YoungChild_BoardGames <- BoardGames_V3 %>%
filter(min_age_categories == "1:Young Child (0-6 yrs)")
head(YoungChild_BoardGames)
OlderChild_BoardGames <- BoardGames_V3 %>%
filter(min_age_categories == "2:Older Child (7-12 yrs)")
head(OlderChild_BoardGames)
Teen_BoardGames <- BoardGames_V3 %>%
filter(min_age_categories == "3:Teen (13-17 yrs)")
head(Teen_BoardGames)
Adult_BoardGames <- BoardGames_V3 %>%
filter(min_age_categories == "4:Adult (18+ yrs)")
head(Adult_BoardGames)
Are the values what you expected for the variables? Why or Why not?
I was surprised that there were so many minimum playtime and minimum age values of zero; I honestly wasn’t sure if this was the equivalent of a “NA,” or if it was the game-designer’s indication that there was not a minimum playtime or not a minimum age limit. It order to account for this, I made a separate category for these games, so that I could choose whether or not to use them in my final analysis. Otherwise, I was not terribly surprised by anything so far.
Visualizing and Summarizing the Data (15 points)
Use group_by()/summarize()
to make a summary of the data here. The summary should be relevant to your research question
First I will be addressing my primary research question, which was: how does the minimum playtime affect the average rating in the overall dataset?
Main Question
#This summary is the average rating, grouped by minimum playtime
BoardGames_V3 %>%
group_by(min_play_categories) %>%
summarise(Number_Games = n(), Mean_Rating = mean(average_rating), Median_Rating = median(average_rating), Min_Rating = min(average_rating), Max_Rating = max(average_rating))
## `summarise()` ungrouping output (override with `.groups` argument)
Next I’ll look into my secondary question, which is: how does the min playtime affect mean rating when we stratify by age group?
No Min Age
#No Min Age (possibly just NA category)
NoMinAge_Games %>%
group_by(min_play_categories) %>%
summarise(Number_Games = n(), Mean_Rating = mean(average_rating), Median_Rating = median(average_rating), Min_Rating = min(average_rating), Max_Rating = max(average_rating))
## `summarise()` ungrouping output (override with `.groups` argument)
Young Children
#Young Children
YoungChild_BoardGames %>%
group_by(min_play_categories) %>%
summarise(Number_Games = n(), Mean_Rating = mean(average_rating), Median_Rating = median(average_rating), Min_Rating = min(average_rating), Max_Rating = max(average_rating))
## `summarise()` ungrouping output (override with `.groups` argument)
Older Children
#Older Children
OlderChild_BoardGames %>%
group_by(min_play_categories) %>%
summarise(Number_Games = n(), Mean_Rating = mean(average_rating), Median_Rating = median(average_rating), Min_Rating = min(average_rating), Max_Rating = max(average_rating))
## `summarise()` ungrouping output (override with `.groups` argument)
Teens
#Teens
Teen_BoardGames %>%
group_by(min_play_categories) %>%
summarise(Number_Games = n(), Mean_Rating = mean(average_rating), Median_Rating = median(average_rating), Min_Rating = min(average_rating), Max_Rating = max(average_rating))
## `summarise()` ungrouping output (override with `.groups` argument)
Adults
#Adults
Adult_BoardGames %>%
group_by(min_play_categories) %>%
summarise(Number_Games = n(), Mean_Rating = mean(average_rating), Median_Rating = median(average_rating), Min_Rating = min(average_rating), Max_Rating = max(average_rating))
## `summarise()` ungrouping output (override with `.groups` argument)
What are your findings about the summary? Are they what you expected?
Initially looking at the data, I was surprised, because (especially if we ignore the No Min Playtime category, which I think is reasonable since we are not sure if that is really just the NA category or not), we can see that mean rating and median rating both go up with each Minimum Playtime category; it seems from this summary data that the players actually prefer longer playtime!
But what about if we subset by age group?
The pattern is less consistent when we break the data down into age groups. No Min Age this group is of unknown usefulness, since we don’t know if it is just a stand-in for NA. In this table, we see that the longer games are highest rated, but there is not a consistent pattern with the ratings. Young Children seem to enjoy +10-30 min games the most, but ratings were fairly consistent across playtime categories. There were no games for young children with playtimes longer than 3 hours. Older Children and Teens both seemed to follow the overall pattern of preferring longer games, with mean(average_rating) rising steadily as playtimes rose. Adults seem from the data to enjoy +1-3 hr games the most, but there were much fewer games aimed for a minimum age of Adults 18+, an only one adult game longer than 3 hours, so it is harder to trust the pattern in this data as much as for some of the other groups.
Make at least two plots that help you answer your question on the transformed or summarized data.
Plot of Minimum Playtime vs Average Game Rating
# First I created a new dataframe with all the No Min Age and No Min Playtime, since we aren't sure if those are useful or just differently coded NA values.
BoardGames_V4 <- BoardGames_V3 %>%
filter(min_play_categories != "0:No Min Playtime")
BoardGames_V4 <- BoardGames_V4 %>%
filter(min_age_categories != "0:No Min Age")
#First I want to make a plot addressing the main research question, the relationship between minimum playtime and average rating. Again, I made use of Dr. Minnier and Dr. Niederhausen's slides from their "An Introduction to R and RStudio for Exploratory Data Analysis" pt 2 presentation, and also the "Data Visualization with ggplot2 : : CHEAT SHEET" available through the help function in RStudio. I also used the Part 4 lecture from this course to help with my code.
ggplot(data = BoardGames_V4,
aes(x = min_playtime, y = average_rating, color = average_rating)) +
geom_point() +
geom_smooth(method = lm) +
scale_x_log10() +
labs(title = "Minimum Playtime vs Average Game Rating", x = "Log(10) of Minimum Playtime", y = "Average Rating")
## `geom_smooth()` using formula 'y ~ x'
I used a log scale for the x-axis to fit all the data in without having to remove outliers, and graphically added a linear model to visualize the data trend.
Boxplots of Minimum Playtime vs Average Game Rating by Age Group Next, I wanted to look at my age-category subsets:
# For this boxplot, I used the Part 3 lecture and Part 3 assignment for assistance with the code, as well as the "Data Visualization with ggplot2 : : CHEAT SHEET" available through the help function in RStudio
ggplot(BoardGames_V4) +
aes(x = min_play_categories,
y = average_rating,
fill = min_play_categories) +
geom_boxplot() +
facet_grid(rows = vars(min_age_categories)) +
labs(title ="Boxplot of Min Playtime and Avg Rating by Age Group", x = "Min Playtime", y = "Avg Rating") +
scale_fill_discrete(name = "Min Playtime")
Final Summary (10 points)
Summarize your research question and findings below.
My original questions were how does minimum length of play affect average rating of board game on Board Game Geek? How does this change if you stratify by minimum age of player? What I found was that overall, the longer the minimum length of play, the higher the mean of the average rating on Board Game Geek (when excluding the games with no listed minimum playtime due to not knowing if those were the equivilent of NA values). However, when I broke the data down by age group, I found that while this trend held true for Older Children and Teens, it did not as much for Young Children and Adults. For adults the preferred length of gameplay was more than 1 and up to 3 hours, and for young children the preferred length of gameplay was more than 10 and up to 30 minutes. However, it is important to mention that there were very few observations in the adult group, and by far the most observations in the older child and teen groups, so that could certainly be part of the reason why the overall trend followed the pattern of those two groups.
Are your findings what you expected? Why or Why not?
My findings for the overall data were not what I expected; I had expected people to prefer shorter games but the overall trend showed the opposite, the longer the game, the higher the average rating. However, I had also hypothesized that young children’s shorter attention spans would cause them to prefer shorter games, and this did hold pretty true; the games that were less than 10 minutes, or 10-30 minutes both had higher ratings than longer games in the Young Child category. I had hypothesized that attention span would keep going up as players got older, and generally for the older children and teens, longer games were preferred. There was little data for the adult minimum age group, and since there was only one game longer than +1-3 hrs, it is hard to conclude whether adults dislike very long games; for adults the preferred game length was +1-3 hrs.
