Define your research question below. What about the data interests you? What is a specific question you want to find out about the data?
I found this BoardGames dataset on the TidyTuesday github page, and I thought it looked really interesting because I love playing board games but I had no idea there were so many of them! I was interested to learn more about what made a board game popular, when there were so many that I had never heard of. Therefore, I came up with the following research question: How does minimum length of play affect average rating of board game on Board Game Geek? How does this change if you stratify by minimum age of player? I was interested in whether people prefer longer or shorter games, and if this changes based on the age group that is the target audience of the game.
Given your question, what is your expectation about the data?
I would expect that overall, longer minimum playtime would lead to lower ratings because aside from committed board game enthusiasts, the casual player might either not want to try the game, or might get bored of it before they can learn to enjoy it. I would think that this effect might be more pronounced among games with a lower minimum player age, due to the typically lower attention span of children and tweens. Games aimed at an older teenage or adult audience might see less of an effect of minimum playtime on rating since those players might be more willing to devote longer to the process, and also may be more likely to already be enthusiastic board-game fans and therefore more used to playing for longer periods or in a social setting.
Load the data below and use
dplyr::glimpse()
orskimr::skim()
on the data. You should upload the data file into thedata
directory.
I got this data from the TidyTuesday github page, and the folder containing the data can be accessed using the following link: https://github.com/rfordatascience/tidytuesday/tree/master/data/2019/2019-03-12
First I will load the BoardGames dataset:
BoardGames <- read.csv("board_games.csv")
head(BoardGames)
Then I will check to make sure that the average_rating, min_playtime, and min_age columns are all encoded as numbers and not characters, so that I can do my analysis properly.
glimpse(BoardGames)
## Rows: 10,532
## Columns: 22
## $ game_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…
## $ description <chr> "Die Macher is a game about seven sequential political…
## $ image <chr> "//cf.geekdo-images.com/images/pic159509.jpg", "//cf.g…
## $ max_players <int> 5, 4, 4, 4, 6, 6, 2, 5, 4, 6, 7, 5, 4, 4, 6, 4, 2, 8, …
## $ max_playtime <int> 240, 30, 60, 60, 90, 240, 20, 120, 90, 60, 45, 60, 120…
## $ min_age <int> 14, 12, 10, 12, 12, 12, 8, 12, 13, 10, 13, 12, 10, 10,…
## $ min_players <int> 3, 3, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 3, 2, 3, 2, 2, …
## $ min_playtime <int> 240, 30, 30, 60, 90, 240, 20, 120, 90, 60, 45, 45, 60,…
## $ name <chr> "Die Macher", "Dragonmaster", "Samurai", "Tal der Köni…
## $ playing_time <int> 240, 30, 60, 60, 90, 240, 20, 120, 90, 60, 45, 60, 120…
## $ thumbnail <chr> "//cf.geekdo-images.com/images/pic159509_t.jpg", "//cf…
## $ year_published <int> 1986, 1981, 1998, 1992, 1964, 1989, 1978, 1993, 1998, …
## $ artist <chr> "Marcus Gschwendtner", "Bob Pepper", "Franz Vohwinkel"…
## $ category <chr> "Economic,Negotiation,Political", "Card Game,Fantasy",…
## $ compilation <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "CATAN…
## $ designer <chr> "Karl-Heinz Schmiel", "G. W. \"Jerry\" D'Arcey", "Rein…
## $ expansion <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, "Elfengold,Elfenla…
## $ family <chr> "Country: Germany,Valley Games Classic Line", "Animals…
## $ mechanic <chr> "Area Control / Area Influence,Auction/Bidding,Dice Ro…
## $ publisher <chr> "Hans im Glück Verlags-GmbH,Moskito Spiele,Valley Game…
## $ average_rating <dbl> 7.66508, 6.60815, 7.44119, 6.60675, 7.35830, 6.52534, …
## $ users_rated <int> 4498, 478, 12019, 314, 15195, 73, 2751, 186, 1263, 672…
Lastly, I’ll check for NA values in my columns of interest (again, average_rating, min_playtime, and min_age):
#I just found this code/refreshed my memory using the help function in RStudio
anyNA(BoardGames$min_age)
## [1] FALSE
anyNA(BoardGames$min_playtime)
## [1] FALSE
anyNA(BoardGames$average_rating)
## [1] FALSE
There are no missing values encoded as NA for my three columns of interest.
If there are any quirks that you have to deal with
NA
coded as something else, or it is multiple tables, please make some notes here about what you need to do before you start transforming the data in the next section.
Note: I thought I did not have NA values, but once I started exploring the data further, it turned out that there was minimum age data encoded as zero, and minimum playtime data encoded as zero. It was unsure if this meant that there really was no minimum age/playtime or just that the data was missing. I chose to create a separate category for these values when I categorized both of those variables. I looked at them briefly in my tables, and excluded them from my graphs, using the category I had created.
Make sure your data types are correct!
Done! See above.
If the data needs to be transformed in any way (values recoded, pivoted, etc), do it here. Examples include transforming a continuous variable into a categorical using
case_when()
, etc.
The three main things I’d like to do to my dataframe to make it more useful for this project are:
Move the columns I will be using the the beginning of the table, because the 2nd column containing the description of the board-game is very long and makes the table unwieldy to look at quickly using functions like head, and doesn’t show up well in the final html file.
Divide minimum playtime into groups, to ease analysis; this will involve making a new categorical variable for minimum playtime.
Divide minimum age into categories for my analysis; I’ll need to look at the range and distribution of ages first, and then choose appropriate category breaks and add a new column with the re-coded categorical variable. This will allow me to subset for the next question.
Subset of the data and think of a question to answer the subset
I will be sub-setting the data into age groups in order to look not only the correlation between minimum playtime and average rating overall, but also in each of the separate age groups.
Bonus points (5 points) for datasets that require merging of tables, but only if you reason through whether you should use
left_join
,inner_join
, orright_join
on these tables. No credit will be provided if you don’t.
I don’t anticipate needing to merge tables for this project, as the BoardGames dataset came in a complete form.
#First, I am rearranging the columns in the table. This code came from Dr. Minnier and Dr. Niederhausen's "An Introduction to R and RStudio for Exploratory Data Analysis" Part 2 slides. https://jminnier-berd-r-courses.netlify.app/01-intro-r-eda/01_intro_r_eda_part2#1
BoardGames_V1 <- BoardGames %>%
relocate(min_age)
BoardGames_V1 <- BoardGames_V1 %>%
relocate(min_playtime, .before = game_id)
BoardGames_V1 <- BoardGames_V1 %>%
relocate(average_rating, .before = game_id)
head(BoardGames_V1)
#Now that my columns are in a good order, I want to look at my minimum playtime data, and see what the distribution is so that I can find good cut points for my categorization. First I used the count function to see the distribution of observations for each playtime.
BoardGames_V1 %>%
count(min_playtime)
#Then I made a boxplot of the minimum play times to visualize the distribution
boxplot(BoardGames_V1$min_playtime, horizontal = T, xlab = "Minimum Playtime (Min)", main = "Boxplot of Min Playtime Distribution")
As we can see, there are a few extreme outliers; which is making it hard to visualize the data; I’m going to remove those points temporarily.
# I'm filtering out the highest 3 observations which were much higher than any of the other min play-times
BoardGames_V1.1 <- BoardGames_V1 %>%
filter(min_playtime <= 6000)
boxplot(BoardGames_V1.1$min_playtime, horizontal = T, xlab = "Minimum Playtime (Min)", main = "Boxplot of Min Playtime Distribution w/ Out Most Extreme Outliers")