Midterm (Due 2/12/2021 at 11:55 pm)

Please submit your .Rmd and .html files in Sakai. If you are working together, both people should submit the files.

60 / 60 points total

The goal of the midterm project is to showcase skills that you have learned in class so far. The midterm is open note, but if you use someone else’s code, you must attribute them. Provide link to source where borrowed code came from as a comment next to the code.

Before you get Started

  1. Pick a dataset. Ideally, the dataset should be around 2000 rows or less, and should have both categorical and numeric covariates.

Potential Sources for data: Tidy Tuesday: https://github.com/rfordatascience/tidytuesday

  • Note that most of these are .csv files. There is code to load the files from csv for each of the datasets and a short description of the variables, or you can upload the .csv file into your data folder.

You may use another dataset or your own data, but please make sure it is de-identified.

  1. Please schedule a time with Eric or Me to discuss your dataset and research question. We just want to look at the data and make sure that it is appropriate for your question.

Working Together

If you’d like to work together, that is encouraged, but you must divide the work equitably and you must note who worked on what. This is probably easiest as notes in the text. Please let Eric or Me know that you’ll be working together.

No acknowledgements of contributions = -10 points overall.

The code chunks and information below were obtained from the R for Data readings and in class activities. If other acknowledgments are needed they will be noted right below my work for that section or code.

Please Note

I will take off points (-5 points for each section) if you don’t add observations and notes in your RMarkdown document. I want you to think and reason through your analysis, even if they are preliminary thoughts.

Define Your Research Question (10 points)

Define your research question below. What about the data interests you? What is a specific question you want to find out about the data?

What items from have the highest and lowest buy value’s in the game of Animal Crossing New Horizons (ACNH)?

I am interested in this dataset because I play ACNH and I am curious which of the items available in are the most and least valuable to purchase in the game of ACNH.

Given your question, what is your expectation about the data?

I expect to find out which items are the most and least valuable items to buy in the game.

Loading the Data (10 points)

Load the data below and use dplyr::glimpse() or skimr::skim() on the data. You should upload the data file into the data directory.

items <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-05-05/items.csv')
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   num_id = col_double(),
##   id = col_character(),
##   name = col_character(),
##   category = col_character(),
##   orderable = col_logical(),
##   sell_value = col_double(),
##   sell_currency = col_character(),
##   buy_value = col_double(),
##   buy_currency = col_character(),
##   sources = col_character(),
##   customizable = col_logical(),
##   recipe = col_double(),
##   recipe_id = col_character(),
##   games_id = col_character(),
##   id_full = col_character(),
##   image_url = col_character()
## )
## Warning: 2 parsing failures.
##  row          col           expected actual                                                                                                  file
## 4472 customizable 1/0/T/F/TRUE/FALSE    Yes 'https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-05-05/items.csv'
## 4473 customizable 1/0/T/F/TRUE/FALSE    Yes 'https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-05-05/items.csv'

If there are any quirks that you have to deal with NA coded as something else, or it is multiple tables, please make some notes here about what you need to do before you start transforming the data in the next section.

Make sure your data types are correct!

I am dealing with some data points that have ‘NA’ values for my buy_value variable as well as my sources variable and I want to clean them up by removing them entirely from my dataset because these ‘NA’ values are not relevant to my question. I will also need to pull out all of the buy_currency values that are equal to bells only since I am not interested in comparing items that require miles currency.

For the ‘NA’s’ and buy currency = bells I used the filter function. First I created a new dataset and called it items_filter. Then I specified buy_currency == “bells” to pull all my currency values that require bells for purchasing. I specified !is.na(buy_value) and !is.na(sources) to remove the ‘NA’ values from my buy_value and sources variables within the filter function.

I was able to pull the data I needed.

In order to subset my data even more I utilized the select() function specifying which variables I wanted to see in my new data frame. First I created another new dataset and named it items_select and then specified that I wanted to see only the name, category, and buy_value from my items_filter dataset.

I had to remove the DIY source values from this dataset to be able to look at my data because it was making my graphs come out very wrong. This did not affect the highest valued item as the source for that is Nook’s Cranny.

Transforming the data (15 points)

If the data needs to be transformed in any way (values recoded, pivoted, etc), do it here. Examples include transforming a continuous variable into a categorical using case_when(), etc.

Subset of the data and think of a question to answer the subset.

Bonus points (5 points) for datasets that require merging of tables, but only if you reason through whether you should use left_join, inner_join, or right_join on these tables. No credit will be provided if you don’t.

items_filter<-items%>%
  arrange(sources,category)%>%                                 #From Part 3 in class and,
  filter(buy_currency == "bells",
         !is.na(buy_value),                                    #From R for Data Science.
         !is.na(sources),
         !(sources == "DIY"))                                  #Intro to R and RStudio by Dr. Jessica Minnier                  
items_select<-items_filter%>%
  select(name,category,sources,buy_value)

Show your transformed table here. Use tools such as glimpse(), skim() or head() to illustrate your point.

glimpse(items_select)
## Rows: 605
## Columns: 4
## $ name      <chr> "Bandage", "Bubblegum", "Butterfly Shades", "Cat Nose", "Cu…
## $ category  <chr> "Accessories", "Accessories", "Accessories", "Accessories",…
## $ sources   <chr> "Able Sisters", "Able Sisters", "Able Sisters", "Able Siste…
## $ buy_value <dbl> 140, 140, 1040, 560, 700, 770, 1100, 1100, 910, 1100, 490, …

Are the values what you expected for the variables? Why or Why not?

Yes these variables are exactly what I expected. I was able to specify the variables I wanted to see in my dataset and remove the ‘NA’ values that are not relevant to my question.

Visualizing and Summarizing the Data (15 points)

Use group_by()/summarize() to make a summary of the data here. The summary should be relevant to your research question

For this, because my question is what are the highest and lowest valued items for buy value I used slice_min and slice_max as suggested by Eric.

items_slice_max<-items_select%>%              #using slice_min & slice_max suggestion from Eric our awesome TA.
  slice_max(buy_value, n=10)%>%
  arrange(desc(buy_value))
items_slice_max
## # A tibble: 10 x 4
##    name                category  sources                               buy_value
##    <chr>               <chr>     <chr>                                     <dbl>
##  1 Open-frame Kitchen  Furniture Nook's Cranny                            140000
##  2 Lighthouse          Furniture Nook Shopping                            100000
##  3 Cotton-candy Stall  Furniture Nook Shopping                             60000
##  4 Acnh Nintendo Swit… Furniture Nook Shopping                             35960
##  5 Acnh Nintendo Swit… Furniture Receive in mail if playing on ACNH S…     35960
##  6 Nintendo Switch     Furniture Receive in mail on your second day        29980
##  7 Cute Bed            Furniture Nook's Cranny                             12000
##  8 Den Desk            Furniture Nook's Cranny                             10000
##  9 High-end Stereo     Furniture Nook's Cranny                             10000
## 10 Pyramid             Furniture Gulliver                                   9200
items_slice_min<-items_select%>%  
 slice_min(buy_value, n=10, with_ties = FALSE)  #with_ties=FALSE from help section on R.
items_slice_min
## # A tibble: 10 x 4
##    name            category    sources        buy_value
##    <chr>           <chr>       <chr>              <dbl>
##  1 Bandage         Accessories Able Sisters         140
##  2 Bubblegum       Accessories Able Sisters         140
##  3 Red Cosmos      Flowers     Find on ground       160
##  4 Red Hyacinths   Flowers     Find on ground       160
##  5 Red Lilies      Flowers     Find on ground       160
##  6 Red Mums        Flowers     Find on ground       160
##  7 Red Pansies     Flowers     Find on ground       160
##  8 Red Roses       Flowers     Find on ground       160
##  9 Red Tulips      Flowers     Find on ground       160
## 10 Red Windflowers Flowers     Find on ground       160

What are your findings about the summary? Are they what you expected?

I would have expected higher prices for some of these max values. The reason is I have seen things that cost way more in the game. But this could be that the dataset is from some time in 2020 and the updates that have been ran in the game may affect the items and prices, that is to say there may be new items this list is not able to account for since the updates in the game. Also what is blowing me away is that my highest value item is not shown in this list. But appears in my graph below.

Make at least two plots that help you answer your question on the transformed or summarized data.

Most of these code chunks came from a combination of Part 3 and the Introduction to R and RStudio by Dr. Jessica Minnier. Eric helped me a website for scale_y_continuous to change my labels on the y-axis, and the rotation for the labels on the x-axis came from stackoverflow.

items_slice_max%>%
  ggplot(aes(x=name, y=buy_value, fill=sources))+
    geom_col()+
    scale_y_continuous(labels = scales::dollar_format())+                   #https://datavizpyr.com/dollar-format-for-axis-labels-with-ggplot2/ 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))+      #The code to rotate the labels in my x-axis came from stackoverflow.  
  labs(title = "Items with Highest Buy Values",
       x = "Item Names",
       y = "Buy Value in Bells")

items_slice_min%>%
  ggplot(aes(x=name, y=buy_value, fill=sources))+
    geom_col()+
    scale_y_continuous(labels = scales::dollar_format())+                   #https://datavizpyr.com/dollar-format-for-axis-labels-with-ggplot2/ 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))+      #The code to rotate the labels in my x-axis came from stackoverflow.
  labs(title = "Items with Lowest Buy Values",
       x = "Item Names",
       y = "Buy Value in Bells")

The first graph above represents the top 10 valued items in the game ACNH and the second graph represents the 10 lowest valued items in the game ACNH. These graphs are color coordinated by the source of where to get the item from. Here Bells = $.

Final Summary (10 points)

Summarize your research question and findings below.

The question I am trying to answer is, what are the items with the highest and lowest buy values in the game of ACNH? What I discovered is highest valued item is a Open Flamed Kitchen costing 140,000 Bells, and the lowest valued items are the Bandage and Bubblegum which cost 140 Bells. I was also able to show the top 10 and bottom 10 valued items in this game. This is exactly what I wanted to know.

Are your findings what you expected? Why or Why not?

This is exactly what I expected to find. I had to do some manipulation of the data because one of the items is listed 3 times in the dataset. After manipulating the data I was able to extract what I needed and my findings are what I expected them to be.