library(ggplot2)
library(readxl)
library(vembedr)
library(visdat)
library(tidyverse)
library(skimr)
library(palmerpenguins)
library(dplyr)
library(ggthemes)
library(ggridges)
library(janitor)
library(wesanderson)
library(here)

Research Question (10 points)


Define your research question below. What about the data interests you? What is a specific question you want to find out about the data?


I will be working on the friends datasets from the provided webside. I am a fan of this series and excited to work on a dataset about the characters’ emotions. The specific research question that I want to answer is “Which of the main six characters in friends series showed the most drastic change in few emotional responses from season one through four?” Since it is a comedy series (or I consider it so), I wanted to see the number of joyful and sad scenes especially and who play them the most over the seasons. Unfortunately, emotion data is only available till season 4 in my downloaded files, so I will be using only season 1, 2 3, and 4.


Given your question, what is your expectation about the data?


I think Chandler would play the most joyful scenes, Ross the sad ones, Rachel would play more mad scenes and Monica would play the powerful scenes.


Data Loading (10 points)


Load the data below and use dplyr::glimpse() or skimr::skim() on the data. You should upload the data file into the data directory.


# Using read_excel here because this is what I am learning in this class, it is new to me. I always converted xlsx into csv and and loaded csv data. 

friends = read_excel(here('data/friends_excel.xlsx'), 
                      sheet=1, 
                      na="NA")
friendsEMO = read_excel(here('data/friends_emotions_excel.xlsx'), 
                      sheet=1, 
                      na="NA")

# Direct links from the provided website
#friends = readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-08/friends.csv')
#friends_emotions = readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-08/friends_emotions.csv')

#friendsCSV = read.csv("friends.csv")
#friendsEmoCSV = read.csv("friends_emotions.csv")

# text and speaker are character variables and season, episode, scene and utterance are numeric variable.
dplyr::glimpse(friends)
## Rows: 67,373
## Columns: 6
## $ text      <chr> "There's nothing to tell! He's just some guy I work with!",…
## $ speaker   <chr> "Monica Geller", "Joey Tribbiani", "Chandler Bing", "Phoebe…
## $ season    <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ episode   <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ scene     <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ utterance <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …

skimr::skim(friendsEMO)
Data summary
Name friendsEMO
Number of rows 12606
Number of columns 5
_______________________
Column type frequency:
character 1
numeric 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
emotion 0 1 3 8 0 7 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
season 0 1 2.57 1.11 1 2 3 4 4 ▆▆▁▇▇
episode 0 1 13.19 7.20 1 7 14 20 25 ▇▆▇▇▇
scene 0 1 7.52 4.43 1 4 7 11 29 ▇▇▂▁▁
utterance 0 1 9.80 5.94 1 5 9 14 30 ▇▇▅▂▁

slice(friendsEMO)
## # A tibble: 12,606 x 5
##    season episode scene utterance emotion
##     <dbl>   <dbl> <dbl>     <dbl> <chr>  
##  1      1       1     4         1 Mad    
##  2      1       1     4         3 Neutral
##  3      1       1     4         4 Joyful 
##  4      1       1     4         5 Neutral
##  5      1       1     4         6 Neutral
##  6      1       1     4         7 Neutral
##  7      1       1     4         8 Scared 
##  8      1       1     4        10 Joyful 
##  9      1       1     4        11 Joyful 
## 10      1       1     4        12 Sad    
## # … with 12,596 more rows

If there are any quirks that you have to deal with NA coded as something else, or it is multiple tables, please make some notes here about what you need to do before you start transforming the data in the next section.


I converted the CSV files to excel, and assigned missing values as “NA”. I want the emotions with character names in a single data frame; hence I will be merging the two datasets using left_join.


Make sure your data types are correct!


Data Transforming (15 points)


If the data needs to be transformed in any way (values recoded, pivoted, etc), do it here. Examples include transforming a continuous variable into a categorical using case_when(), etc.


# I dont need to transform values but need to merge the two main dataset because one has the emotional responses and the other has the character names.
friendsMERGE = friendsEMO %>%
  left_join(y = friends, 
             by = c("season" = "season", "episode" = "episode", 
                    "scene" = "scene", "utterance" = "utterance"))
#nrow(friendsMERGE)
#friendsMERGEinner = friendsEMO %>%
#  inner_join(y = friends, 
#             by = c("season" = "season", "episode" = "episode", 
#                    "scene" = "scene", "utterance" = "utterance"))
#nrow(friendsMERGEinner)

friendsMERGE = friendsMERGE %>% 
  select(season, emotion, speaker) 

#!is.na(friendsMERGE) too many rows to check
vis_dat(friendsMERGE)


Bonus points (5 points) for datasets that require merging of tables, but only if you reason through whether you should use left_join, inner_join, or right_join on these tables. No credit will be provided if you don’t.


Left_join adds columns to the right even if they do not have values in few rows for the conditions I am joining them. Inner_join returns rows when both merging tables have maching values.I am using left_join here because to me it is more intuitive (I have a table I like I want to add columns in that table). However, I did test inner_join and it does not give a different row numbers, and also I want the emotional responses for the rows with character names and no emotional response value would have been no use for this project.


Show your transformed table here. Use tools such as glimpse(), skim() or head() to illustrate your point.


slice(friendsMERGE)
## # A tibble: 12,606 x 3
##    season emotion speaker       
##     <dbl> <chr>   <chr>         
##  1      1 Mad     Ross Geller   
##  2      1 Neutral Joey Tribbiani
##  3      1 Joyful  Chandler Bing 
##  4      1 Neutral Joey Tribbiani
##  5      1 Neutral Chandler Bing 
##  6      1 Neutral Joey Tribbiani
##  7      1 Scared  Chandler Bing 
##  8      1 Joyful  Joey Tribbiani
##  9      1 Joyful  Chandler Bing 
## 10      1 Sad     Ross Geller   
## # … with 12,596 more rows

Are the values what you expected for the variables? Why or Why not?


I believe I got the right values for the variables. I will only be looking at season one through four, emotional responses and the main six characters. I have more than the four seasons I need, more emotional responses than I need and more characters than I need which I will be filtering through in the next step.


Visualizing and Summarizing (15 points)


Use group_by()/summarize() to make a summary of the data here. The summary should be relevant to your research question


# Here I am filtering the the emotions I need and for the main six characters. We have emotions data only for four seasons. 
friendsFinal = friendsMERGE %>%
    filter(emotion %in% c("Joyful", "Sad", "Powerful", "Mad")) %>%
    filter(speaker %in% c("Chandler Bing", "Joey Tribbiani",
           "Ross Geller", "Phoebe Buffay", "Monica Geller",
           "Rachel Green")) %>%
    filter(season %in% c(1, 2, 3, 4)) %>%
    group_by(speaker, emotion, season) %>%
    summarize(EmoCount = n()) 
## `summarise()` has grouped output by 'speaker', 'emotion'. You can override using the `.groups` argument.
glimpse(friendsFinal)
## Rows: 96
## Columns: 4
## Groups: speaker, emotion [24]
## $ speaker  <chr> "Chandler Bing", "Chandler Bing", "Chandler Bing", "Chandler…
## $ emotion  <chr> "Joyful", "Joyful", "Joyful", "Joyful", "Mad", "Mad", "Mad",…
## $ season   <dbl> 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, …
## $ EmoCount <int> 96, 92, 103, 96, 57, 57, 63, 41, 20, 17, 39, 65, 17, 18, 18,…
slice(friendsFinal)
## # A tibble: 96 x 4
## # Groups:   speaker, emotion [24]
##    speaker       emotion  season EmoCount
##    <chr>         <chr>     <dbl>    <int>
##  1 Chandler Bing Joyful        1       96
##  2 Chandler Bing Joyful        2       92
##  3 Chandler Bing Joyful        3      103
##  4 Chandler Bing Joyful        4       96
##  5 Chandler Bing Mad           1       57
##  6 Chandler Bing Mad           2       57
##  7 Chandler Bing Mad           3       63
##  8 Chandler Bing Mad           4       41
##  9 Chandler Bing Powerful      1       20
## 10 Chandler Bing Powerful      2       17
## # … with 86 more rows

What are your findings about the summary? Are they what you expected?


The summary successfully filtered out first four seasons and few emotions that I will be looking at. This table is exactly what I needed for further analysis.


Make at least two plots that help you answer your question on the transformed or summarized data.


# Bar plot by each four interested emotions 
ggplot(data = friendsFinal,
       aes(x = speaker,
           y = EmoCount, 
           fill = factor(season))) +
    geom_bar(stat = "identity", position = "dodge") +
    facet_wrap(vars(emotion)) +  
    scale_fill_discrete(name = "Seasons") +
    labs(x = "character Names",  
       y = "Emotional Response Count",  
       title = "Barplot of Emotions by Seasons") +  
    theme_bw() +  
    theme(axis.text.x = element_text(angle = -30, hjust = 0), 
          legend.position = "bottom") +  
    scale_fill_viridis_d(name = "Seasons") +
    scale_color_manual(values = wes_palette("BottleRocket2"))
## Scale for 'fill' is already present. Adding another scale for 'fill', which
## will replace the existing scale.