Midterm (Due 2/12/2021 at 11:55 pm)

Please submit your .Rmd and .html files in Sakai. If you are working together, both people should submit the files.

60 / 60 points total

The goal of the midterm project is to showcase skills that you have learned in class so far. The midterm is open note, but if you use someone else’s code, you must attribute them.

Midterm (Due 2/12/2021 at 11:55 pm)

Please submit your .Rmd and .html files in Sakai. If you are working together, both people should submit the files.

60 / 60 points total

Define Your Research Question (10 points)

Define your research question below. What about the data interests you? What is a specific question you want to find out about the data?

Research question by Amy:

1. What are the top four countries of origin and out of theses four which contains the highest coffee cup rating?

Research question by Sara:

2. Does altitude affect taste of the coffee and coffee ratings?

Given your question, what is your expectation about the data?

Amy: Through the literature review, I found that there there are a total of 80 countries that render the correct climate conditions to produce coffee; however, only 50 of those countries contain the proper infrustrature for coffee production. Some of the leading countries in coffee production are Colombia, Guatemala, Ethiopia and Costa Rica. Generally, Colombia produced the widest flavor composition of coffee and accounts for some of the largest distribution of coffee - a total of 15%. In terms of flavor profile, globally Guatamala has been voted the highest-quality of coffee. This is largely due to the climatic conditions in which it is grown. Therefore, through my data search I believe that I will find Colombia as the country with the most coffee ratings and Guatemala with the highest coffee rankings.

Reference: https://barefootcoffeeroasters.com/countries-with-the-best-coffee-beans/

Sara: There are many factors affect coffee flavor profile and therefore impact the overall rating of coffee. Through my literature review that the highest rated coffee will come from regions of higher altitude due to the fact that higher alitutude (specifically 900m-1500m) provide the most desirable conditions for growing coffee trees. What makes these altitudes so desireable is the temperature and the ability for water drainage. The ability to secure proper water drainage allows for the coffee beans contain a more concentrated flavor which tends to be the most desireable from coffee drinkers globally. Refrence: www.scribblerscoffee.com

image reference: Moldvaer, Anette. Coffee obsession. Dorling Kindersley, 2014.

image refrence: Moldvaer, Anette. Coffee obsession. Dorling Kindersley, 2014.

knitr::include_graphics("image/Anette Moldvaer, Coffee Obsession.jpg")

image refrence: www.scribblerscoffee.com

knitr::include_graphics("image/scribblerscoffee.com.jpg")

Loading the Data (10 points)

Load the data below and use dplyr::glimpse() or skimr::skim() on the data. You should upload the data file into the data directory.

Amy and Sara: The data was found on github.com under tidytuesday. While uploading the data, it was ensured the function na="NA was applied so that the programming software would recognize that variable. Through a quick exploratory analysis, it was found that there was not a lot of missing data from the variables country_of_origin and total_cup_points; however, altitude contained a total of 230 missing variables - therefore, this will need to be accounted for later in the analysis.

#data taken from github.com and tidytuesday
coffee_ratings <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-07-07/coffee_ratings.csv',na="NA")
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   .default = col_character(),
##   total_cup_points = col_double(),
##   number_of_bags = col_double(),
##   aroma = col_double(),
##   flavor = col_double(),
##   aftertaste = col_double(),
##   acidity = col_double(),
##   body = col_double(),
##   balance = col_double(),
##   uniformity = col_double(),
##   clean_cup = col_double(),
##   sweetness = col_double(),
##   cupper_points = col_double(),
##   moisture = col_double(),
##   category_one_defects = col_double(),
##   quakers = col_double(),
##   category_two_defects = col_double(),
##   altitude_low_meters = col_double(),
##   altitude_high_meters = col_double(),
##   altitude_mean_meters = col_double()
## )
## ℹ Use `spec()` for the full column specifications.
skim(coffee_ratings)
Data summary
Name coffee_ratings
Number of rows 1339
Number of columns 43
_______________________
Column type frequency:
character 24
numeric 19
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
species 0 1.00 7 7 0 2 0
owner 7 0.99 3 50 0 315 0
country_of_origin 1 1.00 4 28 0 36 0
farm_name 359 0.73 1 73 0 571 0
lot_number 1063 0.21 1 71 0 227 0
mill 315 0.76 1 77 0 460 0
ico_number 151 0.89 1 40 0 847 0
company 209 0.84 3 73 0 281 0
altitude 226 0.83 1 41 0 396 0
region 59 0.96 2 76 0 356 0
producer 231 0.83 1 100 0 691 0
bag_weight 0 1.00 1 8 0 56 0
in_country_partner 0 1.00 7 85 0 27 0
harvest_year 47 0.96 3 24 0 46 0
grading_date 0 1.00 13 20 0 567 0
owner_1 7 0.99 3 50 0 319 0
variety 226 0.83 4 21 0 29 0
processing_method 170 0.87 5 25 0 5 0
color 218 0.84 4 12 0 4 0
expiration 0 1.00 13 20 0 566 0
certification_body 0 1.00 7 85 0 26 0
certification_address 0 1.00 40 40 0 32 0
certification_contact 0 1.00 40 40 0 29 0
unit_of_measurement 0 1.00 1 2 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
total_cup_points 0 1.00 82.09 3.50 0 81.08 82.50 83.67 90.58 ▁▁▁▁▇
number_of_bags 0 1.00 154.18 129.99 0 14.00 175.00 275.00 1062.00 ▇▇▁▁▁
aroma 0 1.00 7.57 0.38 0 7.42 7.58 7.75 8.75 ▁▁▁▁▇
flavor 0 1.00 7.52 0.40 0 7.33 7.58 7.75 8.83 ▁▁▁▁▇
aftertaste 0 1.00 7.40 0.40 0 7.25 7.42 7.58 8.67 ▁▁▁▁▇
acidity 0 1.00 7.54 0.38 0 7.33 7.58 7.75 8.75 ▁▁▁▁▇
body 0 1.00 7.52 0.37 0 7.33 7.50 7.67 8.58 ▁▁▁▁▇
balance 0 1.00 7.52 0.41 0 7.33 7.50 7.75 8.75 ▁▁▁▁▇
uniformity 0 1.00 9.83 0.55 0 10.00 10.00 10.00 10.00 ▁▁▁▁▇
clean_cup 0 1.00 9.84 0.76 0 10.00 10.00 10.00 10.00 ▁▁▁▁▇
sweetness 0 1.00 9.86 0.62 0 10.00 10.00 10.00 10.00 ▁▁▁▁▇
cupper_points 0 1.00 7.50 0.47 0 7.25 7.50 7.75 10.00 ▁▁▁▇▁
moisture 0 1.00 0.09 0.05 0 0.09 0.11 0.12 0.28 ▃▇▅▁▁
category_one_defects 0 1.00 0.48 2.55 0 0.00 0.00 0.00 63.00 ▇▁▁▁▁
quakers 1 1.00 0.17 0.83 0 0.00 0.00 0.00 11.00 ▇▁▁▁▁
category_two_defects 0 1.00 3.56 5.31 0 0.00 2.00 4.00 55.00 ▇▁▁▁▁
altitude_low_meters 230 0.83 1750.71 8669.44 1 1100.00 1310.64 1600.00 190164.00 ▇▁▁▁▁
altitude_high_meters 230 0.83 1799.35 8668.81 1 1100.00 1350.00 1650.00 190164.00 ▇▁▁▁▁
altitude_mean_meters 230 0.83 1775.03 8668.63 1 1100.00 1310.64 1600.00 190164.00 ▇▁▁▁▁
glimpse(coffee_ratings)
## Rows: 1,339
## Columns: 43
## $ total_cup_points      <dbl> 90.58, 89.92, 89.75, 89.00, 88.83, 88.83, 88.75…
## $ species               <chr> "Arabica", "Arabica", "Arabica", "Arabica", "Ar…
## $ owner                 <chr> "metad plc", "metad plc", "grounds for health a…
## $ country_of_origin     <chr> "Ethiopia", "Ethiopia", "Guatemala", "Ethiopia"…
## $ farm_name             <chr> "metad plc", "metad plc", "san marcos barrancas…
## $ lot_number            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ mill                  <chr> "metad plc", "metad plc", NA, "wolensu", "metad…
## $ ico_number            <chr> "2014/2015", "2014/2015", NA, NA, "2014/2015", …
## $ company               <chr> "metad agricultural developmet plc", "metad agr…
## $ altitude              <chr> "1950-2200", "1950-2200", "1600 - 1800 m", "180…
## $ region                <chr> "guji-hambela", "guji-hambela", NA, "oromia", "…
## $ producer              <chr> "METAD PLC", "METAD PLC", NA, "Yidnekachew Dabe…
## $ number_of_bags        <dbl> 300, 300, 5, 320, 300, 100, 100, 300, 300, 50, …
## $ bag_weight            <chr> "60 kg", "60 kg", "1", "60 kg", "60 kg", "30 kg…
## $ in_country_partner    <chr> "METAD Agricultural Development plc", "METAD Ag…
## $ harvest_year          <chr> "2014", "2014", NA, "2014", "2014", "2013", "20…
## $ grading_date          <chr> "April 4th, 2015", "April 4th, 2015", "May 31st…
## $ owner_1               <chr> "metad plc", "metad plc", "Grounds for Health A…
## $ variety               <chr> NA, "Other", "Bourbon", NA, "Other", NA, "Other…
## $ processing_method     <chr> "Washed / Wet", "Washed / Wet", NA, "Natural / …
## $ aroma                 <dbl> 8.67, 8.75, 8.42, 8.17, 8.25, 8.58, 8.42, 8.25,…
## $ flavor                <dbl> 8.83, 8.67, 8.50, 8.58, 8.50, 8.42, 8.50, 8.33,…
## $ aftertaste            <dbl> 8.67, 8.50, 8.42, 8.42, 8.25, 8.42, 8.33, 8.50,…
## $ acidity               <dbl> 8.75, 8.58, 8.42, 8.42, 8.50, 8.50, 8.50, 8.42,…
## $ body                  <dbl> 8.50, 8.42, 8.33, 8.50, 8.42, 8.25, 8.25, 8.33,…
## $ balance               <dbl> 8.42, 8.42, 8.42, 8.25, 8.33, 8.33, 8.25, 8.50,…
## $ uniformity            <dbl> 10.00, 10.00, 10.00, 10.00, 10.00, 10.00, 10.00…
## $ clean_cup             <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10,…
## $ sweetness             <dbl> 10.00, 10.00, 10.00, 10.00, 10.00, 10.00, 10.00…
## $ cupper_points         <dbl> 8.75, 8.58, 9.25, 8.67, 8.58, 8.33, 8.50, 9.00,…
## $ moisture              <dbl> 0.12, 0.12, 0.00, 0.11, 0.12, 0.11, 0.11, 0.03,…
## $ category_one_defects  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ quakers               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ color                 <chr> "Green", "Green", NA, "Green", "Green", "Bluish…
## $ category_two_defects  <dbl> 0, 1, 0, 2, 2, 1, 0, 0, 0, 4, 1, 0, 0, 2, 2, 0,…
## $ expiration            <chr> "April 3rd, 2016", "April 3rd, 2016", "May 31st…
## $ certification_body    <chr> "METAD Agricultural Development plc", "METAD Ag…
## $ certification_address <chr> "309fcf77415a3661ae83e027f7e5f05dad786e44", "30…
## $ certification_contact <chr> "19fef5a731de2db57d16da10287413f5f99bc2dd", "19…
## $ unit_of_measurement   <chr> "m", "m", "m", "m", "m", "m", "m", "m", "m", "m…
## $ altitude_low_meters   <dbl> 1950.0, 1950.0, 1600.0, 1800.0, 1950.0, NA, NA,…
## $ altitude_high_meters  <dbl> 2200.0, 2200.0, 1800.0, 2200.0, 2200.0, NA, NA,…
## $ altitude_mean_meters  <dbl> 2075.0, 2075.0, 1700.0, 2000.0, 2075.0, NA, NA,…
class(coffee_ratings)
## [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"
coffee_ratings<-coffee_ratings%>%clean_names()
table(coffee_ratings$species)
## 
## Arabica Robusta 
##    1311      28

If there are any quirks that you have to deal with NA coded as something else, or it is multiple tables, please make some notes here about what you need to do before you start transforming the data in the next section.

Sara: Since I am working with the altitude, I cleaned the data with clean_names to remove any missing spaces within the data or any improper dictation.

coffee_ratings%>%count(altitude_mean_meters,sort=TRUE)

Make sure your data types are correct!

Transforming the data (15 points)

If the data needs to be transformed in any way (values recoded, pivoted, etc), do it here. Examples include transforming a continuous variable into a categorical using case_when(), etc.*

Amy and Sara: The variables altitude_mean_meters and total_cup_points need to be transformed into categorical data using the case_when() function. This function allows us to mutate the data into different categories. For the mean altitude we categorized it into 4 different groups based on the image provided above: low altitude (<900), medium altitude(900-1200), high altitude(1200-1500), and very high altitude(>1500).Additionally, we created a grading scale for the total_cup_points variables to better illustrate the rating of the coffee. The total_cup_points was categorized into 4 groups based on the summary of the data: A(>84), B(82-84),C(81-82), and D(<81).

coffee_ratings%>%count(altitude_mean_meters,sort=TRUE)
#categorizing altitude
coffee_ratings%>%
  summarize(altitude_mean_meters,min=min(altitude_mean_meters,na.rm=TRUE),"1stquartile"=quantile(altitude_mean_meters,0.25,na.rm=TRUE),mean=mean(altitude_mean_meters,na.rm=TRUE), median=median(altitude_mean_meters,na.rm=TRUE), "3rdquartile"=quantile(altitude_mean_meters,0.75,na.rm=TRUE),max=max(altitude_mean_meters,na.rm=TRUE))
coffee_ratings <-coffee_ratings %>% 
    mutate(altitude_category = 
               case_when(altitude_mean_meters < 900 ~ "low_altitude",
                         altitude_mean_meters>=900& altitude_mean_meters<1200~ "medium_altitude",
                         altitude_mean_meters>=1200&altitude_mean_meters<1500~"high_altitude",
                         altitude_mean_meters>=1500~"very_high_altitude")
           ) %>%
    mutate(altitude_category = factor(altitude_category,
                                 levels = c("low_altitude","medium_altitude","high_altitude","very_high_altitude") ))


coffee_ratings <- 
    coffee_ratings %>% 
    mutate(cup_points_category = 
               case_when(total_cup_points >=81&total_cup_points<83~ "C",
                         total_cup_points>=83&total_cup_points<84~"B",
                         total_cup_points>=84~ "A",
                         total_cup_points<81~ "D"
             )
           ) %>%
    mutate(cup_points_category = factor(cup_points_category,
                                 levels = c("A","B","C","D" ) ))

Bonus points (5 points) for datasets that require merging of tables, but only if you reason through whether you should use left_join, inner_join, or right_join on these tables. No credit will be provided if you don’t.

Show your transformed table here. Use tools such as glimpse(), skim() or head() to illustrate your point.

coffee_ratings %>% select(total_cup_points,cup_points_category,altitude_mean_meters, altitude_category,species:altitude_high_meters )->coffee_ratings
names(coffee_ratings)
##  [1] "total_cup_points"      "cup_points_category"   "altitude_mean_meters" 
##  [4] "altitude_category"     "species"               "owner"                
##  [7] "country_of_origin"     "farm_name"             "lot_number"           
## [10] "mill"                  "ico_number"            "company"              
## [13] "altitude"              "region"                "producer"             
## [16] "number_of_bags"        "bag_weight"            "in_country_partner"   
## [19] "harvest_year"          "grading_date"          "owner_1"              
## [22] "variety"               "processing_method"     "aroma"                
## [25] "flavor"                "aftertaste"            "acidity"              
## [28] "body"                  "balance"               "uniformity"           
## [31] "clean_cup"             "sweetness"             "cupper_points"        
## [34] "moisture"              "category_one_defects"  "quakers"              
## [37] "color"                 "category_two_defects"  "expiration"           
## [40] "certification_body"    "certification_address" "certification_contact"
## [43] "unit_of_measurement"   "altitude_low_meters"   "altitude_high_meters"
glimpse(coffee_ratings)
## Rows: 1,339
## Columns: 45
## $ total_cup_points      <dbl> 90.58, 89.92, 89.75, 89.00, 88.83, 88.83, 88.75…
## $ cup_points_category   <fct> A, A, A, A, A, A, A, A, A, A, A, A, A, A, A, A,…
## $ altitude_mean_meters  <dbl> 2075.0, 2075.0, 1700.0, 2000.0, 2075.0, NA, NA,…
## $ altitude_category     <fct> very_high_altitude, very_high_altitude, very_hi…
## $ species               <chr> "Arabica", "Arabica", "Arabica", "Arabica", "Ar…
## $ owner                 <chr> "metad plc", "metad plc", "grounds for health a…
## $ country_of_origin     <chr> "Ethiopia", "Ethiopia", "Guatemala", "Ethiopia"…
## $ farm_name             <chr> "metad plc", "metad plc", "san marcos barrancas…
## $ lot_number            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ mill                  <chr> "metad plc", "metad plc", NA, "wolensu", "metad…
## $ ico_number            <chr> "2014/2015", "2014/2015", NA, NA, "2014/2015", …
## $ company               <chr> "metad agricultural developmet plc", "metad agr…
## $ altitude              <chr> "1950-2200", "1950-2200", "1600 - 1800 m", "180…
## $ region                <chr> "guji-hambela", "guji-hambela", NA, "oromia", "…
## $ producer              <chr> "METAD PLC", "METAD PLC", NA, "Yidnekachew Dabe…
## $ number_of_bags        <dbl> 300, 300, 5, 320, 300, 100, 100, 300, 300, 50, …
## $ bag_weight            <chr> "60 kg", "60 kg", "1", "60 kg", "60 kg", "30 kg…
## $ in_country_partner    <chr> "METAD Agricultural Development plc", "METAD Ag…
## $ harvest_year          <chr> "2014", "2014", NA, "2014", "2014", "2013", "20…
## $ grading_date          <chr> "April 4th, 2015", "April 4th, 2015", "May 31st…
## $ owner_1               <chr> "metad plc", "metad plc", "Grounds for Health A…
## $ variety               <chr> NA, "Other", "Bourbon", NA, "Other", NA, "Other…
## $ processing_method     <chr> "Washed / Wet", "Washed / Wet", NA, "Natural / …
## $ aroma                 <dbl> 8.67, 8.75, 8.42, 8.17, 8.25, 8.58, 8.42, 8.25,…
## $ flavor                <dbl> 8.83, 8.67, 8.50, 8.58, 8.50, 8.42, 8.50, 8.33,…
## $ aftertaste            <dbl> 8.67, 8.50, 8.42, 8.42, 8.25, 8.42, 8.33, 8.50,…
## $ acidity               <dbl> 8.75, 8.58, 8.42, 8.42, 8.50, 8.50, 8.50, 8.42,…
## $ body                  <dbl> 8.50, 8.42, 8.33, 8.50, 8.42, 8.25, 8.25, 8.33,…
## $ balance               <dbl> 8.42, 8.42, 8.42, 8.25, 8.33, 8.33, 8.25, 8.50,…
## $ uniformity            <dbl> 10.00, 10.00, 10.00, 10.00, 10.00, 10.00, 10.00…
## $ clean_cup             <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10,…
## $ sweetness             <dbl> 10.00, 10.00, 10.00, 10.00, 10.00, 10.00, 10.00…
## $ cupper_points         <dbl> 8.75, 8.58, 9.25, 8.67, 8.58, 8.33, 8.50, 9.00,…
## $ moisture              <dbl> 0.12, 0.12, 0.00, 0.11, 0.12, 0.11, 0.11, 0.03,…
## $ category_one_defects  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ quakers               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ color                 <chr> "Green", "Green", NA, "Green", "Green", "Bluish…
## $ category_two_defects  <dbl> 0, 1, 0, 2, 2, 1, 0, 0, 0, 4, 1, 0, 0, 2, 2, 0,…
## $ expiration            <chr> "April 3rd, 2016", "April 3rd, 2016", "May 31st…
## $ certification_body    <chr> "METAD Agricultural Development plc", "METAD Ag…
## $ certification_address <chr> "309fcf77415a3661ae83e027f7e5f05dad786e44", "30…
## $ certification_contact <chr> "19fef5a731de2db57d16da10287413f5f99bc2dd", "19…
## $ unit_of_measurement   <chr> "m", "m", "m", "m", "m", "m", "m", "m", "m", "m…
## $ altitude_low_meters   <dbl> 1950.0, 1950.0, 1600.0, 1800.0, 1950.0, NA, NA,…
## $ altitude_high_meters  <dbl> 2200.0, 2200.0, 1800.0, 2200.0, 2200.0, NA, NA,…
skim(coffee_ratings)
Data summary
Name coffee_ratings
Number of rows 1339
Number of columns 45
_______________________
Column type frequency:
character 24
factor 2
numeric 19
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
species 0 1.00 7 7 0 2 0
owner 7 0.99 3 50 0 315 0
country_of_origin 1 1.00 4 28 0 36 0
farm_name 359 0.73 1 73 0 571 0
lot_number 1063 0.21 1 71 0 227 0
mill 315 0.76 1 77 0 460 0
ico_number 151 0.89 1 40 0 847 0
company 209 0.84 3 73 0 281 0
altitude 226 0.83 1 41 0 396 0
region 59 0.96 2 76 0 356 0
producer 231 0.83 1 100 0 691 0
bag_weight 0 1.00 1 8 0 56 0
in_country_partner 0 1.00 7 85 0 27 0
harvest_year 47 0.96 3 24 0 46 0
grading_date 0 1.00 13 20 0 567 0
owner_1 7 0.99 3 50 0 319 0
variety 226 0.83 4 21 0 29 0
processing_method 170 0.87 5 25 0 5 0
color 218 0.84 4 12 0 4 0
expiration 0 1.00 13 20 0 566 0
certification_body 0 1.00 7 85 0 26 0
certification_address 0 1.00 40 40 0 32 0
certification_contact 0 1.00 40 40 0 29 0
unit_of_measurement 0 1.00 1 2 0 2 0

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
cup_points_category 0 1.00 FALSE 4 C: 492, D: 309, B: 280, A: 258
altitude_category 230 0.83 FALSE 4 ver: 402, hig: 385, med: 174, low: 148

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
total_cup_points 0 1.00 82.09 3.50 0 81.08 82.50 83.67 90.58 ▁▁▁▁▇
altitude_mean_meters 230 0.83 1775.03 8668.63 1 1100.00 1310.64 1600.00 190164.00 ▇▁▁▁▁
number_of_bags 0 1.00 154.18 129.99 0 14.00 175.00 275.00 1062.00 ▇▇▁▁▁
aroma 0 1.00 7.57 0.38 0 7.42 7.58 7.75 8.75 ▁▁▁▁▇
flavor 0 1.00 7.52 0.40 0 7.33 7.58 7.75 8.83 ▁▁▁▁▇
aftertaste 0 1.00 7.40 0.40 0 7.25 7.42 7.58 8.67 ▁▁▁▁▇
acidity 0 1.00 7.54 0.38 0 7.33 7.58 7.75 8.75 ▁▁▁▁▇
body 0 1.00 7.52 0.37 0 7.33 7.50 7.67 8.58 ▁▁▁▁▇
balance 0 1.00 7.52 0.41 0 7.33 7.50 7.75 8.75 ▁▁▁▁▇
uniformity 0 1.00 9.83 0.55 0 10.00 10.00 10.00 10.00 ▁▁▁▁▇
clean_cup 0 1.00 9.84 0.76 0 10.00 10.00 10.00 10.00 ▁▁▁▁▇
sweetness 0 1.00 9.86 0.62 0 10.00 10.00 10.00 10.00 ▁▁▁▁▇
cupper_points 0 1.00 7.50 0.47 0 7.25 7.50 7.75 10.00 ▁▁▁▇▁
moisture 0 1.00 0.09 0.05 0 0.09 0.11 0.12 0.28 ▃▇▅▁▁
category_one_defects 0 1.00 0.48 2.55 0 0.00 0.00 0.00 63.00 ▇▁▁▁▁
quakers 1 1.00 0.17 0.83 0 0.00 0.00 0.00 11.00 ▇▁▁▁▁
category_two_defects 0 1.00 3.56 5.31 0 0.00 2.00 4.00 55.00 ▇▁▁▁▁
altitude_low_meters 230 0.83 1750.71 8669.44 1 1100.00 1310.64 1600.00 190164.00 ▇▁▁▁▁
altitude_high_meters 230 0.83 1799.35 8668.81 1 1100.00 1350.00 1650.00 190164.00 ▇▁▁▁▁
head(coffee_ratings)

Are the values what you expected for the variables? Why or Why not?

Amy and Sara: After transforming the data, we can see that new columns were added to the data after mutating the data. The variables are not exactly what we expected to see. We thought that within the high_altitude most of the coffee ratings would be A and B; however, the high_altitude contains the large number of C and D coffee ratings. Additionally, we thought that Guatemala would also contain the most A’s which was not seen.

Visualizing and Summarizing the Data (15 points)

Use group_by()/summarize() to make a summary of the data here. The summary should be relevant to your research question

#Sara - Altitude and Coffee Rating
coffee_ratings %>% count(cup_points_category)
coffee_ratings %>% 
  count(altitude_category, cup_points_category) %>%
  group_by() %>%
  ungroup()
coffee_ratings%>%group_by(altitude_category)%>%summarize(mean=mean(total_cup_points),median=median(total_cup_points),sd=sd(total_cup_points),range=max(total_cup_points)-min(total_cup_points),max=max(total_cup_points),min=min(total_cup_points))
coffee_ratings %>%count(altitude_category,cup_points_category) %>%
  group_by(altitude_category,cup_points_category)%>%arrange(cup_points_category) %>%
  ungroup()
coffee_ratings %>% filter(!is.na(altitude_mean_meters))%>%
  tabyl(cup_points_category,altitude_category) %>%
  adorn_totals("col") %>%
  adorn_percentages(na.rm=TRUE)%>%adorn_pct_formatting()
# Amy - Top Countries and coffee rating 
coffee_ratings %>% 
  count(country_of_origin, sort = TRUE)
coffee_ratings %>%
  group_by(country_of_origin) %>%
  summarise(best_score = max(total_cup_points)) %>% 
  arrange(-best_score)
coffee_ratings %>%
  filter(country_of_origin %in% c("Mexico","Colombia","Guatemala","Brazil")) %>% 
  count(country_of_origin,cup_points_category) %>%
  group_by(country_of_origin,cup_points_category)%>%
  arrange(country_of_origin) %>%
  ungroup()
library(dslabs)
dat <-coffee_ratings %>%filter(!is.na(country_of_origin))%>%mutate(country = reorder(country_of_origin, total_cup_points))
dat %>% ggplot(aes(country, cup_points_category, fill=cup_points_category)) +
geom_bar(stat="identity") +
coord_flip()

What are your findings about the summary? Are they what you expected?

Sara: Through this exploratory study, I found that the category very high altitude contained the highest ranked coffee with a score of 90.58 and the highest mean and median value of 83. To my surprise, the category high altitude contained the highest percentage of grade C and D cups of coffee. From my background research, I would have thought that as the altitude increases the grade of the coffee increases too. However, ‘high altitude’ having the highest percentage of grade C and D is not expected. I have expected to see the highest percentage of grade C and D in ‘low or medium altitude’ categories. A possible explanation for this can be an uneven distribution of each of the categories. Furthermore, I found that the category low altitude contained the lowest percentage of grade A coffee cups with an only 5.8% receiving this grade which is expected. I anticipate this has to do with the improper water drainage.

Amy: Through this exploratory study, I found that the most frequently observed countries within the data set were Mexico, Colombia, Guatemala and Brazil. From my background research, I was surprised that Mexico’s production ranked so high on the list because Mexico was not mentioned as one of the highest producing coffee countries and expected Colombia to rank as the top country. When comparing all of the countries within the data set to see which cup of coffee ranked the highest, I found that Ethiopia contained the highest ranking of 90.58 with Guatemala following closely with a ranking of 89.75. I expected that Guatemala would rank high based on my background review. When I categorized the data by the top four countries in the data set I found that Colombia relieved the most A cups of coffee. I had expected that Guatemala would have received more A rankings. The country with the lowest grades was Mexico receiving 193 cups of coffee deemed either a C or D grade. This information is visualized well in the plots provided below.

Make at least two plots that help you answer your question on the transformed or summarized data.

#Sara - Plotting Altitude and Coffee Rating 

min(coffee_ratings$total_cup_points,na.rm=TRUE)
## [1] 0
coffee_ratings%>%ggplot(aes(total_cup_points,altitude_mean_meters))+
  geom_point()+geom_smooth(method=lm)
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 230 rows containing non-finite values (stat_smooth).
## Warning: Removed 230 rows containing missing values (geom_point).

coffee_ratings%>%ggplot(aes(total_cup_points,log(altitude_mean_meters)))+
  geom_point()+xlim(80,90)
## Warning: Removed 375 rows containing missing values (geom_point).

coffee_ratings%>%ggplot(aes(log(total_cup_points),log(altitude_mean_meters)))+
  geom_point()+geom_smooth(method=lm,se=FALSE)+xlim(log(60),log(100))
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 232 rows containing non-finite values (stat_smooth).
## Warning: Removed 231 rows containing missing values (geom_point).

coffee_ratings%>%ggplot(aes(total_cup_points,fill=altitude_category))+
  geom_boxplot()+xlim(70,90)
## Warning: Removed 9 rows containing non-finite values (stat_boxplot).

max(coffee_ratings$total_cup_points)
## [1] 90.58
coffee_ratings %>%filter(!is.na(altitude_mean_meters))%>%ggplot(aes(total_cup_points, altitude_category,fill=altitude_category)) +
geom_bar(stat="identity") 

coffee_ratings%>%filter(!is.na(altitude_mean_meters))%>%ggplot(aes(altitude_category,total_cup_points, group=altitude_category,fill=altitude_category))+
geom_boxplot()+scale_y_continuous(trans="log2")+
geom_jitter(width=0.1,alpha=0.2)+ylim(67,92)
## Scale for 'y' is already present. Adding another scale for 'y', which will
## replace the existing scale.
## Warning: Removed 3 rows containing non-finite values (stat_boxplot).
## Warning: Removed 3 rows containing missing values (geom_point).

coffee_ratings%>%filter(!is.na(altitude_mean_meters))%>%
ggplot(aes(altitude_category,group=cup_points_category,fill=cup_points_category))+
geom_bar(position=position_dodge())

# Amy - Plotting the Country of Origin and Coffee Rating 

coffee_ratings%>%
  filter(country_of_origin %in% c("Mexico","Colombia","Guatemala","Brazil")) %>% 
  ggplot(aes(country_of_origin,group=cup_points_category,fill=cup_points_category))+
  geom_bar(position=position_dodge())

coffee_ratings%>%
  filter(country_of_origin %in% c("Mexico","Colombia","Guatemala","Brazil")) %>%
  ggplot(aes(country_of_origin,fill=altitude_category))+
  geom_bar(position=position_dodge())

Final Summary (10 points)

Summarize your research question and findings below.

In summary, the research questions we looked at where how does country of origin and altitude impact the overall rating of coffee. From our data, we found that coffee grown at very high altitude contained the highest ranked coffee. Additionally, we found that the best cup of coffee was from Ethiopia but the most consistently A grade coffee came from Colombia. Looking at the figure, that combines country and altitude, one can see that Colombia renders the highest in very high altitude. This helps explain why Colombia contained the most grade A rated coffee. Additionally, Mexico contains the most medium altitude counts which can help explain why it contains the lowest rating in coffee ratings.

Are your findings what you expected? Why or Why not?

Based on our background research and analysis of the data we found some points which we expected and others which we did not. For instants we expected that Guatemala would contain the highest scoring cup of coffee out of the four top countries and this was found to be true. However, when all the countries were examined Ethiopia contained the highest ranking for a cup of coffee. Additionally, we found that countries in higher altitudes contain more grade A cups of coffee compared to those grown in lower altitudes which is what we expected.