get_regression_equation <- function(model) {
# Takes a *univariate* lm() object and returns the regression equation and R^2 as a string
#
intercept <- round(model$coefficients[1], digits = 1)
slope <- round(model$coefficients[2], digits = 2)
r_square <- round(summary(model)$r.squared, digits = 3)
equation <- paste("y = ", intercept, "+", slope, "x; R^2:", r_square)
return(equation)
}
Please submit your .Rmd
and .html
files in Sakai. If you are working together, both people should submit the files.
The goal of the midterm project is to showcase skills that you have learned in class so far. The midterm is open note, but if you use someone else’s code, you must attribute them.
Potential Sources for data: Tidy Tuesday: https://github.com/rfordatascience/tidytuesday
.csv
file into your data
folder.You may use another dataset or your own data, but please make sure it is de-identified.
If you’d like to work together, that is encouraged, but you must divide the work equitably and you must note who worked on what. This is probably easiest as notes in the text. Please let Eric or Me know that you’ll be working together.
No acknowledgements of contributions = -10 points overall.
I will take off points (-5 points for each section) if you don’t add observations and notes in your RMarkdown document. I want you to think and reason through your analysis, even if they are preliminary thoughts.
Define your research question below. What about the data interests you? What is a specific question you want to find out about the data?
Moore’s law is a projection the American electrical engineer Gordon Moore made in 1965, stating that the number of transistors on a microchip doubles every two years. Now, that means the growth is exponential; Moore’s law suggests something like
Y=2k/2
where Y is the number of transistors on a microchip and k is the number of years from 1965 (the date of Moore’s projection) or the beginning of this data set (1970).
I want to know if manufacturing trends in recent history have diverged from Moore’s law; as a general rule, exponential growth can’t go on forever. Maybe I can find some suggestion of an inflection point.
Given your question, what is your expectation about the data?
I expect Moore’s law to be upheld until the mid-2000s, at which point the rate at which the number of transistors on the chips doubles will slow. I only have what I’ve managed to pick up from people I know who work in semiconductor research, but it seems that the methodological leaps that allow chips to get more tightly-packed are becoming harder, pushing up against physical limits, and exhibiting diminished returns. Altogether, I expect to see doubling rates every two years for the first several decades in this dataset, and slower doubling thereafter.
Load the data below and use
dplyr::glimpse()
orskimr::skim()
on the data. You should upload the data file into thedata
directory.
cpu <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-09-03/cpu.csv")
## Parsed with column specification:
## cols(
## processor = col_character(),
## transistor_count = col_double(),
## date_of_introduction = col_double(),
## designer = col_character(),
## process = col_double(),
## area = col_double()
## )
range(cpu$date_of_introduction) # from 2970 to 2019
## [1] 1970 2019
min(cpu$process, na.rm=T) # only up to the 7 nm process
## [1] 7
skim(cpu)
Name | cpu |
Number of rows | 176 |
Number of columns | 6 |
_______________________ | |
Column type frequency: | |
character | 2 |
numeric | 4 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
processor | 0 | 1 | 4 | 64 | 0 | 175 | 0 |
designer | 0 | 1 | 3 | 18 | 0 | 36 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
transistor_count | 6 | 0.97 | 2.20563e+09 | 4.218881e+09 | 2250 | 600000 | 2.5000e+08 | 2.93000e+09 | 3.200e+10 | ▇▁▁▁▁ |
date_of_introduction | 0 | 1.00 | 2.00167e+03 | 1.452000e+01 | 1970 | 1989 | 2.0065e+03 | 2.01425e+03 | 2.019e+03 | ▂▂▂▃▇ |
process | 9 | 0.95 | 9.21200e+02 | 2.000660e+03 | 7 | 20 | 6.5000e+01 | 6.75000e+02 | 1.000e+04 | ▇▁▁▁▁ |
area | 27 | 0.85 | 2.38130e+02 | 2.155000e+02 | 4 | 83 | 1.5200e+02 | 3.55000e+02 | 8.250e+02 | ▇▃▂▁▁ |
glimpse(cpu)
## Rows: 176
## Columns: 6
## $ processor <chr> "MP944 (20-bit, 6-chip)", "Intel 4004 (4-bit, 16…
## $ transistor_count <dbl> NA, 2250, 3500, 2500, 2800, 3000, 4100, 6000, 80…
## $ date_of_introduction <dbl> 1970, 1971, 1972, 1973, 1973, 1974, 1974, 1974, …
## $ designer <chr> "Garrett AiResearch", "Intel", "Intel", "NEC", "…
## $ process <dbl> NA, 10000, 10000, 7500, 6000, 10000, 6000, 6000,…
## $ area <dbl> NA, 12, 14, NA, 32, 12, 16, 20, 11, 21, NA, NA, …
If there are any quirks that you have to deal with
NA
coded as something else, or it is multiple tables, please make some notes here about what you need to do before you start transforming the data in the next section.
I don’t need to join any data frames. There are missing data in the cpu
data frame, and for simplicity we’re just going to remove all rows with missing values.
Make sure your data types are correct!
If the data needs to be transformed in any way (values recoded, pivoted, etc), do it here. Examples include transforming a continuous variable into a categorical using
case_when()
, etc.
Bonus points (5 points) for datasets that require merging of tables, but only if you reason through whether you should use
left_join
,inner_join
, orright_join
on these tables. No credit will be provided if you don’t.
This dataset is pretty clean. The first change I’ll make is converting process
to a factor. The process
variable refers to the manufacturing process for how tightly gated transistors (the physical manifestations of binary 1s and 0s) can be packed together on a chip. Starting at 10 microns (10,000 nm) in 1970, we’ve gotten down to a 5 nm process for mass-manufactured computer chips in 2019 (though this dataset only contains chips made from a 7 nm process).
It’s critical that we understand that this process is literally quantized. The physical size of the process is not only determined by the precision of machines used in the manufacture, but also by quantum mechanics. Depending on the orientation of any transistors relative to one another, one’s state (as a 1 or 0) could effect the other’s state — or not. From what I understand, the limit is packing resistors close together in such a way that they do not interfere with each other’s activity, because such interference leads to unpredictable effects, which is bad for computing.
All this to say that the values of process
represent more than the size of a metal-oxide-semiconductor field-effect transistor node. Rather, these values represent discrete steps — rather, leaps — in technological performance. Therefore process
should be a categorical variable.
We don’t need to use case_when()
because each unique value in process
represents a meaningfully distinct category.
We’ll also make a categorical variable for the decade the processors were produced in, for use later.
# process is a factor
cpu <- cpu %>%
mutate(process = as.factor(process))
# factor decades inside CPU
cpu <- cpu %>% mutate(decade = case_when(
date_of_introduction < 1980 ~ "70s",
date_of_introduction >= 1980 & date_of_introduction < 1990 ~ "80s",
date_of_introduction >= 1990 & date_of_introduction < 2000 ~ "90s",
date_of_introduction >= 2000 & date_of_introduction < 2010 ~ "00s",
date_of_introduction >= 2010 ~ "10s"))
It’ll be good to see a linear trend for transistor doubling times. Since the projection is a doubling every two years, we’ll need to make a new variable.
Y=2k/2⟹log2(Y)=k2
cpu$log2_transistor_count <- log2(cpu$transistor_count)
Through using the log transform, we can assess Moore’s prediction through linear regression; if the slope is 0.5, then we know that the number of transistors in a chip doubles every two years.
Since we’re interested not only in the number of transistors but also the size of the chip, we’ll make a density variable as well. Because the density also rises exponentially (as we’ll see), we’ll produce a log2-transformed variable as well.
cpu$density <- cpu$transistor_count/cpu$area
cpu$log2_density <- log2(cpu$density)
We’ll remove rows with missing values.
cpu <- drop_na(cpu)
Show your transformed table here. Use tools such as
glimpse()
,skim()
orhead()
to illustrate your point.
skim(cpu)
Name | cpu |
Number of rows | 146 |
Number of columns | 10 |
_______________________ | |
Column type frequency: | |
character | 3 |
factor | 1 |
numeric | 6 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
processor | 0 | 1 | 4 | 64 | 0 | 145 | 0 |
designer | 0 | 1 | 3 | 17 | 0 | 28 | 0 |
decade | 0 | 1 | 3 | 3 | 0 | 5 | 0 |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
process | 0 | 1 | FALSE | 33 | 14: 13, 45: 12, 22: 10, 32: 10 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
transistor_count | 0 | 1 | 2.125012e+09 | 3.600975e+09 | 2250.00 | 3400000.00 | 447500000.00 | 2.997500e+09 | 2.360000e+10 | ▇▁▁▁▁ |
date_of_introduction | 0 | 1 | 2.003140e+03 | 1.346000e+01 | 1971.00 | 1995.25 | 2007.00 | 2.014000e+03 | 2.018000e+03 | ▂▂▂▃▇ |
area | 0 | 1 | 2.408000e+02 | 2.168700e+02 | 4.00 | 83.00 | 161.00 | 3.587500e+02 | 8.250000e+02 | ▇▃▂▁▂ |
log2_transistor_count | 0 | 1 | 2.614000e+01 | 6.700000e+00 | 11.14 | 21.68 | 28.74 | 3.148000e+01 | 3.446000e+01 | ▂▂▂▃▇ |
density | 0 | 1 | 9.378601e+06 | 1.747591e+07 | 87.50 | 30427.88 | 2080423.28 | 8.665374e+06 | 9.324324e+07 | ▇▁▁▁▁ |
log2_density | 0 | 1 | 1.898000e+01 | 5.530000e+00 | 6.45 | 14.88 | 20.99 | 2.305000e+01 | 2.647000e+01 | ▃▃▂▇▇ |
glimpse(cpu)
## Rows: 146
## Columns: 10
## $ processor <chr> "Intel 4004 (4-bit, 16-pin)", "Intel 8008 (8-bi…
## $ transistor_count <dbl> 2250, 3500, 2800, 3000, 4100, 6000, 8000, 4528,…
## $ date_of_introduction <dbl> 1971, 1972, 1973, 1974, 1974, 1974, 1974, 1975,…
## $ designer <chr> "Intel", "Intel", "Toshiba", "Intel", "Motorola…
## $ process <fct> 10000, 10000, 6000, 10000, 6000, 6000, 8000, 80…
## $ area <dbl> 12, 14, 32, 12, 16, 20, 11, 21, 27, 18, 20, 21,…
## $ decade <chr> "70s", "70s", "70s", "70s", "70s", "70s", "70s"…
## $ log2_transistor_count <dbl> 11.13571, 11.77314, 11.45121, 11.55075, 12.0014…
## $ density <dbl> 187.5000, 250.0000, 87.5000, 250.0000, 256.2500…
## $ log2_density <dbl> 7.550747, 7.965784, 6.451211, 7.965784, 8.00140…
Are the values what you expected for the variables? Why or Why not?
Everything seems to be as expected. I also make and append some new variables to cpu for the purpose of visualizing projections in models and projections
.
Use
group_by()/summarize()
to make a summary of the data here. The summary should be relevant to your research question
cpu %>%
group_by(designer) %>%
summarize(number_chips = n(),
median_density = round(median(density), digits=2),
median_transistor_count = round(median(transistor_count), digits = 2),
year_introduced = min(date_of_introduction),
last_year = max(date_of_introduction)) %>%
arrange(median_density)
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 28 x 6
## designer number_chips median_density median_transist… year_introduced
## <chr> <int> <dbl> <dbl> <dbl>
## 1 Toshiba 1 87.5 2800 1973
## 2 RCA 1 185. 5000 1976
## 3 MOS Tec… 1 216. 4528 1975
## 4 Zilog 1 472. 8500 1976
## 5 Texas I… 1 727. 8000 1974
## 6 WDC 2 2181. 16750 1981
## 7 Acorn 4 2232. 69000 1985
## 8 Motorola 7 2235. 190000 1974
## 9 DEC WRL 1 2951. 180000 1988
## 10 MIPS 1 6338. 1350000 1991
## # … with 18 more rows, and 1 more variable: last_year <dbl>
What are your findings about the summary? Are they what you expected?
This table is made to assess individual design firms, since they’re hard to differentiate on a plot. The only thing that really shocked me was the number of one-hit wonders in the CPU-design game. The median density and transistor counts seem consistent enough with the year and the following visualizations under Moore’s 1965 projection.
Make at least two plots that help you answer your question on the transformed or summarized data.
# make five subsets, one for each decade
before_1980 <- cpu[cpu$date_of_introduction < 1980, ]
from_1980_to_1990 <- cpu[cpu$date_of_introduction >= 1980 & cpu$date_of_introduction < 1990, ]
ninties_to_millenium <- cpu[cpu$date_of_introduction >= 1990 & cpu$date_of_introduction < 2000, ]
millenium_to_2010 <- cpu[cpu$date_of_introduction >= 2000 & cpu$date_of_introduction < 2010, ]
from_2010 <- cpu[cpu$date_of_introduction >= 2010, ]
# # Can't wait to find a better way to do *this* kinda stuff
# model for each decade, transistor counts...
seventies_transistorcount <- lm(log2_transistor_count ~ date_of_introduction, data = before_1980)
eighties_transistorcount <- lm(log2_transistor_count ~ date_of_introduction, data = from_1980_to_1990)
ninties_transistorcount <- lm(log2_transistor_count ~ date_of_introduction, data = ninties_to_millenium)
oughts_transistorcount <- lm(log2_transistor_count ~ date_of_introduction, data = millenium_to_2010)
teens_transistorcount <- lm(log2_transistor_count ~ date_of_introduction, data = from_2010)
# ...and densities
seventies_density <- lm(log2_density ~ date_of_introduction, data = before_1980)
eighties_density <- lm(log2_density ~ date_of_introduction, data = from_1980_to_1990)
ninties_density <- lm(log2_density ~ date_of_introduction, data = ninties_to_millenium)
oughts_density <- lm(log2_density ~ date_of_introduction, data = millenium_to_2010)
teens_density <- lm(log2_density ~ date_of_introduction, data = from_2010)
# make the regression equation
model1 <- lm(log2_transistor_count ~ date_of_introduction, data=cpu)
model2 <- lm(log2_density ~ date_of_introduction, data = cpu)
equation1 <- get_regression_equation(seventies_transistorcount)
equation2 <- get_regression_equation(eighties_transistorcount)
equation3 <- get_regression_equation(ninties_transistorcount)
equation4 <- get_regression_equation(oughts_transistorcount)
equation5 <- get_regression_equation(teens_transistorcount)
equation6 <- get_regression_equation(seventies_density)
equation7 <- get_regression_equation(eighties_density)
equation8 <- get_regression_equation(ninties_density)
equation9 <- get_regression_equation(oughts_density)
equation10 <- get_regression_equation(teens_density)
equation11 <- get_regression_equation(model1)
equation12 <- get_regression_equation(model2)
slope1 <- as.numeric(seventies_transistorcount$coefficients[2])
intercept1 <- as.numeric(seventies_transistorcount$coefficients[1])
slope2 <- as.numeric(eighties_transistorcount$coefficients[2])
intercept2 <- as.numeric(eighties_transistorcount$coefficients[1])
slope3 <- as.numeric(ninties_transistorcount$coefficients[2])
intercept3 <- as.numeric(ninties_transistorcount$coefficients[1])
slope4 <- as.numeric(oughts_transistorcount$coefficients[2])
intercept4 <- as.numeric(oughts_transistorcount$coefficients[1])
slope5 <- as.numeric(teens_transistorcount$coefficients[2])
intercept5 <- as.numeric(teens_transistorcount$coefficients[1])
x <- seq(
range(cpu$date_of_introduction)[1],
range(cpu$date_of_introduction)[2],
(range(cpu$date_of_introduction)[2]-range(cpu$date_of_introduction)[1])/(length(cpu$date_of_introduction)-1))
y1 = intercept1 + slope1 * x
y2 = intercept2 + slope2 * x
y3 = intercept3 + slope3 * x
y4 = intercept4 + slope4 * x
y5 = intercept5 + slope5 * x
y1 <- 2^y1
y2 <- 2^y2
y3 <- 2^y3
y4 <- 2^y4
y5 <- 2^y5
df_projection <- data_frame(
"year" = x,
"transistor_count" = y1)
## Warning: `data_frame()` is deprecated as of tibble 1.1.0.
## Please use `tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
cpu$year <- x
cpu$projection1 <- y1
cpu$projection2 <- y2
cpu$projection3 <- y3
cpu$projection4 <- y4
cpu$projection5 <- y5
# year introduced, transistor count, process
trans_ct_linear_process <-
ggplot(cpu, aes(y=log2_transistor_count, x=date_of_introduction)) +
geom_point(aes(color=process)) +
geom_smooth(method=lm, formula = y~x, color = 'gray', alpha=0.225) +
xlab("Year of Introduction") +
ylab("log2(Transistor Count)") +
labs(title = "Figure 1: Transistor doubling times and manufacturing process") +
annotate("text", x = 1985, y = 31, label = equation12) +
theme_minimal()
# year introduced, transistor count, designer
trans_ct_linear_manufacturer <-
ggplot(cpu, aes(y=log2_transistor_count, x=date_of_introduction)) +
geom_point(aes(color=designer)) +
geom_smooth(method=glm, formula = y~x) +
xlab("Year of Introduction") +
ylab("log2(Transistor Count)") +
labs(title = "Figure 2: Transistor doubling times by design firm") +
annotate("text", x = 1985, y = 31, label = equation12) +
theme_minimal()
# linear regressions for each decade on one graph. their respective regression equations are listed in ascending order (top is 1970s).
trans_ct_linear_decade <-
ggplot(cpu, aes(y=log2_transistor_count, x=date_of_introduction)) +
geom_point() +
geom_smooth(method=glm, formula = y~x, aes(color=decade)) +
xlab("Year of Introduction") +
ylab("log2(Transistor Count)") +
labs(title = "Figure 4: The suggestion of an inflection point") +
annotate("text", x = 1984.25, y = 32+1.15, label = equation1) +
annotate("text", x = 1985, y = 32, label = equation2) +
annotate("text", x = 1984.65, y = 32-1.15, label = equation3) +
annotate("text", x = 1985-0.35, y = 32-2.3, label = equation4) +
annotate("text", x = 1985-1.3, y = 32-3.45, label = equation5) +
theme_minimal()
# regression lines for decades
spline <-
ggplot(cpu, aes(y=log2_transistor_count, x=date_of_introduction)) +
geom_point() +
geom_smooth(method = loess, span=0.92, formula = y~x, color='gray', se=T, alpha=0.225) +
xlab("Year of Introduction") +
ylab("log2(Transistor Count)") +
labs(title = "Figure 3: Regression per decade") +
theme_minimal()
two_projections <-
ggplot(cpu, aes(y = transistor_count, x=date_of_introduction)) +
geom_point() +
geom_line(mapping=aes(x=year,y=projection1), color="red") +
geom_line(mapping=aes(x=year,y=projection5), color="blue") +
xlab("Year of Introduction") +
ylab("Transistor Count") +
labs(title="Figure 5: Growth at 1970s rate (red) and 2010s rate (blue)")
trans_ct_linear_process
trans_ct_linear_manufacturer
spline
trans_ct_linear_decade
two_projections
Summarize your research question and findings below.
The question at hand was whether Gordon Moore’s 1965 projection might need an update for modern trends. While Figure 1 shows a doubling rate nearly matching Moore’s projection (Figure 2 being the same data, colored for designer rather than manufacturing process), later figures begin to suggest that the doubling rate has slowed. Figure 3 shows a curve fitted with the Loess method that resembles a logarithmic curve, with an inflection point near the year 2000. Figure 4 shows that the growth rate in the earlier decades in this dataset is greater than the final two. Here, the influence of outliers is clearly dragging down the slope for later decades (1990 onward), but I feel that even if these points were removed we would see a decrease in slope, albeit a more subtle decrease. Finally, Figure 5 shows transistor count data overlaid on projection lines generated from the 1970s growth rate and the 2010s growth rate; while no statistical tests have been performed here, it appears that the curve generated from the earlier decade overestimates growth, while the rate from the 2010s slightly underestimates growth, likely due to the influence of outliers. Still, the blue curve better approximates existing data.
Are your findings what you expected? Why or Why not?
My findings are mostly in line with what I anticipated from the outset. The rate of transistor doubling seems to be higher in Gordon Moore’s time than it is presently. That said, I was doubly surprised. First, I anticipated a single manufacturer having taken the lead in recent decades, since the global semiconductor industry is so thoroughly dominated by Taiwanese Semiconductor. I figured that a single designer would dominate the CPU-design market based off a close working relationship with that national firm. However, it’s hard to discern any single leader in the modern market (Figure 2). The second surprise came for me with Figure 1, where I realized that chips of different manufacturing process had similar transistor counts. I had expected a more stair stepped trend, with larger processes having fewer transistors than their earlier counterparts by a large enough margin to be visible in log-transformed data. I repeated this analysis for transistor density (data not shown) and the trend was still not overruled. I’m almost certainly missing something critical here, but this isn’t my field, after all.
I got interested in this dataset not because I’m particularly interested in computer chips, but because I’m interested in trends. Specifically, trends that appear exponential in an early generation and reveal themselves as unequivocally logarithmic to a future generation. Whether these trends have to do with globally available resources or the performance of computers, there are decisions to be made at the inflection point that determine what the world looks like when the slope approaches zero. The giant Jonas Salk writes about this extensively in Survival of the Wisest, a book that makes some very unsettling projections. CPU chips are a bit more mundane; I’m excited to see how semiconductor engineers innovate as the nodes get closer together.
I really enjoyed having the opportunity to play around with lm()
objects with ggplot2
. It’s obvious that there’s a lot going on under the hood, so to speak, with this package and I feel like doing this project has set me up to explore more efficiently from here out.