Load all the packages you need, plus the _common.R
source file.
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.4.0 ✔ purrr 0.3.5
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
##
## Attaching package: 'openintro'
## The following object is masked from 'package:gt':
##
## sp500
In the last class, we estimated a linear relationship between the size of fiddler grabs and latitude, and we recovered OLS estimates of \(\hat\beta_0\) (the intercept) and \(\hat\beta_1\) (the slope coefficient). From our correlation calculations, we also have a sense of the strength of these linear relationships. Here, we will use a concept that is very closely related to correlation to quantify the overall fit of our linear regression model.
Recall that the coefficient of determination, or \(R^2\), is the share of the variance in \(y\) that is explained by your regression model. Defining SSR as the sum of squared residuals (sum of the square of all our prediction errors) and SST as the total sum of squares (proportional to variance of \(y\)), we have:
\[R^{2}=1-\frac{S S R}{S S T}=1-\frac{\sum_i e_i^2}{\sum_i(y_i-\bar{y})^2}\]
This is the most commonly cited measure of the fit of a regression model. \(R^2\) ranges from 0 (my regression model explains none of the variation in \(y\)) to 1 (my regression model perfectly explains all variation in \(y\)).
Recall our regressions from last week:
\[\text{crab size} = \beta_0 + \beta_1 \text{latitude} + \epsilon\] and:
\[\text{water temperature} = \beta_0 + \beta_1 \text{latitude} + \epsilon\] Recall the graphs of these relationships that we made last week:
Exercise
Use the function summary()
after the regression to
calculate the \(R^2\) for both the crab
size and water temperature regressions. Recall that in the
pie_crab
dataset, size
indicates crab carapace
size, latitude
is latitude of the sampling location, and
water_temp
is the water temperature at the sampling
location.
Which regression model fits the data better? Is this what you expected based on the scatter plots? Why or why not?
Answers
# (i) size regressed on latitude
mod_size <- lm(size ~ latitude, data = pie_crab)
# (ii) water temp regressed on latitude
mod_water <- lm(water_temp ~ latitude, data = pie_crab)
# Recovering R2 from the regressions
R2_size = summary(mod_size)$r.squared
R2_water = summary(mod_water)$r.squared
print(paste0("R2 of size on latitude is: ", round(R2_size,2)))
#> [1] "R2 of size on latitude is: 0.35"
print(paste0("R2 of water temp on latitude is: ", round(R2_water,2)))
#> [1] "R2 of water temp on latitude is: 0.92"
In this section we will consider a situation where a numerical or a binary variable might not be useful for our needs.
We will use an example from the automobile industry where which has data on fuel efficiency and automobile characteristics for cars of two vintages: 1999 cars, and 2008 cars. We are interested here to understand how highway fuel economy differs across these two vintages. This is a policy relevant question – there has been increasing regulatory pressure to improve fuel efficiency in the US vehicle fleet over this time period. Did it work?
The dataset is called mpg
and is pre-loaded in
R
.
Step 1: Get your variables ready.
Note that we want to treat the year of the car as a
categorical variable, as we just have two years and we
want to treat the 1999 cars as one “group” and the 2008 cars as another
“group.” Take a moment to identify the class of the “year” variable, and
then use the as.factor()
command to turn it into a factor
so we can trust R
will treat it as a categorical
variable:
head(mpg)
#> # A tibble: 6 × 11
#> manufacturer model displ year cyl trans drv cty hwy fl class
#> <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
#> 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compa…
#> 2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compa…
#> 3 audi a4 2 2008 4 manual(m6) f 20 31 p compa…
#> 4 audi a4 2 2008 4 auto(av) f 21 30 p compa…
#> 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compa…
#> 6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compa…
class(mpg$year)
#> [1] "integer"
mpg <- mpg %>% mutate(year = as.factor(year))
class(mpg$year) # confirm your class change worked
#> [1] "factor"
Step 2: Visualize your data.
As we showed in class, scatter plots are not all that helpful when we
have a categorical variable. Use geom_boxplot()
to plot
“highway miles per gallon” (variable is called hwy
) on the
\(y\)-axis and vintage year on the
\(x\)-axis. Do the distributions of
fuel efficiency look different across the two groups?
ggplot(data = mpg, aes(x = year, y = hwy)) +
geom_boxplot() +
labs(x = "Year",
y = "Fuel Efficiency")
Step 3: Run a regression.
A linear regression will allow us to quantify the difference in average miles per gallon across the two car vintages. Note that in this case with a simple linear regression and one categorical variable with just two values (1999 and 2008), these regression estimates are equivalent to computing means for each group and differencing them.
Here is our regression: \[hwy_i=\beta_{0}+\beta_{1} \cdot vintage_i +\varepsilon_i\] Complete the following:
Using the model specified above, use lm()
to
estimate \(\hat\beta_0\) and \(\hat\beta_1\) using this sample of data.
Make sure you are treating year
as a categorical variable!
Use summary(lm())
, gt()
or
kable()
to visualize the regression results.
What does the intercept tell you, in words?
What does the coefficient estimate on the 2008 vintage indicator variable tell you, in words?
Answers:
lm(hwy ~ year, data = mpg) %>%
summary()
#>
#> Call:
#> lm(formula = hwy ~ year, data = mpg)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -11.453 -5.453 0.573 3.573 20.573
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 23.4274 0.5517 42.46 <0.0000000000000002 ***
#> year2008 0.0256 0.7802 0.03 0.97
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 5.97 on 232 degrees of freedom
#> Multiple R-squared: 4.66e-06, Adjusted R-squared: -0.00431
#> F-statistic: 0.00108 on 1 and 232 DF, p-value: 0.974
Intercept: The predicted highway miles per gallon for a car with vintage of 1999 is 23.4 mpg.
Coefficient on year 2008 indicator variables: Highway fuel efficiency is, on average, 0.03 miles per gallon higher in the 2008 car vintage than in the 1999 vintage.
Answer: The model is suggestive of some improvements in fuel efficiency over time, but the effect is very small (and imprecise).
In most situations a simple linear regression with one variable might not be useful enough to suit our needs. In this case, we have some evidence that fuel efficiency may have increased over time. However, we can’t be sure if this is because of technological advances in fuel efficiency, or if it’s just that consumer preferences changed over the two periods and people preferred cars with smaller engines, which have higher fuel efficiency when compared to those with larger engines.
We can help “control” for this “omitted-variable bias” by adding additional variables to our regression.
Step 1: Adding in engine size
As we showed in class, scatter plots are useful with numeric
variables. The engine size for vehicles is stored as the variable
displ
and it is a numeric variable. Use
geom_point()
to plot “engine displacement, in litres”
(variable is called displ
) on the \(x\)-axis and “highway miles per gallon”
(variable is called hwy
) on the \(y\)-axis. Does fuel efficiency look
different as engine size increases?
Answers:
ggplot(data = mpg, aes(x = displ, y = hwy)) +
geom_point() +
labs(x = "Engine Displacement, in litres",
y = "Highway Miles per gallon")
While we know that fuel economy evolved over time but how did that really happen? Can the credit for better efficiency over time be given to developments in engineering? Or did consumer tastes change? How do we know that the increase in fuel economy was not just due to the cars in 2008 generally having different sized engines than cars in 1999? This problem might seem confusing but it a multiple variable regression can help us unpack it.
Step 2: Run additional regressions
To resolve these questions we will need to assess the effects of engine size and vintage simultaneously on fuel efficiency. This means in simple words that we want to understand the effect of vintage on fuel economy, after controlling for (or isolating the effect of) engine size.
To be able to do this, we will modify our model in section 2 as the following:
\[hwy_i =\beta_{0}+\beta_{1} \cdot displ_i +\beta_{2} \cdot \text vintage_i+\varepsilon_i\]
Complete the following:
lm()
to estimate
\(\hat\beta_0\), \(\hat\beta_1\) and \(\hat\beta_2\) using this sample of data.
Use summary(lm())
to print the regression results.Answer:
lm(hwy ~ displ + year, data = mpg) %>%
summary()
#>
#> Call:
#> lm(formula = hwy ~ displ + year, data = mpg)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -7.76 -2.52 -0.29 1.87 15.59
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 35.276 0.726 48.61 <0.0000000000000002 ***
#> displ -3.611 0.194 -18.63 <0.0000000000000002 ***
#> year2008 1.402 0.500 2.81 0.0054 **
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 3.78 on 231 degrees of freedom
#> Multiple R-squared: 0.6, Adjusted R-squared: 0.597
#> F-statistic: 174 on 2 and 231 DF, p-value: <0.0000000000000002
Answer:
Intercept: The predicted highway miles per gallon for a car with an engine size of zero and a vintage of 1999 is 35.28 mpg.
Coefficient on engine size: Predicted highway miles per gallon decrease by 3.61 for every litre increase in engine displacement, holding fixed the year the car was made.
Coefficient on year: Highway fuel efficiency is, on average, 1.40 miles per gallon higher in cars built in 2008, as compared to 1999, holding fixed the engine size of the car.
Answer:
This coefficient is larger than in the simple linear regression, suggesting that engine size is correlated with both fuel efficiency and time. Here, it appears that in later years engine sizes were larger, which lowers fuel efficiency. Once we control for that effect, the role of time itself on the efficiency is much more substantial.
Step 3: Visualize your regression (parallel slopes)
This regression includes one “slope” coefficient (coefficient estimate on a numeric variable) and one “indicator” coefficient (coefficient estimate on a categorical variable). This kind of a model is often called “parallel slopes” because the indicator variable’s coefficient shifts predictions up and down, while the numeric variable’s slope is the same across both groups. We will see that visually here.
geom_point()
function with the
color
argument set to year
to make a scatter
plot of miles per gallon (\(y\)-axis)
against engine size (\(x\)-axis), in
which scatter points are colored differently for each vintage.Answer:
mpg %>%
ggplot(aes(x = displ, y = hwy, color = year)) +
geom_point() +
labs(x = "Engine Displacement, in litres",
y = "Highway Miles per gallon")
.fitted
gives you the
predicted (or “fit”) values for all observations in the data.Start by storing your regression results in a new way, using
augment()
, which stores fitted values for every observation
in your data as a column in a dataframe:
Answer:
mod <- lm(hwy ~ displ + year, data = mpg)
augment(mod)
#> # A tibble: 234 × 9
#> hwy displ year .fitted .resid .hat .sigma .cooksd .std.resid
#> <int> <dbl> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 29 1.8 1999 28.8 0.224 0.0143 3.79 0.0000173 0.0597
#> 2 29 1.8 1999 28.8 0.224 0.0143 3.79 0.0000173 0.0597
#> 3 31 2 2008 29.5 1.54 0.0158 3.79 0.000908 0.412
#> 4 30 2 2008 29.5 0.544 0.0158 3.79 0.000113 0.145
#> 5 26 2.8 1999 25.2 0.835 0.00916 3.79 0.000152 0.222
#> 6 26 2.8 1999 25.2 0.835 0.00916 3.79 0.000152 0.222
#> # … with 228 more rows
Then, use geom_line()
and the predictions stored in
augment(mod)
to add the best fit OLS line to your scatter
plots, again using color = year
to color your regression
lines by vintage. What can you say about the evolution of fuel
efficiency over time after controlling for engine size?
Answer:
mpg %>%
ggplot(aes(x = displ, y = hwy, color = year)) +
geom_point() +
geom_line(data = augment(mod), aes(y = .fitted, color = year)) +
labs(x = "Engine Displacement, in litres",
y = "Highway Miles per gallon") +
scale_colour_discrete("Year")
This last graph shows us that on average fuel efficiency is higher for the 2008 vintage, since the blue line is above the pink line. Larger engine size lowers efficiency (negative slope in both lines), but once we control for engine size, we see a positive effect of the 2008 vintage (blue line is parallel but above pink line).