EDS 222: Week 4: In-class Lab

ANSWER KEY

2023-10-25

Section 0: Getting set up

Load the following packages:

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## 
## Attaching package: 'broom'
## 
## 
## The following object is masked from 'package:modelr':
## 
##     bootstrap

Section 1: Interactions in OLS in R

In this section we will focus on estimating an interaction model in R. Recall from lecture that an interaction model is appropriate when you think the effect of some variable \(x\) on your dependent variable \(y\) depends on a third variable \(z\). For example, Graff-Zivin and Neidell (2012) find that the productivity of female agricultural workers is less sensitive to air pollution than that of men’s. This suggests that gender influences the relationship between productivity and pollution.

To learn how to estimate interaction models, we will use an example from the automobile industry (from last week) which has data on fuel efficiency and automobile characteristics for cars of two vintages: 1999 cars, and 2008 cars. Last week we found that vintage influenced fuel efficiency: more recent cars have higher miles per gallon. We also found that our estimates were biased if we did not also control for engine size, as higher engine sizes lead to lower fuel efficiency, and they are also correlated with vintage.

Parallel slopes model:

The model we estimated last week was:

\[hwy_i =\beta_{0}+\beta_{1} \cdot displ_i +\beta_{2} \cdot \text vintage_i+\varepsilon_i\]

The results of this “parallel slopes” model from last week looked like this:

mpg <- mpg %>% mutate(year = as.factor(year)) # Recall, we do this to ensure our year variable is treated as a categorical variable

mod <- lm(hwy ~ displ + year, data = mpg)

mpg %>% 
  ggplot(aes(x = displ, y = hwy, color = year)) +
  geom_point() +
  geom_line(data = augment(mod), aes(y = .fitted, color = year)) + 
  labs(x = "Engine Displacement, in litres",
       y = "Highway Miles per gallon") +
  scale_colour_discrete("Year")

These slopes being parallel comes from our regression equation having just one coefficient on engine size; regardless of your vintage, we estimate the same impact of engine size on fuel economy.

This may not be accurate. If technology is improving over time, the effect of engine size on fuel economy could decline over time – having a larger engine lowers fuel economy, but maybe we are getting better at not sacrificing miles per gallon as we increase engine size. In this subsection, we want to see if this true or or not. We will use the mpg dataset and is pre-loaded in R.

Interaction model

We hypothesize that the relationship between miles per gallon and engine size may not actually be the same across the two vintages. That is, we want a model in which those “parallel slopes” above are allowed to differ by vintage.

We use an interaction model to achieve this: \[hwy_i=\beta_{0}+\beta_{1} \cdot displ_i + \beta_{2} \cdot vintage_i + \beta_{3} \cdot displ_i \cdot vintage_i + \varepsilon_i\] The lm package makes estimating interactions easy. Just use a colon (:) between the two variables you want to interact; the colon tells lm to multiply the two variables by one another.

Complete the following:

  1. Using the model specified above, use lm() to estimate all three coefficients using the mpg data. Make sure you are treating year as a categorical variable! Use summary(lm()) to print the regression results.
mod_int <- lm(hwy ~ displ + year + displ:year, data = mpg) 
summary(mod_int)
## 
## Call:
## lm(formula = hwy ~ displ + year + displ:year, data = mpg)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.8595 -2.4360 -0.2103  1.6037 15.3677 
## 
## Coefficients:
##                Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)     35.7922     0.9794  36.546 <0.0000000000000002 ***
## displ           -3.7684     0.2788 -13.517 <0.0000000000000002 ***
## year2008         0.3445     1.4353   0.240               0.811    
## displ:year2008   0.3052     0.3882   0.786               0.433    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.784 on 230 degrees of freedom
## Multiple R-squared:  0.6015, Adjusted R-squared:  0.5963 
## F-statistic: 115.7 on 3 and 230 DF,  p-value: < 0.00000000000000022
  1. What does the intercept tell you, in words?

Answer: We predict that average highway miles per gallon for a 1999 vehicle with an engine size of zero is 35.8 mpg (nonsensical).

  1. What does the coefficient on displ tell you, in words?

Answer: Highway miles per gallon fall by 3.8 when engine size increases by one litre for a 1999 vintage vehicle. Note: This slope does not apply to a 2008 vintage vehicle, since the interaction term would get “turned on” for that vintage!

  1. What does the coefficient estimate on the 2008 vintage indicator variable (year2008) tell you, in words?

Answer: On average, highway miles per gallon are 0.3 mpg higher for a 2008 vehicle relative to a 1999 vehicle when the engine is size zero (also nonsensical). This is important – with an interaction term, the effect of vintage now varies with engine size. So the coefficient on the vintage dummy tells us the effect of 2008 vintage only when displ=0.

  1. What does the coefficient on displ:year2008 tell you, in words?1 Hint: This is tricky. Start by writing down the relationship between miles per gallon and engine size when the car vintage is 2008. Then write the same thing for car vintage 1999. Can you see what role this coefficient plays?

Answer: The impact of engine size on highway miles per gallon is 0.3 mpg higher in 2008 cars than it is in 1999 cars. This coefficient is telling us how the relationship between miles per gallon and engine size varies with vintage. Because the reference group is the 1999 vintage, this coefficient tells us the slope of the relationship between miles per gallon and engine size for the 2008 vintage minus the same slope for the 1999 vintage.

You can see this by looking at 2008 and 1999 vintages one at a time. The slope of the relationship between hwy and displ in 2008 is \(\beta_1 displ + \beta_3 \cdot displ \cdot 1 = (\beta_1 + \beta_3)\cdot displ\). In contrast, the slope of the relationship between hwy and displ in 1999 is \(\beta_1 displ + \beta_3 \cdot displ \cdot 0 = \beta_1 displ\), so the difference in slopes is \(\beta_3\).

Section 2: Plotting an interaction model

Interaction models are difficult to interpret. Writing out the algebra of what each independent variable and its associated coefficient mean is really valuable, but visualizing the data and resulting model can also help. Here, let’s remake our scatter plot above, but overlay our regression results from the interaction model.

Intuition check before you dive in: What do you expect to see in terms of intercepts and slopes of your vintage-specific regression lines?

We will use ggplot2 with augment() to obtain predictions, in combination with geom_line().

mod_int <- lm(hwy ~ displ + year + displ:year, data = mpg)
mpg %>% 
  ggplot(aes(x = displ, y = hwy, color = year)) +
  geom_point() +
  geom_line(data = augment(mod_int), aes(y = .fitted, color = year)) 

Section 3: Adjusted \(R^2\)

We saw in the previous exercise that an interaction model resulted in different slopes for each group. However, is this more complex model actually “better”? We have explored using \(R^2\) as a measure of model fit. It is a common one and you should be comfortable computing and interpreting it. Maybe it can help us answer the question of which model to use in this case.

However, as we discussed in lecture, \(R^2\) mechanically increases as you add more independent variables to your regression. Adjusted \(R^2\) is designed to “penalize” your measure of model fit for those additional independent variables.

These two measures are: \[R^{2}=1 - \frac{\sum_i e_i^2}{\sum_i (y_i - \bar y)^2}\] \[ \overline{R}^2 = 1 - \dfrac{\sum_i e_i^2/(n-k-1)}{\sum_i \left( y_i - \overline{y} \right)^2/(n-1)} \] Did the added interaction term in our regression improve model fit?

In this case, the interaction model added a new variable to our regression, so we should be comparing adjusted \(R^2\) results from a model without the interaction to a model with it. You can see the adjusted \(R^2\) in summary(lm()), but you can also access it by saving your lm object and calling summary(mod)$adj.r.squared.

# no interaction
mod <- lm(hwy ~ displ + year, data=mpg)
AdjR2 <- summary(mod)$adj.r.squared
print(AdjR2)
## [1] 0.5969436
# with the interaction
AdjR2_int <- summary(mod_int)$adj.r.squared
print(AdjR2_int)
## [1] 0.5962761

Since the adjusted \(R^2\) value declined slightly, we conclude that the interaction model did not meaningfully improve model fit.

Note that if you compare \(R^2\) across models, without accounting for the additional independent variables, you get a slightly different picture:

# no interaction
mod <- lm(hwy ~ displ + year, data=mpg)
R2 <- summary(mod)$r.squared
print(R2)
## [1] 0.6004034
# with the interaction
R2_int <- summary(mod_int)$r.squared
print(R2_int)
## [1] 0.6014743

For multiple regression, you should generally rely on adjusted \(R^2\) and not \(R^2\).