Load the following packages:
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
##
## Attaching package: 'broom'
##
##
## The following object is masked from 'package:modelr':
##
## bootstrap
R
In this section we will focus on estimating an interaction model in R
. Recall from lecture that an interaction model is appropriate when you think the effect of some variable \(x\) on your dependent variable \(y\) depends on a third variable \(z\). For example, Graff-Zivin and Neidell (2012) find that the productivity of female agricultural workers is less sensitive to air pollution than that of men’s. This suggests that gender influences the relationship between productivity and pollution.
To learn how to estimate interaction models, we will use an example from the automobile industry (from last week) which has data on fuel efficiency and automobile characteristics for cars of two vintages: 1999 cars, and 2008 cars. Last week we found that vintage influenced fuel efficiency: more recent cars have higher miles per gallon. We also found that our estimates were biased if we did not also control for engine size, as higher engine sizes lead to lower fuel efficiency, and they are also correlated with vintage.
The model we estimated last week was:
\[hwy_i =\beta_{0}+\beta_{1} \cdot displ_i +\beta_{2} \cdot \text vintage_i+\varepsilon_i\]
The results of this “parallel slopes” model from last week looked like this:
mpg <- mpg %>% mutate(year = as.factor(year)) # Recall, we do this to ensure our year variable is treated as a categorical variable
mod <- lm(hwy ~ displ + year, data = mpg)
mpg %>%
ggplot(aes(x = displ, y = hwy, color = year)) +
geom_point() +
geom_line(data = augment(mod), aes(y = .fitted, color = year)) +
labs(x = "Engine Displacement, in litres",
y = "Highway Miles per gallon") +
scale_colour_discrete("Year")
These slopes being parallel comes from our regression equation having just one coefficient on engine size; regardless of your vintage, we estimate the same impact of engine size on fuel economy.
This may not be accurate. If technology is improving over time, the effect of engine size on fuel economy could decline over time – having a larger engine lowers fuel economy, but maybe we are getting better at not sacrificing miles per gallon as we increase engine size. In this subsection, we want to see if this true or or not. We will use the mpg
dataset and is pre-loaded in R
.
We hypothesize that the relationship between miles per gallon and engine size may not actually be the same across the two vintages. That is, we want a model in which those “parallel slopes” above are allowed to differ by vintage.
We use an interaction model to achieve this:
\[hwy_i=\beta_{0}+\beta_{1} \cdot displ_i + \beta_{2} \cdot vintage_i + \beta_{3} \cdot displ_i \cdot vintage_i + \varepsilon_i\]
The lm
package makes estimating interactions easy. Just use a colon (:
) between the two variables you want to interact; the colon tells lm
to multiply the two variables by one another.
Complete the following:
lm()
to estimate all three coefficients using the mpg
data. Make sure you are treating year
as a categorical variable! Use summary(lm())
to print the regression results.mod_int <- lm(hwy ~ displ + year + displ:year, data = mpg)
summary(mod_int)
##
## Call:
## lm(formula = hwy ~ displ + year + displ:year, data = mpg)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.8595 -2.4360 -0.2103 1.6037 15.3677
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35.7922 0.9794 36.546 <0.0000000000000002 ***
## displ -3.7684 0.2788 -13.517 <0.0000000000000002 ***
## year2008 0.3445 1.4353 0.240 0.811
## displ:year2008 0.3052 0.3882 0.786 0.433
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.784 on 230 degrees of freedom
## Multiple R-squared: 0.6015, Adjusted R-squared: 0.5963
## F-statistic: 115.7 on 3 and 230 DF, p-value: < 0.00000000000000022
Answer: We predict that average highway miles per gallon for a 1999 vehicle with an engine size of zero is 35.8 mpg (nonsensical).
displ
tell you, in words?Answer: Highway miles per gallon fall by 3.8 when engine size increases by one litre for a 1999 vintage vehicle. Note: This slope does not apply to a 2008 vintage vehicle, since the interaction term would get “turned on” for that vintage!
year2008
) tell you, in words?Answer: On average, highway miles per gallon are 0.3 mpg higher for a 2008 vehicle relative to a 1999 vehicle when the engine is size zero (also nonsensical). This is important – with an interaction term, the effect of vintage now varies with engine size. So the coefficient on the vintage dummy tells us the effect of 2008 vintage only when displ=0
.
displ:year2008
tell you, in words?1 Hint: This is tricky. Start by writing down the relationship between miles per gallon and engine size when the car vintage is 2008. Then write the same thing for car vintage 1999. Can you see what role this coefficient plays?Answer: The impact of engine size on highway miles per gallon is 0.3 mpg higher in 2008 cars than it is in 1999 cars. This coefficient is telling us how the relationship between miles per gallon and engine size varies with vintage. Because the reference group is the 1999 vintage, this coefficient tells us the slope of the relationship between miles per gallon and engine size for the 2008 vintage minus the same slope for the 1999 vintage.
You can see this by looking at 2008 and 1999 vintages one at a time. The slope of the relationship between hwy
and displ
in 2008 is \(\beta_1 displ + \beta_3 \cdot displ \cdot 1 = (\beta_1 + \beta_3)\cdot displ\). In contrast, the slope of the relationship between hwy
and displ
in 1999 is \(\beta_1 displ + \beta_3 \cdot displ \cdot 0 = \beta_1 displ\), so the difference in slopes is \(\beta_3\).
Interaction models are difficult to interpret. Writing out the algebra of what each independent variable and its associated coefficient mean is really valuable, but visualizing the data and resulting model can also help. Here, let’s remake our scatter plot above, but overlay our regression results from the interaction model.
Intuition check before you dive in: What do you expect to see in terms of intercepts and slopes of your vintage-specific regression lines?
We will use ggplot2
with augment()
to obtain predictions, in combination with geom_line()
.
mod_int <- lm(hwy ~ displ + year + displ:year, data = mpg)
mpg %>%
ggplot(aes(x = displ, y = hwy, color = year)) +
geom_point() +
geom_line(data = augment(mod_int), aes(y = .fitted, color = year))
We saw in the previous exercise that an interaction model resulted in different slopes for each group. However, is this more complex model actually “better”? We have explored using \(R^2\) as a measure of model fit. It is a common one and you should be comfortable computing and interpreting it. Maybe it can help us answer the question of which model to use in this case.
However, as we discussed in lecture, \(R^2\) mechanically increases as you add more independent variables to your regression. Adjusted \(R^2\) is designed to “penalize” your measure of model fit for those additional independent variables.
These two measures are: \[R^{2}=1 - \frac{\sum_i e_i^2}{\sum_i (y_i - \bar y)^2}\] \[ \overline{R}^2 = 1 - \dfrac{\sum_i e_i^2/(n-k-1)}{\sum_i \left( y_i - \overline{y} \right)^2/(n-1)} \] Did the added interaction term in our regression improve model fit?
In this case, the interaction model added a new variable to our regression, so we should be comparing adjusted \(R^2\) results from a model without the interaction to a model with it. You can see the adjusted \(R^2\) in summary(lm())
, but you can also access it by saving your lm
object and calling summary(mod)$adj.r.squared
.
# no interaction
mod <- lm(hwy ~ displ + year, data=mpg)
AdjR2 <- summary(mod)$adj.r.squared
print(AdjR2)
## [1] 0.5969436
# with the interaction
AdjR2_int <- summary(mod_int)$adj.r.squared
print(AdjR2_int)
## [1] 0.5962761
Since the adjusted \(R^2\) value declined slightly, we conclude that the interaction model did not meaningfully improve model fit.
Note that if you compare \(R^2\) across models, without accounting for the additional independent variables, you get a slightly different picture:
# no interaction
mod <- lm(hwy ~ displ + year, data=mpg)
R2 <- summary(mod)$r.squared
print(R2)
## [1] 0.6004034
# with the interaction
R2_int <- summary(mod_int)$r.squared
print(R2_int)
## [1] 0.6014743
For multiple regression, you should generally rely on adjusted \(R^2\) and not \(R^2\).