Load the following packages:
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.4.0 ✔ purrr 0.3.5
## ✔ tibble 3.2.1 ✔ dplyr 1.1.3
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
##
## Attaching package: 'broom'
## The following object is masked from 'package:modelr':
##
## bootstrap
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
Thus far, we have only built models for a numeric response variable. Here, we will study a problem where the response variable is binary; that is, takes on a 1 or a 0 only. This example follows Tutorial 3, Lesson 9 from Introduction to Modern Statistics Chapter 10 quite closely.
A well-known Stanford University study on heart transplants tracked the five-year survival rate of patients with dire heart ailments. The purpose of the study was to assess the efficacy of heart transplants, but for right now we will simply focus on modeling how age of a heart transplant patient influences their survival probability. Let’s create a plot that illustrates the relationship between age when the study began and survival when the study ended, five years later.
We will use the geom_jitter()
function to create the
illusion of separation in our data. Because the y value is categorical,
all of the points would either lie exactly on “dead” or “alive”, making
the individual points hard to see. To counteract this,
geom_jitter()
will move the points a small random amount up
or down. Note that this is for visualization purposes only – this is a
binary outcome variable, after all!
If you fit a simple linear regression line to these data, what would it look like?
ggplot(data = heart_transplant, aes(x = age, y = survived)) +
geom_jitter(width = 0, height = 0.05, alpha = 0.8) +
labs(x = "Age at time of study", y = "Survived 5 years after study")
First, we have a technical problem – it’s difficult to interpret a
binary variable when it’s stored in factor
format, since
it’s not clear which outcome R
is treating as a 1 and which
outcome R
is treating as a 0. We can get around this by
creating a new variable that is binary (either 0 or 1), based on whether
the patient survived
to the end of the study, or not. We
call this new variable is_alive
.
heart_transplant <- heart_transplant %>%
mutate(is_alive = ifelse(survived == "alive", 1, 0))
Let’s check that this transformation worked, and think about how we
interpret our new variable. The vertical axis can now be thought of as
the probability of being alive at the end of the study, given one’s age
at the beginning. We’ll store this ggplot
object to add on
some new features later.
gg <- ggplot(data = heart_transplant, aes(x = age, y = is_alive)) +
geom_jitter(width = 0, height = 0.05, alpha = 0.8) +
labs(x = "Age at time of study", y = "Is alive")
gg
Now there is nothing preventing us from fitting a simple linear regression model to these data, and in fact, in certain cases this may be an appropriate thing to do.
But it’s not hard to see that the line doesn’t fit very well. There are other problems as well…
gg +
geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula = 'y ~ x'
What would this model predict as the probability of a 70-year-old patient being alive? It would be a number less than zero, which doesn’t make sense as a probability. Because the regression line always extends to infinity in either direction, it will make predictions that are not between 0 and 1, sometimes even for reasonable values of the explanatory variable. This does not make any sense.
Second, the variability in a binary response may violate a number of other assumptions that we make when we do inference in multiple regression. You’ll learn about those assumptions in the later class on inference for regression. Hint: it’s called heteroskedasticity!
Thankfully, a modeling framework exists that generalizes regression to include response variables that are non-normally distributed. This family is called generalized linear models or GLMs for short. One member of the family of GLMs is called logistic regression, and this is the one that models a binary response variable.
A full treatment of GLMs is beyond the scope of this class, but the basic idea is that you apply a so-called link function to appropriately transform the scale of the response variable to match the output of a linear model. The link function used by logistic regression is the logit function. This constrains the fitted values of the model to always lie between 0 and 1, as a valid probability must.
In this lesson, we will model the probability of a binary event using the logit link function:
\[\operatorname{logit}(p)=\log \left(\frac{p}{1-p}\right)=\beta_0+\beta_1 x + \varepsilon \]
Fitting a GLM in R requires only two small changes from fitting a
regression model using lm()
. First, the function is called
glm()
instead of lm()
. Second, we have to
specify which kind of GLM we want using the family argument.
For logistic regression, we specify the binomial
family,1 Why binomial? Binomial means that the response follows a
binomial distribution, where our outcome is a random sample of “trials”
where each “trial” has a success probability of \(p\). glm()
can also handle
things like Poisson models, where instead of a binary outcome you have
an outcome that counts the number of successes for a given “trial”.
which uses the logit link function.
# fit model
mod_heart <- glm(is_alive~age, data = heart_transplant, family = 'binomial')
summary(mod_heart)
##
## Call:
## glm(formula = is_alive ~ age, family = "binomial", data = heart_transplant)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6642 -0.7691 -0.6417 1.1587 1.8857
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.56438 1.01521 1.541 0.1233
## age -0.05847 0.02306 -2.535 0.0112 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 120.53 on 102 degrees of freedom
## Residual deviance: 113.67 on 101 degrees of freedom
## AIC: 117.67
##
## Number of Fisher Scoring iterations: 4
The model that we just fit tells us that:
\[ \operatorname{logit}(\hat p)=\log \left(\frac{\hat p}{1-\hat p}\right)=1.56-0.05847 x \]
Answer: Solve for \(\hat p\) plugging in \(x=64\) .
\[ \begin{align} \frac{\hat p}{1-\hat p} &= e^{1.56-0.05847 \times 64}\\ \hat p &= (1-\hat p)\times 0.113 \\ \hat p &= 0.113 - \hat p\times 0.113 \\ (1.113)\times \hat p &= 0.113 \\ \hat p &= \frac{0.113}{1.113} \\ \hat p &= 0.1 = 10\% \end{align} \]
Note that this is the same solution you’d get if you plugged in the
equation for \(p\) from lecture notes:
\[\hat p = \frac{e^{\beta_0 + \beta_1 x}}{1+
e^{\beta_0 + \beta_1 x}}\] Alternatively, you can solve for \(p\) using R
.2 The uniroot
function searches over the
provided interval to find the zero value of the function provided. We
pass the expression that should equal zero, and it finds us the \(p\) that ensures it equals zero.
toSolve <- function(p) {(1-p)*exp(1.56-0.05847*64) - p}
uniroot(toSolve, interval = c(0, 1))$root
## [1] 0.1013713
Unfortunately, since our model is now non-linear, it’s harder to
succinctly characterize how those probabilities decline. We can no
longer say that “each additional year of age is associated with a
particular change in the probability of surviving,” because that change
in probability is not constant across ages. Thus, plotting the
probabilities over age is the easiest way to visualize the
probabilities. We can do this using geom_smooth()
,
specifying the model arguments as we did when estimating the logistic
regression.
gg +
geom_smooth(method = "lm", se = FALSE) +
geom_smooth(method = "glm", se = FALSE, color = "red",
method.args = list(family = "binomial"))
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
Now, notice how the logistic regression line is curved—most noticeably at the ends. The red logistic regression line will never reach 0 or 1, eliminating those invalid predicted probabilities. In this case, for most ages, the simple linear regression line and the logistic regression line don’t differ by very much, and you might not lose much by using the simpler regression model. But for older people, the logistic model performs much better.
How can we visually assess how well the logistic model fits the data? Since the actual observations are all at 0 or 1, but our predictions are always between 0 and 1, we will always be off by some amount. In particular, our model predicts a 50% chance of survival for patients that are about 27 years old. Is that accurate?
To combat the problem of the scale of the y variable, we can change the scale of the variable on the y-axis. Instead of thinking about the probability of survival, we can think about the odds. While these two concepts are often conflated, they are not the same. They are however, related by the simple formula below. The odds of a binary event are the ratio of how often it happens, to how often it doesn’t happen.
\[ \operatorname{odds}(\hat{p})=\frac{\hat{p}}{1-\hat{p}}=\exp \left(\hat{\beta}_0+\hat{\beta}_1 \cdot x\right) \]
Thus, if the probability of survival is 75%, then the odds of survival are 3:1, since you are three times more likely to survive than you are to die. Odds are commonly used to express uncertainty in a variety of contexts, most notably gambling.
Let’s create the odds_hat
variable for predicted
odds.
heart_transplant_plus <- mod_heart %>%
augment(type.predict = "response") %>%
mutate(y_hat = .fitted) %>%
mutate(odds_hat = y_hat / (1 - y_hat))
We can use changes in the odds due to a change in an independent variable to interpret coefficients. It turns out mathematically that we can write the change in the relative odds due to a one unit increase in \(x\) as the exponential of \(\beta_1\) or \(e^{\beta_1}\),3 When \(x\) increases by 1 unit from \(x\) to \(x+1\), the odds ratio goes from \(odds_{x}=e^{\beta_0+\beta_1 x}\) to \(odds_{x+1}=e^{\beta_0+\beta_1 (x+1)}\). The ratio tells us if the odds has increased or decreased: \(\frac{odds_{x+1}}{odds_x}=\frac{e^{\beta_0+\beta_1 (x+1)}}{e^{\beta_0+\beta_1 \cdot x}}=e^{\beta_0+\beta_1 x + \beta_1-(\beta_0+\beta_1 x)} = e^{\beta_1}\). If this ratio is 1, the odds has not changed, if this ratio is 9, the odds has increased 9-folds! the “slope” coefficient. That is, the ratio of the odds after a one unit increase in \(x\) to the odds before that one unit change is equal to \(e^{\beta_1}\). Notice this doesn’t depend on \(x\) anymore! Therefore, it’s a useful interpretation of coefficients.
This is an “odds ratio”, meaning we care about how this number differs from 1. If it’s greater than 1, then the odds increase when \(x\) increases. Conversely, if it’s less than 1, then the odds decrease.
By how much does our model predict that the odds of survival will change with each additional year of age?
Answer: Our model estimates that one additional year of age is associated with a change in the odds ratio of \(e^{-0.05847}=.94\), or a 6% decrease in the odds of survival.
One important reason to build a model is to learn from the coefficients about the underlying random process. For example, in the Stanford heart transplant study, we were able to estimate the effect of age on the five-year survival rate. This simple model shed no light on the obvious purpose of the study, which was to determine whether those patients who received heart transplants were likely to live longer than the control group that received no transplant.
Let’s include the transplant
variable in our model.
mod <- glm(is_alive ~ age + transplant,
data = heart_transplant, family = binomial)
exp(coef(mod))
## (Intercept) age transplanttreatment
## 2.6461676 0.9265153 6.1914009
By including the transplant variable in our model and exponentiating the coefficients, we see a huge effect. Patients who received a heart transplant saw their odds of survival improve by a factor of 6.2, even after controlling for age. Note that as expected, age still has a deleterious effect on mortality – odds of survival fall by about 8% for each year of age in this model.
By setting the type.predict
argument to “response”, we
retrieve the fitted values on the familiar probability scale.
Making predictions about the probability of survival for those patients who took part in the study is of somewhat limited value. We already know whether they survived! Aside from learning about the efficacy of the treatment, another common purpose for modeling is to make predictions for observations that are not part of our data set. These are called out-of-sample predictions.
# probability scale
augment(mod, type.predict = "response")
## # A tibble: 103 × 9
## is_alive age transplant .fitted .resid .std.resid .hat .sigma .cooksd
## <dbl> <int> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 53 control 0.0443 -0.301 -0.304 0.0219 1.02 0.000354
## 2 0 43 control 0.0904 -0.435 -0.442 0.0295 1.02 0.00104
## 3 0 52 control 0.0476 -0.312 -0.316 0.0225 1.02 0.000392
## 4 0 52 control 0.0476 -0.312 -0.316 0.0225 1.02 0.000392
## 5 0 54 control 0.0412 -0.290 -0.293 0.0213 1.02 0.000319
## 6 0 36 control 0.145 -0.560 -0.571 0.0403 1.02 0.00248
## 7 0 47 control 0.0682 -0.376 -0.381 0.0259 1.02 0.000666
## 8 0 41 treatment 0.418 -1.04 -1.05 0.0192 1.02 0.00477
## 9 0 47 control 0.0682 -0.376 -0.381 0.0259 1.02 0.000666
## 10 0 51 control 0.0512 -0.324 -0.328 0.0231 1.02 0.000436
## # ℹ 93 more rows
For example, former Vice President Dick Cheney famously received a heart transplant in March of 2012 at the age of 71. More than five years later, Cheney is still alive, but what does our model predict for his five-year survival rate?
Answer:
To compute this, we build a data frame with Cheney’s data, and run it
through our model using the newdata argument to augment()
.
cheney <- data.frame(age = 71, transplant = "treatment")
augment(mod, newdata = cheney, type.predict = "response")
## # A tibble: 1 × 3
## age transplant .fitted
## <dbl> <chr> <dbl>
## 1 71 treatment 0.0677
The results suggest that Cheney had only a 6.8% chance of survival. Either Cheney is quite lucky to be alive, or—more likely—the survival rates of all heart transplant patients have improved considerably since the Stanford study was completed in 1973.
If our response variable is binary, then why are we making probabilistic predictions? Shouldn’t we be able to make binary predictions? That is, instead of predicting the probability that a person survives for five years, shouldn’t we be able to predict definitively whether they will live or die?
There are a number of different ways in which we could reasonably convert our probabilistic fitted values into a binary decision. The simplest way would be to simply round the probabilities.
mod_plus <- augment(mod, type.predict = "response") %>%
mutate(alive_hat = round(.fitted))
mod_plus %>%
select(is_alive, age, transplant, .fitted, alive_hat)
## # A tibble: 103 × 5
## is_alive age transplant .fitted alive_hat
## <dbl> <int> <fct> <dbl> <dbl>
## 1 0 53 control 0.0443 0
## 2 0 43 control 0.0904 0
## 3 0 52 control 0.0476 0
## 4 0 52 control 0.0476 0
## 5 0 54 control 0.0412 0
## 6 0 36 control 0.145 0
## 7 0 47 control 0.0682 0
## 8 0 41 treatment 0.418 0
## 9 0 47 control 0.0682 0
## 10 0 51 control 0.0512 0
## # ℹ 93 more rows