EDS 222: Week 6: In-class Lab

2023-11-08

Section 0: Getting set up

Load the following packages:

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.4.0     ✔ purrr   0.3.5
## ✔ tibble  3.2.1     ✔ dplyr   1.1.3
## ✔ tidyr   1.2.1     ✔ stringr 1.4.1
## ✔ readr   2.1.2     ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## 
## Attaching package: 'broom'
## The following object is masked from 'package:modelr':
## 
##     bootstrap
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata

Section 1: Categorical response variable

Thus far, we have only built models for a numeric response variable. Here, we will study a problem where the response variable is binary; that is, takes on a 1 or a 0 only. This example follows Tutorial 3, Lesson 9 from Introduction to Modern Statistics Chapter 10 quite closely.

A well-known Stanford University study on heart transplants tracked the five-year survival rate of patients with dire heart ailments. The purpose of the study was to assess the efficacy of heart transplants, but for right now we will simply focus on modeling how age of a heart transplant patient influences their survival probability. Let’s create a plot that illustrates the relationship between age when the study began and survival when the study ended, five years later.

We will use the geom_jitter() function to create the illusion of separation in our data. Because the y value is categorical, all of the points would either lie exactly on “dead” or “alive”, making the individual points hard to see. To counteract this, geom_jitter() will move the points a small random amount up or down. Note that this is for visualization purposes only – this is a binary outcome variable, after all!

If you fit a simple linear regression line to these data, what would it look like?

ggplot(data = heart_transplant, aes(x = age, y = survived)) + 
  geom_jitter(width = 0, height = 0.05, alpha = 0.8) +
   labs(x = "Age at time of study", y = "Survived 5 years after study") 

Making a binary variable

First, we have a technical problem – it’s difficult to interpret a binary variable when it’s stored in factor format, since it’s not clear which outcome R is treating as a 1 and which outcome R is treating as a 0. We can get around this by creating a new variable that is binary (either 0 or 1), based on whether the patient survived to the end of the study, or not. We call this new variable is_alive.

heart_transplant <- heart_transplant %>%
  mutate(is_alive = ifelse(survived == "alive", 1, 0))

Visualizing a binary response

Let’s check that this transformation worked, and think about how we interpret our new variable. The vertical axis can now be thought of as the probability of being alive at the end of the study, given one’s age at the beginning. We’ll store this ggplot object to add on some new features later.

gg <- ggplot(data = heart_transplant, aes(x = age, y = is_alive)) + 
  geom_jitter(width = 0, height = 0.05, alpha = 0.8) +
   labs(x = "Age at time of study", y = "Is alive") 

gg

Linear regression with a binary response

Now there is nothing preventing us from fitting a simple linear regression model to these data, and in fact, in certain cases this may be an appropriate thing to do.

But it’s not hard to see that the line doesn’t fit very well. There are other problems as well…

gg +
  geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula = 'y ~ x'

Limitations of regression

What would this model predict as the probability of a 70-year-old patient being alive? It would be a number less than zero, which doesn’t make sense as a probability. Because the regression line always extends to infinity in either direction, it will make predictions that are not between 0 and 1, sometimes even for reasonable values of the explanatory variable. This does not make any sense.

Second, the variability in a binary response may violate a number of other assumptions that we make when we do inference in multiple regression. You’ll learn about those assumptions in the later class on inference for regression. Hint: it’s called heteroskedasticity!

Section 2: Logistic regression

Thankfully, a modeling framework exists that generalizes regression to include response variables that are non-normally distributed. This family is called generalized linear models or GLMs for short. One member of the family of GLMs is called logistic regression, and this is the one that models a binary response variable.

A full treatment of GLMs is beyond the scope of this class, but the basic idea is that you apply a so-called link function to appropriately transform the scale of the response variable to match the output of a linear model. The link function used by logistic regression is the logit function. This constrains the fitted values of the model to always lie between 0 and 1, as a valid probability must.

In this lesson, we will model the probability of a binary event using the logit link function:

\[\operatorname{logit}(p)=\log \left(\frac{p}{1-p}\right)=\beta_0+\beta_1 x + \varepsilon \]

Fitting a GLM in R requires only two small changes from fitting a regression model using lm(). First, the function is called glm() instead of lm(). Second, we have to specify which kind of GLM we want using the family argument.

For logistic regression, we specify the binomial family,1 Why binomial? Binomial means that the response follows a binomial distribution, where our outcome is a random sample of “trials” where each “trial” has a success probability of \(p\). glm() can also handle things like Poisson models, where instead of a binary outcome you have an outcome that counts the number of successes for a given “trial”. which uses the logit link function.

# fit model
mod_heart <- glm(is_alive~age, data = heart_transplant, family = 'binomial')

summary(mod_heart)
## 
## Call:
## glm(formula = is_alive ~ age, family = "binomial", data = heart_transplant)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6642  -0.7691  -0.6417   1.1587   1.8857  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  1.56438    1.01521   1.541   0.1233  
## age         -0.05847    0.02306  -2.535   0.0112 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 120.53  on 102  degrees of freedom
## Residual deviance: 113.67  on 101  degrees of freedom
## AIC: 117.67
## 
## Number of Fisher Scoring iterations: 4

The model that we just fit tells us that:

\[ \operatorname{logit}(\hat p)=\log \left(\frac{\hat p}{1-\hat p}\right)=1.56-0.05847 x \]

  1. What is the predicted probability of surviving for someone aged 64 at the time of study?

Answer: Solve for \(\hat p\) plugging in \(x=64\) .

\[ \begin{align} \frac{\hat p}{1-\hat p} &= e^{1.56-0.05847 \times 64}\\ \hat p &= (1-\hat p)\times 0.113 \\ \hat p &= 0.113 - \hat p\times 0.113 \\ (1.113)\times \hat p &= 0.113 \\ \hat p &= \frac{0.113}{1.113} \\ \hat p &= 0.1 = 10\% \end{align} \]

Note that this is the same solution you’d get if you plugged in the equation for \(p\) from lecture notes: \[\hat p = \frac{e^{\beta_0 + \beta_1 x}}{1+ e^{\beta_0 + \beta_1 x}}\] Alternatively, you can solve for \(p\) using R.2 The uniroot function searches over the provided interval to find the zero value of the function provided. We pass the expression that should equal zero, and it finds us the \(p\) that ensures it equals zero.

toSolve <- function(p) {(1-p)*exp(1.56-0.05847*64) - p}

uniroot(toSolve, interval = c(0, 1))$root
## [1] 0.1013713

Visualizing logistic regression

Unfortunately, since our model is now non-linear, it’s harder to succinctly characterize how those probabilities decline. We can no longer say that “each additional year of age is associated with a particular change in the probability of surviving,” because that change in probability is not constant across ages. Thus, plotting the probabilities over age is the easiest way to visualize the probabilities. We can do this using geom_smooth(), specifying the model arguments as we did when estimating the logistic regression.

gg + 
  geom_smooth(method = "lm", se = FALSE) + 
  geom_smooth(method = "glm", se = FALSE, color = "red", 
              method.args = list(family = "binomial"))
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

Now, notice how the logistic regression line is curved—most noticeably at the ends. The red logistic regression line will never reach 0 or 1, eliminating those invalid predicted probabilities. In this case, for most ages, the simple linear regression line and the logistic regression line don’t differ by very much, and you might not lose much by using the simpler regression model. But for older people, the logistic model performs much better.

How can we visually assess how well the logistic model fits the data? Since the actual observations are all at 0 or 1, but our predictions are always between 0 and 1, we will always be off by some amount. In particular, our model predicts a 50% chance of survival for patients that are about 27 years old. Is that accurate?

Section 3: Interpreting coefficients

Odds ratio

To combat the problem of the scale of the y variable, we can change the scale of the variable on the y-axis. Instead of thinking about the probability of survival, we can think about the odds. While these two concepts are often conflated, they are not the same. They are however, related by the simple formula below. The odds of a binary event are the ratio of how often it happens, to how often it doesn’t happen.

\[ \operatorname{odds}(\hat{p})=\frac{\hat{p}}{1-\hat{p}}=\exp \left(\hat{\beta}_0+\hat{\beta}_1 \cdot x\right) \]

Thus, if the probability of survival is 75%, then the odds of survival are 3:1, since you are three times more likely to survive than you are to die. Odds are commonly used to express uncertainty in a variety of contexts, most notably gambling.

Let’s create the odds_hat variable for predicted odds.

heart_transplant_plus <-  mod_heart %>%
  augment(type.predict = "response") %>%
  mutate(y_hat = .fitted) %>% 
  mutate(odds_hat = y_hat / (1 - y_hat))

We can use changes in the odds due to a change in an independent variable to interpret coefficients. It turns out mathematically that we can write the change in the relative odds due to a one unit increase in \(x\) as the exponential of \(\beta_1\) or \(e^{\beta_1}\),3 When \(x\) increases by 1 unit from \(x\) to \(x+1\), the odds ratio goes from \(odds_{x}=e^{\beta_0+\beta_1 x}\) to \(odds_{x+1}=e^{\beta_0+\beta_1 (x+1)}\). The ratio tells us if the odds has increased or decreased: \(\frac{odds_{x+1}}{odds_x}=\frac{e^{\beta_0+\beta_1 (x+1)}}{e^{\beta_0+\beta_1 \cdot x}}=e^{\beta_0+\beta_1 x + \beta_1-(\beta_0+\beta_1 x)} = e^{\beta_1}\). If this ratio is 1, the odds has not changed, if this ratio is 9, the odds has increased 9-folds! the “slope” coefficient. That is, the ratio of the odds after a one unit increase in \(x\) to the odds before that one unit change is equal to \(e^{\beta_1}\). Notice this doesn’t depend on \(x\) anymore! Therefore, it’s a useful interpretation of coefficients.

This is an “odds ratio”, meaning we care about how this number differs from 1. If it’s greater than 1, then the odds increase when \(x\) increases. Conversely, if it’s less than 1, then the odds decrease.

By how much does our model predict that the odds of survival will change with each additional year of age?

Answer: Our model estimates that one additional year of age is associated with a change in the odds ratio of \(e^{-0.05847}=.94\), or a 6% decrease in the odds of survival.

Section 4: Using a logistic model

One important reason to build a model is to learn from the coefficients about the underlying random process. For example, in the Stanford heart transplant study, we were able to estimate the effect of age on the five-year survival rate. This simple model shed no light on the obvious purpose of the study, which was to determine whether those patients who received heart transplants were likely to live longer than the control group that received no transplant.

Let’s include the transplant variable in our model.

mod <- glm(is_alive ~ age + transplant, 
                    data = heart_transplant, family = binomial)

exp(coef(mod))
##         (Intercept)                 age transplanttreatment 
##           2.6461676           0.9265153           6.1914009

By including the transplant variable in our model and exponentiating the coefficients, we see a huge effect. Patients who received a heart transplant saw their odds of survival improve by a factor of 6.2, even after controlling for age. Note that as expected, age still has a deleterious effect on mortality – odds of survival fall by about 8% for each year of age in this model.

Making probabilistic predictions

By setting the type.predict argument to “response”, we retrieve the fitted values on the familiar probability scale.

Making predictions about the probability of survival for those patients who took part in the study is of somewhat limited value. We already know whether they survived! Aside from learning about the efficacy of the treatment, another common purpose for modeling is to make predictions for observations that are not part of our data set. These are called out-of-sample predictions.

# probability scale
augment(mod, type.predict = "response")
## # A tibble: 103 × 9
##    is_alive   age transplant .fitted .resid .std.resid   .hat .sigma  .cooksd
##       <dbl> <int> <fct>        <dbl>  <dbl>      <dbl>  <dbl>  <dbl>    <dbl>
##  1        0    53 control     0.0443 -0.301     -0.304 0.0219   1.02 0.000354
##  2        0    43 control     0.0904 -0.435     -0.442 0.0295   1.02 0.00104 
##  3        0    52 control     0.0476 -0.312     -0.316 0.0225   1.02 0.000392
##  4        0    52 control     0.0476 -0.312     -0.316 0.0225   1.02 0.000392
##  5        0    54 control     0.0412 -0.290     -0.293 0.0213   1.02 0.000319
##  6        0    36 control     0.145  -0.560     -0.571 0.0403   1.02 0.00248 
##  7        0    47 control     0.0682 -0.376     -0.381 0.0259   1.02 0.000666
##  8        0    41 treatment   0.418  -1.04      -1.05  0.0192   1.02 0.00477 
##  9        0    47 control     0.0682 -0.376     -0.381 0.0259   1.02 0.000666
## 10        0    51 control     0.0512 -0.324     -0.328 0.0231   1.02 0.000436
## # ℹ 93 more rows

For example, former Vice President Dick Cheney famously received a heart transplant in March of 2012 at the age of 71. More than five years later, Cheney is still alive, but what does our model predict for his five-year survival rate?

Answer:

To compute this, we build a data frame with Cheney’s data, and run it through our model using the newdata argument to augment() .

cheney <- data.frame(age = 71, transplant = "treatment")

augment(mod, newdata = cheney, type.predict = "response")
## # A tibble: 1 × 3
##     age transplant .fitted
##   <dbl> <chr>        <dbl>
## 1    71 treatment   0.0677

The results suggest that Cheney had only a 6.8% chance of survival. Either Cheney is quite lucky to be alive, or—more likely—the survival rates of all heart transplant patients have improved considerably since the Stanford study was completed in 1973.

Making binary predictions

If our response variable is binary, then why are we making probabilistic predictions? Shouldn’t we be able to make binary predictions? That is, instead of predicting the probability that a person survives for five years, shouldn’t we be able to predict definitively whether they will live or die?

There are a number of different ways in which we could reasonably convert our probabilistic fitted values into a binary decision. The simplest way would be to simply round the probabilities.

mod_plus <- augment(mod, type.predict = "response") %>%
  mutate(alive_hat = round(.fitted))
  
mod_plus %>%
  select(is_alive, age, transplant, .fitted, alive_hat)
## # A tibble: 103 × 5
##    is_alive   age transplant .fitted alive_hat
##       <dbl> <int> <fct>        <dbl>     <dbl>
##  1        0    53 control     0.0443         0
##  2        0    43 control     0.0904         0
##  3        0    52 control     0.0476         0
##  4        0    52 control     0.0476         0
##  5        0    54 control     0.0412         0
##  6        0    36 control     0.145          0
##  7        0    47 control     0.0682         0
##  8        0    41 treatment   0.418          0
##  9        0    47 control     0.0682         0
## 10        0    51 control     0.0512         0
## # ℹ 93 more rows