Logistic Regression (and other nonlinear models)

class: center, middle, inverse, title-slide

.title[
# Logistic Regression (and other nonlinear models)
]
.subtitle[
## EDS 222
]
.author[
### Tamma Carleton
]
.date[
### Fall 2023
]

---

# Announcements/check-in

- Assignment 03 pass/fail, due **today** (5pm)

- Assignment 04 after we cover inference/uncertainty (likely assigned next week)

- Final project proposals, due 11/10 (5pm)
  + More details in a few slides

---
# Final project

### Goal:

Apply **some of** the statistical concepts you have learned in this course to **answer an environmental data science question**.<sup>*</sup>

### Two parts:

Deliverable 1: Technical blog post. Some examples:
  + [G-FEED](http://www.g-feed.com/2020/09/indirect-mortality-from-recent.html)
  + [emLab](https://emlab.ucsb.edu/blog/summertime-blues)
  + [MEDS '22, ex. 1](https://cullen-molitor.github.io/posts/2021-12-05-species-density-sst-lagsst/)
  + [MEDS '22, ex. 2](https://jake-eisaguirre.github.io/posts/2021-11-29-mpasandkelp/)
  + [MEDS '22, ex. 3](https://hdolinh.github.io/posts/2021-11-14-stats-final/)

---
# Final project

### Goal:

Apply **some of** the statistical concepts you have learned in this course to **answer an environmental data science question**.<sup>*</sup>

### Two parts:
  
Deliverable 2: Three-minute in-class presentation during final exam slot (4-7pm, 12/12)

.footnote[
[*]: Your project _must_ include concepts from the second half of the course.
]

---
# Final project

### Proposal:

Short paragraph (4-5 sentences) describing your proposed project. Motivate the question, describe possible data sources, suggest possible analyses.

**Email Sandy your proposal** at sandysum@ucsb.edu by 5pm on November 10th.

---
# Final project

Full guidelines on our [Resources Page](https://tcarleton.github.io/EDS-222-stats/resources.html)

### Some example topics:

- Are political views on climate change associated with recent natural disaster exposure?

- Detecting trends in water quality for indigenous communities in Chile

- Spatial patterns of deforestation during COVID-19

- Are there gendered health effects of wildfire smoke?

---
name: Overview

# Today

#### More on nonlinear relationships with linear regression models
Log-linear, log-log regressions

#### Logistic regression
How do we model binary outcomes?

---
layout: false
class: clear, middle, inverse
# Nonlinear relationships in linear regression models

---
# Nonlinear transformations

- Our linearity assumption requires that **parameters enter linearly** (_i.e._, the `$\beta_k$` multiplied by variables)
- We allow nonlinear relationships between `$y$` and the explanatory variables `$x$`.

**Example: Polynomials**

`$$y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + u_i$$`

`$$y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \beta_3 x_i^3 + u_i$$`

`$$y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \beta_3 x_i^3 + \beta_4 x_i^4 + u_i$$`

...
---
# Polynomials

- Recall the relationship between **temperature** and **harmful algal blooms**:

$$ area_i = \beta_0 + \beta_1 temperature_i + \beta_2 temperature_i^2 + u_i$$

<img src="06-nonlinearmodels_files/figure-html/polys-1.svg" width="70%" style="display: block; margin: auto;" />
---
# Polynomials

Estimating polynomial regressions in `R`:

```r
blooms_df = blooms_df %>% mutate(temp2 = temp^2)
summary(lm(area~temp+temp2, data=blooms_df))
#> 
#> Call:
#> lm(formula = area ~ temp + temp2, data = blooms_df)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -12.597  -2.092  -0.142   1.995   9.487 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)   0.0636     0.2925    0.22     0.83    
#> temp          0.6254     0.4401    1.42     0.16    
#> temp2         1.9212     0.1416   13.57   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 3.02 on 997 degrees of freedom
#> Multiple R-squared:  0.777,	Adjusted R-squared:  0.777 
#> F-statistic: 1.74e+03 on 2 and 997 DF,  p-value: <2e-16
```
---
# Other nonlinear-in-X regressions

- **Polynomials** and **interactions:** `$y_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{1i}^2 + \beta_3 x_{2i} + \beta_4 x_{2i}^2 + \beta_5 \left( x_{1i} x_{2i} \right) + u_i$` (more on this today)

- **Exponentials** `$\log(y_i) = \beta_0 + \beta_2 e^{x_{2i}} + u_i$`

- **Logs:** `$\log(y_i) = \beta_0 + \beta_1 x_{1i} + u_i$` (Today!)

- **Indicators** and **thresholds:** `$y_i = \beta_0 + \beta_1 x_{1i} + \beta_2 \, \mathbb{I}(x_{1i} \geq 100) + u_i$`

In all cases, the effect of a change in `$x$` on `$y$` will vary depending on your baseline level of `$x$`. This is not true with linear relationships!

---

# Log-linear specification

You will frequently see logged<sup>*</sup>  outcome variables with linear (non-logged) explanatory variables, _e.g._,

$$ \log(\text{area}_i) = \beta_0 + \beta_1 \, \text{temperature}_i + u_i $$

This specification changes our interpretation of the slope coefficients.

**Interpretation**

- A one-unit increase in our explanatory variable increases the outcome variable by approximately `$\beta_1\times 100$` percent.

- *Example:* If `$\beta_1 = 0.03$`, an additional degree of warming increases algal bloom area by approximately 3 percent.

.footnote[
[*]: When I say "log", I mean "natural log", i.e. `$ln(x) = log_e(x)$`.
]

---
# Review: Percent changes

- What is a percent change again, anyway?

- Local gasoline prices were $5/gallon, but last month increased by 12%. How much are they now?

$$ 5(1+0.12) = 5\times1.12 = 5.6$$ 
--

Can also write this as `$$0.12 = \frac{5.6-5}{5}$$`

Generally, we have that when `$y$` increases by `$r$` percent, our new value is `$y(1+r)$`.

$$ r = \frac{y_2 - y_1}{y_1}$$
---
# Log differences as percent changes?

Near `$y=1$`, `$log(y)$` is approximately slope 1, i.e. `$log(y) \approx y-1$`

---

# Log differences as percent changes?

Near `$y=1$`, `$log(y)$` is approximately slope 1, i.e. `$log(y) \approx y-1$`

Therefore, `$log(1+r) \approx r$` **when `$r$` is small!** (so that you're still close to 1 on the x-axis)

This lets us show that:

`$$log(y(1+r))  =  log(y) + log(1+r) \approx log(y) + r$$`

So when we see `$log(y)$` go up by `$r$`, we can say that represents an `$r \times 100$` percent change in `$y$`!

For example: `$y$` is increased by 5% means `$y$` increases to `$y(1.05)$`. The log of `$y$` changes from `$log(y)$` to approximately `$log(y) + 0.05$`.  Increasing `$y$` by 5% is therefore (almost) equivalent to adding 0.05 to `$log(y)$`.

---
# Log-linear specification

Back to our log-linear model

$$ \log(y_i) = \beta_0 + \beta_1 \, x_i + u $$

A one unit change in `$x$` causes a `$\beta_1$` unit change in `$log(y)$`.

This is equivalent to a `$\beta_1$` **percentage change** in `$y$`.

---
# Log-linear specification

Because the log-linear specification comes with a different interpretation, you need to make sure it fits your data-generating process/model.

Does `$x$` change `$y$` in levels (_e.g._, a 3-unit increase) or percentages (_e.g._, a 10-percent increase)?

_I.e._, you need to be sure an exponential relationship makes sense:

$$ \log(y_i) = \beta_0 + \beta_1 \, x_i + u_i \iff y_i = e^{\beta_0 + \beta_1 x_i + u_i} $$

Note: You are using linear regression to estimate a nonlinear-in-parameters relationship. This is the power of taking logs!

---
# Log-linear specification

---
# Log-log specification

Similarly, log-log models are those where the outcome variable is logged *and* at least one explanatory variable is logged

$$ \log(\text{log}_i) = \beta_0 + \beta_1 \, \log(\text{temperature}_i) + u_i $$

**Interpretation:**

- A one-percent increase in `$x$` will lead to a `$\beta_1$` percent change in `$y$`.
- Often interpreted as an "elasticity" in economics.

---
# Log-log specification

<img src="06-nonlinearmodels_files/figure-html/log log plot-1.svg" width="100%" style="display: block; margin: auto;" />
---
# Log-linear with a binary variable

**Note:** If you have a log-linear model with a binary indicator variable, the interpretation for the coefficient on that variable changes.

Consider:

`$$\log(y_i) = \beta_0 + \beta_1 x_{1i} + u_i$$`

for binary variable `$x_1$`.

The interpretation of `$\beta_1$` is now

- When `$x_1$` changes from 0 to 1, `$y$` will change by `$100 \times \left( e^{\beta_1} -1 \right)$` percent.
- When `$x_1$` changes from 1 to 0, `$y$` will change by `$100 \times \left( e^{-\beta_1} -1 \right)$` percent.

---
# When the approximation fails

The nice interpretation so far relies on the fact that near 1, `$log(y) \approx y-1$` 
  - So, for example, `$log(y(1+r)) = log(y) + log(1+r) \approx log(y) + r$`

What if `$r$` is large? E.g., `$r$`=0.8:
  - `$log(1*(1.8)) = log(1) + log(1.8) = 0.59 \neq log(1) + 0.8 = 0.8$`

Exact percentage change (use for large predicted changes):

If `$log(y) = \beta_0 + \beta_1 x + \varepsilon$`, then the percentage change in `$y$` for a one unit change in `$x$` is:

$$\text{% change in y} = (e^{\beta_1}-1)\times 100 $$
--

Note that `$e^x$` in `R` is `exp(x)`

---
# When the approximation fails

Example: Suppose in `$log(y) = \beta_0 + \beta_1 x + \varepsilon$`, we estimate that `$\hat\beta_1 = 0.6$`

This looks like a 1 unit change in `$x$` causes a 60% change in `$y$`. But the exact percentage change in `$y$` is:
  + `$(e^{0.6}-1)\times 100 = 0.82 \times 100 \implies 82$` percent change in `$y$`
  + Note that the imprecise approximation for large changes will always be biased _downwards_

Can you just change units of `$x$`? 
  + Yes, mechanically you can do this and avoid the issues with approximation
  + But think hard about your problem! You probably care about understanding the impacts of a meaningful increase in `$x$`, not a tiny increase in `$x$` 
  
---
layout: false
class: clear, middle, inverse
# Logistic regression
---
# Modeling binary outcomes

What do you do when your dependent variable takes on just two values?

---
# Modeling binary outcomes

What's wrong with running our standard linear regression?

$$\text{species present}_i = \beta_0  + \beta_1 \text{forest cover}_i + \varepsilon_i $$
<img src="06-nonlinearmodels_files/figure-html/lpm-1.svg" width="90%" style="display: block; margin: auto;" />

---
# Modeling probabilities

- Our data take on the form `$y_i = 1$` or `$y_i = 0$`

- For each individual `$i$`, there is some probability `$p_i$` they have `$y_i=1$`, so probability `$1-p_i$` they have `$y_i=0$`

- We are interested in how a change in variable `$x$` changes the probability of `$y_i=1$`
  + That is, **we model `$p_i$`** as a function of independent variables
  
--

- Basic idea: we need some transformation of the _probability_ that lets us write:

`$$transformation(p_i) = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + ...$$`

---
# Modeling probabilities

Basic idea: we need some transformation of the _probability_ that lets us write:

`$$transformation(p_i) = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + ...$$`

- We want this transformation to ensure that:
  + we can input a value between 0 and 1 and return a continuous variable (i.e., we want our outcome variable to be a continuous variable)
  + our predicted probabilities `$\hat p_i$` (inverse of the transformation) will fall between 0 and 1

---
# Logistic regression

The **logit function** is the most commonly used nonlinear transformation that ensures predicted probabilities between 0 and 1:

---
# Logistic regression

The **logit function** is the most commonly used nonlinear transformation that ensures predicted probabilities between 0 and 1:

`$$logit(p) = log\left(\frac{p}{1-p}\right)$$`

We can then write:

`$$log\left(\frac{p_i}{1-p_i}\right) = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + ...$$`

The logit function is also called "log odds" because the "odds ratio" is the probability of success, `$p_i$`, divided by the probability of failure, `$1-p_i$`

Because of the properties of the logit function (see last graph), this ensures we will generate predicted probabilities `$\hat{p}_i$` that fall between 0 and 1.

---
# Logistic regression

How do we estimate this regression?

`$$log\left(\frac{p_i}{1-p_i}\right) = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + ...$$`

- Can't use linear regression -- we don't have data on `$p_i$`! We only see `$y_i = 1$` or `$y_i = 0$`
- We use what's called "maximum likelihood estimation" (alternatively, can use gradient descent)
  + Essentially, this asks: what combination of parameters `$\beta_0, \beta_1, ...$` maximizes the likelihood that we would observe the data we have? 
  + E.g., if you have high `$x_1$` values coinciding with many `$y_i =1$` values, likely that `$\beta_1$` is high and that `$p_i$` is high for observations with large `$x_1$`

---
# Logistic regression

How do we estimate this regression?

`$$log\left(\frac{p_i}{1-p_i}\right) = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + ...$$`

All you really need to know on estimation is...
  + That we use `glm()` instead of `lm()` -- GLM for "generalized linear model" 
  + Interpreting coefficients is a lot more complicated! (next slide)

---
# Interpreting logistic regression output

`$$log\left(\frac{p_i}{1-p_i}\right) = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + ...$$`

- `$\beta_k$`: effect of a 1-unit change in `$x_k$` on the log-odds of `$y = 1$` 🤔

We need to transform our output to get predicted probabilities back!

$$
`\begin{aligned}
\log\left( \frac{p_i}{1-p_i} \right) &= b_0 + b_1 x_{1,i} + \cdots + b_k x_{k,i} \\
\frac{p_i}{1-p_i} &= e^{b_0 + b_1 x_{1,i} + \cdots + b_k x_{k,i}} \\
p_i &= \left( 1 - p_i \right) e^{b_0 + b_1 x_{1,i} + \cdots + b_k x_{k,i}} \\
p_i &= e^{b_0 + b_1 x_{1,i}  + \cdots + b_k x_{k,i}} - p_i \times e^{b_0 + b_1 x_{1,i} + \cdots + b_k x_{k,i}} \\
p_i + p_i \text{ } e^{b_0 + b_1 x_{1,i} + \cdots + b_k x_{k,i}} &= e^{b_0 + b_1 x_{1,i} + \cdots + b_k x_{k,i}} \\
p_i(1 + e^{b_0 + b_1 x_{1,i} + \cdots + b_k x_{k,i}}) &= e^{b_0 + b_1 x_{1,i} + \cdots + b_k x_{k,i}} \\
p_i &= \frac{e^{b_0 + b_1 x_{1,i}  + \cdots + b_k x_{k,i}}}{1 + e^{b_0 + b_1 x_{1,i} + \cdots + b_k x_{k,i}}}
\end{aligned}`
$$
---
# Interpreting logistic regression output

This means that if you run a regression with many independent variables, you need to plug your estimated `$\hat\beta$`'s _and_ the values of all your `$x$` variables into this equation to get back a predicted probability for any individual:

`$$p_i = \frac{e^{b_0 + b_1 x_{1,i}  + \cdots + b_k x_{k,i}}}{1 + e^{b_0 + b_1 x_{1,i} + \cdots + b_k x_{k,i}}}$$`
--

If you want to know the _effect_ of changing just one variable `$x_j$` on the probability `$p_i$`, you need to compute:

$$
`\begin{aligned}
p_i(x_j+1) - p_i(x_j) &= \frac{e^{b_0 + \cdots + b_j (x_{j,i}+1)  + \cdots + b_k x_{k,i}}}{1 + e^{b_0 + \cdots + b_j (x_{j,i}+1) + \cdots + b_k x_{k,i}}} - \frac{e^{b_0 + \cdots + b_j x_{j,i}  + \cdots + b_k x_{k,i}}}{1 + e^{b_0 + \cdots + b_j x_{j,i} + \cdots + b_k x_{k,i}}}
\end{aligned}`
$$

**Note** that this calculation depends on all the other `$x$`'s! And it will vary with the baseline level of `$x_j$`

---
# Logistic regression: Example

- Bertrand and Mullainathan (2003) study discrimination in hiring decisions
- Authors created many fake resumes, randomly assigning different characteristics (name, sex, race, experience, honors, etc.)

- **Outcome variable is binary:** Did the resume get a call back from a (real) potential employer? 
  + Yes: `$y_i=1$`
  + No: `$y_i=0$`

- Manipulated first names to be those that are commonly associated with White or Black individuals
- Random study design allows estimation of the causal effect of race on callback probability

---
# Logistic regression: Example

<table class="table table-striped table-condensed" style="width: auto !important; margin-left: auto; margin-right: auto;">
<caption>List of all 36 unique names along with the commonly inferred race and sex associated with these names.</caption>
 <thead>
  <tr>
   <th style="text-align:left;"> first_name </th>
   <th style="text-align:left;"> race </th>
   <th style="text-align:left;"> sex </th>
   <th style="text-align:left;"> first_name </th>
   <th style="text-align:left;"> race </th>
   <th style="text-align:left;"> sex </th>
   <th style="text-align:left;"> first_name </th>
   <th style="text-align:left;"> race </th>
   <th style="text-align:left;"> sex </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> Aisha </td>
   <td style="text-align:left;"> Black </td>
   <td style="text-align:left;"> female </td>
   <td style="text-align:left;border-left:1px solid;"> Hakim </td>
   <td style="text-align:left;"> Black </td>
   <td style="text-align:left;"> male </td>
   <td style="text-align:left;border-left:1px solid;"> Laurie </td>
   <td style="text-align:left;"> White </td>
   <td style="text-align:left;"> female </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Allison </td>
   <td style="text-align:left;"> White </td>
   <td style="text-align:left;"> female </td>
   <td style="text-align:left;border-left:1px solid;"> Jamal </td>
   <td style="text-align:left;"> Black </td>
   <td style="text-align:left;"> male </td>
   <td style="text-align:left;border-left:1px solid;"> Leroy </td>
   <td style="text-align:left;"> Black </td>
   <td style="text-align:left;"> male </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Anne </td>
   <td style="text-align:left;"> White </td>
   <td style="text-align:left;"> female </td>
   <td style="text-align:left;border-left:1px solid;"> Jay </td>
   <td style="text-align:left;"> White </td>
   <td style="text-align:left;"> male </td>
   <td style="text-align:left;border-left:1px solid;"> Matthew </td>
   <td style="text-align:left;"> White </td>
   <td style="text-align:left;"> male </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Brad </td>
   <td style="text-align:left;"> White </td>
   <td style="text-align:left;"> male </td>
   <td style="text-align:left;border-left:1px solid;"> Jermaine </td>
   <td style="text-align:left;"> Black </td>
   <td style="text-align:left;"> male </td>
   <td style="text-align:left;border-left:1px solid;"> Meredith </td>
   <td style="text-align:left;"> White </td>
   <td style="text-align:left;"> female </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Brendan </td>
   <td style="text-align:left;"> White </td>
   <td style="text-align:left;"> male </td>
   <td style="text-align:left;border-left:1px solid;"> Jill </td>
   <td style="text-align:left;"> White </td>
   <td style="text-align:left;"> female </td>
   <td style="text-align:left;border-left:1px solid;"> Neil </td>
   <td style="text-align:left;"> White </td>
   <td style="text-align:left;"> male </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Brett </td>
   <td style="text-align:left;"> White </td>
   <td style="text-align:left;"> male </td>
   <td style="text-align:left;border-left:1px solid;"> Kareem </td>
   <td style="text-align:left;"> Black </td>
   <td style="text-align:left;"> male </td>
   <td style="text-align:left;border-left:1px solid;"> Rasheed </td>
   <td style="text-align:left;"> Black </td>
   <td style="text-align:left;"> male </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Carrie </td>
   <td style="text-align:left;"> White </td>
   <td style="text-align:left;"> female </td>
   <td style="text-align:left;border-left:1px solid;"> Keisha </td>
   <td style="text-align:left;"> Black </td>
   <td style="text-align:left;"> female </td>
   <td style="text-align:left;border-left:1px solid;"> Sarah </td>
   <td style="text-align:left;"> White </td>
   <td style="text-align:left;"> female </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Darnell </td>
   <td style="text-align:left;"> Black </td>
   <td style="text-align:left;"> male </td>
   <td style="text-align:left;border-left:1px solid;"> Kenya </td>
   <td style="text-align:left;"> Black </td>
   <td style="text-align:left;"> female </td>
   <td style="text-align:left;border-left:1px solid;"> Tamika </td>
   <td style="text-align:left;"> Black </td>
   <td style="text-align:left;"> female </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Ebony </td>
   <td style="text-align:left;"> Black </td>
   <td style="text-align:left;"> female </td>
   <td style="text-align:left;border-left:1px solid;"> Kristen </td>
   <td style="text-align:left;"> White </td>
   <td style="text-align:left;"> female </td>
   <td style="text-align:left;border-left:1px solid;"> Tanisha </td>
   <td style="text-align:left;"> Black </td>
   <td style="text-align:left;"> female </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Emily </td>
   <td style="text-align:left;"> White </td>
   <td style="text-align:left;"> female </td>
   <td style="text-align:left;border-left:1px solid;"> Lakisha </td>
   <td style="text-align:left;"> Black </td>
   <td style="text-align:left;"> female </td>
   <td style="text-align:left;border-left:1px solid;"> Todd </td>
   <td style="text-align:left;"> White </td>
   <td style="text-align:left;"> male </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Geoffrey </td>
   <td style="text-align:left;"> White </td>
   <td style="text-align:left;"> male </td>
   <td style="text-align:left;border-left:1px solid;"> Latonya </td>
   <td style="text-align:left;"> Black </td>
   <td style="text-align:left;"> female </td>
   <td style="text-align:left;border-left:1px solid;"> Tremayne </td>
   <td style="text-align:left;"> Black </td>
   <td style="text-align:left;"> male </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Greg </td>
   <td style="text-align:left;"> White </td>
   <td style="text-align:left;"> male </td>
   <td style="text-align:left;border-left:1px solid;"> Latoya </td>
   <td style="text-align:left;"> Black </td>
   <td style="text-align:left;"> female </td>
   <td style="text-align:left;border-left:1px solid;"> Tyrone </td>
   <td style="text-align:left;"> Black </td>
   <td style="text-align:left;"> male </td>
  </tr>
</tbody>
</table>

---
# Logistic regression: Example

Variables included in the data (all randomly assigned):

<table class="table table-striped table-condensed" style="margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> variable </th>
   <th style="text-align:left;"> description </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;font-family: monospace;"> received_callback </td>
   <td style="text-align:left;width: 30em; "> Specifies whether the employer called the applicant following submission of the application for the job. </td>
  </tr>
  <tr>
   <td style="text-align:left;font-family: monospace;"> job_city </td>
   <td style="text-align:left;width: 30em; "> City where the job was located: Boston or Chicago. </td>
  </tr>
  <tr>
   <td style="text-align:left;font-family: monospace;"> college_degree </td>
   <td style="text-align:left;width: 30em; "> An indicator for whether the resume listed a college degree. </td>
  </tr>
  <tr>
   <td style="text-align:left;font-family: monospace;"> years_experience </td>
   <td style="text-align:left;width: 30em; "> Number of years of experience listed on the resume. </td>
  </tr>
  <tr>
   <td style="text-align:left;font-family: monospace;"> honors </td>
   <td style="text-align:left;width: 30em; "> Indicator for the resume listing some sort of honors, e.g. employee of the month. </td>
  </tr>
</tbody>
</table>

---
# Logistic regression: Example

Variables included in the data (all randomly assigned):

<table class="table table-striped table-condensed" style="margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> variable </th>
   <th style="text-align:left;"> description </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;font-family: monospace;"> military </td>
   <td style="text-align:left;width: 30em; "> Indicator for if the resume listed any military experience. </td>
  </tr>
  <tr>
   <td style="text-align:left;font-family: monospace;"> has_email_address </td>
   <td style="text-align:left;width: 30em; "> Indicator for if the resume listed an email address for the applicant. </td>
  </tr>
  <tr>
   <td style="text-align:left;font-family: monospace;"> race </td>
   <td style="text-align:left;width: 30em; "> Race of the applicant, implied by their first name listed on the resume. </td>
  </tr>
  <tr>
   <td style="text-align:left;font-family: monospace;"> sex </td>
   <td style="text-align:left;width: 30em; "> Sex of the applicant (limited to only and in this study), implied by the first name listed on the resume. </td>
  </tr>
</tbody>
</table>

---
# Logistic Regression: example

- First, we estimate a single predictor: `race`
- `race` indicates whether the applicant is White or not (**Note:** `race` is also binary in this case!)
- We find:

`$$\log \left( \frac{\widehat{p}_i}{1-\widehat{p}_i} \right) = -2.67 + 0.44 \times {\texttt{race_White}}$$`

a.  If a resume is randomly selected from the study and it has a Black associated name, what is the probability it resulted in a callback?

b.  What would the probability be if the resume name was associated with White individuals?

---
# Logistic regression: Example

`$$\log \left( \frac{\widehat{p}_i}{1-\widehat{p}_i} \right) = -2.67 + 0.44 \times {\texttt{race_white}}$$`

a.  If a resume is randomly selected from the study and it has a Black associated name, what is the probability it resulted in a callback?

**Answer:** If a randomly chosen resume is associated with a Black name, then `race_white` takes the value of 0 and the right side of the model equation equals `$-2.67$`. Solving for `$p_i$` gives `$log(\frac{\hat p_i}{1-\hat p_i}) = -2.67 \implies \hat p_i = \frac{e^{-2.67}}{1+e^{-2.67}} = 0.065$`.

---
# Logistic regression: Example

`$$\log \left( \frac{\widehat{p}_i}{1-\widehat{p}_i} \right) = -2.67 + 0.44 \times {\texttt{race_white}}$$`

b.  What would the probability be if the resume name was associated with White individuals?

**Answer:** If the resume had a name associated with White individuals, then the right side of the model equation is `$-2.67+0.44\times 1 = -2.23$`. This translates into `$\hat p_i = 0.097$`.

**Conclude:** Being White increases the likelihood of a call back, by 3.2 percentage points.

---
# Logistic regression: Example

**Use the same process** to compute predicted probabilities with multiple independent variables, you just have more calculations!

For example, you might estimate:

$$
`\begin{aligned}
&log \left(\frac{p}{1 - p}\right) \\
&= - 2.7162 - 0.4364 \times \texttt{job_city}_{\texttt{Chicago}} \\
& \quad \quad + 0.0206 \times \texttt{years_experience} \\
& \quad \quad + 0.7634 \times \texttt{honors} - 0.3443 \times \texttt{military} + 0.2221 \times \texttt{email} \\
& \quad \quad + 0.4429 \times \texttt{race}_{\texttt{White}} - 0.1959 \times \texttt{sex}_{\texttt{man}} 
\end{aligned}`
$$

To predict callback probability for a White individual, you also need to know job location, experience, honors, military experience, whether they have an email, race, and sex!

---
# Logistic regression: Example

For example, you might estimate:

Note: the effect of race on call back now varies based on all the other covariates! 
  + Try it: Effect of being white for Chicago male with 10 years experience, an email, no honors and no military experience _versus_ a female with the same characteristics?

---
# Multinomial logistic regression

**What if** your outcome variable is categorical, not binary?

For example:
  - Species
  - Socioeconomic status
  - Survey responses
  - ...

**Multinomial logistic regression** generalizes the binary logistic regression you've seen here to work for multiple outcome categories
  - Model predicts the probability an individual will fall into each category
  - Beyond the scope of this class, but not a far leap from what you've seen here (lots of online resources -- ask me if you're interested!)

---

class: center, middle

Slides created via the R package [**xaringan**](https://github.com/yihui/xaringan).

Some slide components were borrowed from [Ed Rubin's](https://github.com/edrubin/EC421S20) awesome course materials.

---
exclude: true