Ordinary Least Squares, continued

class: center, middle, inverse, title-slide

.title[
# Ordinary Least Squares, continued
]
.subtitle[
## EDS 222
]
.author[
### Tamma Carleton
]
.date[
### Fall 2023
]

---

# Announcements/check-in

- Assignment #1: Grades posted
  + Please ensure your `.html` file is compiled and pushed to GitHub
  + Please do not push data to GitHub (generally a good rule to follow)
  + Sandy to go over some areas of confusion

- Assignment #2: Due 10/20, 5pm

--

- Final project guidelines posted (under "Important Links" on homepage and on Resources tab)

--

- Practice midterm questions posted sometime this week

--

- Reiteration of COVID/illness policy

---
name: Overview

# Today

#### Notes on OLS
- Outliers, missing data

--

#### Measures of model fit
- Coefficient of variation `\(R^2\)`

--

#### Categorical variables
- In .mono[R], interpretation

--

#### Multiple linear regression
- Adding independent variables, interpretation of results

---

layout: false
class: clear, middle, inverse
# Notes on OLS

---
# Outliers

Because OLS minimizes the sum of the **squared** errors, outliers can play a large role in our estimates.

**Common responses**

- Remove the outliers from the dataset

- Replace outliers with the 99<sup>th</sup> percentile of their variable (*winsorize*)

- Take the log of the variable (This lowers the leverage of large values -- why?)

- Do nothing. Outliers are not always bad. Some people are "far" from the average. It may not make sense to try to change this variation.

---
# Missing data

Similarly, missing data can affect your results.

.mono[R] doesn't know how to deal with a missing observation.

```r
1 + 2 + 3 + NA + 5
```

```
#> [1] NA
```

If you run a regression<sup>†</sup> with missing values, .mono[R] drops the observations missing those values.

If the observations are missing in a nonrandom way, a random sample may end up nonrandom.

.footnote[
[†]: Or perform almost any operation/function
]

---

layout: false
class: clear, middle, inverse
# Measures of model fit

---
# Measures of model fit

### Goal: quantify how "well" your regression model fits the data

#### General idea: Larger variance in residuals suggests our model isn't very predictive

.pull-left[

<img src="04-ols-contd_files/figure-html/highvar-1.svg" style="display: block; margin: auto;" />

]

--

.pull-right[

<img src="04-ols-contd_files/figure-html/lowvar-1.svg" style="display: block; margin: auto;" />

]
---
# Coefficient of determination

- We already learned one measure of the strength of a linear relationship: correlation, `\(r\)`

--

- In OLS, we often rely on `\(R^2\)`, the **coefficient of determination**. In simple linear regression, this is simply the square of the correlation.
- Interpretation of `\(R^2\)`: **share of the variance in `\(y\)` that is explained by your regression model**

--
$$
SSR = \text{sum of squared residuals} = \sum_i (y_i - \hat y_i)^2 = \sum_i e_i^2 
$$

$$
SST = \text{total sum of squares} = \sum_i (y_i - \bar y)^2 
$$

$$
R^2 = 1 - \frac{SSR}{SST} = 1 - \frac{\sum_i e_i^2}{\sum_i (y_i - \bar y)^2}
$$

---
# Coefficient of determination

$$
R^2 = 1 - \frac{SSR}{SST} = 1 - \frac{\sum_i e_i^2}{\sum_i (y_i - \bar y)^2}
$$
--

- `\(R^2\)` varies between 0 and 1: Perfect model with `\(e_i=0\)` for all `\(i\)` has `\(R^2=1\)`. `\(R^2=0\)` if we just guess the mean `\(\bar y\)`.

--

- In more complex models, `\(R^2\)` is not the same as the square of the correlation coefficient. You should think of them as related but distinct concepts.

---
# Coefficient of determination

About 49% of the variation in ozone can be explained with temperature alone!

<img src="ozone_temp_r2.png" width="70%" style="display: block; margin: auto;" />

---
# Coefficient of determination

#### Definition: % of variance in `\(y\)` that is explained by `\(x\)` (and any other independent variables)

--

- Describes a _linear_ relationship between `\(y\)` and `\(\hat y\)`
 
--

- Higher `\(R^2\)` does not mean a model is "better" or more appropriate
  + Predictive power is not often the goal of regression analysis (e.g., you may just care about getting `\(\beta_1\)` right)
  + If you are focused on predictive power, many other measures of fit can be appropriate (to discuss in machine learning)
  + Always look at your data and residuals! 
  
--

- Like OLS in general, `\(R^2\)` is very sensitive to outliers. Again...always look at your data!

---
# Coefficient of determination

Here, `\(R^2=0.94\)` for a model of `\(y = \beta_0 + \beta_1 x + \epsilon\)`. Does that mean a linear relationship with `\(x\)` is appropriate?

<img src="04-ols-contd_files/figure-html/rsq-1.svg" style="display: block; margin: auto;" />

---
# Coefficient of determination

Here, `\(R^2=0\)` for a model of `\(y = \beta_0 + \beta_1 x + \epsilon\)`. Does that mean there is no relationship between these variables?

<img src="04-ols-contd_files/figure-html/rsq2-1.svg" style="display: block; margin: auto;" />

---

layout: false
class: clear, middle, inverse
# Indicator/categorical variables

---
# Categorical variables

We have been talking a lot about **numerical** variables in linear regression...
  + Ozone levels
  + Crab size
  + Temperature and precipitation amounts
  + etc.

--

...but a lot of variables of interest are **categorical**:
  + Male/female
  + Presence/absence of a species
  + In/out of compliance with a pollution standard
  + etc.

--

#### How do we execute and interpret linear regression with categorical data?

---
# Categorical variables

We use **dummy** or **indicator** variables in linear regression to capture the influence of a categorical independent variable (_x_) on a continuous dependent variable (_y_).

--

For example, let _x_ be a categorical variable indicating the gender of an individual. Suppose we are interested in the "gender wage gap", so _y_ is income We estimate:

$$y_i = \beta_0 + \beta_1 MALE_i + \varepsilon_i $$
--

###Interpretation [draw it]:
- `\(MALE_i\)` is an **indicator** variable that = 1 when `\(i\)` is male (0 otherwise)
- `\(\beta_0=\)` average wages if `\(i\)` is **not** male
- `\(\beta_0+\beta_1=\)` average wages if `\(i\)` is male
- `\(\beta_1=\)` average _difference_ in wages between males and females

---
# Categorical variables

#### For a categorical variable with two "levels", the OLS slope coefficient is the _difference_ in means across the two groups

<img src="04-ols-contd_files/figure-html/unnamed-chunk-2-1.svg" style="display: block; margin: auto;" />

---
# Categorical variables

#### What if I have many categories?
- E.g., species, education level, age group, ...

For example, let _x_ be a categorical variable indicating the species of penguin, and _y_ is body mass. We estimate:

$$y_i = \beta_0 + \beta_1 SPECIES_i + \varepsilon_i $$
Where **species** can be one of:
- Adelie
- Chinstrap
- Gentoo

---
# Categorical variables

```r
library(palmerpenguins) 
head(penguins)
```

```
#> # A tibble: 6 × 8
#>   species island    bill_length_mm bill_depth_mm flipper_l…¹ body_…² sex    year
#>   <fct>   <fct>              <dbl>         <dbl>       <int>   <int> <fct> <int>
#> 1 Adelie  Torgersen           39.1          18.7         181    3750 male   2007
#> 2 Adelie  Torgersen           39.5          17.4         186    3800 fema…  2007
#> 3 Adelie  Torgersen           40.3          18           195    3250 fema…  2007
#> 4 Adelie  Torgersen           NA            NA            NA      NA <NA>   2007
#> 5 Adelie  Torgersen           36.7          19.3         193    3450 fema…  2007
#> 6 Adelie  Torgersen           39.3          20.6         190    3650 male   2007
#> # … with abbreviated variable names ¹flipper_length_mm, ²body_mass_g
```

```r
class(penguins$species)
```

```
#> [1] "factor"
```

---
# Categorical variables

```r
summary(lm(body_mass_g ~ species, data = penguins))
```

```
#> 
#> Call:
#> lm(formula = body_mass_g ~ species, data = penguins)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -1126.02  -333.09   -33.09   316.91  1223.98 
#> 
#> Coefficients:
#>                  Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)       3700.66      37.62   98.37   <2e-16 ***
#> speciesChinstrap    32.43      67.51    0.48    0.631    
#> speciesGentoo     1375.35      56.15   24.50   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 462.3 on 339 degrees of freedom
#>   (2 observations deleted due to missingness)
#> Multiple R-squared:  0.6697,	Adjusted R-squared:  0.6677 
#> F-statistic: 343.6 on 2 and 339 DF,  p-value: < 2.2e-16
```

---
# Categorical variables

What is going on here?? One _x_ variable turned into multiple slope coefficients? 🤔

--

.mono[R] is turning our regression

$$y_i = \beta_0 + \beta_1 SPECIES_i + \varepsilon_i $$

where _SPECIES_ is a categorical variable indicating one of three species, into:

$$y_i = \beta_0 + \beta_1 CHINSTRAP_i +\beta_2 GENTOO_i + \varepsilon_i $$

where _CHINSTRAP_ and _GENTOO_ are dummy variables for the Chinstrap and Gentoo species, respectively.

---
# Categorical variables

When your categorical variable takes on `\(k\)` values, .mono[R] will create dummy variables for `\(k-1\)` values, leaving one as the **reference** group:

<img src="penguins.png" width="1087" style="display: block; margin: auto;" />

--

To evaluate the outcome for the reference group, **set the dummy variables equal to zero for all other groups**.

> Q: What is the average body mass of an Adelie species?

> Q: What is the difference in body mass between Chinstrap and Adelie?

---
layout: false
class: clear, middle, inverse
# Multiple linear regression

---

# More explanatory variables

We're moving from **simple linear regression** (one .pink[outcome variable] and one .purple[explanatory variable])

$$ \color{#e64173}{y_i} = \beta_0 + \beta_1 \color{#6A5ACD}{x_i} + u_i $$

--

to the land of **multiple linear regression** (one .pink[outcome variable] and multiple .purple[explanatory variables])

$$ \color{#e64173}{y\_i} = \beta\_0 + \beta\_1 \color{#6A5ACD}{x\_{1i}} + \beta\_2 \color{#6A5ACD}{x\_{2i}} + \cdots + \beta\_k \color{#6A5ACD}{x\_{ki}} + u\_i $$

--

**Why?**
--
 We can better explain the variation in `\(y\)`, improve predictions, avoid omitted-variable bias (i.e., second assumption needed for unbiased OLS estimates), ...

---
# More explanatory variables

Multiple linear regression...

$$ \color{#e64173}{y\_i} = \beta\_0 + \beta\_1 \color{#6A5ACD}{x\_{1i}} + \beta\_2 \color{#6A5ACD}{x\_{2i}} + \cdots + \beta\_k \color{#6A5ACD}{x\_{ki}} + u\_i $$
--

... raises many questions:

--

- Which `\(x\)`'s should I include? This is the problem of "model selection".

--

- How does my interpretation of `\(\beta_1\)` change?

--

- What if my `\(x\)`'s interact with each other? E.g., race and gender, temperature and rainfall.

--

- How do I measure model fit now?

--

**We will dig into each of these here,** and you will see these questions in other MEDS courses

---
# Multiple regression

`\(y_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + u_i \quad\)` `\(x_1\)` is continuous `\(\quad x_2\)` is categorical

--

<img src="04-ols-contd_files/figure-html/multregplot1-1.svg" style="display: block; margin: auto;" />

---
# Multiple regression

The intercept and categorical variable `\(x_2\)` control for the groups' means.

<img src="04-ols-contd_files/figure-html/multregplot2-1.svg" style="display: block; margin: auto;" />

---
# Multiple regression

`\(\hat{\beta}_1\)` estimates the relationship between `\(y\)` and `\(x_1\)` after controlling for `\(x_2\)`. This is often called the "parallel slopes" model (one slope `\(\beta_1\)` for each of the groups in `\(x_2\)`)

<img src="04-ols-contd_files/figure-html/multregplot5-1.svg" style="display: block; margin: auto;" />

---
# Multiple regression

More generally, how do we think about multiple explanatory variables?

--

Suppose `\(y_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + u_i\)`

.pull-left[

<img src="04-ols-contd_files/figure-html/yx1-1.svg" style="display: block; margin: auto;" />

]

--

.pull-right[

<img src="04-ols-contd_files/figure-html/yx2-1.svg" style="display: block; margin: auto;" />

]

---
# Multiple regression

More generally, how do we think about multiple explanatory variables?

<div id="htmlwidget-ada8cc9ad64e1dba5279" style="width:756px;height:504px;" class="plotly html-widget"></div>
<script type="application/json" data-for="htmlwidget-ada8cc9ad64e1dba5279">{"x":{"visdat":{"33632dec4b8c":["function () ","plotlyVisDat"]},"cur_data":"33632dec4b8c","attrs":{"33632dec4b8c":{"x":[-2.31777953216806,0.733796428889036,0.655648397281766,0.740276650059968,2.16549230134115,0.841863631736487,-2.94302546186373,-1.60469696391374,0.996502549387515,0.0855068480595946,1.16154775070027,0.269849013537169,-1.30359849845991,2.54060090566054,-1.24610495846719,2.02377376891673,-1.28266029199585,-1.39907531999052,-1.8796632620506,-1.60664453683421,-1.10032527102157,-1.18383977562189,-2.04572398262098,-2.76002449169755,-1.68720275396481,1.86359131475911,0.154185280669481,2.48794899601489,1.98807028122246,-2.72537842020392,-0.263451105449349,-1.40887996880338,-1.17196678183973,0.0438412204384804,-1.91342275030911,1.55802381271496,-1.79251177422702,-1.44714108807966,2.95290250517428,1.84411404188722,0.320001544430852,0.878436564467847,-1.12905415752903,0.730915188789368,-1.02137894555926,0.0119848377071321,1.06256716372445,-0.0900525650940835,-1.53642703592777,1.59275872539729,-2.55732071958482,-1.14188038883731,1.30363046005368,0.0272754728794098,-2.08200624631718,0.0236009289510548,-0.0362344617024064,1.50720118219033,-1.95210105646402,2.09035446261987,2.18900299211964,-2.74885634891689,-1.09690706804395,-2.91750036505982,-1.56584563991055,1.23896770365536,-1.15143145713955,0.0512853940017521,-2.69012028397992,0.387419039849192,-2.27111887698993,2.35701829008758,-2.91223646560684,1.69872662192211,-2.46023200219497,0.115139884874225,-0.694399874657393,-2.57968501606956,-1.07613346679136,1.01097238250077,2.55840285820886,-0.16854167310521,-2.14430794073269,0.265618530102074,-1.82295208889991,2.39148293528706,-0.6630012919195,-1.13477532193065,-2.03982802201062,2.37711509736255,-2.0016373177059,2.40254757693037,-2.19553082948551,-2.21031519491225,-2.36827498488128,0.0695014866068959,-1.19880567630753,-2.83969862759113,-1.1421154118143,1.45271794265136],"y":[5.3545672702603,10.650761120487,7.80257776146755,7.04196316422895,6.33738898672163,8.25681924354285,6.55061969533563,6.29962139530107,9.35531059745699,5.38642652565613,12.1330156293698,6.00769041106105,14.5030493848026,6.21817762730643,7.19656620873138,14.1308776685037,14.4585312111303,7.79156222939491,6.23471087776124,12.9716045944951,12.442772150971,14.1597422375344,14.9459824501537,14.4236071500927,9.86135407583788,7.8345954278484,7.51545701175928,10.0325517076999,9.9696617317386,8.18445809651166,14.6222282689996,11.3409936823882,6.27433398040012,9.23046993790194,14.1431690962054,9.67792333336547,14.0816914546303,10.9774328372441,11.3174281595275,13.6915831873193,10.0274982256815,14.8363511473872,8.24386027408764,9.81374948518351,8.56987077509984,11.2747768405825,12.4160019052215,10.6596681731753,14.8078650655225,10.7681273552589,9.39042051788419,7.28599695023149,5.82158065633848,13.502649217844,7.34661261551082,14.8816744680516,11.0189754585735,14.9874080810696,8.7559937662445,10.5512663093396,9.29443962872028,10.7587778079323,9.32507397374138,7.24845763994381,5.84984737681225,11.3729825965129,9.31016370421275,5.72716093622148,13.024020192679,8.25278303353116,12.5728903501295,10.8427151734941,12.0883940672502,9.26975766429678,8.43572702025995,12.5911998515949,9.24030208028853,10.6088725384325,6.16135774413124,8.03021797677502,9.78802685858682,8.44830546760932,11.0071414010599,5.76083323219791,14.5599261112511,5.22206823108718,13.4171063243411,11.3244244875386,8.10094165150076,12.4256936623715,11.3891131361015,14.9251598725095,6.28269788343459,13.8323957775719,13.1008338788524,13.2185117830522,13.3470266242512,12.3273222055286,14.8304402432404,11.3920458266512],"z":[23.1505721475843,40.3945295795296,29.6206561422929,25.9250982244619,21.3580650584161,31.3524026221879,27.2492396169615,25.9713679327738,35.7797081014553,20.4080941376938,46.1745801480625,22.1118428791013,57.2060288062715,22.1813838062202,29.0547323194687,54.3308775224048,56.8724972851729,31.7343506513363,26.4916730800433,52.4667865383663,49.6800217063495,56.0409020790595,62.8878157709992,60.2049545452243,41.9568273596946,28.5548500377265,28.2762334674013,35.128969714751,36.2544768146469,34.6895123387477,58.7660545289135,46.0256048334226,24.0973543905991,36.5467528596943,55.8359982002863,35.787817272555,56.8031592639153,42.4086263897348,42.2368676553265,51.2993471125113,38.4559547082976,58.8621159176105,33.7411696649185,37.4156510549543,34.8146248243633,44.4863943318968,49.2642969045919,42.0046186613808,60.2741599212254,40.8273026706622,38.7417651436253,29.3834876528725,22.621436810422,52.1077289243071,30.5902167070015,59.8652276041005,42.8775152090739,56.3890483341449,35.1062925158061,38.7245837450913,33.1414054496319,44.5233281885469,36.9827832566676,30.7282801271156,24.3722912450401,43.8775958105983,39.0702920176953,21.7886646973192,53.4653611420378,33.0947188113243,53.2670096753348,40.0570864414462,49.9331554154741,32.5580686177019,36.6144024819815,48.4120770874284,35.5318454014162,47.0589410558948,24.956585751567,30.076640914486,32.8614850531826,32.8619729551006,46.1489052798948,22.1915833141407,59.9749786955145,19.480522189746,54.5005351032977,44.92373625667,34.1477748064792,46.1272432783013,46.0200190732271,53.4423332580885,25.5366755109601,57.0277129403637,55.9396430398103,52.3052402598828,54.2071223769795,50.183084239572,59.6265310928944,41.0372278220659],"mode":"markers","color":[23.1505721475843,40.3945295795296,29.6206561422929,25.9250982244619,21.3580650584161,31.3524026221879,27.2492396169615,25.9713679327738,35.7797081014553,20.4080941376938,46.1745801480625,22.1118428791013,57.2060288062715,22.1813838062202,29.0547323194687,54.3308775224048,56.8724972851729,31.7343506513363,26.4916730800433,52.4667865383663,49.6800217063495,56.0409020790595,62.8878157709992,60.2049545452243,41.9568273596946,28.5548500377265,28.2762334674013,35.128969714751,36.2544768146469,34.6895123387477,58.7660545289135,46.0256048334226,24.0973543905991,36.5467528596943,55.8359982002863,35.787817272555,56.8031592639153,42.4086263897348,42.2368676553265,51.2993471125113,38.4559547082976,58.8621159176105,33.7411696649185,37.4156510549543,34.8146248243633,44.4863943318968,49.2642969045919,42.0046186613808,60.2741599212254,40.8273026706622,38.7417651436253,29.3834876528725,22.621436810422,52.1077289243071,30.5902167070015,59.8652276041005,42.8775152090739,56.3890483341449,35.1062925158061,38.7245837450913,33.1414054496319,44.5233281885469,36.9827832566676,30.7282801271156,24.3722912450401,43.8775958105983,39.0702920176953,21.7886646973192,53.4653611420378,33.0947188113243,53.2670096753348,40.0570864414462,49.9331554154741,32.5580686177019,36.6144024819815,48.4120770874284,35.5318454014162,47.0589410558948,24.956585751567,30.076640914486,32.8614850531826,32.8619729551006,46.1489052798948,22.1915833141407,59.9749786955145,19.480522189746,54.5005351032977,44.92373625667,34.1477748064792,46.1272432783013,46.0200190732271,53.4423332580885,25.5366755109601,57.0277129403637,55.9396430398103,52.3052402598828,54.2071223769795,50.183084239572,59.6265310928944,41.0372278220659],"alpha_stroke":1,"sizes":[10,100],"spans":[1,20],"type":"scatter3d"}},"layout":{"margin":{"b":40,"l":60,"t":25,"r":10},"scene":{"xaxis":{"title":"x1"},"yaxis":{"title":"x2"},"zaxis":{"title":"y"}},"hovermode":"closest","showlegend":false,"legend":{"yanchor":"top","y":0.5}},"source":"A","config":{"modeBarButtonsToAdd":["hoverclosest","hovercompare"],"showSendToCloud":false},"data":[{"x":[-2.31777953216806,0.733796428889036,0.655648397281766,0.740276650059968,2.16549230134115,0.841863631736487,-2.94302546186373,-1.60469696391374,0.996502549387515,0.0855068480595946,1.16154775070027,0.269849013537169,-1.30359849845991,2.54060090566054,-1.24610495846719,2.02377376891673,-1.28266029199585,-1.39907531999052,-1.8796632620506,-1.60664453683421,-1.10032527102157,-1.18383977562189,-2.04572398262098,-2.76002449169755,-1.68720275396481,1.86359131475911,0.154185280669481,2.48794899601489,1.98807028122246,-2.72537842020392,-0.263451105449349,-1.40887996880338,-1.17196678183973,0.0438412204384804,-1.91342275030911,1.55802381271496,-1.79251177422702,-1.44714108807966,2.95290250517428,1.84411404188722,0.320001544430852,0.878436564467847,-1.12905415752903,0.730915188789368,-1.02137894555926,0.0119848377071321,1.06256716372445,-0.0900525650940835,-1.53642703592777,1.59275872539729,-2.55732071958482,-1.14188038883731,1.30363046005368,0.0272754728794098,-2.08200624631718,0.0236009289510548,-0.0362344617024064,1.50720118219033,-1.95210105646402,2.09035446261987,2.18900299211964,-2.74885634891689,-1.09690706804395,-2.91750036505982,-1.56584563991055,1.23896770365536,-1.15143145713955,0.0512853940017521,-2.69012028397992,0.387419039849192,-2.27111887698993,2.35701829008758,-2.91223646560684,1.69872662192211,-2.46023200219497,0.115139884874225,-0.694399874657393,-2.57968501606956,-1.07613346679136,1.01097238250077,2.55840285820886,-0.16854167310521,-2.14430794073269,0.265618530102074,-1.82295208889991,2.39148293528706,-0.6630012919195,-1.13477532193065,-2.03982802201062,2.37711509736255,-2.0016373177059,2.40254757693037,-2.19553082948551,-2.21031519491225,-2.36827498488128,0.0695014866068959,-1.19880567630753,-2.83969862759113,-1.1421154118143,1.45271794265136],"y":[5.3545672702603,10.650761120487,7.80257776146755,7.04196316422895,6.33738898672163,8.25681924354285,6.55061969533563,6.29962139530107,9.35531059745699,5.38642652565613,12.1330156293698,6.00769041106105,14.5030493848026,6.21817762730643,7.19656620873138,14.1308776685037,14.4585312111303,7.79156222939491,6.23471087776124,12.9716045944951,12.442772150971,14.1597422375344,14.9459824501537,14.4236071500927,9.86135407583788,7.8345954278484,7.51545701175928,10.0325517076999,9.9696617317386,8.18445809651166,14.6222282689996,11.3409936823882,6.27433398040012,9.23046993790194,14.1431690962054,9.67792333336547,14.0816914546303,10.9774328372441,11.3174281595275,13.6915831873193,10.0274982256815,14.8363511473872,8.24386027408764,9.81374948518351,8.56987077509984,11.2747768405825,12.4160019052215,10.6596681731753,14.8078650655225,10.7681273552589,9.39042051788419,7.28599695023149,5.82158065633848,13.502649217844,7.34661261551082,14.8816744680516,11.0189754585735,14.9874080810696,8.7559937662445,10.5512663093396,9.29443962872028,10.7587778079323,9.32507397374138,7.24845763994381,5.84984737681225,11.3729825965129,9.31016370421275,5.72716093622148,13.024020192679,8.25278303353116,12.5728903501295,10.8427151734941,12.0883940672502,9.26975766429678,8.43572702025995,12.5911998515949,9.24030208028853,10.6088725384325,6.16135774413124,8.03021797677502,9.78802685858682,8.44830546760932,11.0071414010599,5.76083323219791,14.5599261112511,5.22206823108718,13.4171063243411,11.3244244875386,8.10094165150076,12.4256936623715,11.3891131361015,14.9251598725095,6.28269788343459,13.8323957775719,13.1008338788524,13.2185117830522,13.3470266242512,12.3273222055286,14.8304402432404,11.3920458266512],"z":[23.1505721475843,40.3945295795296,29.6206561422929,25.9250982244619,21.3580650584161,31.3524026221879,27.2492396169615,25.9713679327738,35.7797081014553,20.4080941376938,46.1745801480625,22.1118428791013,57.2060288062715,22.1813838062202,29.0547323194687,54.3308775224048,56.8724972851729,31.7343506513363,26.4916730800433,52.4667865383663,49.6800217063495,56.0409020790595,62.8878157709992,60.2049545452243,41.9568273596946,28.5548500377265,28.2762334674013,35.128969714751,36.2544768146469,34.6895123387477,58.7660545289135,46.0256048334226,24.0973543905991,36.5467528596943,55.8359982002863,35.787817272555,56.8031592639153,42.4086263897348,42.2368676553265,51.2993471125113,38.4559547082976,58.8621159176105,33.7411696649185,37.4156510549543,34.8146248243633,44.4863943318968,49.2642969045919,42.0046186613808,60.2741599212254,40.8273026706622,38.7417651436253,29.3834876528725,22.621436810422,52.1077289243071,30.5902167070015,59.8652276041005,42.8775152090739,56.3890483341449,35.1062925158061,38.7245837450913,33.1414054496319,44.5233281885469,36.9827832566676,30.7282801271156,24.3722912450401,43.8775958105983,39.0702920176953,21.7886646973192,53.4653611420378,33.0947188113243,53.2670096753348,40.0570864414462,49.9331554154741,32.5580686177019,36.6144024819815,48.4120770874284,35.5318454014162,47.0589410558948,24.956585751567,30.076640914486,32.8614850531826,32.8619729551006,46.1489052798948,22.1915833141407,59.9749786955145,19.480522189746,54.5005351032977,44.92373625667,34.1477748064792,46.1272432783013,46.0200190732271,53.4423332580885,25.5366755109601,57.0277129403637,55.9396430398103,52.3052402598828,54.2071223769795,50.183084239572,59.6265310928944,41.0372278220659],"mode":"markers","type":"scatter3d","marker":{"colorbar":{"title":"","ticklen":2},"cmin":19.480522189746,"cmax":62.8878157709992,"colorscale":[["0","rgba(68,1,84,1)"],["0.0416666666666667","rgba(70,19,97,1)"],["0.0833333333333333","rgba(72,32,111,1)"],["0.125","rgba(71,45,122,1)"],["0.166666666666667","rgba(68,58,128,1)"],["0.208333333333333","rgba(64,70,135,1)"],["0.25","rgba(60,82,138,1)"],["0.291666666666667","rgba(56,93,140,1)"],["0.333333333333333","rgba(49,104,142,1)"],["0.375","rgba(46,114,142,1)"],["0.416666666666667","rgba(42,123,142,1)"],["0.458333333333333","rgba(38,133,141,1)"],["0.5","rgba(37,144,140,1)"],["0.541666666666667","rgba(33,154,138,1)"],["0.583333333333333","rgba(39,164,133,1)"],["0.625","rgba(47,174,127,1)"],["0.666666666666667","rgba(53,183,121,1)"],["0.708333333333333","rgba(79,191,110,1)"],["0.75","rgba(98,199,98,1)"],["0.791666666666667","rgba(119,207,85,1)"],["0.833333333333333","rgba(147,214,70,1)"],["0.875","rgba(172,220,52,1)"],["0.916666666666667","rgba(199,225,42,1)"],["0.958333333333333","rgba(226,228,40,1)"],["1","rgba(253,231,37,1)"]],"showscale":false,"color":[23.1505721475843,40.3945295795296,29.6206561422929,25.9250982244619,21.3580650584161,31.3524026221879,27.2492396169615,25.9713679327738,35.7797081014553,20.4080941376938,46.1745801480625,22.1118428791013,57.2060288062715,22.1813838062202,29.0547323194687,54.3308775224048,56.8724972851729,31.7343506513363,26.4916730800433,52.4667865383663,49.6800217063495,56.0409020790595,62.8878157709992,60.2049545452243,41.9568273596946,28.5548500377265,28.2762334674013,35.128969714751,36.2544768146469,34.6895123387477,58.7660545289135,46.0256048334226,24.0973543905991,36.5467528596943,55.8359982002863,35.787817272555,56.8031592639153,42.4086263897348,42.2368676553265,51.2993471125113,38.4559547082976,58.8621159176105,33.7411696649185,37.4156510549543,34.8146248243633,44.4863943318968,49.2642969045919,42.0046186613808,60.2741599212254,40.8273026706622,38.7417651436253,29.3834876528725,22.621436810422,52.1077289243071,30.5902167070015,59.8652276041005,42.8775152090739,56.3890483341449,35.1062925158061,38.7245837450913,33.1414054496319,44.5233281885469,36.9827832566676,30.7282801271156,24.3722912450401,43.8775958105983,39.0702920176953,21.7886646973192,53.4653611420378,33.0947188113243,53.2670096753348,40.0570864414462,49.9331554154741,32.5580686177019,36.6144024819815,48.4120770874284,35.5318454014162,47.0589410558948,24.956585751567,30.076640914486,32.8614850531826,32.8619729551006,46.1489052798948,22.1915833141407,59.9749786955145,19.480522189746,54.5005351032977,44.92373625667,34.1477748064792,46.1272432783013,46.0200190732271,53.4423332580885,25.5366755109601,57.0277129403637,55.9396430398103,52.3052402598828,54.2071223769795,50.183084239572,59.6265310928944,41.0372278220659],"line":{"colorbar":{"title":"","ticklen":2},"cmin":19.480522189746,"cmax":62.8878157709992,"colorscale":[["0","rgba(68,1,84,1)"],["0.0416666666666667","rgba(70,19,97,1)"],["0.0833333333333333","rgba(72,32,111,1)"],["0.125","rgba(71,45,122,1)"],["0.166666666666667","rgba(68,58,128,1)"],["0.208333333333333","rgba(64,70,135,1)"],["0.25","rgba(60,82,138,1)"],["0.291666666666667","rgba(56,93,140,1)"],["0.333333333333333","rgba(49,104,142,1)"],["0.375","rgba(46,114,142,1)"],["0.416666666666667","rgba(42,123,142,1)"],["0.458333333333333","rgba(38,133,141,1)"],["0.5","rgba(37,144,140,1)"],["0.541666666666667","rgba(33,154,138,1)"],["0.583333333333333","rgba(39,164,133,1)"],["0.625","rgba(47,174,127,1)"],["0.666666666666667","rgba(53,183,121,1)"],["0.708333333333333","rgba(79,191,110,1)"],["0.75","rgba(98,199,98,1)"],["0.791666666666667","rgba(119,207,85,1)"],["0.833333333333333","rgba(147,214,70,1)"],["0.875","rgba(172,220,52,1)"],["0.916666666666667","rgba(199,225,42,1)"],["0.958333333333333","rgba(226,228,40,1)"],["1","rgba(253,231,37,1)"]],"showscale":false,"color":[23.1505721475843,40.3945295795296,29.6206561422929,25.9250982244619,21.3580650584161,31.3524026221879,27.2492396169615,25.9713679327738,35.7797081014553,20.4080941376938,46.1745801480625,22.1118428791013,57.2060288062715,22.1813838062202,29.0547323194687,54.3308775224048,56.8724972851729,31.7343506513363,26.4916730800433,52.4667865383663,49.6800217063495,56.0409020790595,62.8878157709992,60.2049545452243,41.9568273596946,28.5548500377265,28.2762334674013,35.128969714751,36.2544768146469,34.6895123387477,58.7660545289135,46.0256048334226,24.0973543905991,36.5467528596943,55.8359982002863,35.787817272555,56.8031592639153,42.4086263897348,42.2368676553265,51.2993471125113,38.4559547082976,58.8621159176105,33.7411696649185,37.4156510549543,34.8146248243633,44.4863943318968,49.2642969045919,42.0046186613808,60.2741599212254,40.8273026706622,38.7417651436253,29.3834876528725,22.621436810422,52.1077289243071,30.5902167070015,59.8652276041005,42.8775152090739,56.3890483341449,35.1062925158061,38.7245837450913,33.1414054496319,44.5233281885469,36.9827832566676,30.7282801271156,24.3722912450401,43.8775958105983,39.0702920176953,21.7886646973192,53.4653611420378,33.0947188113243,53.2670096753348,40.0570864414462,49.9331554154741,32.5580686177019,36.6144024819815,48.4120770874284,35.5318454014162,47.0589410558948,24.956585751567,30.076640914486,32.8614850531826,32.8619729551006,46.1489052798948,22.1915833141407,59.9749786955145,19.480522189746,54.5005351032977,44.92373625667,34.1477748064792,46.1272432783013,46.0200190732271,53.4423332580885,25.5366755109601,57.0277129403637,55.9396430398103,52.3052402598828,54.2071223769795,50.183084239572,59.6265310928944,41.0372278220659]}},"frame":null},{"x":[-2.94302546186373,2.95290250517428],"y":[5.22206823108718,14.9874080810696],"type":"scatter3d","mode":"markers","opacity":0,"hoverinfo":"none","showlegend":false,"marker":{"colorbar":{"title":"","ticklen":2,"len":0.5,"lenmode":"fraction","y":1,"yanchor":"top"},"cmin":19.480522189746,"cmax":62.8878157709992,"colorscale":[["0","rgba(68,1,84,1)"],["0.0416666666666667","rgba(70,19,97,1)"],["0.0833333333333333","rgba(72,32,111,1)"],["0.125","rgba(71,45,122,1)"],["0.166666666666667","rgba(68,58,128,1)"],["0.208333333333333","rgba(64,70,135,1)"],["0.25","rgba(60,82,138,1)"],["0.291666666666667","rgba(56,93,140,1)"],["0.333333333333333","rgba(49,104,142,1)"],["0.375","rgba(46,114,142,1)"],["0.416666666666667","rgba(42,123,142,1)"],["0.458333333333333","rgba(38,133,141,1)"],["0.5","rgba(37,144,140,1)"],["0.541666666666667","rgba(33,154,138,1)"],["0.583333333333333","rgba(39,164,133,1)"],["0.625","rgba(47,174,127,1)"],["0.666666666666667","rgba(53,183,121,1)"],["0.708333333333333","rgba(79,191,110,1)"],["0.75","rgba(98,199,98,1)"],["0.791666666666667","rgba(119,207,85,1)"],["0.833333333333333","rgba(147,214,70,1)"],["0.875","rgba(172,220,52,1)"],["0.916666666666667","rgba(199,225,42,1)"],["0.958333333333333","rgba(226,228,40,1)"],["1","rgba(253,231,37,1)"]],"showscale":true,"color":[19.480522189746,62.8878157709992],"line":{"color":"rgba(255,127,14,1)"}},"z":[19.480522189746,62.8878157709992],"frame":null}],"highlight":{"on":"plotly_click","persistent":false,"dynamic":false,"selectize":false,"opacityDim":0.2,"selected":{"opacity":1},"debounce":0},"shinyEvents":["plotly_hover","plotly_click","plotly_selected","plotly_relayout","plotly_brushed","plotly_brushing","plotly_clickannotation","plotly_doubleclick","plotly_deselect","plotly_afterplot","plotly_sunburstclick"],"base_url":"https://plot.ly"},"evals":[],"jsHooks":[]}</script>

---
# Multiple regression

### With **many** explanatory variables, we visualizing relationships means thinking about **hyperplanes** 🤯

`$$y_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + ... + \beta_k x_{ki} + u_i$$`
#### Math notation looks very similar to simple linear regression, but _conceptually_ and _visually_ multiple regression is **very different**

---
# Multiple regression

### Interpretation of coefficients

`$$y_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + ... + \beta_k x_{ki} + u_i$$`

--

- `\(\beta_k\)` tells us the change in `\(y\)` due to a one unit change in `\(x_k\)` when **all other variables are held constant**

--

- This is an "all else equal" interpretation

--

- E.g., how much do wages increase with one more year of education, _holding gender fixed_?

--

- E.g., how much does ozone increase when temperature rises, _holding NOx emissions fixed_?

---
# Tradeoffs

There are tradeoffs to consider as we add/remove variables:

**Fewer variables**

- Generally explain less variation in `\(y\)`
- Provide simple interpretations and visualizations (*parsimonious*)
- May need to worry about omitted-variable bias

**More variables**

- More likely to find *spurious* relationships (statistically significant due to chance—does not reflect a true, population-level relationship)
- More difficult to interpret the model
- You may still miss important variables—still omitted-variable bias

---
# Omitted-variable bias

You will study this in much more depth in EDS 241, but here's a primer.

**Omitted-variable bias** (OVB) arises when we omit a variable that

1. affects our outcome variable `\(y\)`

2. correlates with an explanatory variable `\(x_j\)`

As it's name suggests, this situation leads to bias in our estimate of `\(\beta_j\)`. In particular, it violates Assumption 2 of OLS from last week.

--

**Note:** OVB Is not exclusive to multiple linear regression, but it does require multiple variables affect `\(y\)`.

---
# Omitted-variable bias

**Example**

Let's imagine a simple model for the cancer rates in census tract `\(i\)`:

$$ \text{Cancer rate}_i = \beta_0 + \beta_1 \text{UV radiation}_i + \beta_2 \text{TRI}_i + u_i $$

where

- `\(\text{UV radiation}_i\)` gives the average UV radiation in tract `\(i\)` (mW/cm$^2$) 
- `\(\text{TRI}_i\)` denotes an indicator variable for whether tract `\(i\)` has a Toxics Release Inventory facility

thus

- `\(\beta_1\)`: the change in cancer rate associated with a 1 mW/cm$^2$ increase in UV radiation (*ceteris paribus*)
- `\(\beta_2\)`: the difference in avg. cancer rates between TRI and non-TRI census tracts (*ceteris paribus*)
<br>If `\(\beta_2 > 0\)`, then TRI tracts have higher cancer rates

---

# Omitted-variable bias

**"True" relationship:** `\(\text{Cancer rate}_i = 20 + 0.5 \times \text{UV radiation}_i + 10 \times \text{TRI}_i + u_i\)`

The relationship between cancer rates and UV radiations:

<img src="04-ols-contd_files/figure-html/plotovb1-1.svg" style="display: block; margin: auto;" />
---
# Omitted-variable bias

Biased regression estimate: `\(\widehat{\text{Cancer rate}}_i = 31.3 + -0.9 \times \text{UV radiation}_i\)`

<img src="04-ols-contd_files/figure-html/plotovb2-1.svg" style="display: block; margin: auto;" />

---
# Omitted-variable bias

Recalling the omitted variable: TRI (**<font color="#e64173">non-TRI</font>** and **<font color="#314f4f">TRI</font>**)

<img src="04-ols-contd_files/figure-html/plotovb3-1.svg" style="display: block; margin: auto;" />

---
# Omitted-variable bias

Recalling the omitted variable: TRI (**<font color="#e64173">non-TRI</font>** and **<font color="#314f4f">TRI</font>**)

<img src="04-ols-contd_files/figure-html/plotovb4-1.svg" style="display: block; margin: auto;" />

---
# Omitted-variable bias

Unbiased regression estimate: `\(\widehat{\text{Cancer rate}}_i = 20.9 + 0.4 \times \text{UV radiation}_i + 9.1 \times \text{TRI}_i\)`

<img src="04-ols-contd_files/figure-html/plotovb5-1.svg" style="display: block; margin: auto;" />

---

class: center, middle

Slides created via the R package [**xaringan**](https://github.com/yihui/xaringan).

Some slide components come from [Ed Rubin's](https://github.com/edrubin/EC421S20) awesome course materials.

<style type="text/css">
@media print {
  .has-continuation {
    display: block;
  }
}
</style>
  
---
exclude: true