EDS 222: Assignment 03 (due: Nov 7, 5pm)

{ANSWER KEY}

2023-10-24

Question 1: Some math with Ordinary Least Squares

We will rely on R to implement all the heavy lifting of OLS, but it’s essential that you understand what is happening beneath the hood of canned linear regression commands so that you can accurately design, execute, and interpret regressions. These questions ask you to probe the algebra of least squares so that you can see some of the mechanics behind lm() and other linear regression packages in R and beyond.

Consider a simple linear regression model:

\[y_i = \beta_0 + \beta_1 x_i + u_i\] Recall the definitions of the OLS estimate of the intercept and slope coefficients:

\[\hat{\beta}_1 = \dfrac{\sum_i (x_i - \overline{x})(y_i - \overline{y})}{\sum_i (x_i - \overline{x})^2} = \frac{cov(x,y)}{var(x)}\]

\[ \hat{\beta}_0 = \overline{y} - \hat{\beta}_1 \overline{x} \] Using these definitions, show mathematically how \(\hat\beta_0\) and \(\hat\beta_1\) change under the following scenarios.1 Note that these kinds of scenarios occur in practice all the time when we change units of measurement. Give some intuition for your answers. If your answers differ across scenarios, why do they? If not, why not?

Answer:

Let our estimate of the original \(\beta_1\) and \(\beta_0\) in the original data be denoted by \(\hat\beta_1\) and \(\hat\beta_0\), respectively.

Now, we multiply all observations of \(x\) by 3.

Our estimate of the slope now becomes:2 Note: we are using \(\tilde\beta\) notation to indicate a new coefficient estimate we recover in these new data

\[\begin{align*} \tilde\beta_1 &= \frac{cov(3x,y)}{var(3x)} \\ &= \frac{3cov(x,y)}{3^2var(x)} \\ &= \frac{1}{3}\hat\beta_1 \end{align*}\]

Our estimate of the intercept now becomes:

\[\begin{align*} \tilde\beta_0 &= \bar{y} - \tilde\beta_1 \frac{1}{n}\sum_i 3x_i \\ &= \bar{y} - \frac{1}{3} \hat\beta_1 3\bar{x} \\ &= \bar{y} - \hat\beta_1 \bar{x} \\ &= \hat\beta_0 \end{align*}\]

Therefore, when we multiply our independent variable by 3, our intercept does not change, while our slope coefficient is multiplied by 1/3. This is intuitive – nothing about the predicted value of \(y\) at \(x=0\) has changed, since the \(y\) data were unaffected by this rescaling. However, the rescaling of \(x\) means that we are stretching our data out along the \(x\)-axis: changing this new \(x\) variable by 1 is equivalent to changing the original \(x\) variable by 3, so our slope coefficient falls by 1/3.

Here, we multiply all observations of \(y\) by 3.

Our estimate of the slope now becomes:3 Note: using the same \(\tilde\beta\) notation to indicate the new coefficients in the transformed data.

\[\begin{align*} \tilde\beta_1 &= \frac{cov(x,3y)}{var(x)} \\ &= \frac{3cov(x,y)}{var(x)} \\ &= 3\hat\beta_1 \end{align*}\]

Our estimate of the intercept now becomes:

\[\begin{align*} \tilde\beta_0 &= \frac{1}{n}\sum_i 3y_i - \tilde\beta_1 \bar{x} \\ &= 3\bar{y} - 3 \hat\beta_1 \bar{x} \\ &= 3(\bar{y} - \hat\beta_1 \bar{x}) \\ &= 3\hat\beta_0 \end{align*}\]

Therefore, both our intercept and slope coefficients are multiplied by 3 when the independent variable is multiplied by 3. The intuition here is that when we evaluate our regression at \(x=0\), the \(y\) variable must be 3 times its original value since all values are scaled by 3. Similarly, a one unit change in \(x\) will lead to a three times larger change in \(y\) since all \(y\) values have been scaled by 3.

Here, we multiply all observations of \(y\) and \(x\) by 3.

Our estimate of the slope now becomes:4 Note: using the same \(\tilde\beta\) notation to indicate the new coefficients in the transformed data.

\[\begin{align*} \tilde\beta_1 &= \frac{cov(3x,3y)}{var(3x)} \\ &= \frac{3^2 cov(x,y)}{3^2 var(x)} \\ &= \hat\beta_1 \end{align*}\]

Our estimate of the intercept now becomes:

\[\begin{align*} \tilde\beta_0 &= \frac{1}{n}\sum_i 3y_i - \tilde\beta_1 \frac{1}{n}\sum_i 3x \\ &= 3\bar{y} - \hat\beta_1 3 \bar{x} \\ &= 3(\bar{y} - \hat\beta_1 \bar{x}) \\ &= 3\hat\beta_0 \end{align*}\]

Therefore, in this case, our slope coefficient is unchanged while our intercept is multiplied by 3. Here, since both variables are scaled, a one unit change in our new \(x\) leads to the same change in \(y\) as in the original regression. However, the intercept still gets multiplied by 3 because the predicted level of \(y\) when \(x=0\) must be 3 times larger than it was before the rescaling.

Question 2: Forest fires and temperature

The data provided for this assignment, called forestfires.rds, is a dataset of daily forest fire area burned in the northeast region of Portugal and meteorological conditions on the recorded day (data was constructed from here). The goal of assembling this dataset was to evaluate if weather conditions, such as temperature and air humidity, can predict area burned and inform fire management decisions. Each observation is a daily observation of forest fire are burned and set of associated daily weather variables, all of which are detailed in the provided README.txt file.

The original documentation of the dataset can be found in Cortez and Morais, 2007. For more information on the other variables from the Fire Weather Index (FWI), please see here.

Question 2.1

In this homework, we are interested in the relationship between temperature and area burned.

Answer:

df = readRDS(file.path(datadir,"forestfires.rds"))

ggplot(data=df, aes(x=temp, y = area)) + 
  geom_point(size=3) + theme_bw() + geom_hline(yintercept=0, color="red")

This scatter plot shows a generally positive relationship between hectares (ha) burned and daily temperature. It appears to have a few outliers above 1000 ha burned. This general pattern makes intuitive sense, as the hotter the temperature, the drier, and more flammable the forest is likely to be. This appears to be a fairly strong correlation, although the outliers may reduce the magnitude of the estimated correlation.

Question 2.2

Use the lm() command to estimate the following simple linear regression:

\[ \text{area_burned}_i = \beta_0 + \beta_1 \text{temp}_i + \varepsilon_i \] Display your estimated intercept and slope coefficients using summary(), gt(), or kable(). Interpret each coefficient in words, and then answer the following:

Answer:

summary(lm(area ~ temp, data =df))
## 
## Call:
## lm(formula = area ~ temp, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -296.68  -71.16   -6.75   69.36 1049.50 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  14.7769    17.1650   0.861     0.39    
## temp         12.2568     0.8687  14.110   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 114.6 on 515 degrees of freedom
## Multiple R-squared:  0.2788, Adjusted R-squared:  0.2774 
## F-statistic: 199.1 on 1 and 515 DF,  p-value: < 2.2e-16

Interpretation:

Our regression results imply that we predict area burned to be 14.78 ha on a 0 degree Celsius day. We also estimate that a one degree Celsius increase in temperature (e.g., from 15 to 16 Celsius, or 30 to 31 Celsius) will lead to an increase in daily area burned of 12.26.

Our model predicts that a fire at 10 degrees Celsius will burn \(14.78+ 12.26 \times 10 = 137.38\) ha.

Our model predicts that a fire at 28 degrees Celsius will burn \(14.78+ 12.26 \times 28 = 358.06\) ha.

There are two ways you could answer this. First, this is a \(30-12=18\) unit change in \(x\), so we predict the difference to be \(12.26\times 18= 220.68\) more ha of daily burned area when we move from 12 to 30 Celsius. We could also calculate this by taking the difference in predicted values for \(x=30\) minus \(x=12\): \(14.78+12.26\times 30 - (14.78 + 12.26\times 12) = 12.26 \times 18 = 220.68\) ha.

Question 2.3

The area variable covers a range of zero to 1371.923 (units: hectares). Based on the math you showed in Question 1, how do you expect your coefficients to change if you rescale this variable to acres (one hectare is about 2.47 acres)?

Answer:

Our math above implies that if we multiply our dependent variable by \({2.47}\), our slope coefficient should change by \({2.47}\) times. That is, our slope coefficient should be \(2.47 \times \hat\beta_1\), where \(\hat\beta_1\) is the estimated slope coefficient on \(temp\) from the original regression (in units of ha per degree C). Our intercept should also be rescaled by a factor of 2.47.

Implement this rescaling and show your new coefficients. Does your math align with your new coefficients?

Answer:

df <- df %>% mutate(area_acres = area*2.47)
summary(lm(area_acres ~ temp, data =df))
## 
## Call:
## lm(formula = area_acres ~ temp, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -732.81 -175.76  -16.67  171.33 2592.27 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   36.499     42.398   0.861     0.39    
## temp          30.274      2.146  14.110   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 283 on 515 degrees of freedom
## Multiple R-squared:  0.2788, Adjusted R-squared:  0.2774 
## F-statistic: 199.1 on 1 and 515 DF,  p-value: < 2.2e-16

Answer:

Yes, our results align with our math from Question 1. The intercept and slope coefficient were each multiplied by 2.47. This makes intuitive sense – rescaling the area burned to acres from ha should also translate when temperature is 0! Therefore when \(x=0\), we get the same amount of area burned, now multiplied by 2.47 to be expressed in acres.

Question 2.4

Using your original regression model with the original temp variable, use geom_smooth() in ggplot() to visualize your regression line, overlaid on your scatter plot. Use se=FALSE to suppress standard errors; we will dig into those soon!

How well do you think your model is fitting the data?

Answer:

ggplot(data=df, aes(x=temp, y = area)) + 
  geom_point(alpha=0.1, size=3) + 
  geom_smooth(method='lm', formula= y~x, color="lightcoral", se=F, size=1.5) +
  theme_bw() + geom_hline(yintercept=0, color="seagreen")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.

This looks like a decent model fit, throughout the support of temperature. However, are some really low amounts of area burned (close to 0) and some a couple of instances when area burned were extremely high that our simple model cannot predict.

Compute the coefficient of determination (\(R^2\)), or report it based on the regression results you saved above. What percent of variation in area burned are explained by temperature? Does this align with your intuition based on the scatter plot?

Answer:

mod <- lm(area ~ temp, data =df)
summary(mod)$r.squared
## [1] 0.2787951

This tells us that about 28% of the variation in area burned is explained by temperature. This makes sense based on our scatter plot, as we can see that there is generally a pretty strong upward relationship but that there are also large residuals everywhere – those will dramatically increase the numerator of the second term in the \(R^2\) expression, amounting to a large sum of squared errors and diminish the \(R^2\).

Question 2.5

Due to complex climatological phenomena, days with high temperatures tend to coincide with days that are also different in other dimensions. For example, hot days tend to be less rainy, with lower wind, and of higher or lower humidity, depending on the geographic location. This raises the concern of omitted variables bias, as these variables may also be correlated with area burned.

To address this concern, add relative humidity (RH) as an independent variable to your linear regression model, in addition to temperature. Display your estimated intercept and slope coefficients using summary(), gt(), or kable(). Interpret your slope coefficient on temperature in words.

Answer:

summary(lm(area ~ temp + RH, data =df))
## 
## Call:
## lm(formula = area ~ temp + RH, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -295.76  -71.47   -4.91   67.56 1052.89 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -15.5334    31.3950  -0.495    0.621    
## temp         12.8782     1.0221  12.600   <2e-16 ***
## RH            0.4193     0.3637   1.153    0.249    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 114.5 on 514 degrees of freedom
## Multiple R-squared:  0.2807, Adjusted R-squared:  0.2779 
## F-statistic: 100.3 on 2 and 514 DF,  p-value: < 2.2e-16

These results show that increasing daily temperature by 1 degree C while holding relative humidity fixed is estimated to increase area burned by 12.88 ha. Thus, the effect of temperature on area burned is largely unaffected by the addition of relative humidity to the model. The slope coefficient on temperature increases slightly – from 12.26 to 12.88 – but these values are very close to one another. Relative humidity appears to increase area burned but not to be causing substantial omitted variables bias when omitted from the regression.