EDS 222: Week 2: In-class Lab

ANSWER KEY

2023-10-11

Section 0: Setup

Start by loading packages you need. This week we will be using the Plum Island Ecosystem (PIE) LTER fiddler crabs dataset from the lterdatasampler package. Please install the package if you have not already done so.

## Loading required package: lterdatasampler
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.4.0      ✔ purrr   0.3.5 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.2      ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
## 
## Attaching package: 'openintro'
## The following object is masked from 'package:gt':
## 
##     sp500
## 
## Attaching package: 'modelr'
## The following object is masked from 'package:broom':
## 
##     bootstrap

Section 1: Visualizing Scatterplots

Using similar concepts from a histogram or a density plot, we can use a scatter plot to visualize the relationship between two variables. It is always a good idea to look at a scatter plot when trying to establish the relationship between two variables. A scatter plot is a simple two-dimensional plot in which the two coordinates of each dot represent the values of two variables measured on a single observation. The dependent or response variable is typically on the vertical axis and the independent or explanatory variable on the horizontal axis.

Here we will use a data frame on Atlantic marsh fiddler crabs that contains records from summer 2016 of fiddler crab body size and other variables obtained across three Long Term Ecological Research (LTER) sites across the Atlantic coast (more information available here). Specifically, we are going to explore what is called Bergmann’s rule:

“One of the best-known patterns in biogeography is Bergmann’s rule. It predicts that organisms at higher latitudes are larger than ones at lower latitudes. Many organisms follow Bergmann’s rule, including insects, birds, snakes, marine invertebrates, and terrestrial and marine mammals.” (Johnson et al., 2019)

We will try to see if (i) latitude associated with the size of PIE crabs, as Bergmann’s rule would suggest; and (ii) latitude is associated with water temperature, as this may help us understand why Bergmann’s rule holds.

The pie_crab data sample can be called directly once you’ve loaded the lterdatasampler package. Take a look at your variable names and make a scatter plot showing (i) crab body size as a function of latitude and (ii) water temperature as a function of latitude. That is, treat latitude as your independent variable.

  1. Make the two scatter plots described above.
  2. Discuss whether these two variables appear correlated, what sign you anticipate the correlation to be, and the strength of the correlation you expect.
summary(pie_crab) #note: ?pie_crab gives you helpful information like units for each variable
##       date               latitude         site                size      
##  Min.   :2016-07-24   Min.   :30.00   Length:392         Min.   : 6.64  
##  1st Qu.:2016-07-28   1st Qu.:34.00   Class :character   1st Qu.:12.02  
##  Median :2016-08-01   Median :39.10   Mode  :character   Median :14.44  
##  Mean   :2016-08-02   Mean   :37.69                      Mean   :14.66  
##  3rd Qu.:2016-08-09   3rd Qu.:41.60                      3rd Qu.:17.34  
##  Max.   :2016-08-13   Max.   :42.70                      Max.   :23.43  
##     air_temp      air_temp_sd      water_temp    water_temp_sd  
##  Min.   :10.29   Min.   :6.391   Min.   :13.98   Min.   :4.838  
##  1st Qu.:12.05   1st Qu.:8.110   1st Qu.:14.33   1st Qu.:6.567  
##  Median :13.93   Median :8.410   Median :17.50   Median :6.998  
##  Mean   :15.20   Mean   :8.654   Mean   :17.65   Mean   :7.252  
##  3rd Qu.:18.63   3rd Qu.:9.483   3rd Qu.:20.54   3rd Qu.:7.865  
##  Max.   :21.79   Max.   :9.965   Max.   :24.50   Max.   :9.121  
##      name          
##  Length:392        
##  Class :character  
##  Mode  :character  
##                    
##                    
## 
ggplot(data = pie_crab, aes(y = size, x = latitude)) +
  geom_point() +
  labs(x = "Latitude",
       y = "Size of crab carapace (mm)")+ theme_classic()

ggplot(data = pie_crab, aes(y = water_temp, x = latitude)) +
  geom_point() +
  labs(x = "Latitude",
       y = "Water temperature (Celsius)")+ theme_classic()

Since many crab samples were taken at each site, we see many data points for each latitude. It looks like crab size and latitude are somewhat positively correlated, with a lot of noise. Water temperature and latitude are negatively correlated and the relationship appears stronger. Generally, crab size increases as latitude increases, following Bergmann’s rule.

Section 2: Calculating Correlations

Visualizing values of the correlation between two quantities from their scatter plot can be very difficult. Research has shown that people’s perception of the strength of these relationships can be influenced by design choices like the size of the graph and the aspect ratio,1 The aspect ratio is the ratio of the width to the height of a figure or image. among other features. Hence, to more objectively report the relationship between two variables, we use correlation, computing the “correlation coefficient” \(r\).2 Note: we just called this “correlation” in lecture, but people use “correlation” and “correlation coefficient” interchangeably. In most cases, this refers to the Pearson’s correlation coefficient, as defined in lecture. There are other forms of correlation that are less common, such as rank correlation, but we won’t use them in this class. Here, we will compute correlation coefficients for the two relationships plotted above.

Exercise:

  1. Use the plots from section 1 to write your best estimate of the correlation coefficient between each of the two examples in section 1.
  1. the size of crab and latitude:
  2. water temperature and latitude:
  1. Use the cor() function in R to calculate the correlation coefficient between the two variables in (i) and the two in (ii). How close was your guess to the true answer?
# (i) size & latitude:
pie_crab %>% summarize(crab_lat = cor(size, latitude))
## # A tibble: 1 × 1
##   crab_lat
##      <dbl>
## 1    0.590
# (ii) water temperature & latitude:
pie_crab %>% 
  summarize(temp_lat = cor(water_temp, latitude)) 
## # A tibble: 1 × 1
##   temp_lat
##      <dbl>
## 1   -0.957

Section 3: Simple Linear Regression

Perform Simple Linear Regression

To estimate the linear relationship between the variables explored above, we’ll use lm() to fit a regression model using Ordinary Least Squares (OLS). The lm() function has two required arguments:

  1. a formula that indicates which variable is the dependent variable and which is the independent variable;
  2. a data argument that identifies the data frame for the variables in (i).

First let’s consider the relationship between crab size and latitude. We assume there is a linear population relationship, specified as:

\[\text{crab size} = \beta_0 + \beta_1 \text{latitude} + \epsilon\]

Exercise:

  1. Use lm() to estimate \(\hat\beta_0\) and \(\hat\beta_1\) using this sample of data. Use summary(lm()) or gt() to visualize the regression results.

  2. Interpret your two coefficients, paying careful attention to units.

  3. Is your estimate of \(\beta_0\) directly interpretable? What might you be concerned about when interpreting \(\beta_0\)?

Answers:

# (i) latitude is associated with the size of crabs:
lm(size ~ latitude, data = pie_crab) %>% 
 tidy %>% 
  gt()
term estimate std.error statistic p.value
(Intercept) -3.6244214 1.27405099 -2.844801 0.00467819238835985844876530492797428451012820
latitude 0.4851191 0.03359292 14.441111 0.00000000000000000000000000000000000003602897
#alternatively, can write: summary(lm(size ~ latitude, data = pie_crab))

The coefficient on latitude tells us that for each degree increase in latitude, we expect crab size to increase by 0.49mm on average.

The intercept estimate tells us that on average, crabs living on the equator, i.e. at latitude=0, have a size of -3.62mm. Note that this doesn’t make sense! Do these crabs even live at the equator? This is a case where you want to be careful about extrapolating. \(\hat\beta_0\) is helping us fit the data well in the range of our sample, but you shouldn’t go making predictions about crabs in the equatorial region using this model estimated from the Atlantic coast of the US.

Exercises, continued

  1. Does your model suggest anything about the relationship between crab size and water temperature? Is this evidence that water temperature is driving the Bergmann’s rule (i.e. the colder climate is causing the selection of larger organisms)?

Because crab size is increasing in latitude and water temperature is decreasing in latitude in the sample region, we can infer that as water temperature decreases, crab size increases. This is a statistical correlation we could estimate, but it is not causal evidence. You might think that colder water temperature is selecting for larger organisms by favoring larger bodies as they have lower surface-area-to-volume ratio (the heat conservation hypothesis), suggesting a causal link. However, for this hypothesis to be true, we should not see this relationship in “cold-blooded” animals, but many of them also follow this relationship.

“Bergmann originally hypothesized that the organisms he studied, birds, were larger in the colder, higher latitudes due to heat-conservation. But the heat-conservation hypothesis relies on internal regulation of body temperature and therefore does not apply to ectotherms,3 Commonly referred to as “cold-blooded animals”, ectotherms are animal that depends on external sources of body heat; such as fishes, amphibians, reptiles, and invertebrates. some of which also follow Bergmann’s rule. There is likely no universal mechanism underpinning Bergmann’s rule, regardless of ecto- or endothermy. As a result, other mechanisms have been proposed to explain Bergmann’s rule, including the starvation-resistant hypothesis, the diet-quality hypothesis, the enemy hypothesis, the resource rule, seasonality hypothesis, and the temperature–size rule (Johnson et al., 2019).”

  1. [If we have time] Repeat the above steps for the relationship between latitude and water temperature.

Here are analogous results for the relationship between latitude and water temperature:

# (ii) latitude is associated with water temperature: 
lm(water_temp ~ latitude, data = pie_crab) %>% 
  tidy %>% 
  gt()
term estimate std.error statistic p.value
(Intercept) 46.7782261 0.44896940 104.19023 0.00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000008505623
latitude -0.7729139 0.01183798 -65.29102 0.00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000004744672936810910431529715069308409588967560449162833510838002532630119373404108699

For each additional degree increase in latitude, we expect water temperature to fall by 0.77 degrees Celsius on average.

Visualize Regression [if we have time]

Here, we visualize our simple linear regression models above using the scatter plots from Section 1.

Exercise:

Use the geom_smooth() function with method argument set to lm, which stands for “linear model”, to add the best fit OLS line to your scatter plots from Section 1. Use the option se = FALSE to turn off estimates of uncertainty. Don’t worry, we’ll dig into those soon.

Answers:

# (i) latitude is associated with the size of crabs:
ggplot(data = pie_crab, aes(x = latitude, y = size)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(x = "Latitude", 
       y = "Size of crab carapace (mm)") + theme_classic()
## `geom_smooth()` using formula = 'y ~ x'

# (ii) latitude is associated with water temperature:
ggplot(data = pie_crab, aes(y = water_temp, x = latitude)) +
  geom_point() + 
  geom_smooth(method = "lm", formula= y~x, se=FALSE) + 
  labs(x = "Latitude",
       y = "Water temperature (Celsius)") + theme_classic()

Residuals, Sum of Squared Errors

The difference between what was observed in the data and what was predicted from the regression line is what we call an “error” or a “residual.” Observations that lie above the regression line exceeded their predicted value and have a positive residual. Observations that lie below the regression line are less than their predicted value and have a negative residual.

Recall that one of our assumptions needed in order to prove that OLS has the lowest variance of all unbiased estimators was that the population error \(u\) is: normally distributed, has a mean of zero, and has constant variance. While these assumptions are not directly testable because we can never observe \(u\), we can use our regression residuals \(e\) to assess the plausibility of these assumptions in the population.

Exercise:

Using the regression of crab size on latitude:

  1. Add to the pie_crab dataframe a column of predicted total lengths \(\hat y_i\) and a column of residuals \(e_i = y_i-\hat y_i\).4 Hint: In dplyr you can use add_predictions(mod) where mod is a regression model to generate a new variable containing your model’s predicted values.
# regression
model_1 <- lm(size ~ latitude, data = pie_crab)

# create predictions and residuals
predictions <- pie_crab %>% add_predictions(model_1) %>%
  mutate(residuals = size-pred)

# take a look to understand what just happened
head(predictions)
## # A tibble: 6 × 11
##   date       latitude site   size air_temp air_tem…¹ water…² water…³ name   pred
##   <date>        <dbl> <chr> <dbl>    <dbl>     <dbl>   <dbl>   <dbl> <chr> <dbl>
## 1 2016-07-24       30 GTM    12.4     21.8      6.39    24.5    6.12 Guan…  10.9
## 2 2016-07-24       30 GTM    14.2     21.8      6.39    24.5    6.12 Guan…  10.9
## 3 2016-07-24       30 GTM    14.5     21.8      6.39    24.5    6.12 Guan…  10.9
## 4 2016-07-24       30 GTM    12.9     21.8      6.39    24.5    6.12 Guan…  10.9
## 5 2016-07-24       30 GTM    12.4     21.8      6.39    24.5    6.12 Guan…  10.9
## 6 2016-07-24       30 GTM    13.0     21.8      6.39    24.5    6.12 Guan…  10.9
## # … with 1 more variable: residuals <dbl>, and abbreviated variable names
## #   ¹​air_temp_sd, ²​water_temp, ³​water_temp_sd
  1. Make a histogram of your residuals. Do they look approximately normally distributed?
# histogram
ggplot(data=predictions) + geom_histogram(aes(residuals), bins=25)+ theme_classic()

Pretty okay! Probably would merit a Q-Q plot to confirm, given the small sample size.

  1. Compute the mean of your residuals. Is it basically zero?5 Note: OLS gives you this one for free. The OLS regression line will always go through the middle of your data, as it aims to minimize squared errors. So, you’ll always end up with sample errors that are mean zero. This doesn’t necessarily mean that in your population \(u\) is mean zero!
# mean
mean(predictions$residuals)
## [1] -0.00000000000003905975
  1. Plot residuals against your independent variable, latitude. Does the variance look approximately constant across all values of latitude?
# variance in residuals against latitude
ggplot(predictions) + geom_point(aes(x=latitude, y=residuals)) + theme_classic()

Roughly, yes. Doesn’t look like there’s a systematic increase or decrease in variance, except maybe at the higher end of the latitude distribution. Overall, this looks like pretty strong evidence to support the assumption that the population error is normally distributed, mean zero, and constant variance. Since we are using our sample to justify assumptions we need to make about the population, we can never be fully confident. But this is the best we can do without a census!