Start by loading packages you need. This week we will be using the
Plum Island Ecosystem (PIE) LTER fiddler crabs dataset from the
lterdatasampler
package. Please install the package if you
have not already done so.
## Loading required package: lterdatasampler
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.4.0 ✔ purrr 0.3.5
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
##
## Attaching package: 'openintro'
## The following object is masked from 'package:gt':
##
## sp500
##
## Attaching package: 'modelr'
## The following object is masked from 'package:broom':
##
## bootstrap
Using similar concepts from a histogram or a density plot, we can use a scatter plot to visualize the relationship between two variables. It is always a good idea to look at a scatter plot when trying to establish the relationship between two variables. A scatter plot is a simple two-dimensional plot in which the two coordinates of each dot represent the values of two variables measured on a single observation. The dependent or response variable is typically on the vertical axis and the independent or explanatory variable on the horizontal axis.
Here we will use a data frame on Atlantic marsh fiddler crabs that contains records from summer 2016 of fiddler crab body size and other variables obtained across three Long Term Ecological Research (LTER) sites across the Atlantic coast (more information available here). Specifically, we are going to explore what is called Bergmann’s rule:
“One of the best-known patterns in biogeography is Bergmann’s rule. It predicts that organisms at higher latitudes are larger than ones at lower latitudes. Many organisms follow Bergmann’s rule, including insects, birds, snakes, marine invertebrates, and terrestrial and marine mammals.” (Johnson et al., 2019)
We will try to see if (i) latitude associated with the size of PIE crabs, as Bergmann’s rule would suggest; and (ii) latitude is associated with water temperature, as this may help us understand why Bergmann’s rule holds.
The pie_crab
data sample can be called directly once
you’ve loaded the lterdatasampler
package. Take a look at
your variable names and make a scatter plot showing (i) crab body size
as a function of latitude and (ii) water temperature as a function of
latitude. That is, treat latitude as your independent variable.
summary(pie_crab) #note: ?pie_crab gives you helpful information like units for each variable
## date latitude site size
## Min. :2016-07-24 Min. :30.00 Length:392 Min. : 6.64
## 1st Qu.:2016-07-28 1st Qu.:34.00 Class :character 1st Qu.:12.02
## Median :2016-08-01 Median :39.10 Mode :character Median :14.44
## Mean :2016-08-02 Mean :37.69 Mean :14.66
## 3rd Qu.:2016-08-09 3rd Qu.:41.60 3rd Qu.:17.34
## Max. :2016-08-13 Max. :42.70 Max. :23.43
## air_temp air_temp_sd water_temp water_temp_sd
## Min. :10.29 Min. :6.391 Min. :13.98 Min. :4.838
## 1st Qu.:12.05 1st Qu.:8.110 1st Qu.:14.33 1st Qu.:6.567
## Median :13.93 Median :8.410 Median :17.50 Median :6.998
## Mean :15.20 Mean :8.654 Mean :17.65 Mean :7.252
## 3rd Qu.:18.63 3rd Qu.:9.483 3rd Qu.:20.54 3rd Qu.:7.865
## Max. :21.79 Max. :9.965 Max. :24.50 Max. :9.121
## name
## Length:392
## Class :character
## Mode :character
##
##
##
ggplot(data = pie_crab, aes(y = size, x = latitude)) +
geom_point() +
labs(x = "Latitude",
y = "Size of crab carapace (mm)")+ theme_classic()
ggplot(data = pie_crab, aes(y = water_temp, x = latitude)) +
geom_point() +
labs(x = "Latitude",
y = "Water temperature (Celsius)")+ theme_classic()
Since many crab samples were taken at each site, we see many data points for each latitude. It looks like crab size and latitude are somewhat positively correlated, with a lot of noise. Water temperature and latitude are negatively correlated and the relationship appears stronger. Generally, crab size increases as latitude increases, following Bergmann’s rule.
Visualizing values of the correlation between two quantities from their scatter plot can be very difficult. Research has shown that people’s perception of the strength of these relationships can be influenced by design choices like the size of the graph and the aspect ratio,1 The aspect ratio is the ratio of the width to the height of a figure or image. among other features. Hence, to more objectively report the relationship between two variables, we use correlation, computing the “correlation coefficient” \(r\).2 Note: we just called this “correlation” in lecture, but people use “correlation” and “correlation coefficient” interchangeably. In most cases, this refers to the Pearson’s correlation coefficient, as defined in lecture. There are other forms of correlation that are less common, such as rank correlation, but we won’t use them in this class. Here, we will compute correlation coefficients for the two relationships plotted above.
Exercise:
cor()
function in R
to calculate
the correlation coefficient between the two variables in (i) and the two
in (ii). How close was your guess to the true answer?# (i) size & latitude:
pie_crab %>% summarize(crab_lat = cor(size, latitude))
## # A tibble: 1 × 1
## crab_lat
## <dbl>
## 1 0.590
# (ii) water temperature & latitude:
pie_crab %>%
summarize(temp_lat = cor(water_temp, latitude))
## # A tibble: 1 × 1
## temp_lat
## <dbl>
## 1 -0.957
To estimate the linear relationship between the variables explored
above, we’ll use lm()
to fit a regression model using
Ordinary Least Squares (OLS). The lm()
function has two
required arguments:
First let’s consider the relationship between crab size and latitude. We assume there is a linear population relationship, specified as:
\[\text{crab size} = \beta_0 + \beta_1 \text{latitude} + \epsilon\]
Exercise:
Use lm()
to estimate \(\hat\beta_0\) and \(\hat\beta_1\) using this sample of data.
Use summary(lm())
or gt()
to visualize the
regression results.
Interpret your two coefficients, paying careful attention to units.
Is your estimate of \(\beta_0\) directly interpretable? What might you be concerned about when interpreting \(\beta_0\)?
Answers:
# (i) latitude is associated with the size of crabs:
lm(size ~ latitude, data = pie_crab) %>%
tidy %>%
gt()
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | -3.6244214 | 1.27405099 | -2.844801 | 0.00467819238835985844876530492797428451012820 |
latitude | 0.4851191 | 0.03359292 | 14.441111 | 0.00000000000000000000000000000000000003602897 |
#alternatively, can write: summary(lm(size ~ latitude, data = pie_crab))
The coefficient on latitude tells us that for each degree increase in latitude, we expect crab size to increase by 0.49mm on average.
The intercept estimate tells us that on average, crabs living on the equator, i.e. at latitude=0, have a size of -3.62mm. Note that this doesn’t make sense! Do these crabs even live at the equator? This is a case where you want to be careful about extrapolating. \(\hat\beta_0\) is helping us fit the data well in the range of our sample, but you shouldn’t go making predictions about crabs in the equatorial region using this model estimated from the Atlantic coast of the US.
Exercises, continued
Because crab size is increasing in latitude and water temperature is decreasing in latitude in the sample region, we can infer that as water temperature decreases, crab size increases. This is a statistical correlation we could estimate, but it is not causal evidence. You might think that colder water temperature is selecting for larger organisms by favoring larger bodies as they have lower surface-area-to-volume ratio (the heat conservation hypothesis), suggesting a causal link. However, for this hypothesis to be true, we should not see this relationship in “cold-blooded” animals, but many of them also follow this relationship.
“Bergmann originally hypothesized that the organisms he studied, birds, were larger in the colder, higher latitudes due to heat-conservation. But the heat-conservation hypothesis relies on internal regulation of body temperature and therefore does not apply to ectotherms,3 Commonly referred to as “cold-blooded animals”, ectotherms are animal that depends on external sources of body heat; such as fishes, amphibians, reptiles, and invertebrates. some of which also follow Bergmann’s rule. There is likely no universal mechanism underpinning Bergmann’s rule, regardless of ecto- or endothermy. As a result, other mechanisms have been proposed to explain Bergmann’s rule, including the starvation-resistant hypothesis, the diet-quality hypothesis, the enemy hypothesis, the resource rule, seasonality hypothesis, and the temperature–size rule (Johnson et al., 2019).”
Here are analogous results for the relationship between latitude and water temperature:
# (ii) latitude is associated with water temperature:
lm(water_temp ~ latitude, data = pie_crab) %>%
tidy %>%
gt()
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 46.7782261 | 0.44896940 | 104.19023 | 0.00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000008505623 |
latitude | -0.7729139 | 0.01183798 | -65.29102 | 0.00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000004744672936810910431529715069308409588967560449162833510838002532630119373404108699 |
For each additional degree increase in latitude, we expect water temperature to fall by 0.77 degrees Celsius on average.
Here, we visualize our simple linear regression models above using the scatter plots from Section 1.
Exercise:
Use the geom_smooth()
function with method argument set
to lm
, which stands for “linear model”, to add the best fit
OLS line to your scatter plots from Section 1. Use the option
se = FALSE
to turn off estimates of uncertainty. Don’t
worry, we’ll dig into those soon.
Answers:
# (i) latitude is associated with the size of crabs:
ggplot(data = pie_crab, aes(x = latitude, y = size)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(x = "Latitude",
y = "Size of crab carapace (mm)") + theme_classic()
## `geom_smooth()` using formula = 'y ~ x'
# (ii) latitude is associated with water temperature:
ggplot(data = pie_crab, aes(y = water_temp, x = latitude)) +
geom_point() +
geom_smooth(method = "lm", formula= y~x, se=FALSE) +
labs(x = "Latitude",
y = "Water temperature (Celsius)") + theme_classic()
The difference between what was observed in the data and what was predicted from the regression line is what we call an “error” or a “residual.” Observations that lie above the regression line exceeded their predicted value and have a positive residual. Observations that lie below the regression line are less than their predicted value and have a negative residual.
Recall that one of our assumptions needed in order to prove that OLS has the lowest variance of all unbiased estimators was that the population error \(u\) is: normally distributed, has a mean of zero, and has constant variance. While these assumptions are not directly testable because we can never observe \(u\), we can use our regression residuals \(e\) to assess the plausibility of these assumptions in the population.
Exercise:
Using the regression of crab size on latitude:
pie_crab
dataframe a column of predicted
total lengths \(\hat y_i\) and a column
of residuals \(e_i = y_i-\hat y_i\).4 Hint: In dplyr
you can use
add_predictions(mod)
where mod
is a regression
model to generate a new variable containing your model’s predicted
values.# regression
model_1 <- lm(size ~ latitude, data = pie_crab)
# create predictions and residuals
predictions <- pie_crab %>% add_predictions(model_1) %>%
mutate(residuals = size-pred)
# take a look to understand what just happened
head(predictions)
## # A tibble: 6 × 11
## date latitude site size air_temp air_tem…¹ water…² water…³ name pred
## <date> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
## 1 2016-07-24 30 GTM 12.4 21.8 6.39 24.5 6.12 Guan… 10.9
## 2 2016-07-24 30 GTM 14.2 21.8 6.39 24.5 6.12 Guan… 10.9
## 3 2016-07-24 30 GTM 14.5 21.8 6.39 24.5 6.12 Guan… 10.9
## 4 2016-07-24 30 GTM 12.9 21.8 6.39 24.5 6.12 Guan… 10.9
## 5 2016-07-24 30 GTM 12.4 21.8 6.39 24.5 6.12 Guan… 10.9
## 6 2016-07-24 30 GTM 13.0 21.8 6.39 24.5 6.12 Guan… 10.9
## # … with 1 more variable: residuals <dbl>, and abbreviated variable names
## # ¹air_temp_sd, ²water_temp, ³water_temp_sd
# histogram
ggplot(data=predictions) + geom_histogram(aes(residuals), bins=25)+ theme_classic()
Pretty okay! Probably would merit a Q-Q plot to confirm, given the small sample size.
# mean
mean(predictions$residuals)
## [1] -0.00000000000003905975
# variance in residuals against latitude
ggplot(predictions) + geom_point(aes(x=latitude, y=residuals)) + theme_classic()
Roughly, yes. Doesn’t look like there’s a systematic increase or decrease in variance, except maybe at the higher end of the latitude distribution. Overall, this looks like pretty strong evidence to support the assumption that the population error is normally distributed, mean zero, and constant variance. Since we are using our sample to justify assumptions we need to make about the population, we can never be fully confident. But this is the best we can do without a census!