EDS 222: Assignment 02 (due: Oct 20, 5pm)

{ANSWER KEY}

2023-10-21

Question 1: Probability density functions in R

R has many built-in functions that let you describe, analyze, and sample from common probability density functions. For example, if you type ?stats::Normal in your console, you’ll see documentation on all the functions relevant to the normal distribution.1 Recall that the normal distribution is a family of distributions that are symmetric and do not have long tails. They each have different means \(\mu\) and standard deviations \(\sigma\). These functions include:

Question 1.1

x = seq(-4, 4, 0.01)

Use dnorm() to compute the density of the normal pdf for all values in the x vector generated above, using \(\mu = 0\) and \(\sigma = 1\). Use geom_polygon(), geom_line(), or geom_point() (take your pick) to plot this pdf over the support given in x.

Answer:

# Step 1: Define function for generating dnorm() for our sequence
combine_dfs <- 
  function(i, mean = 0, sd = 1) {
    data_frame(dnorm = dnorm(i, mean = mean, sd = sd)) %>%
    mutate(x = i)
  }

# Step 2: Run function
df <- 
  map_df(x, ~combine_dfs(.))

# Step 3: Generate Graph
df %>% 
  ggplot() +
  geom_line(aes(x = x, y = dnorm)) +
  theme_bw() +
  labs(x = "Values of `x`", 
       y = "Density of the normal PDF") 

Question 1.2

Use the densities you generated in 1.1 to calculate the probability that a random variable distributed normally with mean 0 and standard deviation 1 falls between -2 and 2.2 Hint: Remember that \[ Pr(A\leq x \leq B) = \int_A^B f(x)dx \] where the integral is a fancy way to tell you to sum up \(f(x)\) over values of \(x\) from \(A\) to \(B\).

Answer:

# using the densities you already computed -- cumsum is a cumulative sum function
subset <- df %>% filter(x<=2 & x>=-2)
tail(cumsum(subset$dnorm), n=1)
## [1] 95.50378
# or, can use the cumulative probability function pnorm
(pnorm(2, mean = 0, sd = 1) - pnorm(-2, mean = 0, sd = 1)) * 100
## [1] 95.44997

Note that you could have answered this question by summing up the densities you computed above, or by using the pnorm() function directly. Either works – the only reason these are not exactly the same answer (they are close!) is because our densities are only evaluated at discrete intervals of 0.01. Therefore, pnorm() is more accurate in this case.

Question 1.3

Suppose \(\sigma=2\) instead. Qualitatively, how would your answer to Question 1.2 change? Why?

Answer (code not necessary, just shown for completeness):

# Step 2: Run function, change standard deviation argument
df2 <- 
  map_df(x, ~combine_dfs(., mean = 0, sd = 2))

# Step 3: Generate Graph
df2 %>% 
  ggplot() +
  geom_line(aes(x = x, y = dnorm)) +
  theme_bw() +
  labs(x = "Values of `x`", 
       y = "Density of the normal PDF") 

# using the densities you already computed -- cumsum is a cumulative sum function
subset2 <- df2 %>% filter(x<=2 & x>=-2)
tail(cumsum(subset2$dnorm), n=1)
## [1] 68.38983
# or, can use the cumulative probability function pnorm
(pnorm(2, mean = 0, sd = 2) - pnorm(-2, mean = 0, sd = 2)) * 100
## [1] 68.26895

The probability that this random variable falls into the range -2 to 2 is lower with \(\sigma=2\) than it was with \(\sigma=1\) because a higher standard deviation means the distribution has wider spread. We can see this wider spread in our graph, as the mass of the distribution has shifted toward the tails. Therefore, with higher standard deviation there is more mass in the probability density function at lower and at higher values, and less mass in the center, between -2 and 2. This means it is less likely that this random variable falls into this range: 99% probability with \(\sigma=1\) and just 68% probability with \(\sigma=2\).

Question 1.4

An analogous set of functions computes densities and probabilities for the log normal distribution. These functions are dlnorm() and plnorm() and operate as above for the normal distribution functions.

Use plnorm() under default parameters to compute the probability that a random variable distributed log normal takes on a value above 2. Use pnorm() to compute the corresponding probability for the normal distribution under default parameters. Why are these values so different?

Answer:

# log normal
(1-plnorm(2, mean = 0, sd = 1)) * 100
## [1] 24.41086
# normal
(1-pnorm(2, mean = 0, sd = 1)) * 100
## [1] 2.275013

These results tell us that there is a 24.4% probability that a random variable distributed log normal takes on a value above 2, but just a 2.3% probability that a random variable distributed standard normal takes on a value above 2. This makes sense because the log normal distribution has a long right tail, with a lot of probability mass above 2, while the normal distribution is symmetric and does not have a right tail.

Question 2: Climate summary statistics

In the following questions, you’ll be working with climate data from Colombia. These data were obtained from the ERA5 database, a product made available by the European Centre for Medium-Range Weather Forecast. The high-resolution hourly gridded data were aggregated to the municipality by month level – that is, each observation in these data report a monthly average temperature value and a monthly cumulative precipitation value for one of the 1,123 municipalities across the country.3 Note: The computational techniques we use to go from raw, spatial, gridded data to a tabular dataset at an administrative level are really valuable for environmental data science. Between Ruth and I, we’re hoping to cover some of these topics later in the quarter!

These data – stored in colombia_climate.csv – cover all municipalities for the period 1996 to 2015. Climate scientists tend to describe the “climate” of a location as the probability density function of a large set of climate variables over about a 30 year period. We only have 20 years, but we will consider our sample as randomly drawn temperature and precipitation realizations from the “climate” p.d.f. over this period. We are aiming to draw conclusions about the Colombian climate using this sample of temperature and precipitation observations.

Question 2.1

Read these data into R using the read.csv() function.4 See the README.rtf file for details on the variables in colombia_climate.csv.

For each of the temperature and rainfall variables, create a histogram that shows the distribution of the variable across the entire sample. For each variable, answer the following questions:

Answer:

df = read.csv(file.path(datadir,"colombia_climate.csv"))

t <- ggplot(data = df) + geom_histogram(aes(x=temperature), fill="lightcoral",bins = 50) +
  theme_bw() + theme(
  line = element_blank(),
  panel.grid = element_blank(),
  rect = element_blank(),
  strip.text = element_blank(),
  axis.text.x = element_text(size = 14),
  axis.text.y = element_blank(),
  legend.position = "none") +
  labs(x = "Temperature (C)") 

p <- ggplot(data = df) + geom_histogram(aes(x=precip), fill="seagreen", bins = 50) +
  theme_bw() + theme(
  line = element_blank(),
  panel.grid = element_blank(),
  rect = element_blank(),
  strip.text = element_blank(),
  axis.text.x = element_text(size = 14),
  axis.text.y = element_blank(),
  legend.position = "none") +
  labs(x = "Precipitation (mm)") 

plot_grid(t,p)

Both the distributions are skewed, since neither is symmetric. Precipitation has a long right tail, while temperature does not have a tail. Neither distribution looks normally distributed. Temperature is bimodal whereas precipitation is unimodal.

Question 2.2

Given your answers to 2.1 above, do you expect the mean of temperature to differ from the median? Is it likely to be about the same, smaller, or larger? What about precipitation?

Answer (code and results not necessary, shown for completeness only):

df = read.csv(file.path(datadir,"colombia_climate.csv"))

mean_temperature <- df %>% 
  pull(temperature) %>% 
  mean() %>%
  signif(6)

mean_precip <- df %>% 
  pull(precip) %>% 
  mean() %>%
  signif(6)

median_temperature <- df %>% 
  pull(temperature) %>% 
  median() %>%
  signif(6)

median_precip <- df %>% 
  pull(precip) %>% 
  median() %>%
  signif(6)

t <- ggplot(data = df) + geom_histogram(aes(x=temperature), fill="lightcoral",bins = 50) +
  theme_bw() + theme(
  line = element_blank(),
  panel.grid = element_blank(),
  rect = element_blank(),
  strip.text = element_blank(),
  axis.text.x = element_text(size = 14),
  axis.text.y = element_blank(),
  legend.position = "none") +
  labs(x = "Temperature (C)") + 
  geom_vline(xintercept = mean_temperature, size = 0.5, linetype="solid") +
  geom_vline(xintercept = median_temperature, size = 0.5, linetype="dotted") + 
  ggtitle("dotted = median, solid = mean")

p <- ggplot(data = df) + geom_histogram(aes(x=precip), fill="seagreen", bins = 50) +
  theme_bw() + theme(
  line = element_blank(),
  panel.grid = element_blank(),
  rect = element_blank(),
  strip.text = element_blank(),
  axis.text.x = element_text(size = 14),
  axis.text.y = element_blank(),
  legend.position = "none") +
  labs(x = "Precipitation (mm)") + 
  geom_vline(xintercept = mean_precip, size = 0.5, linetype="solid") +
  geom_vline(xintercept = median_precip, size = 0.5, linetype="dotted") + 
  ggtitle("dotted = median, solid = mean")

plot_grid(t,p)

In both distributions, the mean is likely to exceed the median. This is because there is asymmetry with right skew, and larger values are pulling the mean up, while the median is invariant to the length of the right tails. Recall that medians and means are only identical in perfectly symmetric distributions.

Question 2.3

Anthropogenic climate change is expected to raise temperatures across Colombia, increase total precipitation, and increase variability in precipitation. Compute the mean, the median, and the standard deviation of each climate variable in:

Put your summary statistics into a table (or two tables, whatever is easiest). Are the changes you see between the pre-2005 and post-2005 periods consistent with climate change? Explain why.

Answer:

df %>% 
  filter(year <= 2005) %>% 
  select(temperature, precip) %>% 
  summarize_all(funs(mean, median, sd)) %>% 
  gather(key, value) %>% 
  separate(key, c("variable", "stat")) %>%
  spread(stat, value) %>% 
  gt() %>%
  tab_header(
    title = md("Climate in Colombia"),
    subtitle = md("Pre 2005 summary statistics")
  )
Climate in Colombia
Pre 2005 summary statistics
variable mean median sd
precip 254.87686 204.95326 215.679753
temperature 20.19685 19.04034 3.453797
df %>% 
  filter(year > 2005) %>% 
  select(temperature, precip) %>% 
  summarize_all(funs(mean, median, sd)) %>% 
  gather(key, value) %>% 
  separate(key, c("variable", "stat")) %>%
  spread(stat, value) %>% 
  gt() %>%
  tab_header(
    title = md("Climate in Colombia"),
    subtitle = md("Post 2005 summary statistics")
  )
Climate in Colombia
Post 2005 summary statistics
variable mean median sd
precip 313.35358 246.22357 252.652797
temperature 20.18779 18.97515 3.500039

These summary statistics show that temperature has not risen between these two periods, which is inconsistent with the general warming trend under climate change. However, precipitation has increased, as seen by the rising mean and median across the two tables. Precipitation has also become more variable, as seen by the larger standard deviation in the post-2005 period.

Question 2.4

The histograms and summary statistics should make you concerned that these data are not normally distributed. As we will show later in the course, it’s often very helpful to have normally distributed data before we do things like linear regressions or hypothesis testing. Here, let’s use a Q-Q plot to assess the normality of our sample data.

Answer:

tqq <- ggplot(data = df) +
geom_qq(aes(sample=temperature), color = "lightcoral", size=3) +
  geom_qq_line(aes(sample=temperature), color="grey") +
  xlab("Normal distribution quantiles") +
  ylab("Sample quantiles") + theme_bw() + 
  theme(
    line = element_blank(),
  panel.grid = element_blank(),
  rect = element_blank(),
  strip.text = element_blank(),
  axis.text.x = element_text(size = 10),
  axis.text.y = element_text(size = 10),
  axis.title = element_text(size=10,face="bold"),
  plot.title = element_blank(),
  legend.position = "none")

pqq <- ggplot(data = df) +
geom_qq(aes(sample=precip), color = "seagreen", size=3) +
  geom_qq_line(aes(sample=precip), color="grey") +
  xlab("Normal distribution quantiles") +
  ylab("Sample quantiles") + theme_bw() + 
  theme(
    line = element_blank(),
  panel.grid = element_blank(),
  rect = element_blank(),
  strip.text = element_blank(),
  axis.text.x = element_text(size = 10),
  axis.text.y = element_text(size = 10),
  axis.title = element_text(size=10,face="bold"),
  plot.title = element_blank(),
  legend.position = "none")

plot_grid(tqq, pqq)

Neither looks particularly well-approximated by a normal distribution, given the divergence between scatter points and the 45 degree line. However, precipitation looks particularly non-normal with large divergence from the expected quantiles.

Question 2.5

When our sample observations are not normally distributed, we often rely on nonlinear transformations6 Any mathematical operation that is a nonlinear function of the underlying variable can be considered a “nonlinear transformation”. For example, \(x^2\) and \(log(x)\) are both nonlinear transformations. to reshape our data. If we compute a nonlinear transformation on our underlying data and they then look closer to normal, we can use this transformed version of our variable in later statistical analysis.

Because we tend to see a lot of variables in the world that follow the lognormal distribution, a very common nonlinear transformation is the natural logarithm. Transform the precipitation data by taking the natural logarithm. Then remake your Q-Q plot – does your variable (defined as log(precip)) now look closer to normally distributed? What can you learn about where the data diverge from the normal distribution?

Answer:

df$lnprcp = log(df$precip)
pqq2 <- ggplot(data = df) +
geom_qq(aes(sample=lnprcp), color = "seagreen", size=3) +
  geom_qq_line(aes(sample=lnprcp), color="grey") +
  xlab("Normal distribution quantiles") +
  ylab("Sample quantiles") + theme_bw() + 
  theme(
    line = element_blank(),
  panel.grid = element_blank(),
  rect = element_blank(),
  strip.text = element_blank(),
  axis.text.x = element_text(size = 10),
  axis.text.y = element_text(size = 10),
  axis.title = element_text(size=10,face="bold"),
  plot.title = element_blank(),
  legend.position = "none")

pqq2

Taking the logarithm of precipitation improves the match between the resulting distribution and the normal distribution. However, log(precip) still diverges strongly from the normal distribution quantiles at the low end of the distribution. That is, log(precip) is too low compared to what it would be if it were normally distributed.