EDS 222: Week 7: In-class Lab

{ANSWER KEY}

2023-11-15

Section 0: Getting set up

Load the following packages:

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.4.0     ✔ purrr   0.3.5
## ✔ tibble  3.2.1     ✔ dplyr   1.1.3
## ✔ tidyr   1.2.1     ✔ stringr 1.4.1
## ✔ readr   2.1.2     ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## 
## Attaching package: 'broom'
## The following object is masked from 'package:modelr':
## 
##     bootstrap
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata

Section 1: Hypothesis testing in R

The state of North Carolina released to the public a large data set containing information on births recorded in this state. This data set has been of interest to medical researchers who are studying the relation between habits and practices of expectant mothers and the birth of their children. ncbirths is a random sample of 1000 cases from this data set, and is provided in the openintro package. We want to evaluate whether there is a difference between weights of babies born to smoker and non-smoker mothers.

First, filter ncbirths for rows where the variable habit is non-missing (this variable indicates whether a mother smokes or not).

Use the is.na() function to specify that you want values of habit that are not missing (where is.na() is FALSE).

# take a look at the documentation to understand variable units, etc.
?ncbirths

# remove missings
head(ncbirths)
## # A tibble: 6 × 13
##    fage  mage mature    weeks premie visits marital gained weight lowbirthweight
##   <int> <int> <fct>     <int> <fct>   <int> <fct>    <int>  <dbl> <fct>         
## 1    NA    13 younger …    39 full …     10 not ma…     38   7.63 not low       
## 2    NA    14 younger …    42 full …     15 not ma…     20   7.88 not low       
## 3    19    15 younger …    37 full …     11 not ma…     38   6.63 not low       
## 4    21    15 younger …    41 full …      6 not ma…     34   8    not low       
## 5    NA    15 younger …    39 full …      9 not ma…     27   6.38 not low       
## 6    NA    15 younger …    38 full …     19 not ma…     22   5.38 low           
## # ℹ 3 more variables: gender <fct>, habit <fct>, whitemom <fct>
ncbirths_complete <- ncbirths %>%
  filter(is.na(habit) == FALSE)

Step 1: Define the null and alternative hypotheses:

Construct a null and an alternative hypothesis that will allow you to evaluate if the birth weight of babies born to smoking mothers is different from the birth weight of babies born to non-smoking mothers.

Answer:

We can write the two hypotheses as:

\[H_{0}: \mu_{nonsmoker} - \mu_{smoker} = 0\] \[H_{A}: \mu_{nonsmoker} - \mu_{smoker} \neq 0\]

Step 2: Collect data and compute the point estimate.

Use the complete version of the ncbirths data to estimate the “point estimate”, which in this case is a difference in means between the two groups of mothers.

mu_nonsmoker = ncbirths_complete %>% 
  filter(habit == "nonsmoker") %>% 
  summarize(mean(weight))

mu_smoker = ncbirths_complete %>% 
  filter(habit == "smoker") %>% 
  summarize(mean(weight))

point_est = as.numeric(mu_nonsmoker - mu_smoker)
print(point_est)
## [1] 0.3155425

This doesn’t look like “no difference”, and it’s consistent with our prior that non-smoking mothers should have higher birth weight babies. But remember that to conduct a hypothesis test we need a measure of variability, not just a measure of the mean.

Step 3: Model the variability of the statistic

Use the definition of the z-score as defined in class to construct a test statistic for this hypothesis by hand. Don’t worry, we’ll let R do it all for us in a minute.

Recall the definition of the z-score for hypothesis testing:

\[z_{score}=\frac{\text { point estimate }-\text { null value }}{S E}\] Recall how we compute a standard error of a difference in means:

\[SE = \sqrt{\frac{s_1^2}{n_1} + \frac{s^2_2}{n_2}}\] First let’s compute the SE by hand:

n1 = ncbirths_complete %>% filter(habit == "nonsmoker") %>% count()
n2 = ncbirths_complete %>% filter(habit == "smoker") %>% count()
s1 = ncbirths_complete %>% filter(habit == "nonsmoker") %>% summarize(sd(weight, na.rm = TRUE))
s2 = ncbirths_complete %>% filter(habit == "smoker") %>% summarize(sd(weight, na.rm = TRUE))
SE = as.numeric(sqrt(s1^2/n1 + s2^2/n2))
print(SE)
## [1] 0.1337605

And now the test statistic:

zscore = (point_est - 0)/SE
print(zscore)
## [1] 2.359011

Can you explain the z-score in words? What does this suggest to you about the evidence against the null?

Answer: the observed difference in birth weights is 2.36 standard deviations above the null of zero difference.

Step 4: Quantify the probability that your sample statistic differs from the null

Use the z-score calculated in step 3 to compute the p-value which is is the probability of getting a point estimate at least as extreme as ours if the null hypothesis were true: \[p \text { - value }=\operatorname{Pr}(Z<-|z| \text { or } Z>|z|)=2 * \operatorname{Pr}(Z>|z|)\] Make use of the function pnorm() to access the normal distribution.1 Recall: pnorm() gives you the probability mass below a certain cutoff in a probability distribution with a mean and standard deviation you can control. You can use lower.tail=FALSE to get the mass above a given cutoff.

p_val = 2 * pnorm(zscore, lower.tail=FALSE)
print(p_val)
## [1] 0.01832372
# alternative method: use the point estimate on a normal distribution with standard deviation SE
p_val_alt = 2 * pnorm(point_est, mean = 0, sd = SE, lower.tail=FALSE)
print(p_val_alt)
## [1] 0.01832372

Step 5: Evaluate whether you can reject or fail to reject your null hypothesis

Use the p_value determined in step 4 to evaluate whether you can reject or fail to reject the null hypothesis at a 95% confidence level (i.e., \(\alpha = .05\)). State precisely the conclusion of your hypothesis test.

Answer: Since \(p-value = 0.018 < 0.05\) we reject the null that there is no difference in the birth weight of babies born to smoking versus non-smoking mothers. We can say there is a statistically significant difference (at the 5% significance level) in baby birth weight across smoking and non-smoking mothers.

Using t.test() to implement a hypothesis test in R

Let’s repeat the above steps using the canned function in R that allows you to perform hypothesis tests for comparing one or two means.

Again, we want to know if there is a statistically significant difference in baby birth weight across smoking and non-smoking mothers.

First, take a look at t.test() documentation to see how the function works.

?t.test

We can implement our t-test in multiple ways using t.test(). First, let’s use the formula object, as I think this is the nicest approach. To use this method, we first enter the variable we are interested in evaluating means of (i.e., weight), then use a tilde ~ like in regression analysis, followed by the variable indicating the groups for which we are testing for differences in means (habit).

t.test(weight ~ habit, data = ncbirths_complete)
## 
##  Welch Two Sample t-test
## 
## data:  weight by habit
## t = 2.359, df = 171.32, p-value = 0.01945
## alternative hypothesis: true difference in means between group nonsmoker and group smoker is not equal to 0
## 95 percent confidence interval:
##  0.05151165 0.57957328
## sample estimates:
## mean in group nonsmoker    mean in group smoker 
##                7.144273                6.828730

Note that we obtain nearly the same p-value and test-statistic (shown as t) as we did above, although it’s slightly different as this is using the t-distribution and we used the normal.2 By the time \(n=1000\), the t-distribution and normal are very very similar.

Another way to execute this test is to pass the two vectors of data into t.test() as the first two arguments. This works the same way, it’s just uglier.

t.test(ncbirths_complete$weight[ncbirths_complete$habit=="nonsmoker"], ncbirths_complete$weight[ncbirths_complete$habit=="smoker"] )
## 
##  Welch Two Sample t-test
## 
## data:  ncbirths_complete$weight[ncbirths_complete$habit == "nonsmoker"] and ncbirths_complete$weight[ncbirths_complete$habit == "smoker"]
## t = 2.359, df = 171.32, p-value = 0.01945
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.05151165 0.57957328
## sample estimates:
## mean of x mean of y 
##  7.144273  6.828730

Finally, note that you do not want to enter the difference in the means directly into t.test(), since it will treat your second mean as a constant and not a random variable, giving you far too much confidence in your point estimate.

To see this, try entering \(\mu_{nonsmoker} - \mu_{smoker}\) directly into the function:

t.test(ncbirths_complete$weight[ncbirths_complete$habit=="nonsmoker"] - ncbirths_complete$weight[ncbirths_complete$habit=="smoker"] )
## Warning in ncbirths_complete$weight[ncbirths_complete$habit == "nonsmoker"] - :
## longer object length is not a multiple of shorter object length
## 
##  One Sample t-test
## 
## data:  ncbirths_complete$weight[ncbirths_complete$habit == "nonsmoker"] - ncbirths_complete$weight[ncbirths_complete$habit == "smoker"]
## t = 4.3656, df = 872, p-value = 0.0000142
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  0.1699302 0.4475269
## sample estimates:
## mean of x 
## 0.3087285

Look how low your p-value is! And also note that R is telling you it’s running a one-sample test, which we know we don’t want because both variables are random (birth weight for nonsmoking mothers and birth weight for smoking moms).

Section 2: Confidence Intervals

The above analysis revealed a statistically significant difference in birth weight between the two groups of mothers. Is this a meaningful difference? P-values tell us nothing about whether it’s an effect size we should be concerned about or not!

Here, we’ll build a confidence interval around our point estimate. Recall that a confidence interval gives us a range of values that is likely to contain the true population parameter.

Construct a 95% confidence interval by hand

First, we’ll do this by hand. Recall the formula for a confidence interval:

\[ \text{point_estimate} \pm z_{\alpha/2}*SE\] What is \(z_{\alpha/2}\)? It is the z-score for which there is an \(\alpha\) probability of observing a point estimate as extreme as yours.

We would like to construct a 95% confidence interval. This means we want a 5% probability of observing a point estimate as extreme as ours. What z-score gives us this 5%?

It must be close to 2, since we know the 68-95-99.7 rule is a close rule of thumb…to get the exact number, we can use the qnorm() function (or look it up in a z-table).

qnorm() tells us the quantile of the normal distribution, which is what we’re after. Note that we will pass qnorm() the 2.5% quantile, since we have a symmetric distribution and we want a 5% probability the z-score is at least as far from the null (in either direction) as our point estimate is.

crit_val = qnorm(0.025, lower.tail=FALSE)
print(crit_val)
## [1] 1.959964

Now we can construct the 95% confidence interval:

ci_lower = round(point_est - crit_val*SE, 2)
ci_upper = round(point_est + crit_val*SE, 2)

print(paste0("95% probability that [", ci_lower, " ,", ci_upper, "] contains the difference in birth weights across smoking and non smoking mothers."))
## [1] "95% probability that [0.05 ,0.58] contains the difference in birth weights across smoking and non smoking mothers."

Note that you got this output for free from t.test() earlier. You can change the confidence level reported by t.test() by using the conf.level argument.

Use t.test() to compute the 90% confidence interval around your point estimate. Does the interval get wider or tighter? Why?

t.test(weight ~ habit, data = ncbirths_complete, conf.level = 0.90)
## 
##  Welch Two Sample t-test
## 
## data:  weight by habit
## t = 2.359, df = 171.32, p-value = 0.01945
## alternative hypothesis: true difference in means between group nonsmoker and group smoker is not equal to 0
## 90 percent confidence interval:
##  0.09432986 0.53675506
## sample estimates:
## mean in group nonsmoker    mean in group smoker 
##                7.144273                6.828730