Load the following packages:
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.4.0 ✔ purrr 0.3.5
## ✔ tibble 3.2.1 ✔ dplyr 1.1.3
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
##
## Attaching package: 'broom'
## The following object is masked from 'package:modelr':
##
## bootstrap
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
R
The state of North Carolina released to the public a large data set
containing information on births recorded in this state. This data set
has been of interest to medical researchers who are studying the
relation between habits and practices of expectant mothers and the birth
of their children. ncbirths
is a random sample of 1000
cases from this data set, and is provided in the openintro
package. We want to evaluate whether there is a difference between
weights of babies born to smoker and non-smoker mothers.
First, filter ncbirths
for rows where the variable
habit
is non-missing (this variable indicates whether a
mother smokes or not).
Use the is.na() function to specify that you want values of habit that are not missing (where is.na() is FALSE).
# take a look at the documentation to understand variable units, etc.
?ncbirths
# remove missings
head(ncbirths)
## # A tibble: 6 × 13
## fage mage mature weeks premie visits marital gained weight lowbirthweight
## <int> <int> <fct> <int> <fct> <int> <fct> <int> <dbl> <fct>
## 1 NA 13 younger … 39 full … 10 not ma… 38 7.63 not low
## 2 NA 14 younger … 42 full … 15 not ma… 20 7.88 not low
## 3 19 15 younger … 37 full … 11 not ma… 38 6.63 not low
## 4 21 15 younger … 41 full … 6 not ma… 34 8 not low
## 5 NA 15 younger … 39 full … 9 not ma… 27 6.38 not low
## 6 NA 15 younger … 38 full … 19 not ma… 22 5.38 low
## # ℹ 3 more variables: gender <fct>, habit <fct>, whitemom <fct>
ncbirths_complete <- ncbirths %>%
filter(is.na(habit) == FALSE)
Construct a null and an alternative hypothesis that will allow you to evaluate if the birth weight of babies born to smoking mothers is different from the birth weight of babies born to non-smoking mothers.
Answer:
We can write the two hypotheses as:
\[H_{0}: \mu_{nonsmoker} - \mu_{smoker} = 0\] \[H_{A}: \mu_{nonsmoker} - \mu_{smoker} \neq 0\]
Use the complete version of the ncbirths
data to
estimate the “point estimate”, which in this case is a difference in
means between the two groups of mothers.
mu_nonsmoker = ncbirths_complete %>%
filter(habit == "nonsmoker") %>%
summarize(mean(weight))
mu_smoker = ncbirths_complete %>%
filter(habit == "smoker") %>%
summarize(mean(weight))
point_est = as.numeric(mu_nonsmoker - mu_smoker)
print(point_est)
## [1] 0.3155425
This doesn’t look like “no difference”, and it’s consistent with our prior that non-smoking mothers should have higher birth weight babies. But remember that to conduct a hypothesis test we need a measure of variability, not just a measure of the mean.
Use the definition of the z-score as defined in class to construct a
test statistic for this hypothesis by hand. Don’t
worry, we’ll let R
do it all for us in a minute.
Recall the definition of the z-score for hypothesis testing:
\[z_{score}=\frac{\text { point estimate }-\text { null value }}{S E}\] Recall how we compute a standard error of a difference in means:
\[SE = \sqrt{\frac{s_1^2}{n_1} + \frac{s^2_2}{n_2}}\] First let’s compute the SE by hand:
n1 = ncbirths_complete %>% filter(habit == "nonsmoker") %>% count()
n2 = ncbirths_complete %>% filter(habit == "smoker") %>% count()
s1 = ncbirths_complete %>% filter(habit == "nonsmoker") %>% summarize(sd(weight, na.rm = TRUE))
s2 = ncbirths_complete %>% filter(habit == "smoker") %>% summarize(sd(weight, na.rm = TRUE))
SE = as.numeric(sqrt(s1^2/n1 + s2^2/n2))
print(SE)
## [1] 0.1337605
And now the test statistic:
zscore = (point_est - 0)/SE
print(zscore)
## [1] 2.359011
Can you explain the z-score in words? What does this suggest to you about the evidence against the null?
Answer: the observed difference in birth weights is 2.36 standard deviations above the null of zero difference.
Use the z-score calculated in step 3 to compute the p-value which is
is the probability of getting a point estimate at least as extreme as
ours if the null hypothesis were true: \[p
\text { - value }=\operatorname{Pr}(Z<-|z| \text { or } Z>|z|)=2 *
\operatorname{Pr}(Z>|z|)\] Make use of the function
pnorm()
to access the normal distribution.1 Recall: pnorm()
gives you the probability
mass below a certain cutoff in a probability distribution with a mean
and standard deviation you can control. You can use
lower.tail=FALSE
to get the mass above a given cutoff.
p_val = 2 * pnorm(zscore, lower.tail=FALSE)
print(p_val)
## [1] 0.01832372
# alternative method: use the point estimate on a normal distribution with standard deviation SE
p_val_alt = 2 * pnorm(point_est, mean = 0, sd = SE, lower.tail=FALSE)
print(p_val_alt)
## [1] 0.01832372
Use the p_value determined in step 4 to evaluate whether you can reject or fail to reject the null hypothesis at a 95% confidence level (i.e., \(\alpha = .05\)). State precisely the conclusion of your hypothesis test.
Answer: Since \(p-value = 0.018 < 0.05\) we reject the null that there is no difference in the birth weight of babies born to smoking versus non-smoking mothers. We can say there is a statistically significant difference (at the 5% significance level) in baby birth weight across smoking and non-smoking mothers.
t.test()
to implement a hypothesis test in
R
Let’s repeat the above steps using the canned function in
R
that allows you to perform hypothesis tests for comparing
one or two means.
Again, we want to know if there is a statistically significant difference in baby birth weight across smoking and non-smoking mothers.
First, take a look at t.test()
documentation to see how
the function works.
?t.test
We can implement our t-test in multiple ways using
t.test()
. First, let’s use the formula
object,
as I think this is the nicest approach. To use this method, we first
enter the variable we are interested in evaluating means of (i.e.,
weight
), then use a tilde ~
like in regression
analysis, followed by the variable indicating the groups for which we
are testing for differences in means (habit
).
t.test(weight ~ habit, data = ncbirths_complete)
##
## Welch Two Sample t-test
##
## data: weight by habit
## t = 2.359, df = 171.32, p-value = 0.01945
## alternative hypothesis: true difference in means between group nonsmoker and group smoker is not equal to 0
## 95 percent confidence interval:
## 0.05151165 0.57957328
## sample estimates:
## mean in group nonsmoker mean in group smoker
## 7.144273 6.828730
Note that we obtain nearly the same p-value and test-statistic (shown
as t
) as we did above, although it’s slightly different as
this is using the t-distribution and we used the normal.2 By the time \(n=1000\),
the t-distribution and normal are very very similar.
Another way to execute this test is to pass the two vectors of data
into t.test()
as the first two arguments. This works the
same way, it’s just uglier.
t.test(ncbirths_complete$weight[ncbirths_complete$habit=="nonsmoker"], ncbirths_complete$weight[ncbirths_complete$habit=="smoker"] )
##
## Welch Two Sample t-test
##
## data: ncbirths_complete$weight[ncbirths_complete$habit == "nonsmoker"] and ncbirths_complete$weight[ncbirths_complete$habit == "smoker"]
## t = 2.359, df = 171.32, p-value = 0.01945
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.05151165 0.57957328
## sample estimates:
## mean of x mean of y
## 7.144273 6.828730
Finally, note that you do not want to enter the
difference in the means directly into t.test()
, since it
will treat your second mean as a constant and not a random variable,
giving you far too much confidence in your point estimate.
To see this, try entering \(\mu_{nonsmoker} - \mu_{smoker}\) directly into the function:
t.test(ncbirths_complete$weight[ncbirths_complete$habit=="nonsmoker"] - ncbirths_complete$weight[ncbirths_complete$habit=="smoker"] )
## Warning in ncbirths_complete$weight[ncbirths_complete$habit == "nonsmoker"] - :
## longer object length is not a multiple of shorter object length
##
## One Sample t-test
##
## data: ncbirths_complete$weight[ncbirths_complete$habit == "nonsmoker"] - ncbirths_complete$weight[ncbirths_complete$habit == "smoker"]
## t = 4.3656, df = 872, p-value = 0.0000142
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 0.1699302 0.4475269
## sample estimates:
## mean of x
## 0.3087285
Look how low your p-value is! And also note that R
is
telling you it’s running a one-sample test, which we know we don’t want
because both variables are random (birth weight for nonsmoking mothers
and birth weight for smoking moms).
The above analysis revealed a statistically significant difference in birth weight between the two groups of mothers. Is this a meaningful difference? P-values tell us nothing about whether it’s an effect size we should be concerned about or not!
Here, we’ll build a confidence interval around our point estimate. Recall that a confidence interval gives us a range of values that is likely to contain the true population parameter.
First, we’ll do this by hand. Recall the formula for a confidence interval:
\[ \text{point_estimate} \pm z_{\alpha/2}*SE\] What is \(z_{\alpha/2}\)? It is the z-score for which there is an \(\alpha\) probability of observing a point estimate as extreme as yours.
We would like to construct a 95% confidence interval. This means we want a 5% probability of observing a point estimate as extreme as ours. What z-score gives us this 5%?
It must be close to 2, since we know the 68-95-99.7 rule is a close
rule of thumb…to get the exact number, we can use the
qnorm()
function (or look it up in a z-table).
qnorm()
tells us the quantile of the normal
distribution, which is what we’re after. Note that we will pass
qnorm()
the 2.5% quantile, since we have a symmetric
distribution and we want a 5% probability the z-score is at least as far
from the null (in either direction) as our point estimate is.
crit_val = qnorm(0.025, lower.tail=FALSE)
print(crit_val)
## [1] 1.959964
Now we can construct the 95% confidence interval:
ci_lower = round(point_est - crit_val*SE, 2)
ci_upper = round(point_est + crit_val*SE, 2)
print(paste0("95% probability that [", ci_lower, " ,", ci_upper, "] contains the difference in birth weights across smoking and non smoking mothers."))
## [1] "95% probability that [0.05 ,0.58] contains the difference in birth weights across smoking and non smoking mothers."
Note that you got this output for free from t.test()
earlier. You can change the confidence level reported by
t.test()
by using the conf.level
argument.
Use t.test()
to compute the 90% confidence interval
around your point estimate. Does the interval get wider or tighter?
Why?
t.test(weight ~ habit, data = ncbirths_complete, conf.level = 0.90)
##
## Welch Two Sample t-test
##
## data: weight by habit
## t = 2.359, df = 171.32, p-value = 0.01945
## alternative hypothesis: true difference in means between group nonsmoker and group smoker is not equal to 0
## 90 percent confidence interval:
## 0.09432986 0.53675506
## sample estimates:
## mean in group nonsmoker mean in group smoker
## 7.144273 6.828730