Unit 4 - Inference For Numerical Variables: Suggested Reading: Openintro Statistics, Suggested Exercises
Unit 4 - Inference For Numerical Variables: Suggested Reading: Openintro Statistics, Suggested Exercises
Suggested exercises:
Part 1 - Comparing two means: 5.1, 5.5, 5.7, 5.9, 5.11
Part 2 - Bootstrapping
Part 3 - Inference with the t-distribution: 5.15, 5.19, 5.21, 5.29, 5.31, 5.33, 5.35
Part 4 - Comparing three or more means (ANOVA): 5.39, 5.41, 5.43, 5.45
LO 1. Describe how bootstrap distributions are constructed, and recognize how they are different
from sampling distributions.
LO 2. Construct bootstrap confidence intervals using one of the following methods:
- Percentile method: XX% confidence level is the middle XX% of the bootstrap distribu-
tion.
- Standard error method: If the standard error of the bootstrap distribution is known,
and the distribution is nearly normal, the bootstrap interval can also be calculated as
xboot z ? SEboot .
LO 3. Recognize that when the bootstrap distribution is extremely skewed and sparse, the bootstrap
confidence interval may not be reliable.
Test yourself:
1. How is a bootstrap distribution different from a sampling distribution?
2. If a bootstrap distribution is constructed using 200 simulations, how would we find the
95% bootstrap confidence interval? Hint: Draw a sketch.
3. When is a bootstrap confidence interval not appropriate?
1
Dr. Cetinkaya-Rundel Data Analysis and Statistical Inference Duke University
Test yourself:
1. 20 cardiac patients blood pressure is measured before taking a medication, and after. For
a given patient, are the before and after blood pressure measurements dependent (paired)
or independent?
2. A random sample of 100 students were obtained and then randomly assigned into two
equal sized groups. One group went on a roller coaster while the other in a simulator at
an amusement park. Afterwards their blood pressure measurements were taken. Are the
measurements dependent (paired) or independent?
and use this standard error in hypothesis testing and confidence intervals comparing means
of independent groups.
LO 7. Recognize that a good interpretation of a confidence interval for the difference between two
parameters includes a comparative statement (mentioning which group has the larger param-
eter).
LO 8. Recognize that a confidence interval for the difference between two parameters that doesnt
include 0 is in agreement with a hypothesis test where the null hypothesis that sets the two
parameters equal to each other is rejected.
Test yourself:
1. Describe how the two sample means test is different from the paired means test, both
conceptually and in terms of the calculation of the standard error.
2. A 95% confidence interval for the difference between the number of calories consumed
by mature and juvenile cats (mat juv ) is (80 calories, 100 calories). Interpret this
interval, and determine if it suggests a significant difference between the two means.
2
Dr. Cetinkaya-Rundel Data Analysis and Statistical Inference Duke University
LO 12. Use a t-statistic, with degrees of freedom df = n 1 for inference for a population mean using
data from a small sample:
x
CI: x t?df SE HT: Tdf =
SE
where SE = s .
n
LO 13. Use a t-statistic, with degrees of freedom df = min(n1 1, n2 1) for inference for difference
between
q means of two population means using data from two small samples, where SE =
s21 s22
n1
+ n2
.
LO 14. Make note of the pooled standard deviation but use it in very rare circumstances where the
standard deviations of the populations being compared are known to be very similar.
s21 (n1 1) + s22 (n2 1)
s2pooled =
n1 + n2 2
LO 15. Describe how to obtain a p-value for a t-test and a critical t-score (t?df ) for a confidence
interval.
Test yourself:
1. What is the t? for a 95% confidence interval for a mean, where the sample size is 13.
2. What is the p-value for a hypothesis test where the alternative hypothesis is two-sided,
the sample size is 20, and the test statistic, T, is calculated to be 1.75?
3
Dr. Cetinkaya-Rundel Data Analysis and Statistical Inference Duke University
with two different measures of degrees of freedom: one for the numerator (dfG = k 1, where
k is the number of groups) and one for the denominator (dfE = n k, where n is the total
sample size).
- Note that you wont be expected to calculate MSG or MSE from the raw data, but
you should have a conceptual understanding of how theyre calculated and what they
measure.
LO 20. Describe why calculation of the p-value for ANOVA is always one sided.
LO 21. Describe why conducting many t-tests for differences between each pair of means leads to an
increased Type 1 Error rate, and we use a corrected significance level (Bonferroni correction,
? = /K, where K is the number of comparisons being considered) to combat inflating this
error rate.
k(k1)
- Note that K = 2
, where k is the number of groups.
LO 22. Describe why it is possible to reject the null hypothesis in ANOVA but not find significant
differences between groups when doing pairwise comparisons.
Test yourself:
1. We would like to compare the average income of Americans who live in the Northeast,
Midwest, South, and West. What are the appropriate hypotheses?
2. Suppose the sample in the question above has 1000 observations, what are the degrees of
freedom associated with the F-statistic?
3. Suppose the null hypothesis is rejected. Describe how we would discover which regions
averages are different from each other. Make sure to discuss how many pairwise com-
parisons we would need to make, and what the corrected significance level would be.
4. What visualizations are useful for checking each of the conditions required for performing
ANOVA?