STAT 22000 Lecture Slides
Variability in Estimates &
Central Limit Theorem
Yibi Huang
Department of Statistics
University of Chicago
Outline
This set of slides covers section 4.1 and 4.4 in the text, which
includes
• Central Limit Theorem (CLT)
• Sampling distribution
1
Example — Rating of a Movie
Suppose a certain movie has a bipolar distribution of ratings, that
in a 1 to 10 scale, of those having watched the movie, 1/3 gave 9
points, 1/3 gave 2 points, and the remaining 1/3 gave 1 points.
So the population distribution is
X 1 2 9
P (X ) 1/3 1/3 1/3
1 2 3 4 5 6 7 8 9 10
Population Distribution
2
Histogram of the Sample
In practice, since the population are difficult (or impossible) to
examine completely, we take a sample to learn about the
population. Will the makeup of the sample mimic the makeup of
the population?
First, the sampling method must be appropriate. A biased sample
won’t give us the correct information about the population.
Suppose we take a simple random sample of size n (say
n = 400) from the population. What will the histogram of the
ratings of the movie given by subjects in the sample look like?
popratings = c(1,2,9)
s400 = sample(popratings, size = 400, replace=T, prob=c(1/3,1/3,1/3))
hist(s400, breaks=0:10+.5, xlab="Ratings", main="Sample Size = 400")
3
sample size = 10 sample size = 25
sample mean = 3.6 sample mean = 4.72
Frequency
Frequency
4
8
2
4
0
0
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
Ratings Ratings
sample size = 100 sample size = 400
sample mean = 4.01 sample mean = 3.86
30
Frequency
Frequency
50 100
10 0
0
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
Ratings Ratings
The histogram of the sample looks somewhat like the histogram of
the population. The larger the sample size, the higher the
resemblance.
4
Estimation of the Population Mean
In practice, the population distribution is usually unknown. We are
often interested in population parameters, like the population
mean.
• As all we know about the population is the sample, we can
only use the sample to estimate the population parameter of
interest, called statistic.
• A commonly used estimate of the population mean is the
sample mean. Thus the sample mean is one of such statistic.
• Sample statistics vary from sample to sample.
• How close is the sample mean to the population mean?
5
Variability of the Sample Means
To know the variability of the sample mean of a sample of size
n = 25, we pretend that we know the population
X 1 2 9
P (X ) 1/3 1/3 1/3
and then to the following simulation.
1. We take a random sample of size n = 25 from the population,
compute and record the sample mean, and the put the
sample back.
2. We repeat the previous step 10000 times, and then obtain
10000 sample means.
What will the histogram of the 10000 sample means look like?
6
samplemean25 = vector("numeric", 10000)
for(i in 1:10000){
samplemean25[i] = mean(sample(popratings, size = 25, replace=T,
prob=c(1/3,1/3,1/3)))
}
hist(samplemean25, breaks=seq(1.5,7.02,by=0.04),
xlab="sample mean",
main="Histogram of the Means of 10000 Samples of Size 25")
abline(v=4, col=2)
Histogram of the Means of 10000 Samples of Size 25
0 100 250
Frequency
2 3 4 5 6 7
sample mean
The red vertical line marks the position of the population mean = 4
When we take a sample of size 25, the distribution of the sample
means is not very normal, with a number of hills and valleys. 7
samplemean100 = vector("numeric", 10000)
for(i in 1:10000){
samplemean100[i] = mean(sample(popratings, size = 100, replace=T,
prob=c(1/3,1/3,1/3)))
}
hist(samplemean100, breaks=seq(2.51,5.51,by=0.02),
xlab="sample mean",
main="Histogram of the Means of 10000 Samples of Size 100")
abline(v=4, col=2)
Histogram of the Means of 10000 Samples of Size 100
100 200
Frequency
0
2.5 3.0 3.5 4.0 4.5 5.0 5.5
sample mean
The red vertical line marks the position of the population mean = 4
8
samplemean400 = vector("numeric", 10000)
for(i in 1:10000){
samplemean400[i] = mean(sample(popratings, size = 400, replace=T,
prob=c(1/3,1/3,1/3)))
}
hist(samplemean400, breaks=seq(3.3,4.7,by=0.01),
xlab="sample mean",
main="Histogram of the Means of 10000 Samples of Size 400")
abline(v=4, col=2) # population mean
Histogram of the Means of 10000 Samples of Size 400
100 200
Frequency
0
3.4 3.6 3.8 4.0 4.2 4.4 4.6
sample mean
The red vertical line marks the position of the population mean = 4
When the sample size increases to 400, the distribution of the
9
sample means looks very normal.
Sampling Distribution
• The probability distribution of a statistic is called the sampling
distribution of the statistic.
• What we just constructed is the sampling distribution of the
sample mean.
10
Observations for the Simulations Above
• The sampling distribution of the sample mean may not be
normal when the sample size is small, but it gets more normal
when the sample size gets larger.
• The sample mean may not be equal to the population mean,
but its distribution centers at the population mean.
• With a larger sample, the variability sample mean around the
population gets smaller.
• What are the SDs of the sample means?
> mean(samplemean25) > sd(samplemean25)
[1] 3.99808 [1] 0.7073244
> mean(samplemean100) > sd(samplemean100)
[1] 4.001438 [1] 0.3577802
> mean(samplemean400) > sd(samplemean400)
[1] 3.99929 [1] 0.1770972
11
Expected Value and SD of the Sample Mean
For i.i.d. random variables X1 , X2 , . . . , Xn from a population with
mean µ and SD σ, the expected value and SD of the sample mean
X n = (X1 + X2 + · · · + Xn )/n are respectively
√
E (X n ) = µ, SD (X n ) = σ/ n
• Here, “i.i.d.” = “independent, and identically distributed”.
which means X1 , . . . , Xn are independent and have identical
probability distributions.
• Observations in a simple random sample is nearly i.i.d. if the
sample size is less than 10% of the population size.
• SD of the sample mean is specifically call the standard error.
12
For the movie rating example, recall the population distribution is
X 1 2 9
P (X ) 1/3 1/3 1/3
The mean, variance and SD of the population distribution are
respectively
1 1 1
µ=1· +2· +9· =4
3 3 3
r r
1 1 1 38
σ = (1 − 4) · + (2 − 4) · + (9 − 4) · =
2 2 2 ≈ 3.56.
3 3 3 3
sample expected > sd(samplemean25)
size n value of X n SD of X n [1] 0.7073244
√
25 4 3.56/ 25 ≈ 0.712 > sd(samplemean100)
√
100 4 3.56/ 100 ≈ 0.356 [1] 0.3577802
√
400 4 3.56/ 400 ≈ 0.178 > sd(samplemean400)
[1] 0.1770972
13
Central Limit Theorem (CLT)
Let X1 , X2 , . . . be a sequence of i.i.d. random variables (discrete or
continuous) with mean µ and variance σ2 . Then, when n is large,
• the distribution of the sample mean
1
Xn = (X1 + X2 + · · · + Xn )
n
is approximately
σ
!
N µ, √ .
n
• the distribution of the sum Sn = X1 + X2 + · · · + Xn is
approximately
√
N (nµ, nσ).
14
Example
Xi 1 2 9
Xi ’s are i.i.d., with the distribution
P (Xi ) 1/3 1/3 1/3
Recall that µ = 4, σ ≈ 3.56. So the sampling distribution of X 100
is approximately
√
N (µ, σ/ 100) = N (4, 0.356).
So
4.5 − 4
!
P (X 100 > 4.5) = P Z > ≈ P (Z > 1.40) ≈ 0.08.
0.356
In the simulation 804 of the 10000 simulated X 100 exceeds 4.5,
which agrees with the CLT approximation that X 100 exceeds 4.5 for
about 8% of the time.
> sum(samplemean100 > 4.5)
[1] 804 15
Sample Size Required to Use CLT?
• Provided the sample size is large enough, the sampling
distributions of the sample mean will be approximately
normal, even when the population distribution is not normal.
• If the population distribution is normal, then so does the
sampling distributions of the sample mean, regardless of the
sample size.
• If population distribution is symmetric, then n should be at
least 30 or so.
• If the population distribution is skewed or has outliers, then
sample size n should be moderate (at least 100 or so), or
even larger depending on how skewed or irregular the
population distribution is.
16
Exercise 4.35 – Housing Prices (p.214)
A housing survey was conducted to determine the price of a typical
home in Topanga, CA. The mean price of a house was roughly $1.3
million with an SD of $0.3 million. There were no houses listed
below $0.3 million but a few houses above $3 million.
Can we find an approximate probability that a randomly chosen
house in Topanga costs more than $1.4 million using the normal
distribution?
No, because the population do not follow a normal distribution (it is
right skewed), and a sample of size 1 is too small to use CLT.
17
Exercise 4.35 – Housing Prices (p.214)
Can we find an approximate probability that the mean of 60 ran-
domly chosen houses in Topanga is more than $1.4 million using
the normal distribution? If yes, compute the approximate probabil-
ity.
Yes, if the population distribution is not too skewed, the sampling
distribution of the sample mean of a sample of size 60 might be
normal by CLT.
σ 0.3
!
X 60 ∼ N µ = 1.3, SE = √ = √ = N (1.3, 0.0387).
60 60
So,
1.4 − 1.3
!
P (X 60 > 1.4) = P Z > ≈ P (Z > 2.58) ≈ 0.0049.
0.0387
18
What Does the CLT Say?
True or False and explain: The central limit theorem says that as you
take larger and larger samples from a population, the histogram of
the sample values looks more and more normal.
False, as you take larger and larger samples, the histogram of the
sample values looks more and more like the histogram of the
population.
What is the thing that becomes more and more normal as the sam-
ple size gets larger and larger?
It is the distribution of the sample mean that get’s more and more
normal.
19