TOAE201-LecturerNotes-Chapter 4. Sample Theoretical Basis
TOAE201-LecturerNotes-Chapter 4. Sample Theoretical Basis
Textbook: Paul Newbold, William [Link], Betty Thorne, 2010, Statistics for Business
and Economics, 7th edition, Pearson.
* Time
It may take to much time to contact the whole population. Even if you could
contact the whole population the results may be meaningless as they would be out
of date, e.g. if it took me 2 years to poll the voting population in the 1 year run up
to an election.
* Cost
The cost may be too high and hence prohibitive, e.g. if I was to poll the 90 million
people who vote a general election, the cost would be astronomical.
Sampling Methods
When sampling we must ensure that we choose a sample which is representative of the
whole population.
The sampling methods that follow are just a few ways in which sampling can be carried
out. Other methods also exist which are not discussed here.
A sample is selected so that each item or person in the population has the same chance of
being selected.
Example
For example, market-research groups may use random numbers to select telephone
numbers to call and ask about preferences for a product. Various statistical computer
packages and spreadsheets have routines for obtaining random numbers, and these are
used for sampling studies.
The starting point would be a random number between 1 and k. Then we would pick
every 𝑘 𝑡ℎ number after that.
If we choose 18 as our random starting point, then starting with the 18th observation,
every 20th observation (18, 38, 58,……) would be chosen. We would end up with a
sample of 100 observations.
Example
Consider the advertising expenditure for the largest 352 companies in the United States.
Suppose we wanted to study whether firms with high returns on equity, spent more of
each sales dollar on advertising than firms with a low return or deficit.
So here, we will not have all clusters (groups) represented in our sample.
Sampling “Error”
Sampling error is the difference between a sample statistic and its corresponding
population parameter. In the case of the mean, it is
X − Where:
X = the mean of the sample
= the population mean
Samples are used to estimate population characteristics. For example the mean of a
sample is used to estimate the mean of the population. However, since the sample is only
part of the population, it is unlikely that the sample mean will be exactly equal to the
population mean.
Likewise, the sample standard deviation is unlikely to be exactly the same as the
population standard deviation.
We would not be surprised then if the sample statistic is different from the corresponding
population parameter.
Each of these differences, 1.0 and -0.5, is the sampling error made in estimating the
population mean based on the sample mean.
Each of the possible samples of size 2 has an equal chance of selection. Each sample may
have a different sample mean and hence, sampling error. The value of the sampling error
is based on the random selection of the sample. Therefore, sampling errors are random
and occur by chance.
Here our random variable will be a mean. Each observation represents the average of a
sample of size n.
Organizing the means of all possible samples of size n, into a probability distribution
would result in us obtaining the sampling distribution of the sample mean.
Joe $7
Sam $7
Sue $8
Bob $8
Jan $7
Art $8
Ted $9
What is the sampling distribution of the sample mean of the samples of size 2?
To arrive at the sampling distribution of the sample mean, all possible samples of size 2
need to be selected without replacement from the population, and their means computed.
7!
There are 21 possible samples ( 7 C 2 = = 21 ).
2!5!
Listed below are all the 21 sample means from all samples of size 2.
Total 21 1.00
Chapter 4 – TOAE201 Lecture Notes by Vuong Thi Thao Binh
(Some contents of this slide are based on Lecture Notes by Panayiotis Skordi,
Fullerton University) 6
Sampling Distribution of the Mean of Two Dice
Let us consider rolling a fair die an infinite number or times. We know that the possible
outcomes are 1, 2, 3, 4, 5, 6. The probability distribution of the random variable X is:
X 1 2 3 4 5 6
P(X) 1/6 1/6 1/6 1/6 1/6 1/6
= xP(x)
2 = (x − )2 P( x)
2 1 2 1 2 1 2 1 2 1 2 1
= (1 − 3.5) + (2 − 3.5) + (3 − 3.5) + (4 − 3.5) + (5 − 3.5) + (6 − 3.5) = 2.92
6 6 6 6 6 6
The Distribution of x is
Sample # Sample x
1 1, 1 1.0
2 1, 2 1.5
3 1, 3 2.0
4 1, 4 2.5
5 1, 5 3.0
6 1, 6 3.5
7 2, 1 1.5
8 2, 2 2.0
9 2, 3 2.5
10 2, 4 3.0
11 2, 5 3.5
12 2, 6 4.0
12 3, 1 2.0
14 3, 2 2.5
15 3, 3 3.0
16 3, 4 3.5
17 3, 5 4.0
18 3, 6 4.5
19 4, 1 2.5
20 4, 2 3.0
21 4, 3 3.5
22 4, 4 4.0
23 4, 5 4.5
24 4, 6 5.0
25 5, 1 3.0
26 5, 2 3.5
27 5, 3 4.0
28 5, 4 4.5
29 5, 5 5.0
30 5, 6 5.5
31 6, 1 3.5
32 6, 2 4.0
33 6, 3 4.5
34 6, 4 5.0
35 6, 5 5.5
36 6, 6 6.0
There are 36 possible samples of size 2. Each sample outcome is equally likely and has a
probability of 1/36 of occurring. x can assume only 11 different possible values: 1.0,
1.5, ……….6.0, with certain values of x occurring more frequently than others.
x P(x )
1.0 1/36
1.5 2/36
2.0 3/36
2.5 4/36
3.0 5/36
3.5 6/36
4.0 5/36
4.5 4/36
5.0 3/36
5.5 2/36
6.0 1/36
The value x =1.0 occurs only once, so its probability is 1/36. The value of x =3.5 occurs
6 times, so its probability is 6/36.
x = x P(x )
1 2 3 3 2 1
= 1.0 + 1.5 + 2.0 + ...........5.0 + 5.5 + 6.0 = 3.5
36 36 36 36 36 36
2 x = (x − x )2 P( x )
2 1 2 2 2 1
= (1.0 − 3.5) + (1.5 − 3.5) + ................ + (6.0 − 3.5) = 1.46
36 36 36
Note that the mean of 3.5 is the same as the mean of the population of tossing a die.
Further, note that the variance of the sampling distribution of x , where n=2, is 1.46
which is exactly half the variance of the population of the toss of a die (2.92).
The Distribution of x is
Repeating the experiment with larger sample sizes n, the sampling distribution tends to
resemble a normal probability distribution.
That is:
and the standard deviation is This is known as the standard error of the mean.
n
Where:
Also as the sample size n, increases, the sample means tend to cluster around the true
population mean.
We now develop important properties of the sampling distribution of the sample means.
Our analysis begins with a random sample of n observations from a very large population with
mean and variance 2; the sample observations are random variables X1, X2, . . . , Xn. Before the
sample is observed, there is uncertainty about the outcomes.
This uncertainty is modeled by viewing the individual observations as random variables from a
population with mean and variance 2. Our primary interest is in making
inferences about the population mean . An obvious starting point is the sample mean.
At this point we cannot determine the shape of the sampling distribution, but we can determine
the mean and variance of the sampling distribution from basic definitions we learned in Chapters
2.
In Chapters 2 and 3 we saw that the expectation of a linear combination of random variables is
the linear combination of the expectations:
In Chapters 2 and 3 we saw that the variance of a linear combination of independent random
variables is the sum of the linear coefficients squared times the variance of the random variables.
It follows that
This means that as the sample size, n, gets larger, the sample means tend to follow a
normal probability distribution and tend to cluster around the true population mean. This
holds regardless of the distribution of the population from which the sample was drawn.
In summary, regardless of the type of distribution for which one draws a random sample,
the sampling distribution will be normal under certain conditions:
1. if the population distribution is normal N(, 2) the sampling distribution will be
normal N(, 2/n) regardless of sample size.
2. if the population distribution is approximately normal, the sample distribution will
be approximately normal.
3. if the population is not normal, the sample distribution will be approximately
normal if the sample is large enough, typically taken to be least 30.
Example
Here we have an underlying normal distribution x, with mean = and
standard deviation = .
Normal Distribution of x, with Mean = and Standard Deviation =
NORMAL DISTRIBUTION
x
We will generate the x distribution (sample size n), with mean = and standard
deviation = . This will also be a normal distribution – according to the central limit
n
theorem.
Normal Distribution of x , with Mean = and Standard Deviation =
n
z
This rule holds true for any underlying distribution x of x . So even if the underlying
distribution, x, was not normal, the distribution of x would still be normal with mean =
and standard deviation = . This is the result of the central limit theorem and we
n
must bear in mind that the sample size n must be at least 30.
x = when the population standard deviation, , is KNOWN
n
s
sx = when the population standard deviation, , is UNKNOWN
n
x =
x = the mean of the sample means
= the mean of the population
Here we take the average of the sample means and use that as an approximation to the
population mean. It is denoted by x .
When the population standard deviation and population mean are both KNOWN.
X −
z=
n
When the population standard deviation is UNKNOWN and the population mean is
KNOWN.
𝑋̄ − 𝜇
𝑇=
𝑠⁄√𝑛
Example
The foreman of a bottling plant has observed that the amount of soda in each 32-ounce
bottle is actually a normally distributed random variable, with mean of 32.2 ounces and a
standard deviation of 0.3 ounces.
a) If a customer buys one bottle, what is the probability that the bottle will contain more
than 32 ounces?
b) If a customer buys a carton of four bottles, what is the probability that the mean
amount of the four bottles will be greater than 32 ounces?
Solution a.
(The solution uses table 1. You may also use tables 1 or Excel to solve. The methods to
achieve this were covered at great length in the previous chapter.)
Let X be the random variable representing the amount of soda in one bottle.
It is normally distributed with mean = 32.2 and SD = 0.3
X − mean 32 − 32.2
P( X 32) = P( ) = P( Z −0.67)
SD 0.3
NORMAL DISTRIBUTION
Mean=0 & SD=1
We require this area
where z is greater than -0.67
z
–0.67 0
= P(−0.67 Z 0) + 0.5
= 0.2486 + 0.5 = 0.7486
(The solution uses table B1. You may also use tables B1 or Excel to solve. The methods
to achieve this were covered at great length in the previous chapter.)
Let X be the random variable representing the average amount of soda in four bottles.
0.3
It is normally distributed with mean = 32.2 and SD = = 0.15
4
X − mean 32 − 32.2
P ( X 32) = P ( ) = P ( Z −1.33)
SD 0.15
NORMAL DISTRIBUTION
Mean=0 & SD=1
We require this area
where z is greater than -1.33
z
–1.33 0
= P(−1.33 Z 0) + 0.5
A real estate exams scores are normally distributed with mean 430 and standard deviation 20.
If we randomly selected 50 exams what is the probability that the sample mean of these 50
exams would exceed a score of 458?
(The solution uses table 1. You may also use tables 1 and Excel to solve. The methods to
achieve this were covered at great length in the previous chapter.)
NORMAL DISTRIBUTION
Mean=0 & SD=1
We require this area
where z is greater than 9.899
z
0 9.899
We may be interested in testing measures other than the sample mean. We may be
interested in measuring the percentage of people in the work force that would opt for
early retirement. Each person has two choices of either agreeing with early retirement or
not. This experiment follows a binomial probability distribution.
As we do not know the proportion of people in the population of the workforce that
would opt for early retirement we can take samples and calculate the approximate
population proportion.
If the samples are large enough we may use the normal distribution as an approximation
to the binomial.
The conditions that must apply for this to be the case are:
Suppose we take 10 sample groups of 150 people in each group, and record the number
of people in each group that agree with early retirement. The following are the results:
Averaging these all out gives an approximation for the population proportion:
0.173 + 0.12 + 0.14 + 0.2 + 0.16 + 0.14 + 0.107 + 0.187 + 0.233 + 0.18
= 0.164
10
Now we can answer such questions as, “what is the probability that 20% or less of the
workforce will agree with early retirement?” We already have the mean and standard
error and we know we can use the normal distribution to approximate the binomial.
Sampling Distribution
of the Proportion
Standardizing, gives
𝑝̅ −𝑝𝑚𝑒𝑎𝑛 0.20−0.164
𝑝( < ) = 𝑝(𝑧 < 1.20) = 0.8849
𝜎𝑝 0.030
Sampling Distribution
of the Proportion
0.8849
1.2 z
4. According to an IRS study, it takes an average of 330 minutes for taxpayers to prepare,
copy, and electronically file a 1040 tax form and finds the standard deviation of the time
to prepare, copy, and electronically file form 1040 is 80 minutes. A consumer watchdog
agency selects a random sample of 40 taxpayers.
a. What assumption or assumptions do you need to make about the shape of the
population?
b. What is the standard error of the mean?
c. What is the likelihood the sample mean is greater than 320 minutes?
d. What is the likelihood the sample mean is between 320 and 350 minutes?
e. What is the likelihood the sample mean is greater than 350 minutes?
Paul Newbold: Exercises 6.5 - 6.12 pp.262-263
6.26 - 6.32 pp. 268-269
The conclusion that the expected value of the sample variance is the population variance is quite
general. But for statistical inference we would like to know more about the sampling distribution.
If we can assume that the underlying population distribution is normal, then it can be shown that
the sample variance and the population variance are related through a probability distribution
known as the chi-square distribution.
Exercise
A random sample of size n = 16 is obtained from a normally distributed population
with a population mean of = 100 and a variance of 2 = 25.
a. What is the probability that x 101 ?
b. What is the probability that the sample variance is greater than 45?
c. What is the probability that the sample variance is greater than 60?
Solution
a) Let X be the random variable representing the population. X N(, 2)
101 − 100
( )
p x 101 = p Z
5 = p ( Z 0.2 ) = 0.42
2 (n − 1)S 2 (n − 1).45 (16 − 1).45
b) p ( S 45) = p χ n −1 =
2
= = 27 = 0.029
σ 2
σ 2
25