Notes On Estimation
Notes On Estimation
C11: STATISTICS
Contents
Aims of this course . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
Schedules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
Recommended books . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
Keywords . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
1 Parameter estimation 1
1.1 What is Statistics? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 RVs with values in R n or Zn . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Some important random variables . . . . . . . . . . . . . . . . . . . . 4
1.4 Independent and IID RVs . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 Indicating dependence on parameters . . . . . . . . . . . . . . . . . . 5
1.6 The notion of a statistic . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.7 Unbiased estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.8 Sums of independent RVs . . . . . . . . . . . . . . . . . . . . . . . . 6
1.9 More important random variables . . . . . . . . . . . . . . . . . . . . 7
1.10 Laws of large numbers . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.11 The Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . 8
1.12 Poisson process of rate λ . . . . . . . . . . . . . . . . . . . . . . . . . 8
4 Confidence intervals 17
4.1 Interval estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.2 Opinion polls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.3 Constructing confidence intervals . . . . . . . . . . . . . . . . . . . . 19
4.4 A shortcoming of confidence intervals* . . . . . . . . . . . . . . . . . 20
i
5 Bayesian estimation 21
5.1 Prior and posterior distributions . . . . . . . . . . . . . . . . . . . . . 21
5.2 Conditional pdfs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.3 Estimation within Bayesian statistics . . . . . . . . . . . . . . . . . . 24
6 Hypothesis testing 25
6.1 The Neyman–Pearson framework . . . . . . . . . . . . . . . . . . . . 25
6.2 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
6.3 Likelihood ratio tests . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
6.4 Single sample: testing a given mean, simple alternative, known vari-
ance (z-test) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
ii
11 The t-test 45
11.1 Confidence interval for the mean, unknown variance . . . . . . . . . . 45
11.2 Single sample: testing a given mean, unknown variance (t-test) . . . . 46
11.3 Two samples: testing equality of means, unknown common variance
(t-test) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
11.4 Single sample: testing a given variance, unknown mean (χ2-test) . . . 48
15 Computational methods 61
15.1 Analysis of residuals from a regression . . . . . . . . . . . . . . . . . 61
15.2 Discriminant analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
15.3 Principal components / factor analysis . . . . . . . . . . . . . . . . . 62
15.4 Bootstrap estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
16 Decision theory 65
16.1 The ideas of decision theory . . . . . . . . . . . . . . . . . . . . . . . 65
16.2 Posterior analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
16.3 Hypothesis testing as decision making . . . . . . . . . . . . . . . . . . 68
16.4 The classical and subjective points of view . . . . . . . . . . . . . . . 68
iii
Aims of the course
The aim of this course is to aquaint you with the basics of mathematical statistics:
the ideas of estimation, hypothesis testing and statistical modelling.
After studying this material you should be familiar with
1. the notation and keywords listed on the following pages;
2. the definitions, theorems, lemmas and proofs in these notes;
3. examples in notes and examples sheets that illustrate important issues concerned
with topics mentioned in the schedules.
Schedules
Estimation
Review of distribution and density functions, parametric families, sufficiency, Rao-
Blackwell theorem, factorization criterion, and examples; binomial, Poisson, gamma.
Maximum likelihood estimation. Confidence intervals. Use of prior distributions and
Bayesian inference.
Hypothesis Testing
Simple examples of hypothesis testing, null and alternative hypothesis, critical re-
gion, size, power, type I and type II errors, Neyman-Pearson lemma. Significance
level of outcome. Uniformly most powerful tests. Likelihood ratio, and the use of
likelihood ratio to construct test statistics for composite hypotheses. Generalized
likelihood-ratio test. Goodness-of-fit and contingency tables.
Linear normal models
The χ2 , t and F distribution, joint distribution of sample mean and variance, Stu-
dent’s t-test, F -test for equality of two variances. One-way analysis of variance.
Linear regression and least squares
Simple examples, *Use of software*.
Recommended books
M. H. De Groot, Probability and Statistics, 2nd edition, Addison-Wesley, 1986.
J. A. Rice, Mathematical Statistics and Data Analysis, 2nd edition, Duxbury Press,
1994.
G. Casella and J. O. Berger, Statistical Inference, Brooks Cole, 1990.
D. A. Berry and B. W. Lindgren, Statistics, Theory and Methods, Brooks Cole, 1990
(out of print).
iv
Keywords
v
posterior, 21 Simpson’s paradox, 41
posterior mean, 24 size, 26
posterior median, 24 standard error, 45
power function, 29 standard normal, 4
predictive confidence interval, 58 standardized, 8
principal components, 62 standardized residuals, 61
prior, 21 statistic, 5
probability density function, 3
strong law of large numbers, 8
probability mass function, 3
sufficient statistic, 11
quadratic error loss, 24
t-distribution, 18, 44
Rao–Blackwell theorem, 14 t-test, 46
Rao–Blackwellization, 15 two-tailed test, 28, 34
regression through the origin, 56 type I error, 26
residual sum of squares, 57 type II error, 26
RV, 2
unbiased estimator, 6
sample correlation coefficient, 58
uniform distribution, 4
scale parameter, 19
significance level of a test, 26, 29 uniformly most powerful, 30
significance level of an observation, variance, 3
29
simple hypothesis, 26 weak law of large numbers, 8
simple linear regression model, 53 within samples sum of squares, 51
vi
Notation
X a scalar or vector random variable, X = (X1, . . . , Xn)
X∼ X has the distribution . . .
E X, var(X) mean and variance of X
µ, σ 2 mean and variance as typically used for N(µ, σ 2)
RV, IID ‘random variable’, ‘independent and identically distributed’
beta(m, n) beta distribution
B(n, p) binomial distribution
χ2n chi-squared distribution with n d.f.
E(λ) exponential distribution
Fm,n F distribution with m and n d.f.
gamma(n, λ) gamma distribution
N(µ, σ 2) normal (Gaussian) distribution
P (λ) Poisson distribution
U [a, b] uniform distribution
tn Student’s t distribution with n d.f.
Φ distribution function of N(0, 1)
φ density function of N(0, 1)
(n) (m,n)
zα , tα , Fα upper α points of N(0, 1), tn and Fm,n distributions
θ a parameter of a distribution
θ̂(X), θ̂(x) an estimator of θ, a estimate of θ.
MLE ‘maximum likelihood estimator’
FX (x | θ) distribution function of X depending on a parameter θ
fX (x | θ) density function of X depending on a parameter θ
fθ (x) density function depending on a parameter θ
fX|Y conditional density of X given Y
p(θ | x) posterior density of θ given data x
x 1 , . . . , xn n
Pobserved P data values
P
xi·, x·j , x·· j xij , i xij and ij xij
T (x) a statistic computed from x1, . . . , xn
vii
H0 , H1 null and alternative hypotheses
f0 , f1 null and alternative density functions
Lx (H0), Lx(H1) likelihoods of H0 and H1 given data x
Lx (H0, H1) likelihood ratio Lx(H1)/Lx(H0)
(n) (m,n)
tα , Fα points to the right of which lie α100% of Tn and Fm,n
C critical region: reject H0 if T (x) ∈ C.
W (θ) power function, W (θ) = P(X ∈ C | θ)
α,β probabilities of Type I and Type II error
intercept and gradient of a regression line, Yi = α + βwi + i
oi , ei , δi observed and expected counts; δi = oi − ei
X̄ mean of X1, . .P
P . , Xn P
SXX , SY Y , SXY (Xi − X̄) , (Yi − Ȳ )2, (Xi − X̄)(Yi − Ȳ )
2
WWW site
There is a web page for this course, with copies of the lecture notes, examples sheets,
corrections, past tripos questions, statistical tables and other additional material. It
can be accessed as http://www.statslab.cam.ac.uk/~rrw1/stats/
viii
1 Parameter estimation
Statisticians do it when it counts.
Example 1.2 A famous study investigated the effects upon heart attacks of taking
an aspirin every other day. The results after 5 years were
1
Condition Heart attack No heart attack Attacks per 1000
Aspirin 104 10,933 9.42
Placebo 189 10,845 17.13
What can make of this data? Is it evidence for the hypothesis that aspirin prevents
heart attacks?
The aspirin study is an example of a controlled experiment. The subjects were
doctors aged 40 to 84 and none knew whether they were taking the aspirin or the
placebo. Statistics is also concerned with analysing data from observational stud-
ies. For example, most of us make an intuitive statistical analysis when we use our
previous experience to help us choose the shortest checkout line at a supermarket.
The data analysis of observational studies and experiments is a central component
of decision-making, in science, medicine, business and government.
X = (X1, X2 , . . . , Xn )
X : Ω → Z.
RVs can also take values in R rather than in Z and the sample space Ω can be
uncountable.
X : Ω → R.
2
In both cases the distribution function FX of X is defined as:
X
FX (x) := P(X ≤ x) = P(ω).
{ω : X(ω)≤x}
So
X
P(X ∈ A) = fX (x), A ⊆ Z.
x∈A
the first formula being the real definition. In the continuous case the calculation
Z
E(X) = X(ω) P(dω)
Ω
var(X) = E (X − µ)2 = E (X 2 ) − µ2 .
3
1.3 Some important random variables
(a) We say that X has the binomial distribution B(n, p), and write X ∼ B(n, p),
if
(
k p (1 − p) if k ∈ {0, . . . , n},
n k n−k
P(X = k) =
0 otherwise.
Then E (X) = np, var(X) = np(1 − p). This is the distribution of the number of
successes in n independent trials, each of which has probability of success p.
(b) We say that X has the Poisson distribution with parameter λ, and write
X ∼ P (λ), if
(
e−λλk /k! if k ∈ {0, 1, 2, . . . },
P(X = k) =
0 otherwise.
4
1.4 Independent and IID RVs
Random variables X1 , . . . , Xn are called independent if for all x1, . . . , xn
P(X1 ≤ x1; . . . ; Xn ≤ xn) = P(X1 ≤ x1) · · · P(Xn ≤ xn).
IID stands for independent identically distributed. Thus if X1 , X2, . . . , Xn are
IID RVs, then they all have the same distribution function and hence the same mean
and same variance.
We work with the probability mass function (pmf) of X in Zn or probability
density function (pdf) of X in R n : In most cases, X1 , . . . , Xn are independent, so
that if x = (x1, . . . , xn) ∈ R n , then
fX (x) = fX1 (x1) · · · fXn (xn).
Clearly, some statistics are more natural and useful than others. The first of these
would be useful for estimating µ if the data are samples from a N(µ, 1) distribution.
The second would be useful for estimating θ if the data are samples from U [0, θ].
5
1.7 Unbiased estimators
An estimator of a parameter θ is a function T = T (X) which we use to estimate θ
from an observation of X. T is said to be unbiased if
E (T ) = θ.
The expectation above is taken over X. Once the actual data x is observed, t = T (x)
is the estimate of θ obtained via the estimator T .
Another possible unbiased estimator for p is p̃ = 13 (X1 + 2X2) (i.e., we ignore most
of the data.) It is also unbiased since
1 1 1
E p̃(X) = E (X1 + 2X2) = (E X1 + 2E X2 ) = (p + 2p) = p .
3 3 3
Intuitively, the first estimator seems preferable.
6
1.9 More important random variables
(a) We say that X is geometric with parameter p, if
(
p(1 − p)k−1 if k ∈ {1, 2, . . . },
P(X = k) =
0 otherwise.
Then E(X) = 1/p and var(X) = (1 − p)/p2. X is the number of the toss on which
we first observe a head if we toss a coin which shows heads with probability p.
(b) We say that X is exponential with rate λ, and write X ∼ E(λ), if
(
λe−λx if x > 0,
fX (x) =
0 otherwise.
Then E (X) = λ−1, var(X) = λ−2 .
The geometric and exponential distributions are discrete and continuous ana-
logues. They are the unique ‘memoryless’ distributions, in the sense that P(X ≥
t + s | X ≥ t) = P(X ≥ s). The exponential is the distribution of the time between
successive events of a Poisson process.
(c) We say that X is gamma(n, λ) if
(
λn xn−1e−λx /(n − 1)! if x > 0,
fX (x) =
0 otherwise.
X has the distribution of the sum of n IID RVs that have distribution E(λ). So
E(λ) = gamma(1, λ). E (X) = nλ−1 and var(X) = nλ−2.
This also makes sense for real n > 0 (and λ > 0), if we interpret (n − 1)! as Γ(n),
R ∞ n−1
where Γ(n) = 0 x e−x dx.
(d) We say that X is beta(a, b) if
(
1
B(a,b) xa−1(1 − x)b−1 if 0 < x < 1,
fX (x) =
0 otherwise.
Here B(a, b) = Γ(a)Γ(b)/Γ(a + b). Then
a ab
E (X) = , var(X) = .
a+b (a + b + 1)(a + b)2
7
The weak law of large numbers is that for > 0,
P(|Sn /n − µ| > ) → 0, as n → ∞ .
P(Sn /n → µ) = 1 .
T1 , T 1 + T 2 , T 1 + T 2 + T 3 , . . .
where T1 , T2, . . . are independent and each exponentially distributed with parameter
λ. Numbers of arrivals in disjoint intervals are independent RVs, and the number of
arrivals in any interval of length t has the P (λt) distribution. The time
Sn = T1 + T2 + · · · + Tn
of the nth arrival has the gamma(n, λ) distribution, and 2λSn ∼ X2n
2
.
8
2 Maximum likelihood estimation
When it is not in our power to follow what is true, we ought
to follow what is most probable. (Descartes)
lik(θ) = f (x | θ) .
Thus we are considering the density as a function of θ, for a fixed x. In the case
of multiple observations, i.e., when x = (x1, . . . , xn) is a vector of observed values
of X1 , . . . , Xn, we assume, unless otherwise stated, that X1 , . . . , Xn are IID; in this
case f (x1, . . . , xn | θ) is the product of the marginals,
Y
n
lik(θ) = f (x1, . . . , xn | θ) = f (xi | θ) .
i=1
Examples 2.1
(a) Smarties are sweets which come in k equally frequent colours. Suppose we do
not know k. We sequentially examine 3 Smarties and they are red, green, red. The
likelihood of this data, x = the second Smartie differs in colour from the first but the
third Smartie matches the colour of the first, is
k−1 1
lik(k) = p(x | k) = P(2nd differs from 1st)P(3rd matches 1st) =
k k
= (k − 1)/k ,
2
which equals 1/4, 2/9, 3/16 for k = 2, 3, 4, and continues to decrease for greater k.
Hence the maximum likelihood estimate is k̂ = 2.
Suppose a fourth Smartie is drawn and it is orange. Now
9
which equals 2/27, 3/32, 12/125, 5/54 for k = 3, 4, 5, 6, and decreases thereafter.
Hence the likelihood estimate is k̂ = 5. Note that although we have seen only 3
colours the maximum likelihood estimate is that there are 2 colours we have not yet
seen.
(b) X ∼ B(n, p), n known, p to be estimated.
Here
n x
log p(x | n, p) = log p (1 − p)n−x = · · · + x log p + (n − x) log(1 − p) .
x
This is maximized where
x n−x
− = 0,
p̂ 1 − p̂
so the MLE of p is p̂ = X/n. Since E [X/n] = p the MLE is unbiased.
(c) X ∼ B(n, p), p known, n to be estimated.
Now we want to maximize
n x
p(x | n, p) = p (1 − p)n−x
x
with respect to n, n ∈ {x, x + 1, . . . }. To do this we look at the ratio
x p (1 − p)
n+1 x
p(x | n + 1, p) n+1−x
(1 − p)(n + 1)
= = .
p(x | n, p) x p (1 − p)
n x n−x n+1−x
This is monotone decreasing in n. Thus p(x | n, p) is maximized by the least n for
which the above expression is ≤ 1, i.e., the least n such that
(1 − p)(n + 1) ≤ n + 1 − x ⇐⇒ n + 1 ≥ x/p ,
giving a MLE of n̂ = [X/p]. Note that if x/p happens to be an integer then both
n = x/p − 1 and n = x/p maximize p(x | n, p). Thus the MLE need not be unique.
(d) X1 , . . . , Xn ∼ geometric(p), p to be estimated.
Because the Xi are IID their joint density is the product of the marginals, so
!
Y
n X n
log f (x1, . . . , xn | p) = log (1 − p)xi −1p = xi − n log(1 − p) + n log p .
i=1 i=1
with a maximum where
P
−n n
i xi
− + =0.
1 − p̂ p̂
So the MLE is p̂ = X̄ −1 . This MLE is biased. For example, in the case n = 1,
X∞
1 p
E [1/X1 ] = (1 − p)x−1p = − log p > p .
x=1
x 1 − p
Note that E [1/X1 ] does not equal 1/E X1 .
10
2.2 Sufficient statistics
The MLE, if it exists, is always a function of a sufficient statistic. The informal no-
tion of a sufficient statistic T = T (X1, . . . , Xn ) is that it summarises all information
in {X1, . . . , Xn } which is relevant to inference about θ.
Formally, the statistic T = T (X) is said to be sufficient for θ ∈ Θ if, for each
t, Pθ X ∈ · | T (X) = t does not depend on θ. I.e., the conditional distribution of
X1 , . . . , Xn given T (X) = t does not involve θ. Thus to know more about x than
that T (x) = t is of no additional help in making any inference about θ.
Theorem 2.2 The statistic T is sufficient for θ if and only if f (x | θ) can be ex-
pressed as
f (x | θ) = g T (x), θ h(x).
where by sufficiency the second factor does not depend on θ. So we identify the first
and second terms on the r.h.s. as g(t, θ) and h(x) respectively.
Examples 2.3
(a) X1 , . . . , Xn ∼ P (λ), λ to be estimated.
Y
n P .Y
n
xi −λ −nλ
f (x | λ) = {λ e /xi!} = λ i e
xi
xi ! .
i=1 i=1
P Q P
So g T (x), λ = λ i xi e−nλ and h(x) = 1 / i xi!. A sufficient statistic is t = i xi .
Note that the sufficient statistic is not unique. If T (X) is a sufficient statistic,
then so are statistics like T (X)/n and log T (X).
11
The MLE is found by maximizing f (x | λ), and so
P
d xi
log f (x | λ) = i −n=0.
dλ λ̂
λ=λ̂
12
3 The Rao-Blackwell theorem
Variance is what any two statisticians are at.
Example 3.1 Consider the estimators in Example 1.3. Each is unbiased, so its MSE
is just its variance.
1 var(X1) · · · + var(Xn) np(1 − p) p(1 − p)
var(p̂) = var (X1 + · · · + Xn ) = = =
n n2 n2 n
1 var(X1 ) + 4 var(X2) 5p(1 − p)
var(p̃) = var (X1 + 2X2 ) = =
3 9 9
Not surprisingly, var(p̂) < var(p̃). In fact, var(p̂)/ var(p̃) → 0, as n → ∞.
Note that p̂ is the MLE of p. Another possible unbiased estimator would be
1
p∗ = 1 (X1 + 2X2 + · · · + nXn )
2 n(n + 1)
with variance
1 2(2n + 1)
var(p∗) = 2 1 + 2 + · · · + n p(1 − p) =
2 2
p(1 − p) .
1
n(n + 1) 3n(n + 1)
2
Example 3.2 Suppose X1, . . . , Xn ∼ N(µ, σ 2), µ and σ 2 unknown and to be esti-
mated. To find the MLEs we consider
Y
n
1 n 1 X
n
−(xi −µ)2 /2σ2
log f (x | µ, σ ) = log
2
√ e = − log(2πσ 2) − 2 (xi − µ)2 .
i=1 2πσ 2 2 2σ i=1
13
and the MLEs are
1X 1X
n n
1
µ̂ = X̄ = Xi , 2
σ̂ = SXX := (Xi − X̄)2.
n i=1 n n i=1
This is minimized by λ = 1/(n + 1). Thus the estimator which minimizes the mean
squared error is SXX /(n + 1) and this is neither the MLE nor unbiased. Of course
there is little difference between any of these estimators when n is large.
Note that E [σ̂ 2 ] → σ 2 as n → ∞. So again the MLE is asymptotically unbiased.
E [θ∗ − θ]2
h i2 h i2 h i
= E E θ̂ | T − θ = E E θ̂ − θ | T ≤ E E (θ̂ − θ) | T
2
= E (θ̂ − θ)2
The outer expectation is being taken with respect to T . The inequality follows from
the fact that for any RV, W , var(W ) = E W 2 − (E W )2 ≥ 0. We put W = (θ̂ − θ | T )
and note that there is equality only if var(W ) = 0, i.e., θ̂ − θ can take just one value
for each value of T , or in other words, θ̂ is a function of T .
14
Note that if θ̂ is unbiased then θ∗ is also unbiased, since
h i
∗
E θ = E E (θ̂ | T ) = E θ̂ = θ .
We now have a quantitative rationale for basing estimators on sufficient statistics:
if an estimator is not a function of a sufficient statistic, then there is another estimator
which is a function of the sufficient statistic and which is at least as good, in the
sense of mean squared error of estimation.
Examples 3.4
(a) X1 , . . . , Xn ∼ P (λ), λ to be estimated. P
In Example 2.3 (a) we saw that a sufficient statistic is i xi. Suppose we start
with the unbiased estimator λ̃ = X1 . Then ‘Rao–Blackwellization’ gives
P
λ∗ = E [X1 | i Xi = t] .
But
X P P P
E Xi | X
i i = t = E X
i i | X
i i = t =t.
i
By the fact that X1, . . . , Xn are IID, every term within the sum on the l.h.s. must
be the same, and hence equal to t/n. Thus we recover the estimator λ∗ = λ̂ = X̄.
(b) X1 , . . . , Xn ∼ P (λ), θ = e−λ to be estimated.
Now θ = P(X1 = 0). So a simple unbiased estimator is θ̂ = 1{X1 = 0}. Then
" # !
X n Xn
θ∗ = E 1{X1 = 0} Xi = t = P X1 = 0 Xi = t
i=1
! ! i=1
X
n . X
n
=P X1 = 0; Xi = t P Xi = t
i=2 i=1
((n − 1)λ) et −(n−1)λ . (nλ)t e−nλ n − 1 t
= e−λ =
t! t! n
Since θ̂ is unbiased, so is θ∗ . As it should be, θ∗ is only a function of t. If you do
Rao-Blackwellization and you do not get just a function of t then you have made a
mistake.
(c) X1 , . . . , Xn ∼ U [0, θ], θ to be estimated.
In Example 2.3 (c) we saw that a sufficient statistic is maxi xi . Suppose we start
with the unbiased estimator θ̃ = 2X1. Rao–Blackwellization gives
1 n − 1 n+1
θ∗ = E [2X1 | maxi Xi = t] = 2 t+ (t/2) = t.
n n n
This is an unbiased estimator of θ. In the above calculation we use the idea that
X1 = maxi Xi with probability 1/n, and if X1 is not the maximum then its expected
value is half the maximum. Note that the MLE θ̂ = maxi Xi is biased.
15
3.3 Consistency and asymptotic efficiency∗
Two further properties of maximum likelihood estimators are consistency and asymp-
totic efficiency. Suppose θ̂ is the MLE of θ.
To say that θ̂ is consistent means that
P(|θ̂ − θ| > ) → 0 as n → ∞ .
It can be shown that var(θ̃) ≥ 1/nI(θ) for any unbiased estimate θ̃, where 1/nI(θ)
is called the Cramer-Rao lower bound. To say that θ̂ is asymptotically efficient
means that
lim var(θ̂)/[1/nI(θ)] = 1 .
n→∞
Example 3.5 You and a friend have agreed to meet sometime just after 12 noon.
You have arrived at noon, have waited 5 minutes and your friend has not shown
up. You believe that either your friend will arrive at X minutes past 12, where you
believe X is exponentially distributed with an unknown parameter λ, λ > 0, or that
she has completely forgotten and will not show up at all. We can associate the later
event with the parameter value λ = 0. Then
Z ∞
P(data | λ) = P(you wait at least 5 minutes | λ) = λe−λt dt = e−5λ .
5
Thus the maximum likelihood estimator for λ is λ̂ = 0. If you base your decision as
to whether or not you should wait a bit longer only upon the maximum likelihood
estimator of λ, then you will estimate that your friend will never arrive and decide
not to wait. This argument holds even if you have only waited 1 second.
The above analysis is unsatisfactory because we have not modelled the costs of
either waiting in vain, or deciding not to wait but then having the friend turn up.
16
4 Confidence intervals
Statisticians do it with 95% confidence.
Examples 4.1
17
Hence for a 95% confidence interval we would take −ξ = η = 1.96, as Φ(1.96) =
0.975. The 95% confidence interval is
1.96σ 1.96σ
X̄ − √ , X̄ + √
n n
For a 99% confidence interval, 1.96 would be replaced by 2.58, as Φ(2.58) = 0.995.
(b) If X1 , . . . , Xn ∼ N(µ, σ 2) independently, with µ and σ 2 both unknown, then
√
n(X̄ − µ)
p ∼ tn−1,
SXX /(n − 1)
where tn−1 denotes the ‘Student’s t-distribution on n−1 degrees of freedom’ which
will be studied later. So if ξ and η are such that P(ξ ≤ tn−1 ≤ η) = γ, we have
√ !
n(X̄ − µ)
P(µ,σ2 ) ξ ≤ p ≤ η = γ,
{SXX /(n − 1)}
Again the choice of ξ and η is not unique, but it is natural to try to make the length
of the confidence interval as small as possible. The symmetry of the t-distribution
implies that we should choose ξ and η symmetrically about 0.
18
So
Example 4.2 U.S. News and World Report (Dec 19, 1994) reported on a telephone
survey of 1,000 Americans, in which 59% said they believed the world would come
to an end, and of these 33% believed it would happen within a few years or decades.
Let us find a confidence interval for the proportion of Americans who believe the
end of the world in imminent. Firstly, p̂ = 0.59(0.33) = 0.195. The variance of
p̂ is p(1 − p)/590 which we estimate by (0.195)(0.805)/590
√ = 0.000266. Thus an
approximate 95% confidence interval is 0.195 ± 0.00266(1.96), or [0.163, 0.226].
Note that this is only approximately a 95% confidence interval. We have used the
normal approximation, and we have approximated p(1 − p) by p̂(1 − p̂). These are
both good approximations and this is therefore a very commonly used analysis.
For small populations the formula for the variance of p̂ depends on the total popula-
tion size N. E.g., if we are trying to estimate the proportion p of N = 200 students
in a lecture who support the Labour party and we take n = 200, so we sample them
all, then clearly var(p̂) = 0. If n = 190 the variance will be close to 0. In fact,
N − n p(1 − p)
var(p̂) = .
N −1 n
19
Example 4.3 Suppose that X1 , . . . , Xn are IID E(θ). Then
Yn P
f (x | θ) = θe−θxi = θn e−θ i xi
i=1
P
so T (X) = i Xi is sufficient for θ. Also, T ∼ gamma(n, θ) with pdf
fT (t) = θn tn−1e−θt /(n − 1)!, t > 0.
Consider S = 2θT . Now P(S ≤ s) = P(T ≤ s/2θ), so by differentiation with respect
to s, we find the density of S to be
1 θn (s/2θ)n−1e−θ(s/2θ) 1 sn−1(1/2)ne−s/2
fS (s) = fT (s/2θ) = = , s > 0.
2θ (n − 1)! 2θ (n − 1)!
So S = 2θT ∼ gamma n, 12 ≡ χ22n .
Suppose we want a 95% confidence interval for the mean, 1/θ. We can write
P(ξ ≤ 2T θ ≤ η) = P (2T /η ≤ 1/θ ≤ 2T /ξ) = F2n(ξ) − F2n (η) ,
where F2n is the cdf of a χ22n RV.
For example, if n = 10 we refer to tables for the χ220 distribution and pick ξ = 34.17
and η = 9.59, so that F20(ξ) = 0.975, F20(η) = 0.025 and F20(ξ) − F20(η) = 0.95.
Then a 95% confidence interval for 1/θ is
[2t/34.17 , 2t/9.59 ] .
Along the same lines, a confidence interval for σ can be constructed in the cir-
cumstances of Example 4.1 (b) by using fact that SXX /σ 2 ∼ χ2n−1 . E.g., if n = 21 a
95% confidence interval would be
hp p i
Sxx /34.17 , Sxx /9.59 .
20
5 Bayesian estimation
Bayesians probably do it.
Bayesian statistics, (named after the Rev. Thomas Bayes, an amateur 18th century
mathematician), represents a different approach to statistical inference. Data are still
assumed to come from a distribution belonging to a known parametric family. How-
ever, whereas classical statistics considers the parameters to be fixed but unknown,
the Bayesian approach treats them as random variables in their own right. Prior
beliefs about θ are represented by the prior distribution, with a prior probability
density (or mass) function, p(θ). The posterior distribution has posterior density
(or mass) function, p(θ | x1 , . . . , xn), and captures our beliefs about θ after they have
been modified in the light of the observed data.
By Bayes’ celebrated formula,
f (x1, . . . , xn | θ)p(θ)
p(θ | x1, . . . , xn) = R .
f (x1, . . . , xn | φ)p(φ) dφ
The denominator of the above equation does not involve θ and so in practice is
usually not calculated. Bayes’ rule is often just written,
Example 5.1 Consider the Smarties example addressed in Example 2.1 (a) and
suppose our prior belief is that the number of colours is either 5, 6, 7 or 8, with prior
probabilities 1/10, 3/10, 3/10 and 3/10 respectively. On seeing the data x =‘red,
green, red’ we have f (x | k) = (k − 1)/k 2. Similarly, if the fourth Smartie is orange,
f (x | k) = (k − 1)(k − 2)/k 3. Then
There is very little modification of the prior. This analysis reveals, in a way that
the maximum likelihood approach did not, that the data obtained from looking at
just 4 Smarties is not very informative. However, as we sample more Smarties the
posterior distribution will come to concentrate on the true value of k.
21
5.2 Conditional pdfs
The discrete case
then we define
P(X = x; Y = y) fX,Y (x, y)
fX|Y (x | y) := P(X = x | Y = y) = =
P(Y = y) fY (y)
if fY (y) 6= 0. We can safely define fX|Y (x | y) := 0 if fY (y) = 0. Of course,
X X
fY (y) = P(Y = y) = P(X = x; Y = y) = fX,Y (x, y).
x x
Example 5.2 Suppose that X and R are independent RVs, where X is Poisson with
parameter λ and R is Poisson with parameter µ. Let Y = X + R.
Then
X
λx e−λ µy−x e−µ λxe−λ µr e−µ
fX|Y (x | y) =
x! (y − x)! x,r:x+r=y x! r!
X
y! x (y−x) y! x r
= λ µ λ µ
x!(y − x)! x,r:x+r=y
x!r!
x y−x
y λ µ
= .
x λ+µ λ+µ
Hence (X | Y = y) ∼ B(y, p), where p = λ/(λ + µ).
The intuitive idea is: P(X ∈ dx | Y ∈ dy) = P(X ∈ dx; Y ∈ dy)/P(Y ∈ dy).
Examples 5.3
(a) A biased coin is tossed n times. Let xi be 1 or 0 as the ith toss is or is not a
head. Suppose we have no idea how biased the coin is, so we place a uniform prior
distribution on θ, to give a so-called ‘noninformative prior’ of
p(θ) = 1, 0 ≤ θ ≤ 1.
We would usually not bother with the denominator and just write
p(θ | x) ∝ θt (1 − θ)n−t .
23
5.3 Estimation within Bayesian statistics
The Bayesian approach to the parameter estimation problem is to use a loss func-
tion L(θ, a) to measure the loss incurred by estimating the value of a parameter to
be a when its true value is θ. Then θ̂ is chosen to minimize E [L(θ, θ̂)], where this
expectation is taken over θ with respect to the posterior distribution p(θ | x).
The minimum is achieved when both integrals are equal to 12 , i.e., by taking θ̂ to be
the posterior median.
Example 5.4 Let X1 , . . . , Xn ∼ P (λ), λ ∼ E(1) so that p(λ) = e−λ, λ ≥ 0.
The posterior distribution is
Yn
e−λλxi P
−λ −λ(n+1)
p(λ | x1, . . . , xn) = e ∝e λ xi ,
i=1
xi !
P
i.e., gamma xi + 1, (n + 1) . So under quadratic error loss,
Pn
xi + 1
θ̂ = posterior mean = i=1 .
n+1
Under absolute error loss, θ̂ solves
Z θ̂ −λ(n+1) P xi P
e λ (n + 1) xi +1 1
P dλ = .
0 ( xi)! 2
24
6 Hypothesis testing
Statistics is the only profession which demands the right to make mistakes 5 per
cent of the time – Thomas Huxley.
Example 6.1 It has been suggested that dying people may be able to postpone
their death until after an important occasion. In a study of 1919 people with Jewish
surnames it was found that 922 occurred in the week before Passover and 997 in the
week after. Is there any evidence in this data to reject the hypothesis that a person
is as likely to die in the week before as in the week after Passover?
Example 6.2 In one of his experiments, Mendel crossed 556 smooth, yellow male
peas with wrinkled, green female peas. Here is what he obtained and its comparison
with predictions based on genetic theory.
type observed predicted expected
count frequency count
smooth yellow 315 9/16 312.75
smooth green 108 3/16 104.25
wrinkled yellow 102 3/16 104.25
wrinkled green 31 1/16 34.75
Is there any evidence in this data to reject the hypothesis that theory is correct?
We follow here an approach developed by Neyman and Pearson. Suppose we have
data x = (x1, x2, . . . , xn) from a density f . We have two hypotheses about f . On
the basis of the data one is accepted, the other rejected. The two hypotheses have
different philosophical status. The first, called the null hypothesis, and denoted
by H0, is a conservative hypothesis, not to be rejected unless evidence is clear. The
second, the alternative hypothesis, denoted by H1 , specifies the kind of departure
from the null hypothesis of interest to us.
It is often assumed that f belongs to a specified parametric family f (· | θ) indexed
by a parameter θ ∈ Θ (e.g. N(θ, 1), B(n, θ)). We might then want to test a parametric
hypothesis
H0 : θ ∈ Θ0 against H1 : θ ∈ Θ1
25
with Θ0 ∩ Θ1 = ∅. We may, or may not, have Θ0 ∪ Θ1 = Θ.
We will usually be concerned with testing a parametric hypothesis of this kind,
but alternatively, we may wish to test
H0 : f = f0 against H1 : f 6= f0
H0 : f = f0 against H1 : f = f1
where f0 and f1 are specified, but do not necessarily belong to the same family.
6.2 Terminology
A hypothesis which specifies f completely is called simple, e.g., θ = θ0. Otherwise,
a hypothesis is composite, e.g., θ > θ0.
Suppose we wish to test H0 against H1 . A test is defined by a critical region C.
We write C̄ for the complement of C.
(
C then H0 is rejected, and
If x = (x1, x2, . . . , xn) ∈
C̄ then H0 is accepted (not rejected).
Note that when x ∈ C̄ we might sometimes prefer to say ‘not rejected’, rather
than ‘accepted’. This is a minor point which need not worry us, except to note
that sometimes ‘not rejected’ does more accurately express what we are doing: i.e.,
looking to see if the data provides any evidence to reject the null hypothesis. If it
does not, then we might want to consider other things before finally ‘accepting H0’.
There are two possible types of error we might make:
H0 might be rejected when it is true (a type I error), or
H0 might be accepted when it is false (a type II error).
Since H0 is conservative, a type I error is generally considered to be ‘more serious’
that a type II error. For example, the jury in a murder trial should take as its null
hypothesis that the accused is innocent, since the type I error (that an innocent
person is convicted and the true murderer is never caught) is more serious than the
type II error (that a murderer is acquitted).
Hence, we fix (an upper bound on) the probability of type I error, e.g., 0.05 or
0.01, and define the critical region C by minimizing the type II error subject to this.
If H0 is simple, Θ0 = {θ0}, the probability of a type I error is called the size,
or significance level, of the test. If H0 is composite, the size of the test is α =
supθ∈Θ0 P(X ∈ C | θ). .
26
The likelihood of a simple hypothesis H : θ = θ∗ given data x is
Lx (H) = fX (x | θ = θ∗ ).
If H is composite, H : θ ∈ Θ, we define
Lx(H) = sup fX (x | θ).
θ∈Θ
The likelihood ratio for two hypotheses H0 , H1 is
Lx(H1 )
Lx(H0, H1) = .
Lx(H0 )
Notice that if T (x) is a sufficient statistic for θ then by the factorization criterion
Lx (H0, H1) is simply a function of T (x).
27
6.4 Single sample: testing a given mean, simple alternative, known vari-
ance (z-test)
Let X1 , . . . , Xn be IID N(µ, σ 2 ), where σ 2 is known. We wish to test H0 : µ = µ0
against H1 : µ = µ1 , where µ1 > µ0 .
The Neyman–Pearson test is to reject H0 if the likelihood ratio is large (i.e., greater
than some k). The likelihood ratio is
Pn
f (x | µ1 , σ 2) (2πσ 2)−n/2 exp − i=1(xi − µ1 )2/2σ 2
= P
f (x | µ0 , σ 2) (2πσ 2 )−n/2 exp [− ni=1(xi − µ0 )2/2σ 2]
" n #
X
= exp (xi − µ0 )2 − (xi − µ1 )2 /2σ 2
i=1
= exp n 2x̄(µ1 − µ0 ) + (µ20 − µ21 ) /2σ 2
It often turns out, as here, that the likelihood ratio is a monotone function of a
sufficient statistic and we can immediately rephrase the critical region in more con-
venient terms. We notice that the likelihood ratio above is increasing in the sufficient
statistic x̄ (since µ1 − µ0 > 0). So the Neyman–Pearson test is equivalent to ‘reject
H0 if x̄ > c’, where we choose c so that P(X̄ > c | H0) = α. There is no need to try
to write c in terms of k.
However, under H0 the distribution of X̄ is N(µ0 , σ 2/n). This means that
√
Z = n(X̄ − µ0 )/σ ∼ N(0, 1) .
It is now convenient to rephrase the test in terms of Z, so that a test of size α is to
reject H0 if z > zα , where zα = Φ−1(1 − α) is the ‘upper α point of N(0, 1)’ i.e., the
point such that P(N(0, 1) > zα ) = α. E.g., for α = 0.05 we would take zα = 1.645,
since 5% of the standard normal distribution lies to the right of 1.645.
Because we reject H0 only if z is in the upper tail of the normal distribution we
call this a one-tailed test. We shall see other tests in which H0 is rejected if the
test statistic lies in either of two tails. Such a test is called a two-tailed test.
Example 6.4 Suppose X1 , . . . , Xn are IID N(µ, σ 2) as above, and we want a test
of size 0.05 of H0 : µ = 5 against H1 : µ = 6, with σ 2 = 1. Suppose the data is
x = (5.1, 5.5, 4.9, 5.3). Then x̄ = 5.2 and z = 2(5.2 − 5)/1 = 0.4. Since this is less
than 1.645 we do not reject µ = 5.
Suppose the hypotheses are reversed, so that we test H0 : µ = 6 against H1 : µ = 5.
The test statistic is now z = 2(5.2 − 6)/1 = −1.6 and we should reject H0 for values
of Z less than −1.645. Since z is more than −1.645, it is not significant and we do
not reject µ = 6.
This example demonstrates the preferential position given to H0 and therefore
that it is important to choose H0 in a way that makes sense in the context of the
decision problem with which the statistical analysis is concerned.
28
7 Further aspects of hypothesis testing
The significance level of a test is another name for α, the size of the test. For a
composite hypothesis H0 : θ ∈ Θ0 and rejection region C this is
α = sup P(X ∈ C | θ) .
θ∈Θ0
For a parametric hypothesis about θ ∈ Θ, we define the power function of the test
specified by the critical region C as
29
7.3 Uniformly most powerful tests
For the test of H0 : µ = µ0 against H1 : µ = µ1 , µ1 > µ0 described in section 6.4 the
critical region turned out to be
√
C(µ0) = x : n(x̄ − µ0 )/σ > αz .
This depends on µ0 but not on the specific value of µ1 . The test with this critical
region would be optimal for any alternative H1 : µ = µ1 , provided µ1 > µ0 . This is
the idea of a uniformly most powerful (UMP) test.
We can find the power function of this test.
W (µ) = P(Z > αz | µ)
√
= P n(X̄ − µ0 )/σ > αz | µ
√ √
= P n(X̄ − µ)/σ + n(µ − µ0 /σ > αz | µ)
√
= 1 − Φ αz − n(µ − µ0 )/σ
Note that W (µ) increases from 0 to 1 as µ goes from −∞ to ∞ and W (µ0) = α.
More generally, suppose H0 : θ ∈ Θ0 is to be tested against H1 : θ ∈ Θ1, where
Θ0 ∩ Θ1 = ∅. Suppose H1 is composite. H0 can be simple or composite. We want a
test of size α. So we require W (θ) ≤ α for all θ ∈ Θ0 , W (θ0) = α for some θ0 ∈ Θ0 .
A uniformly most powerful (UMP) test of size α satisfies (i) it is of size α, (ii)
W (θ) is as large as possible for every θ ∈ Θ1 .
UMP tests may not exist. However, likelihood ratio tests are often UMP.
Example 7.1 Let X1, . . . , Xn be IID N(µ, σ 2 ), where µ is known. Suppose H0 :
σ 2 ≤ 1 is to be tested against H1 : σ 2 > 1.
We begin by finding the most powerful test for testing H00 : σ 2 = σ02 against
H10 : σ 2 = σ12, where σ02 ≤ 1 < σ12. The Neyman–Pearson test rejects H00 for large
values of the likelihood ratio:
−n/2 P
f x | µ, σ12 2πσ12 exp − ni=1(xi − µ)2 /2σ12
= −n/2 P
f x | µ, σ02 2πσ02 exp − ni=1(xi − µ)2 /2σ02
" X n
#
1 1
= (σ0/σ1)n exp 2 − 2 (xi − µ)2
2σ0 2σ1 i=1
P
which is large when i (xi − µ)2 is large. If σ 2 = 1 then
X
n
(xi − µ)2 ∼ χ2n .
i=1
P (n) (n)
So a test of the form ‘reject H0 if T :=
i (x i − µ) 2
> F
α ’, has size α where Fα
(n)
is the upper α point of χn . That is, P T > Fα | σ ≤ 1 ≤ α, for all σ 2 ≤ 1, with
2 2
30
equality for σ 2 = 1. But this test doesn’t depend on the value of σ12 , and hence is
the UMP test of H0 against H1.
Example 7.2 Consider Example 6.1. Let p be the probability that if a death occurs
in one of the two weeks either side of Passover it actually occurs in the week after
the Passover. Let us test H0 : p = 0.5 vs. H1 : p > 0.5.
The distribution of the number of deaths in the week after Passover, say X,
is B(n, p), which we can approximate by N np, np(1 − p) ; under H0 this is
√
N(0.5n, 0.25n). √ So a size 0.05 test is to reject H0 if z = n x̄ − µ0 /σ > 1.645,
where here z = 1919 997/1919 − 0.5 /0.5 = 1.712. So the data is just significant
at the 5% level. We reject the hypothesis that death p = 1/2.
It is important to realise that this does not say anything about why this might
be. It might be because people really are able to postpone their deaths to enjoy
the holiday. But it might also be that deaths increase after Passover because of
over-eating or stress during the holiday.
Theorem 7.3
(i) Suppose that for every θ0 there is a size α test of H0 : θ = θ0 against some
alternative. Denote the acceptance region by A(θ0). Then I(X) = {θ : X ∈
A(θ)} is a 100(1 − α)% confidence interval for θ.
(ii) Conversely, if I(X) is a 100(1−α)% confidence interval for θ then an acceptance
region for a size α test of H0 : θ = θ0 is A(θ0) = {X : θ0 ∈ I(X)}.
By assumption the l.h.s. is 1 − α in case (i) and the r.h.s. is 1 − α in case (ii).
This duality can be useful. In some circumstances it can be easier to see what
is the form of a hypothesis test and then work out a confidence interval. In other
circumstances it may be easier to see the form of the confidence interval.
In Example 4.1 we saw that a 95% confidence interval for µ based upon X1 , . . . , Xn
being IID samples from N(µ, σ 2), σ 2 known, is
1.96σ 1.96σ
X̄ − √ , X̄ + √ .
n n
31
Thus H0 : µ = µ0 is rejected in a 5% level test against H1 : µ 6= µ0 if and only if µ0
is not in this interval; i.e., if and only if
√
nX̄ − µ0 /σ > 1.96 .
32
8 Generalized likelihood ratio tests
An approximate answer to the right problem is worth a good deal more than an
exact answer to an approximate problem.
33
Θ1 has k free parameters and Θ0 has k − p free parameters. We write |Θ1 | = k and
|Θ0 | = k − p. Then we have the following theorem (not to be proved.)
Theorem 8.1 Suppose Θ0 ⊂ Θ1 and |Θ1 |−|Θ0 | = p. Then under certain conditions,
as n → ∞ with X = (X1, . . . , Xn ) and Xi IID,
2 log LX (H0, H1) ∼ χ2p ,
if H0 is true. If H0 is not
true, 2 log LX tends to be larger. We reject H0 if 2 log Lx >
c, where α = P χp > c to give a test of size approximately α.
2
We say that 2 log LX (H0, H1) is asymptotically distributed as χ2p . The conditions
required by the theorem hold in all the circumstances we shall meet in this course.
Lemma 8.2 Suppose X1 , . . . , Xn are IID N(µ, σ 2). Then
P
(i) maxµ f (x | µ, σ 2) = (2πσ 2 )−n/2 exp − i (xi − x̄)2 /2σ 2 .
h P i
2 −n/2
i (xi −µ)
(ii) maxσ2 f (x | µ, σ ) = 2π
2
n exp [−n/2].
h P i−n/2
i (xi −x̄)
2
(iii) maxµ,σ2 f (x | µ, σ ) = 2π
2
n exp [−n/2].
34
Note that 2 log LX (H0, H1) = Z 2 ∼ χ21 . In this example H0 imposes p = 1
constraint on the parameter space and the approximation in Theorem 8.1 is exact.
8.4 Single sample: testing a given variance, known mean (χ2 -test)
As above, let X1, . . . , Xn be IID N(µ, σ 2), where µ is known. We wish to test
H0 : σ 2 = σ02 against H1 : σ 2 6= σ02 . The generalized likelihood ratio test suggests
that we should reject H0 if Lx (H0, H1) is large, where
h P i−n/2
2π i (xi −µ)2
supσ2 f x | µ, σ 2
n
exp [−n/2]
Lx (H0, H1) = = P .
f x | µ, σ02 (2πσ02)−n/2 exp − i (xi − µ)2/2σ02
P
If we let t = i(xi − µ)2 /nσ02 we find
2 log Lx (H0, H1) = n(t − 1 − log t) ,
which increases as t increases from 1 and t decreases from 1. Thus we should reject
H0 when the difference of t and 1 is large.
Again, this is not surprising, for under H0,
Xn
T = (Xi − µ)2 /σ02 ∼ χ2n .
i=1
(n) (n)
So a test of size α is the two-tailed test which rejects H0 if t > Fα/2 or t < F1−α/2
(n) (n)
where F1−α/2 and Fα/2 are the lower and upper α/2 points of χ2n , i.e., the points such
(n) (n)
that P χn < F1−α/2 = P χn > Fα/2 = α/2.
2 2
35
So we should reject H0 if |x̄ − ȳ| is large. Now, X̄ ∼ N(µ1 , σ 2/m) and Ȳ ∼
N(µ2 , σ 2/n), and the samples are independent, so that, on H0 ,
1 1
X̄ − Ȳ ∼ N 0, σ 2 +
m n
or
− 1
1 1 2 1
Z = (X̄ − Ȳ ) + ∼ N(0, 1).
m n σ
A size α test is the two-tailed test which rejects H0 if z > zα/2 or if z < −zα/2 , where
zα/2 is, as in 8.3 the upper α/2 point of N(0, 1). Note that 2 log LX (H0 , H1) = Z 2 ∼
χ21 , so that for this case the approximation in Theorem 8.1 is again exact.
This is called a goodness-of-fit test, because we are testing whether our data
fit a particular distribution (in the above example the binomial distribution).
The distribution of (x1, . . . , xk ) is the multinomial distribution
n!
P(x1 , . . . , xk | p) = px1 1 · · · pxkk ,
x1 ! · · · xk !
Pk
for (x1, . . . , xk ) s.t. xi ∈ {0, . . . , n} and i=1 xi = n. Then we have
( k )
X Xk
sup log f (x) = const + sup xi log pi 0 ≤ pi ≤ 1, pi = 1 .
H1
i=1 i=1
P P
Now, i xi log pi may be maximised subject to i pi = 1 by a Lagrangian technique
and we get p̂i = xi /n. Likewise,
( k )
X
sup log f (x) = const + sup xi log pi(θ) .
H0 θ i=1
36
9 Chi-squared tests of categorical data
A statistician is someone who refuses to play the national lottery,
but who does eat British beef. (anonymous)
Recall
X
k X
k X
k
2 log Lx(H0 , H1) = 2 xi log p̂i − 2 xi log pi (θ̂) = 2 xi log p̂i /pi(θ̂) ,
i=1 i=1 i=1
where p̂i = xi/n and θ̂ is the MLE of θ under H0. Let oi = xi denote the number
of time that outcome i occurred and let ei = npi (θ̂) denote the expected number of
times it would occur under H0 . It is usual to display the data in k cells, writing oi
in cell i. Let δi = oi − ei . Then
X
k
2 log Lx (H0, H1) = 2 xi log (xi/n)/pi(θ̂)
i=1
Xk
=2 oi log(oi /ei)
i=1
Xk
=2 (δi + ei ) log(1 + δi /ei)
i=1
Xk
=2 (δi + ei )(δi/ei − δi2 /2e2i + · · ·)
i=1
X
k
+ δi2 /ei
i=1
Xk
(oi − ei )2
= (1)
i=1
ei
37
degrees of freedom is k − p − 1. Thus, if H0 is true the statistic (1) ∼ χ2k−p−1
approximately. A mnemonic for the d.f. is
d.f. = #(cells) − #(parameters estimated) −1. (2)
Note that
X
k k 2
X X
k X
k
(oi − ei )2 o o2 o2
= i
− 2oi + ei = i
− 2n + n = i
− n. (3)
i=1
ei i=1
ei i=1
ei i=1
ei
Example 9.1 For the data from Mendel’s experiment, the test statistic has the value
0.618. This is to be compared to χ23 , for which the 10% and 95% points are 0.584
and 7.81. Thus we certainly do not reject the theoretical model. Indeed, we would
expect the observed counts to show even greater disparity from the theoretical model
about 90% of the time.
Similar analysis has been made of many of Mendel’s other experiments. The data
and theory turn out to be too close for comfort. Current thinking is that Mendel’s
theory is right but that his data were massaged by somebody (Fisher thought it was
Mendel’s gardening assistant) to improve its agreement with the theory.
Suppose the row sums are fixed and the distribution of (Xi1, . . . , Xin ) in row i is
multinomial with probabilities (pi1, . . . , pin), independently of the other rows. We
want to test the hypothesis that the distribution in each row is the same, i.e., H0 : pij
is the same for all i, (= pj ) say, for each j = 1, . . . , n. The alternative hypothesis is
H1 : pij are unrestricted. We have
XX
log f (x) = const + xij log pij , so that
i j
( )
X
m X
n X
n
sup log f (x) = const + sup xij log pij 0 ≤ pij ≤ 1, pij = 1 ∀i
H1
i=1 j=1 j=1
P P
Now, x log p may be maximized subject to j pij = 1 by a Lagrangian tech-
j ij ij
P P
nique. The maximum of j xij log pij + λ 1 − j pij occurs when xij /pij = λ,
38
P
∀j. Then P the constraints give λ = j xij and the corresponding maximizing pij is
p̂ij = xij / j xij = xij /xi·. Hence,
X
m X
n
sup log f (x) = const + xij log(xij /xi·).
H1 i=1 j=1
Likewise,
( )
XX X
sup logf (x) = const + sup xij log pj 0 ≤ pj ≤ 1, pj = 1 ,
H0
i j j
XX
= const + xij log(x·j /x··).
i j
Here p̂j = x·j /x··. Let oij = xij and write eij = p̂j xi· = (x·j /x··)xi· for the expected
number of items in position (i, j) under H0 . As before, let δij = oij − eij . Then,
XX
2 log Lx(H0, H1) = 2 xij log(xij x·· /xi·x·j )
i j
XX
=2 oij log(oij /eij )
i j
XX
=2 (δij + eij ) log(1 + δij /eij )
i j
XX
+ δij2 /eij
i j
XX
= (oij − eij )2/eij . (4)
i j
For H0, we have (n−1) parameters to choose, for H1 we have m(n−1) parameters
to choose, so the degrees of freedom is (n − 1)(m − 1). Thus, if H0 is true the
statistic (4) ∼ χ2(n−1)(m−1) approximately.
Example 9.2 The observed (and expected) counts for the study about aspirin and
heart attacks described in Example 1.2 are
Heart attack No heart attack Total
Aspirin 104 (146.52) 10,933 (10890.5) 11,037
Placebo 189 (146.48) 10,845 (10887.5) 11,034
Total 293 21,778 22,071
293
E.g., e11 = 22071 11037 = 146.52. The χ2 statistic is
(104−146.52)2 (189−146.48)2 2
(10845−10887.5)2
146.52 + 46.48 + (10933−10890.5)
10890.5 + 10887.5 = 25.01 .
39
The 95% point of χ21 is 3.84. Since 25.01 > 3.84, we reject the hypothesis that heart
attack rate is independent of whether the subject did or did not take aspirin.
Note that if there had been only a tenth as many subjects, but the same percent-
ages in each in cell, the statistic would have been 2.501 and not significant.
We have now seen Pearson χ2 tests in three different settings. Such a test is
appropriate whenever the data can be viewed as numbers of times that certain out-
comes have occurred and we wish to test a hypothesis H0 about the probabilities
with which they occur. Any unknown parameter is estimated by maximizing the
likelihood function that pertains under H0 and ei is computed as the expected num-
ber of times outcome i occurs if that parameter is replaced by this MLE value. The
statistic is (1), where the sum is computed over all cells. The d.f. is given by (2).
40
10 Distributions of the sample mean and variance
Example 10.1 These are some Cambridge admissions statistics for 1996.
Women Men
applied accepted % applied accepted %
Computer Science 26 7 27 228 58 25
Economics 240 63 26 512 112 22
Engineering 164 52 32 972 252 26
Medicine 416 99 24 578 140 24
Veterinary medicine 338 53 16 180 22 12
Total 1184 274 23 2470 584 24
In all five subjects women have an equal or better success rate in applications than
do men. However, taken overall, 24% of men are successful but only 23% of women
are successful! This is called Simpson’s paradox (though it was actually discovered
by Yule 50 years earlier). It can often be found in real data. Of course it is not a
paradox. The explanation here is that women are more successful in each subject,
but tend to apply more for subjects that are hardest to get into (e.g., Veterinary
medicine). This example should be taken as a warning that pooling contingency
tables can produce spurious associations. The correct interpretation of this data is
that, for these five subjects, women are significantly more successful in gaining entry
than are men.
In order to produce an example of Simpson’s paradox I carefully selected five
subjects from tables of 1996 admissions statistics. Such ‘data snooping’ is cheating; a
similar table that reversed the roles of men and women could probably be constructed
by picking different subjects.
The rest of this lecture is aimed Pat proving2 some important facts about distribution
of the statistics X̄ and SXX = i(Xi − X̄) , when X1 , . . . , Xn are IID N(µ, σ 2). We
begin by reviewing some ideas about transforming random variables.
Suppose the joint density of X1, . . . , Xn is fX , and there is a 1–1 mapping between
X1 , . . . , Xn and Y1, . . . , Yn such that Xi = xi (Y1, . . . , Yn). Then the joint density of
41
Y1 , . . . , Yn is
∂x (y)
1 ∂x1 (y)
∂y1 · · · ∂yn
..
fY (y1 , . . . , yn ) = fX (x1(y), . . . , xn(y)) ... .
∂xn (y) ∂xn (y)
···
∂y1 ∂yn
where the Jacobian := J(y1, . . . , yn) is the absolute value of the determinant of the
matrix (∂xi(y)/∂yj ).
The following example is an important one, which also tells us more about the
beta distribution.
Example 10.2 Let X1 ∼ gamma(n1, λ) and X2 ∼ gamma(n2, λ), independently.
Let Y1 = X1 /(X1 + X2 ), Y2 = X1 + X2 . Since X1 and X2 are independent we
multiply their pdfs to get
λn1 xn1 1 −1 −λx1 λn2 xn2 2−1 −λx2
fX (x) = e × e .
(n1 − 1)! (n2 − 1)!
Then x1 = y1y2 , x2 = y2 − y1 y2 , so
∂x1 (y) ∂x1 (y)
∂y1 ∂y2 y y
J(y1, y2) = ∂x2 (y) ∂x2 (y) = 2 1 = y2
∂y1
∂y2
−y2 1 − y1
Hence making the appropriate substitutions and arranging terms we get
(n1 + n2 − 1)! n1 −1 n2 −1 λn1 +n2 y2n1 +n2−1 −λy2
fY (y) = y (1 − y1 ) × e
(n1 − 1)!(n2 − 1)! 1 (n1 + n2 − 1)!
from which it follows that Y1 and Y2 are independent RVs (since their joint den-
sity function factors into marginal density functions) and Y1 ∼ beta(n1, n2), Y2 ∼
gamma(n1 + n2 , λ).
42
Since x = A>y, we have ∂xi/∂yj = aji and hence J(y1, . . . , yn) = | det(A>)| = 1.
Thus
1 > > >
fY (y1, . . . , yn | µ, σ 2) = exp −(A y − µ) (A y − µ)/2σ 2
(2πσ 2 )n/2
1 > > > > >
= exp −(A y − A Aµ) (A y − A Aµ)/2σ 2
(2πσ 2 )n/2
1 > >
= exp −(y − Aµ) AA (y − Aµ)/2σ 2
(2πσ 2 )n/2
1 >
= exp −(y − Aµ) (y − Aµ)/2σ 2
(2πσ 2 )n/2
Proof.
(i) and (ii) are immediate from the fact that linear combinations of normal RVs
are normally distributed and the definition of χ2n . To prove (iii) and (iv) we note
43
that
X
n X
n
2
(Xi − µ) =
2
[Xi − X̄] + [X̄ − µ]
i=1 i=1
Xn
= [Xi − X̄]2 + 2[Xi − X̄][X̄ − µ] + [X̄ − µ]2
i=1
= SXX + n[X̄ − µ]2
Let A be an orthogonal matrix such that
√
Y = A X − µ1) = n(X̄ − µ), Y2, . . . , Yn .
I.e., we take
√ √ √
1/ n 1/ n · · · 1/ n
A = ... ..
.
..
.
· · ··· ·
where the rows below the first are chosen to make the matrix orthogonal. P Then
√ n
Y1 = n(X̄ − µ) ∼ N(0, σ 2) and Y1 is independent of Y2, . . . , Yn. Since i=1 Yi2 =
P
i (Xi − µ) , we must have
2
X
n X
n
Yi2 = (Xi − µ)2 − n(X̄ − µ)2 = SXX .
i=2 i=1
44
11 The t-test
Statisticians do it with two-tail T tests.
Example 11.1 In ‘Sexual activity and the lifespan of male fruitflies’, Nature, 1981,
Partridge and Farquhar report experiments which examined the cost of increased re-
production in terms of reduced longevity for male fruitflies. They kept numbers of
male flies under different conditions. 25 males in one group were each kept with 1
receptive virgin female. 25 males in another group were each kept with 1 female who
had recently mated. Such females will refuse to remate for several days. These served
as a control for any effect of competition with the male for food or space. The groups
were treated identically in number of anaesthetizations (using CO2) and provision of
fresh food.
To verify ‘compliance’ two days per week throughout the life of each experimental
male, the females that had been supplied as virgins to that male were kept and exam-
ined for fertile eggs. The insemination rate declined from approximately 1 per day at
age one week to about 0.6 per day at age eight weeks.
The data was as follows
45
the the longevities of these 50 flies in order to compute confidence intervals or test
statistics.
From these summary statistics we can compute 95% confidence intervals for the
mean lives of the control and test groups to be
√ √
[64.80 − 2.06(15.6526)/ 25, 64.80 + 2.06(15.6526)/ 25] = [58.35, 71.25]
√ √
[56.76 − 2.06(14.9284)/ 25, 56.76 + 2.06(14.9284)/ 25] = [50.61, 62.91]
It is interesting to look at the data, and doing so helps us check that lifespan is
normally distributed about a mean. The longevities for control and test groups were
42 42 46 46 46 48 50 56 58 58 63 65 65 70 70 70 70 72 72 76 76 80 90 92 97
21 36 40 40 44 48 48 48 48 53 54 56 56 60 60 60 60 65 68 68 68 75 81 81 81
0 10 20 30 40 50 60 70 80 90 100
i (xi − x̄)
2
P n/2
i (xi −P
x̄)2 + n(x̄ − µ0 )2
=
i (xi − x̄)
2
n/2
n(x̄ − µ0 )2
= 1+ P .
i (xi − x̄)
2
P
This is large when T 2 := n(n − 1)(x̄ − µ0 )2 i (xi − x̄) is large, equivalently when
2
|T | is large. Under H0 we have T ∼ tn−1. So a size α test is the two-tailed test which
(n−1) (n−1)
rejects H0 if t > tα/2 or if t < −tα/2 .
Example 11.2 Does jogging lead to a reduction in pulse rate? Eight non-jogging
volunteers engaged in a one-month jogging programme. Their pulses were taken before
and after the programme.
46
pulse rate before 74 86 98 102 78 84 79 70
pulse rate after 70 85 90 110 71 80 69 74
decrease 4 1 8 -8 7 4 10 -4
Although there are two sets of data it is really just the changes that matter. Let the
decreases in pulse rates be x1, . . . , x8 and assume these are samples from N(µ, σ 2)
for some unknown σ 2 . To test H0 : µ = 0 against H1 : µ 6= 0 we compute
X X X
xi = 22, x̄ = 2.75, 2
xi = 326, Sxx = x2i − 8x̄2 = 265.5.
47
Example 11.3 For the fruitfly data we might test H0 : that mean longevity is the
same for males living with 8 interested females as with 8 uninterested females. The
test statistic is
s
1 1 24(15.6525) + 24(14.9284)
t = (64.80 − 56.76) + = 1.859
25 25 25 + 25 − 2
(48)
which can be compared to t0.025 = 2.01, and therefore is not significant at the 0.05%
(48)
level. H0 is not rejected. (It is however, significant at the 10% level, since t0.05 =
1.68).
Similarly, we can give a 95% confidence interval for the difference of the means.
This has endpoints
s
1 1 24(15.6525) + 24(14.9284)
(64.80 − 56.76) ± 2.01 +
25 25 25 + 25 − 2
= 8.04 ± 8.695.
I.e., a 95% confidence interval for the extra longevity of celibate males is
[−0.655, 16.735] days. Notice again that finding we cannot reject µ1 − µ2 = 0 at
the 5% level is equivalent to finding that the 95% confidence interval for the differ-
ence of the means contains 0.
In making the above test we have assumed that the variances for the two popula-
tions are the same. In the next lecture we will see how we might test that hypothesis.
11.4 Single sample: testing a given variance, unknown mean (χ2 -test)
Let X1 , . . . , Xn be IID N(µ, σ 2), and suppose we wish to test H0 : σ 2 = σ02 against
H1 : σ 2 6= σ02, where µ is unknown, and therefore a ‘nuisance parameter’.
Following Theorem 8.1, the likelihood ratio is
h P i
2 −n/2
i (xi −x̄)
maxµ,σ2 f x | µ, σ 2 2π n
exp [−n/2]
Lx(H0, H1) = = −n/2 P
maxµ f x | µ, σ02 (2πσ02 ) exp [−(1/2σ02) i(xi − x̄)2]
P
As in Section 8.4 this is large when i (xi − x̄)/nσ02 (= Sxx /nσ02) differs substantially
from 1.
Under H0, SXX /σ02 ∼ χ2n−1. Given the required size of test α, let a1 , a2 be such
that
P(SXX /σ02 < a1 ) + P(SXX /σ02 > a2 ) = α
under H0 . Then a size α test is to reject H0 if Sxx /σ02 < a1 or if Sxx /σ02 > a2 .
−1 −1
Usually we would take a1 = Fn−1 (α/2), a2 = Fn−1 (1 − α/2), where Fn−1 is the
2
distribution function of a χn−1 random variable.
48
12 The F -test and analysis of variance
The statistician’s attitude to variation is like that of the evangelist to sin;
he sees it everywhere to a greater or lesser extent.
12.1 F -distribution
If X ∼ χ2m and Y ∼ χ2n , independently of X, then
Example 12.1 Suppose we wish to test the hypothesis that the variance of longevity
is the same for male fruitflies kept with 1 interested or 1 uninterested female, i.e.,
H0 : σ12 = σ22 against H0 : σ12 6= σ22 .The test statistic is
f = (15.6525)2/(14.9284)2 = 1.099,
(24,24)
which, as F0.05 = 1.98, is not significant at the 10% level (the test is two-tailed).
Notice that in order to use F tables we put the larger of σ̂12 and σ̂22 in the numerator.
49
12.3 Non-central χ2
Pk 2
If X1 , . . . , Xk are independent N(µi , 1) then Z = i=1 Xi has thePnon-central
chi-squared distribution, χ2k (λ), with non-centrality parameter λ = ki=1 µ2i . Note
that EW = k + λ; thus a non-central χ2k tends to be larger than a central χ2k .
To see that it is only the value of λ matters, let A be an orthogonal matrix
such
Pk that Aµ
P = (λ1/2, 0, . . . , 0)>, so (Aµ)>P
(Aµ) = µ> µ = λ. Let Y = AX; then
2 k 2 2 2 k 2 2
i=1 Xi = i=1 Yi , with Y1 = χ1 (λ) and i=2 Yi = χk−1 .
50
P
where s2 := i ni (x̄i· − x̄·· )2, and thus s0 /s1 is large when s2 /s1 is large.
s1 is called the within samples sum of squares and s2 is called the between
samples sum of squares. P
Now, whether or not H0 is true, j (Xij − X̄i· )2 ∼ σ 2χ2ni −1, since E(Xij ) depends
only on i.PHence, S1 ∼ σ 2χ2N −k , since samples for different i are independent.
Also, j (Xij − X̄i· )2 is independent of X̄i· , so that S1 is independent of S2 . If H0
is true S2 ∼ σ 2 χ2k−1, and if H0 is not true, S2 ∼ σ 2χ2k−1 (λ), where
P P
E(S2) = (k − 1)σ 2 + λ, λ = ki=1 ni (µi − µ̄)2, µ̄ = i niµi /N.
Intuitively, if H0 is not true S2 tends to be inflated.
So, if H0 is true then Q = {S2/(k − 1)}/{S1/(N − k)} ∼ Fk−1,N −k , while if H0 is
(k−1,N −k)
not true, Q tends to be larger. So for a size α test we reject H0 if q > Fα .
P An interpretation of this is that the variability in the total data set is s0 =
(x − x̄·· )2. Under H1 we expect xij to be about x̄i· and so a variability of s2 =
Pij ij
ij (x̄i· − x̄·· ) is ‘explained’ by H1 . Statisticians say that H1 ‘explains (s2 /s0 )100%
2
of the variation in the data’, (where since s0 = s1 + s2 , we must have s2/s0 ≤ 1.) If
s2 /s0 is near 1, or equivalently if s2 /s1 is large, then H1 does much better than H0
in explaining why the data has the variability it does.
Example 12.2 Partridge and Farquhar did experiments with five different groups of
25 male fruitflies. In addition to the groups kept with 1 interested or 1 uninteresed
female, 25 males were each kept with no companions, and groups of 25 were each
kept with 8 uninterested or 8 interested females. The ‘compliance’ of the males who
were supplied with 8 virgin females per day varied from 7 inseminations per day at
age one week to just under 2 per day at age eight weeks.
Groups of 25 mean life s.e.
males kept with (days)
no companions 63.56 16.4522
1 uninterested female 64.80 15.6525
1 interested female 56.76 14.9284
8 uninterested females 63.36 14.5398
8 interested females 38.72 12.1021
Suppose we wish to test equality of means in the three control groups, i.e., those
kept with either no companions, or 1 or 8 uninterested females (rows 1, 2 and 4).
First we reconstruct the sums of squares,
P25
j=1 (x1j − x̄1 ) = 24(16.4522 ) = 6496.16
2 2
P25
j=1 (x2j − x̄2 ) = 24(15.6525 ) = 5880.00
2 2
P25
j=1 (x4j − x̄4 ) = 24(14.5398 ) = 5073.76
2 2
51
then we calculate the within and between sums of squares,
x̄ = (63.56 + 64.80 + 63.36)/3 = 63.91
s1 = 6496.16 + 5880.00 + 5073.76 = 17449.92
X X
s2 = 25(x̄i· − x̄·· )2 = 25x̄2i· − 75x̄2·· = 30.427
i=1,2,4 i=1,2,4
Total N −1 74 s0 17480.35
(2,72)
The value of 0.0628 is not significant compared to F0.05 = 3.12 and hence we do
not reject the hypothesis of equal means.
A similar test for equality of all five group means gives a statistic with value 507.5,
(4,120)
to be compared to F0.05 = 2.45. Clearly we reject the hypothesis of equal means.
It does seem that sexual activity is associated with reduced longevity.
ANOVA can be carried out for many other experimental designs. We might want
to investigate more than one treatment possibility, or combinations of treatments.
(E.g., in the fruitfly experiments each male fly was kept separate from other males;
we might want to do experiments in which males are kept with different numbers
of interested females and/or competing males.) If there are k possible treatments
which can be applied or not applied, then 2k different combinations are possible and
this may be more than is realistic. The subject of ‘experimental design’ has to do
with deciding how to arrange the treatments so as to gather as much information
as possible from the fewest observations. The data is to be analysed to compare
treatment effects and this typically involves some sort of ANOVA. The methodology
is the same as for the one-way ANOVA considered above; we consider a normalised
quotient, such as q above, between the reduction in the residual sums of squares that
is obtained when moving from H0 to H1 (e.g., s0 − s1 ) and the value of the residual
sum of squares under H1 (e.g., s1 ). In subsequent lectures we will see further examples
of this idea in the context of regression models.
52
13 Linear regression and least squares
Numbers are like people; torture them enough and they’ll tell you anything.
with respect to α and β. These are called the least squares estimators and are
given by:
53
Proof. Since Yi ∼ N(α + βwi, σ 2) the likelihood of of y1 , . . . , yn is
!
1 1 Xn
1 −S/2σ2
fY (y | µ, σ 2) = exp − (yi − α − βwi ) 2
= e .
(2πσ 2)n/2 2σ 2 i=1 (2πσ 2)n/2
The maximum likelihood estimator minimizes S, and so at a minimum,
∂S Xn
∂S Xn
= −2 (yi − α̂ − β̂wi ) = 0 , α=α̂ = −2 wi (yi − α̂ − β̂wi ) = 0.
∂α α=α̂ i=1
∂β i=1
β=β̂ β=β̂
Hence
X
n X
n X
n
Yi − nα̂ = 0 and wi Yi − β̂ wi2 = 0,
i=1 i=1 i=1
from which the answers follow.
54
country mean life people per people per
expectancy, y television, u doctor, v
Argentina 70.5 4.0 370
Bangladesh 53.5 315.0 6166
Brazil 65.0 4.0 684
.. ..
. .
United Kingdom 76.0 3.0 611
United States 75.5 1.3 404
Venezuela 74.5 5.6 576
Vietnam 65.0 29.0 3096
Zaire 54.0 * 23193
80 80
life expectancy
life expectancy
o o
60 60
40 40
Let xi = log10 ui and consider fitting a regression of y against x. There is data for
38 countries (as television data for Zaire and Tanzania is missing). We compute the
following summary statistics
These give
1
β̂ = Sxy /Sxx = −9.808, α̂ = ȳ − β̂ x̄ = 77.887, r = Sxy /(SxxSyy ) 2 = −0.855.
55
appropriate. The third case is affected by the presence of an outlier and the fourth
case is really no more than a straight line fit through 2 points. The lesson is: plot
the data!
1200 1200 1200 1200
0
0 2 4 6 8 10 12 14 16 18 20 0
0 2 4 6 8 10 12 14 16 18 20
0
0 2 4 6 8 10 12 14 16 18 20
0
0 2 4 6 8 10 12 14 16 18 20
10 804 8 695 13 758 9 881 11 833 14 996 6 724 4 426 12 1084 7 482 5 568
10 914 8 814 13 874 9 877 11 926 14 810 6 613 4 310 12 913 7 726 5 474
10 746 8 677 13 1274 9 711 11 781 14 884 6 608 4 539 12 815 7 642 5 573
8 658 8 576 8 771 8 884 8 847 8 704 8 525 19 1250 8 556 8 791 8 689
56
14 Hypothesis tests in regression models
Statisticians do it with a little deviance.
Theorem 14.1
(i) α̂ = Ȳ is distributed as N(α, σ 2/n);
(ii) β̂ is distributed as N β, (w>w)−1σ 2 independently of α̂;
(iii) the residual sum of squares R, the minimised value of S, is distributed as
σ 2 χ2n−2 independently of α̂ and β̂, and is equal to
X
R= Yj2 − nȲ 2 − (w>w)β̂ 2;
√ √
Z1 = nα̂ ∼ N nα, σ 2
Z2 = (w> w)1/2β̂ ∼ N (w>w)1/2β, σ 2
Z3 = · ∼ N 0, σ 2
.. ..
. .
Zn = · ∼ N 0, σ 2
57
and
X
n X
n
Zi2 = Yi2
i=1 i=1
= k(Y − α̂1 − β̂w) + α̂1 + β̂wk2
= kY − α̂1 − β̂wk2 + nα̂2 + β̂ 2kwk2
(since all cross-product terms vanish)
= R + nȲ 2 + (w>w)β̂ 2
Pn
So R = 3 Zi2 ∼ σ 2 χ2n−2 and is independent of Z1 and Z2, i.e., of α̂ and β̂.
58
and so H0 should be rejected if r2 is near 1. P
Note that the variation in the data is Syy = j (yj − ȳ)2. The regression model
P P
‘explains’ variation of j (ŷj − ȳ)2 where ŷi = â + β̂xi . One can check that j (ŷj −
ȳ)2 = Sxy
2
/Sxx and so the ratio of these is r2 . We say that ‘the regression explains
100r2% of the variation in the data’.
if the means are indeed linearly related. Thus to test linearity we consider
P 2
m ni=1 Ȳi − α̂ − β̂(xi − x̄) /(n − 2)
F = Pn Pm ∼ Fn−2,n(m−1),
j=1 (Yij − Ȳi ) /n(m − 1)
2
i=1
(n−2),n(m−1)
if the model of linearity holds. We reject the hypothesis if f > Fα .
59
‘Length’ is the length of the fruitfly’s thorax. It turns out that longevity (y) is
positively correlated to thorax size (x) (as plots of the data show).
Suppose we consider only the data for rows 2 and 3 and adopt a model that for
i = 2, 3,
yij = ai + βxij , j = 1 . . . , 25.
Let ā = 12 (a2 + a3 ). Our model ‘explains’ the observed variation in longevity within
group i in terms of the sum of two effects: firstly, an effect due to thorax size, ā+βxij ;
secondly, an effect specific to group i, ai − ā. We would like to test
H0 : a2 = a3 against H1 : a2 6= a3 .
To do this we need to fit the appropriate regression models under the two hy-
potheses by minimizing the residual sum of squares
X
25 X
25
S= (y2j − a2 − βx2j ) +
2
(y3j − a3 − βx3j )2 .
j=1 j=1
Under H1 we minimize freely over a2 , a3 , β and get â2 = −46.04, â3 = −55.69,
β̂ = 134.25, with residual sum of squares R1 = 6962.90.
Under H0 we minimize subject to a2 = a3 and get â2 = â3 = −45.82, β̂ = 128.18,
with residual sum of squares R0 = 8118.39. We can write
R0 = (R0 − R1) + R1 .
The degrees of freedom of H0 and H1 are 2 and 3 respectively. It can be shown that
R1 ∼ σ 2 χ250−3, whether or not H0 is true. Also R1 and R0 − R1 are independent. If
H0 is true, then R0 − R1 ∼ σ 2χ23−2 . If H0 is not true then R0 − R1 is inflated.
As we have done previously for ANOVA in Section 12.4, we compute an F statistic
(R0 − R1 )/(3 − 2)
f= = 7.80,
R1 /(50 − 3)
(1,47)
which upon comparison to F0.05 = 4.21 leads us to reject H0 ; there is indeed a
significant difference between the longevities of the males in the two groups. This is
the opposite to what we found with a t-test for equality of means in Example 11.3.
The explanation is that the mean thorax size happens to be greater within the group
of the males exposed to interested females. This is usually associated with greater
longevity. When we take into account the fact that this group did not show the
greater longevity that would be appropriate to its greater mean thorax size then we
do find a difference in longevities between males in this group and those in the group
that were kept with a nonreceptive female.
Thus we see that the analysis in Example 11.3 was deficient. There is a lesson in
this example, which might be compared to that in Simpson’s paradox.
60
15 Computational methods
Computers have freed statisticians from the grip of mathematical tractability.
2 2
1 1
0 0
-1 -1
-2 -2
-3 -3
(a) 0 1 2 3 (b) 0.6 0.8 1
log people per television thorax length (mm)
We draw lines at ±1.96, the values between which samples from a N(0, 1) will
lie 95% of the time. In (a) the pattern of residuals is consistent with samples from
N(0, 1). In (b) it looks as though the magnitude of the errors might be increasing
with thorax length. This is known as ‘heteroscedasticity’. Perhaps a better model
would be i ∼ N(0, σ 2xi). This would suggest we try fitting, with ηi ∼ N(0, σ 2):
√ √ √
yi / xi = a/ xi + β xi + ηi .
61
15.2 Discriminant analysis
A technique which would be impossible in practice without computer assistance is
the credit-scoring used by banks and others to screen potential customers.
Suppose a set of individuals {1, 2, . . . , n} can be divided into two disjoint sets, A
and B, of sizes nA and nB respectively. Those in set A are known good credit risks
and those in set B are known bad credit risks. For each individual we have measured
p variables which we believe to be related to credit risk. These might be years at
present address, annual income, age, etc. For the ith individual these are xi1, . . . , xip.
The question is: given measurements for a new individual, say x01, . . . , x0p, is that
individual more likely to be a good or bad credit risk? Is he more similar to the
people in group A or to those in group B?
One approach to this problem is to use least squares to fit a model
yi = β0 + β1 xi1 + · · · + βp xip + i
is used to P
classify the new individual
P as being in group A or group B as ŷ0 is closer
to (1/nA) i∈A ŷi or to (1/nB ) i∈B ŷi . We do not go any further with the theory
here. The point is that this is a practically important application of statistics, but
a lot of calculation is required to find the discriminant function. Of course a mail
order company will experiment with building its discriminant function upon different
variables and doing this research is also computer-intensive.
Other uses of discriminant analysis, (and related ideas of ‘cluster analysis’ when
there are more than two groups), include algorithms used in speech recognition and
in finance to pick investments for a portfolio.
62
finding that linear function of the variables with the greatest variance, i.e.,
X
n
2 X
p
maximize (β1 xi1 + · · · + βp xip) − (β1x̄1 + · · · + βp x̄p ) subject to βi2 = 1
i=1 i=1
where x̄i is the mean of the ith variable within the population. Equivalently,
maximize β >Gβ subject to β >β = 1,
P
where G is the p × p matrix with Gjk = ni=1(xij − x̄j )(xik − x̄k ). By Lagrangian
methods we find that the maximum equals the largest eigenvalue of G, say λ1 , and is
achieved when β is the corresponding right hand eigenvector, say β 1 = (β11, . . . , βp1)> .
We call β 1 the ‘first principal component’. Similarly, we can find the eigenvector
β 2 of G corresponding to the second largest eigenvalue, λ2 . Continuing, we find
an orthogonal set of eigenvectors β 1, . . . , β m, m < p, such that the proportion of
variance explained, i.e.,
, p n
Xm X n
j 2
XX
(β1 xi1 + · · · + βpj xip) − (β1j x̄1 + · · · + βpj x̄p) (xij − x̄j )2
j=1 i=1 j=1 i=1
Pm Pp
is near 1. This amounts to
Pp the same thing as j=1 λj / j=1 λj ; indeed the denomina-
tor above is trace(G) = j=1 λj . The above ratio is also the proportion of variation
explained by using least squares to fit
xij = α1j zi1 + · · · + αm
j
zim + ij ,
when we take zij = β1j xi1 + · · · βpj xip. Here zij is the ‘score of individual i on factor j’.
The final step is to try to give some natural interpretation to the factors,
z1 , . . . , zm . For example, if we observe that the components of β 1 which are large in
magnitude seem to match up with components of x which have something to do with
whether or not an individual is extroverted, and other components of β 1 are near 0,
then we might interpret factor 1 as an ‘extroversion factor’. Then if zi1, the score of
individual i on this factor, is large and positive we could say that i is extroverted,
and if large and negative that i is introverted.
To be fair, we should say that things are rarely so simple in practice and that
many statisticians are dubious about the value of factor analysis. For one thing, the
factors depend on the relative units in which the variables are measured.
Nevertheless, here is a simple illustration for p = 2, m = 1, n = 8. Suppose 8
students are scored on two tests, one consisting of verbal puzzles and the other of
maths puzzles; the ith student scores (xi1, xi2). The first principal component is a
line through the data which minimizes the sum of squared differences between the
data points and their orthogonal projections onto this line. A reasonable name for
this component might be ‘IQ’. The ‘IQ’ of student i is zi1 = β11xi1 + β21 xi2.
63
90
student math verbal IQ mathmo
1
verbal score 80 score score factor factor
3 1 85 80 116.1 12.1
70 6 2 77 62 97.2 17.8
4 3 75 75 105.8 7.8
7 2 4 70 65 94.9 10.5
60
5 67 50 81.6 18.1
6 63 69 93.4 2.6
8 5
50 7 60 62 86.1 4.9
50 60 70 80 90
8 55 49 73.0 9.6
maths score
For the data above, z1 = 0.653x1 + 0.757x2. The proportion of variation explained
is θ̂ = t(x) = 0.86. A bootstrap estimate with B = 240 gives σ̂θ̂ = 0.094.
Formalisation of the bootstrap method dates from 1979; the study of its use for
constructing estimators, tests and confidence intervals is an active area of research.
64
16 Decision theory
To guess is cheap, to guess wrongly is expensive. (Old Chinese proverb)
65
Example 16.1 In Nature (29 August, 1996, p. 766) Matthews gives the following
table for various outcomes of Meteorological Office forecasts of weather covering 1000
one-hour walks in London.
Rain No rain Sum
Forecast of rain 66 156 222
Forecast of no rain 14 764 778
Sum 80 920 1000
Should one pay any attention to weather forecasts when deciding whether or not
to carry an umbrella?
To analyse this question in a decision-theoretic way, let W , F and U be respec-
tively the events that it is going to rain (be wet), that rain has been forecast, and
that we carry an umbrella. The possible states of nature are W and W c . The data is
X = F or X = F c . Possible actions are chosen from the set A = {U, U c }. We might
present the loss function as
Wc W
c
U L00 L01
U L10 L11
For example, we might take L01 = 4, L11 = 2, L10 = 1, L00 = 0. Of course these
are subjective choices, but most people would probably rank the four outcomes this
way.
One possible decision function is given by d1 (X) = U c , i.e., never carry an um-
brella. It’s risk function is
R(W c, d1) = L00; R(W, d1) = L01 .
Another possible decision function is given by d2 (F ) = U and d2(F c ) = U c , i.e., carry
an umbrella if and only if rain is forecast. The risk function is
R(W c , d2) = (764/920)L00 + (156/920)L10; R(W, d2) = (66/80)L11 + (14/80)L01 .
We see that if θ = W c then d1 is better, but if θ = W then d2 is better. Thus neither
rule is uniformly better for both states of nature. Both d1 and d2 are admissible. By
averaging over the states of nature we have the so-called Bayes risk, defined as
B(d) = E [R(θ, d)],
where the expected value is now taken over θ. For example, in our problem, P(W ) =
0.08 and P(W c ) = 0.92, so B(d) = 0.08R(W, d) + 0.92R(W c, d).
The Bayes rule is defined as the rule d which minimizes the Bayes risk. Thus to
find the Bayes rule for our problem, we must compare
B(d1) = .08L01 + .92L00
66
to
B(d2) = .08 (66/80)L11 + (14/80)L01 + .92 (764/920)L00 + (156/920)L10
= .066L11 + .014L01 + .764L00 + .156L10.
It follows that it is better to ignore weather forecasts and simply go for walks without
an umbrella, if
which can hold for reasonable values of the loss function, such as those given above,
for which ∆ = 2. It all depends how you feel about getting wet versus the inconve-
nience of carrying an umbrella. Similar analysis shows that the commonly followed
rule of always carrying an umbrella is better than doing so only if rain is forecast
only if one is very adverse to getting wet, i.e., if ∆ > 764/14 + 53.
Not surprisingly, this leads to exactly the same criterion for choosing between d1 and
d2 as we have already found above.
This is a general principle: the Bayes rule, d, can be determined as the action a
which minimizes E θ|X [R(θ, a)], this expectation being taken over θ with respect to
the posterior distribution p(θ | x).
67
16.3 Hypothesis testing as decision making
We conclude by elucidating a decision theoretic approach to hypothesis testing. Con-
sider the problem of testing a simple null hypothesis H0 : θ = θ0 against a simple
alternative hypothesis H1 : θ = θ1 . On the basis of an observation X we must decide
in favour of H0 (i.e., take action a0 ) or decide in favour of H1 (i.e., take action a1 ).
For the case of so-called 0–1 loss we take L(θ0, a0 ) = L(θ1, a1 ) = 0 and L(θ0, a1) =
L(θ1, a0) = 1. I.e., there is unit loss if and only if we make the wrong decision.
The risk function is then simply the probability of making the wrong decision, so
R(θ0, d) = P(d(X) = a1 | H0) and R(θ1, d) = P(d(X) = a0 | H1).
Suppose we have prior probabilities on H0 and H1 of p0 and p1 respectively. This
gives Bayes risk of
As we have seen in the previous section the Bayes rule minimizes the posterior
losses, so we should choose d(X) to be a1 or a0 as
B(a0 | x) P(H1 | x) p1 P(x | H1 ) p1 f (x | θ1 )
= = =
B(a1 | x) P(H0 | x) p0 P(x | H0 ) p0 f (x | θ0 )
is greater or less than 1.
This is of course simply a likelihood ratio test. Observe, however, that we have
reached this form of test by a rather different route than in Lecture 6.
68