EC501 Econometric Methods
2. Linear Regression: Statistical Properties
Marcus Chambers
Department of Economics
University of Essex
19 October 2023
1 / 27
Outline
Review
The linear regression model: assumptions
Statistical properties of OLS: small N
Statistical properties of OLS: large N
Reference: Verbeek, chapter 2.
2 / 27
Review
We motivated the ordinary least squares (OLS) estimator by
choosing a linear combination of the regressors that provides a
‘good’ approximation of the dependent variable.
Our measure of ‘good’ was in terms of the sum of squared
residuals, where the residual for observation i is
ei = yi − β̃1 − β̃2 xi2 − . . . − β̃K xiK , i = 1, . . . , N.
The OLS estimator is obtained as b = arg minβ̃ S(β̃) where
N
X
S(β̃) = e2i = e′ e = (y − X β̃)′ (y − X β̃).
i=1
The result is
b = (X ′ X)−1 X ′ y.
3 / 27
Properties of b?
But: what are the (statistical) properties of b?
To answer this question we need to move beyond thinking of
OLS in a purely algebraic sense.
Instead of describing the properties of a given sample we shall
think in terms of a statistical model relating y to x2 , . . . , xK .
We try to learn something about this relationship from our
observed sample.
The statistical properties (assumptions) of the model then
determine the statistical properties of b.
4 / 27
The linear regression model
The linear regression model takes the form
yi = β1 + β2 xi2 + . . . + βK xiK + ϵi or yi = xi′ β + ϵi ,
where ϵi is an error term or disturbance.
This is a population relationship between y and x and is
assumed to hold for any possible observation.
Our goal is to estimate the population parameters, β1 , . . . , βK ,
based on our sample, (yi , xi ; i = 1, . . . , N).
We regard yi and ϵi (and usually xi ) as random variables that
are part of a sample derived from the population.
Recall that we can write the model in matrix form as
y = X β + ϵ.
(1)
(N × 1) (N × K) (K × 1) (N × 1)
5 / 27
Random sampling
The origins of the linear regression model lie in the sciences
where the xi variables are determined in a laboratory setting.
The xi variables are fixed in repeated samples so that the only
source of randomness is ϵi leading to different values for yi
across samples.
This can be hard to justify in Economics where it is more
common to regard both xi and ϵi as changing across samples.
This leads to different observed values of yi and xi each time a
new sample is drawn.
A random sample implies that each observation, (yi , xi ), is an
independent drawing from the population.
We will use this idea as a basis for a set of statistical
assumptions.
6 / 27
Assumptions
Our assumptions concern the linear model
yi = xi′ β + ϵi , i = 1, . . . , N.
The Gauss-Markov conditions are:
E{ϵi } = 0, i = 1, . . . , N; (A1)
{ϵ1 , . . . , ϵN } and {x1 , . . . , xN } are independent; (A2)
V{ϵi } = σ 2 , i = 1, . . . , N; (A3)
cov{ϵi , ϵj } = 0, i, j = 1, . . . , N, i ̸= j. (A4)
Note that we also need N > K and X ′ X to be invertible – here
we need X to have rank K i.e. the columns of X are linearly
independent (M4).
What do these conditions imply?
7 / 27
Assumptions A1, A3 and A4
Assumption (A1) suggests that the regression line holds on
average (more on this shortly).
Assumption (A3) states that all disturbances have the same
variance - this is known as homoskedasticity (which rules out
heteroskedasticity, or non-constant variances, which we shall
deal with later).
Assumption (A4) tells us that all pairs, ϵi and ϵj , are
uncorrelated (this is essentially just random sampling), thereby
ruling out autocorrelation.
In terms of the N × 1 vector ϵ, these assumptions imply (see
S12) that
E{ϵ} = 0 (N × 1) and V{ϵ} = σ 2 IN (N × N),
where IN is the N × N identity matrix.
8 / 27
Assumption A2
Under assumption (A2) the matrix X and vector ϵ are
independent.
This means that knowledge of X tells us nothing about the
distribution of ϵ (and vice versa).
It implies that
E{ϵ|X} = E{ϵ} = 0 and V{ϵ|X} = V{ϵ} = σ 2 IN .
Under (A1) and (A2) the linear regression model is a model for
the conditional mean of yi , because
E{yi |xi } = E{xi′ β + ϵi |xi } = xi′ β + E{ϵi |xi } = xi′ β
in view of E{ϵi |xi } = 0.
Assumptions (A1)–(A4) jointly determine the properties of b.
9 / 27
Small N
We shall begin by taking the sample size, N, to be a finite
number (but recall N > K).
First, note that the OLS vector b is a linear function of y:
b = (X ′ X)−1 X ′ y
= (X ′ X)−1 X ′ (Xβ + ϵ) (using y = Xβ + ϵ)
= (X ′ X)−1 X ′ Xβ + (X ′ X)−1 X ′ ϵ
= β + (X ′ X)−1 X ′ ϵ (because (X ′ X)−1 X ′ X = IK ).
It is, therefore, also a linear function of the unobservable
random vector ϵ.
10 / 27
E{b}
The expected value of b is
E{b} = E{β + (X ′ X)−1 X ′ ϵ} = β + E{(X ′ X)−1 X ′ ϵ}.
But
E{(X ′ X)−1 X ′ ϵ} = E{(X ′ X)−1 X ′ }E{ϵ} = 0
by (A2) and E{ϵ} = 0 by (A1).
Hence E{b} = β and the OLS estimator b is said to be an
unbiased estimator of β.
In repeated sampling the OLS estimator will be equal to β ‘on
average.’
Note that unbiasedness does not require (A3) and (A4).
11 / 27
V{b|X}
The conditional covariance matrix of b is:
V{b|X} = E{(b − β)(b − β)′ |X}
= E{(X ′ X)−1 X ′ ϵϵ′ X(X ′ X)−1 |X}
= (X ′ X)−1 X ′ E{ϵϵ′ |X}X(X ′ X)−1
= σ 2 (X ′ X)−1 X ′ X(X ′ X)−1 as E{ϵϵ′ |X} = σ 2 IN
= σ 2 (X ′ X)−1 .
We will denote this as V{b} = σ 2 (X ′ X)−1 for convenience.
The unconditional variance matrix is actually
V{b} = σ 2 E (X ′ X)−1 ,
which is rather more complicated!
12 / 27
Gauss-Markov Theorem
Clearly OLS is a Linear Unbiased Estimator (LUE).
But how does OLS compare to other LUEs?
Gauss-Markov Theorem
Under Assumptions (A1)–(A4), the OLS estimator b of β is the
Best Linear Unbiased Estimator (BLUE) in the sense that it has
minimum variance within the class of LUEs.
What does this mean?
Take any other LUE, call it b̃; then
V{b̃|X} ≥ V{b|X}
in the sense that the matrix V{b̃|X} − V{b|X} is positive
semi-definite; see (M10).
13 / 27
Normality of ϵ
Sometimes it is appropriate to actually specify the distribution
of the random disturbance vector ϵ.
A common assumption, that incorporates (A1), (A3) and (A4),
is:
ϵ ∼ N(0, σ 2 IN ). (A5)
This is equivalent to
ϵi ∼ NID(0, σ 2 ), (A5′ )
where NID denotes ‘normally and independently distributed.’
This also implies that yi ∼ NID(xi′ β, σ 2 ) (conditional on X) which
is not always appropriate.
14 / 27
Normality of b
Under (A2) and (A5) it follows that
b ∼ N(β, σ 2 (X ′ X)−1 )
because b is linear in ϵ.
Each element of b is also normally distributed:
bk ∼ N(βk , σ 2 ckk ), k = 1, . . . , K,
where ckk denotes the (k, k) (diagonal) element of (X ′ X)−1 .
These results motivate statistical tests based on b but, in
practice, we don’t know σ 2 .
We therefore estimate σ 2 using the data – how do we do this?
15 / 27
Estimation of σ 2 = V{ϵi }
We usually estimate variances by sample averages but ϵi is
unobserved.
Instead we can base an estimator on the residuals:
N
1 X 2
s2 = ei .
N−K
i=1
This estimator is unbiased (i.e. E{s2 } = σ 2 ).
Note the degrees of freedom adjustment – the denominator is
N − K rather than N − 1.
This is because we have estimated K parameters in order to
obtain the residuals (ei = yi − xi′ b).
The estimated variance matrix of b is then
V̂{b} = s2 (X ′ X)−1 .
16 / 27
Returning to the R output for a regression of individuals’ wages
on years of education from last week:
> fit1 <- lm(lwage~educ, data=wage1)
> summary(fit1)
Call:
lm(formula = lwage ~ educ, data = wage1)
Residuals:
Min 1Q Median 3Q Max
-2.21158 -0.36393 -0.07263 0.29712 1.52339
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.583773 0.097336 5.998 3.74e-09 ***
educ 0.082744 0.007567 10.935 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4801 on 524 degrees of freedom
Multiple R-squared: 0.1858,Adjusted R-squared: 0.1843
F-statistic: 119.6 on 1 and 524 DF, p-value: < 2.2e-16
= 0.4801 (implying s2 = 0.2305), while the standard
Here, s p
errors ( s2 ckk ) are 0.0973 and 0.0076 for b1 and b2 , respectively.
17 / 27
Large N
The Gauss-Markov assumptions ensure that exact finite
sample results hold for b (e.g. unbiasedness, normality).
If we wish to relax some of these assumptions then exact finite
sample results are typically not available.
For example, if (A2) doesn’t hold, then b will be biased.
We therefore use results for large N to find out the asymptotic
properties as N → ∞.
For large enough N we treat the asymptotic results as holding
approximately.
18 / 27
Convergence
Consider a sequence of numbers indexed by N e.g.
−N 1 1 1 1
{xN = e } = , , ,..., N,... .
e e2 e3 e
We can define the limit of this sequence as N → ∞:
lim xN = lim e−N = 0.
N→∞ N→∞
The sequence {xN } is said to converge to zero.
But what happens if the elements of the sequence are random
variables?
19 / 27
Convergence of random variables
The sequence of random variables {xN } is said to converge in
probability to a constant c if
lim P {|xN − c| > δ} = 0 for all δ > 0;
N→∞
(see, for example, (2.69) on p.34 of Verbeek).
This is written
p
xN → c or plim xN = c.
In words: there exists a positive number δ such that, as N gets
larger and larger, the probability that the distance between xN
and c is larger than δ, converges to zero.
Note that δ can be arbitrarily small.
20 / 27
Slutsky’s Theorem
If plim b = β then b is a consistent estimator of β.
Consistency can be thought of as a large sample version of
unbiasedness and is a minimum requirement for an estimator.
A useful property of the plim operator is:
Slutsky’s Theorem
If g(·) is a continuous function and plim xN = c, then
plim g(xN ) = g(plim xN ) = g(c);
(see, for example, (2.71) on p.34 of Verbeek).
This is not a property shared by the expectations operator; in
general,
̸ g{E(x)}
E{g(x)} =
for a random variable x.
21 / 27
Convergence to a constant
6
N=10
N=100
5
N=1000
4
Density
0
-1 -0.5 0 0.5 1
Estimator
Convergence to a constant c (here, c = 0) is illustrated above
by the variance of the distribution becoming smaller as N
increases.
22 / 27
Large N assumptions
What can we say about b in large samples? Is it consistent?
It is convenient to make the following assumptions:
N
1 ′ 1X ′ p
XX= xi xi → Σxx , (finite, nonsingular); (A6)
N N
i=1
E{xi ϵi } = 0, i = 1, . . . , N. (A7)
In (A6) the matrix Σxx can be regarded as E(xi xi′ ).
Assumption (A7) states that xi and ϵi are uncorrelated.
What do these conditions imply for b?
23 / 27
Properties of b
We begin by writing
−1
1 ′ 1 ′
b = β+ XX Xϵ
N N
N
!−1 N
1X ′ 1X
= β+ xi xi xi ϵi .
N N
i=1 i=1
Applying the plim operator and using Slutsky we find
N
!−1 N
1X ′ 1X
plim(b − β) = plim xi xi plim xi ϵi .
N N
i=1 i=1
The first term converges to Σ−1
xx using (A6).
24 / 27
Large sample results
It is reasonable to assume that sample averages converge to
their population values and so
N
1X
plim xi ϵi = E{xi ϵi }.
N
i=1
But E{xi ϵi } = 0 under (A7) and so
plim(b − β) = Σ−1
xx E{xi ϵi } = 0.
Hence b is a consistent estimator of β.
It is also possible to show that, as N → ∞,
√
N(b − β) → N(0, σ 2 Σ−1
xx ),
where → means ‘is asymptotically distributed as’.
25 / 27
Large sample approximation
For a large but finite sample size we can use this result to
approximate the distribution of b as
a
b ∼ N(β, σ 2 Σ−1
xx /N),
a
where ∼ means ‘is approximately distributed as.’
Our best estimate of Σxx is X ′ X/N and we estimate σ 2 using s2 .
Hence we have the familiar result
a
b ∼ N(β, s2 (X ′ X)−1 ).
But note this is only approximate as it is based on weaker
assumptions than Gauss-Markov.
26 / 27
Summary
• Gauss-Markov assumptions
• statistical properties of OLS: small N and large N
• Next week:
• goodness-of-fit
• hypothesis testing (t and F tests)
27 / 27