Econometrics Lecture Notes

Econometrics Lecture Notes
Brennan Thompson 11th September 2006
Review of probability and statistics
1.1 Random variables

In the simplest of terms, a is determined by the
will denote an RV by an uppercase letter (such as a lowercase letter (such as die. Let
random variable RV outcome

(or
) is a variable whose value
of a random experiment.
1 In what follows, we
or
Y ),
and an outcome by
or
y ).
As a simple example, consider the random experiment of rolling a single
be the RV determined by the outcome of this experiment. In this
experiment, there are six possible outcomes: If the die lands with 1, 2, 3, 4, 5, or 6, facing up, we assign that value to the RV
X.
This example illustrates what is known as a Here, the RV or 6. A discrete RV diers from a
an RV which can only take on a nite (or countably innite) number of values.
discrete random variable
can take only take on the values only the value of 1, 2, 3, 4, 5,
can take on any value in some real interval (and therefore an uncountably innite number of values). An example of continuous RV is a person's height. For an adult, this could take on any value between (say) 4 and 8 feet, depending on the precision of measurement.
continuous random variable
: an RV which
1.2 Probability functions

Associated with every RV is a discrete RV, we will call a continuous RV, we will call a
Consider rst the discrete case. The probability mass function the probability that the discrete RV
probability function probability mass function probability density function

X
will take on the value
, which in the case of a , and in the case of a (or
PDF
f (x)
). tells us
x.
More formally,
f (x) = Prob(X = x).

Department of Economics, Ryerson University, 350 Victoria Street, Toronto, Ontario, M5B 2K3, Canada. Email: brennan@ryerson.ca 1 From this deniton, it should be obvious that an RV diers from a constant, which is a variable whose value is xed.
It should be noted that a probability mass function for a discrete RV requires that
0 f (x) 1,
2 and
f (x) = 1.
x
In our die rolling example, it should be clear that, since each outcome (1, 2, 3, 4, 5, or 6) is equally likely, the probability mass function for equal to
X , f (x),
is
1 6 for each
x,
i.e.
f (x) =
1 , for x = 1, . . . , 6. 6
Such a probability mass function is said to represent a
since each outcome has an equal (or uniform) probability of occurring.
uniform distribution
X
For a continuous RV, the probability associated with any particular value is zero, so we can only assign positive probabilities to ranges of values. example, if we are interested in the probability that the continuous RV take on a value in the range between as For will
a and b, the PDF f (x) is implicitly dened

b
Prob(a X b) =
a
f (x)dx.
As in the discrete case, a PDF for a continuous RV has certain (roughly analogous) requirements that must be satised:
f (x) 0,
and
f (x)dx = 1.
x
1.3 Distribution functions

Closely related to the concept of a probability function is a
tion
(or
probability that an RV will take on any value
cumulative distribution function CDF

or
less than or equal to
), which tells us the some specic
distribution funcF (x)

is
value. That is, for an RV dened as
(whether discrete or continuous), the CDF
F (x) = Prob(X x).

It should be fairly easy to see that for a discrete RV
X,
this implies that
F (x) =
Xx
f (x).
2 The notation is used to denote the sum over the entire range of values that the x discrete RV X can take on. 3 The notation is used to denote the integral over the entire range of values that the x continuous RV X can take on.
In our example of rolling a single die, it should be easy to see that the CDF of
is
F (x) =
For a continuous RV
x , for x = 1, . . . , 6. 6
x
X,
the denition of a CDF implies that
F (x) =
and
f (t)dt,
f (x) =
ous), the CDF
dF (x) . dx X
(whether discrete or continu-
Finally, it should be noted that for any RV
F (x)
must satisfy
0 F (x) 1, F (+) = 1, F () = 0,
and
F (x1 ) F (x2 ) for x1 > x2 .
1.4 Joint probability functions

A probability function dened over more than one RV is known as a
probability function Y joint probability mass function f (x, y)

, the the RV
. For instance, in the case of two discrete RVs,
joint
y
and
x and
tells us the probability that
the RV
(simultaneously). That is,
f (x, y) = Prob(X = x, Y = y).

Similar to the requirements of a probability mass function (for a single discrete RV), we must have
0 f (x, y) 1,
and
f (x, y) = 1.
x y
For example, consider two dice, one red and one blue. Denote by by
the RV
whose value is determined by the random experiment of rolling the red die, and
the RV whose value is determined by the random experiment of rolling It should be emphasized that these are two do not confuse this with the
the blue die.
experiments (i.e.
single
and
separate
random
random experiment of let
tossing a pair of dice). combined outcome that
However, the possible outcomes of these two random
experiments can be combined in the following manner:
(x, y)
denote the
takes on the value
x,
takes on the value
y.
For example, the combined outcome each with equal probability, mass function for
(1, 5)
represents rolling a 1 with the red
die, and a 5 with the blue die. In total, there are 36 such combined outcomes,
and
1 36 . Accordingly, we can write the joint probability as
f (x, y) =
1 , for x = 1, . . . , 6 and y = 1, . . . , 36. 36 a and b d (simultaneously). The ) f (x, y) is dened so that

b d
In the continuous case, we may be interested in the probability that the RV
takes on some value in the range between (or
function
some value between
joint PDF
and
joint probability density

f (x, y)dxdy.
and the RV
will take on
Prob(a x b, c y d) =
a c
The requirements here are similar to that of a PDF (for a single continuous RV):
f (x, y) 0,
and
f (x, y)dxdy = 1.
x y
1.5 Conditional probability functions
Closely related to the concept of a joint probability function is a
probability function ability mass function f (y|x)

,
. For two discrete RVs,
and
Y , the
, tell us the probability that
conditional conditional probtakes on the value
y,
given that
takes on the value
x,
i.e.
f (y|x) = Pr(Y = y|X = x),

and is equal to
f (y|x)
= =
conditional PDF
this.
In the continuous case, a
) can be used to tell us the probability that a continuous RV
conditional probability density function
Pr(Y = y, X = x) Pr(X = x) f (y, x) . f (x)
(or
takes on a value in some range, given that another continuous RV takes on some specic value. Of course, as noted above, the probability that a continuous RV takes on a specic outcome is zero, so we need to make some approximation to
takes on some value in the range between
Y a and b, given that the continuous RV X takes on some value extremely close to x. The probability that the continuous
Suppose that we want to know the probability that the continuous RV 4
RV X takes on some value extremely close to x, say in the range between x and x + h, where h is some extremely small number, is
x+h
Pr(x X x + h)
=
x
f (v)dv.
hf (x).
Now, the probability that
a and
and
and
x+h
is
x+h
Pr(a Y b, x X x + h)
=
a b x
f (y, v)dydv hf (y, x)dy.

a
So, similar to the discrete case, the probability that in the range between
takes on some value
a and b given X
and
x+h
is
Pr(a Y b|x X x + h)
= =
Pr(a Y b, x X x + h) Pr(x X x + h)
b a b a
hf (y, x)dy hf (x) f (y, x)dy . f (x) f (y|x)

as
From this approximation, we implicitly dene that conditional PDF
Pr(a Y b|X x) =
a
f (y|x)dy.
1.6 Expected value

Roughly speaking, the
we would expect it to take on if we were to repeat the random experiment which determines its value many times over. It is what is known as a measure of
expected value mean

or
of an RV, is the average value
central tendency.4
We generally use the notation
the expected value of the the RV
X,
and the notation
E(X) when we speak of x when we speak of its

is dened as
mean. However, since these terms are interchangeable, so is the notation. The mathematical expectation of a discrete RV
E(X) =
x
xf (x),
4 Other measures of central tendency are the median and mode. The median of the RV X (whether discrete or continuous) is the value m such that Prob(X m) 0.5 and Prob(X m) 0.5, while the mode of X is the value of x at which f (x) takes it maximum.
while for a continuous RV
we have
E(X) =
xf (x)dx.
It should be clear that these denitions are quite similar, in that the dierent possible values of the RV are weighted by the probability attached to them. In the discrete case, these weighted values are summed, while in the continuous case, integration is used. It should go without saying that
of the constant itself.
the expected value of a constant is the value

c,
we have
That is, for any constant
E(c) = c.
On the other hand, it is not so obvious that
constant.
the expected value of an RV is a
Using the preceding rule, we therefore have
E[E(x)] = E(x),
which might be a little less confusing if we use the notation
E(X ) = X .
In our example of rolling a single die, we can calculate the expected value of
as
E(X) =
1 1 1 1 1 1 (1) + (2) + (3) + (4) + (5) + (6) = 3.5. 6 6 6 6 6 6

be some function of the discrete RV
In a more general way, we can also calculate the expected value of a function of a RV. Letting
g(X)
X,
the expected
value of this function is
E[g(X)] =
x
Similarly, letting
g(x)f (x). X,
the expected
g(X)
be some function of a continuous RV
value of this function is
E[g(X)] =
Note that
g(x)f (x)dx.
To see this, it might help to
a function of an RV is an RV itself.
Z,
imagine a new RV, call it
which is a function of our original RV,
(i.e.
Z = g(X)).
As an example, let's again consider the RV what is known as a RV called
X,
whose value is determined
from the outcome of tossing a single die. Suppose we have
Z,
where, here,
linear function
g(X) = 4X Z
(this is can be
). Again, it might help to imagine a new The expected value of
Z = g(X) = 4X .
calculated as
E(Z) = E(4X) =
1 1 1 1 1 1 (4) + (8) + (12) + (16) + (20) + (24) = 14. 6 6 6 6 6 6 X

(which, as we saw
Notice that this is just four times the expected value of above, is 3.5).
This example illustrates an important rule: The expected value of an RV multiplied by a constant, is the constant multiplied by the expected value of that RV. That is, if X is an RV and c is a constant, we have
E(cX) = cE(X).
The proof of this follows directly from the denition of the mathematical expectation of a discrete RV:
E(cX)
=
x
cxf (x) xf (x)

x
= c
= cE(X).
Continuing with our random experiment of tossing a single die, consider the function
g(X) = X 2 Z
(this is what is known as a can be calculated as
again, it might help to imagine a new RV called expected value of
Z , where Z = g(X) = X 2 .
nonlinear function
). Once The
E(Z) = E(X 2 ) =
1 1 1 1 1 91 1 (1) + (4) + (9) + (16) + (25) + (36) = 15.17. 6 6 6 6 6 6 6 E(X 2 )

(the expected value of the square of the RV
It is important to note that
X)
is
not
equal to
[E(X)]
(the square of the expected value of the RV
Y ),
which in this case would be a function, function is
3.52 = 12.25. X
and
Quite often, we are interested in functions of more than one RV. Consider
g(X, Y ),
of two discrete RVs,
Y.
The expected value of this
E[g(X, Y )] =
x
where tion
g(x, y)f (x, y),

y
f (x, y)
is the joint probability mass function of
and
Y.
Similarly, for two continuous RVs,
and
Y,
the expected value of the func-
g(X, Y )
is
E[g(X, Y )] =
x
where
g(x, y)f (x, y)dxdy,

y
and
f (x, y)
is the joint PDF of
Y.
As above, it might help to once
Consider now our example of rolling two separate die, one red and one blue. Now, consider the function
g(X, Y ) = X + Y .
5 This
rule also holds for continuous RVs, and a similar proof could be made.
7
again imagine a new RV called value of
Z,
where
Z = g(X, Y ) = X + Y .6
The expected
can be calculated as
E(Z)
= E(X + Y ) 1 2 3 4 5 6 = (2) + (3) + (4) + (5) + (6) + (7) 36 36 36 36 36 36 4 3 2 1 5 + (8) + (9) + (10) (11) + (12) 36 36 36 36 36 = 7. X
(3.5) plus the expected
Note that this is just equal to the expected value of value of
This example illustrates another important rule: The expected value of the sum of two (or more) RVs is the sum of their expected values. In general terms,
(3.5).
for any two RVs,
and
Y,
we have
E(X + Y ) = E(X) + E(Y ).

After seeing this, it might be tempting to that that the same thing holds for the product of several RVs, (i.e. that in a certain special case (known as in general.
statistical independence
E(XY ) = E(X)E(Y )).While
this is true
), it is not true
1.7 Conditional expectation

Often, we are interested in knowing what the expected value of an RV is, given that another RV takes on a certain value. Consider two discrete RVs, The
conditional expectation
X
of
given that
takes on the value
X and Y . x is
E(Y |X = x) =
y
For two continuous RVs, that and
f (y|x). Y
given
Y,
the conditional expectation of
takes on some value extremely close to
(see above) is
E(Y |X x) =
y
f (y|x)dy.
1.8 Variance
Roughly speaking, the
its expected value that we would expect it to take on if we were to repeat the
variance
of an RV, is the average squared distance from
6 Note that, while Z can be seen as the RV whose value is determined by the outcome of the single random experiment of rolling a pair of dice, we are dening it here as the RV whose value is determined by a function of two separate RVs, determined by the outcomes of two separate random experiments. 7 As above, this rule also holds for continuous RVs. We omit the proof for both cases here.
random experiment which determines its value many times over. More formally,
8 (or sometimes Var(X)), is dened as
the variance of the RV
(whether discrete or continuous), usually denoted
2 X
2 X = E[(X X )2 ].
From this denition, it should be clear that
zero.
the variance of a constant is

c
we have
The proof of this is quite trivial: for some constant
2 c
= E[(c c )2 ] = E[(c c)2 ] = 0.
The variance of an RV is known as a measure of of dispersion for an RV is its denoted
square root root of its variance. The standard deviation of the RV
standard deviation
X
dispersion.
Another measure
, which is just the positive
is usually
X .
9 Keeping this in mind, we can write the variance
is the expected value of the function
Notice that the variance of an RV is just the expectation of a function of that RV (i.e. the variance of the RV
(X X )2 ,
as dened above).
of the discrete RV
as
2 X = x
(x X )2 f (x), X
can be written as
while the variance of the continuous RV
2 X =
(x X )2 f (x)dx. X
(remembering that
In our example of tossing a single die, the variance of
X = 3.5),
is calculated as
2 X
1 1 1 (1 3.5)2 + (2 3.5)2 + (3 3.5)2 6 6 6 1 1 1 + (4 3.5)2 + (5 3.5)2 + (6 3.5)2 6 6 6 = 17.5, = X = 17.5 4.18.

where
which implies
Before moving on, we can also consider the variance of a function of an RV. Consider the linear function
g(X) = cX
is an RV and
c is a constant.10
8 At times, it may be convenient to dene the variance the RV X as 2 = E(X 2 ) 2 . It X is a useful exercise to show that these two denitions are equivalent. 9 While this function is also an RV, its expected value (i.e. its variance) is actually a constant (just like its expected value, as we pointed out above). 10 Here, we will focus on linear functions, but the analysis could be extended to more general function. We will consider the variance of a function of more than one RV in the next section.
The variance of this function is
Var(cX)
= = = = =
E[(cX cX )2 ] E[(cX cX )2 ] E[c2 (X X )2 ] c2 E[(X X )2 ] 2 c2 X .
1.9 Covariance
Closely related to the concept of variance is sure of
association
between two RVs.
The covariance between the two RVs usually denoted
covariance
, which is a mea(or sometimes
X and Y (whether discrete or continuous), Cov(X, Y )), is dened as11
X,Y
X,Y = E[(X X )(Y Y )].

This denition should make it clear that the covariance of an RV with itself is just equal to the variance of that RV. In other words, the covariance of the RV
with itself is
X,X
= E[(X X )(X X )] = E[(X X )2 ] 2 = X .
Just as we viewed variance as the expectation of a function of a (single) RV, we can view covariance as the expectation of a function of two RVs. For two discrete RVs
and
Y,
this means
12
X,Y =
x
while for two continuous RVs
(x X )(y Y )f (x, y),

y
and
Y,
this means
E[g(X, Y )] =
x y
As mentioned above, two RVs dent if
(x X )(y Y )f (x, y)dxdy. X

and
are said to be statistically indepen-
E(XY ) = E(X)E(Y ).
11 We can also dene the covariance of the RVs X and Y as X,Y = E(XY ) X Y . As was the case with the denitions of variance, it is a useful exercise to show that these two are equivalent. 12 Here, the expectation is taken of the function g(X, Y ) = (X )(Y y ). X Y
10
Therefore, using the alternate denition of covariance, it can be see that if two RVs
and
are statistically independent, their covariance is zero:
X,Y
= E(XY ) X y = E(X)E(Y ) X y = 0
Using the denition of covariance, we can now consider the variance of a function of more than one RV. Consider the linear function where
and
are two RVs.
13 The variance of this function is
g(X, Y ) = X + Y ,
V ar(X + Y )
= E[(X + Y X+Y )2 ] = E[(X + Y X Y )2 ] = E(X 2 + Y 2 + 2XY 2XX 2XY 2Y X 2Y Y + 2X Y + 2 + 2 ) X Y = E(X 2 2XX + 2 ) + E(Y 2 2Y Y + 2 ) X Y +2E(XY XX Y Y + X Y ) 2 2 = X + Y + 2X,Y . Y
are statistically independent,
Of course, if
and
2 2 Var(X + Y ) = X + Y
since
X,Y
is equal to zero.
A similar analysis (omitted here) could be used to show that the variance of the function
g(X, Y ) = X Y
is equal to
2 2 Var(X Y ) = X + Y 2X,Y ,
unless
and
are statistically independent, in which case
2 2 Var(X Y ) = X + Y .
relation
A related measure of association is between the two RVs
and
Y,
correlation
denoted
. The
X,Y ,
is dened as
coecient of cor-
X,Y =
Note that if uncorrelated.
X,Y . X Y
and
lation is zero (since
Y are statistically independent, the coecient of correX,Y = 0), in which case we would say they are said to be
13 We
again focus on only linear functions, but more general functions could be considered.
11
1.10 Estimators
In order to calculate the measures considered above (expected value, variance, and covariance), it is necessary to know something about the probability function associated with the RV (or, in the case of covariance, the joint probability function associated with the RVs) of interest. In the die rolling example, the RV
was known
a priori
to be uniformly distributed. Given this, we were able to
calculate the expected value and variance of
quite easily. In such cases, it is
However, the probability function for an RV determined by anything but the most trivial random experiment is usually unknown. common to
assume
that the RV follows some theoretical distribution (such as , we will use the PDF
the uniform distribution). For example, if we assume that the continuous RV
follows a
normal distribution
1
2 2X
f (x) =
exp
(x X )2 2 2 X
, for < x < +,

and
which requires us to know the parameters
2 X .14
The problem is that
we can't derive these parameters unless we rst know the PDF (which requires knowing the parameters). In some cases we may choose to assume values for them, but in others, we need to A is a set of RVs,
random sample
(or ).
estimate
them using a sample of observations.
15
of
independently drawn observations on the RV
x1 , . . . , xn ,
each of which has the same distribution as
this reason, we say that the RVs
distributed IID sample mean

where
x1 , . . . , xn
are
Let's start by considering an
. The usual estimator of the mean of the RV
estimator
n i=1
independent, identically
X,
denoted
X.
For
of the mean of an RV, called the
X,
is
1 X= n x1 , . . . , xn
xi , X. s2 , X
is a random sample of observations on the RV
variance
Next, consider an estimator of the the variance of an RV, called the . For the RV
X,
the usual estimator of variance, denoted
sample
is
s2 X
1 = n1
(xi X)2 .
i=1
It should be clear right away that both the sample mean and sample variance of an RV are functions of RVs (the random sample), and are therefore RVs themselves. Simply put,
estimators are RVs.
14 If the RV X follows a normal distribution with mean and variance 2 , it is common x X 2 to write X N (X , X ). 15 In fact, if we are not comfortable in assuming some theoretical distribution, it is even possible that we can estimate the probability function itself using a sample of observations. This approach is considered nonparametric, since it does not involve estimating the parameters of some theoretical distribution (which is a parametric approach).
12
As RVs, estimators have their own probability function, distribution function, expected value, variance, etc. sample mean of the RV For example, the expected value of the
is
E(X)
= E 1 E n
1 n
xi
i=1 n
= =
xi
i=1
1 [E(x1 ) + . . . + E(xn )] n 1 = (X + . . . + X ) n = X .
This demonstrates that the sample mean is what is a called an that parameter. The variance of the sample mean of the RV mator: an estimator of a parameter whose expected value is the true value of
unbiased
esti-
is
Var(X)
Var
1 n
xi
i=1 n
= = = =
1 Var n2
xi
i=1
1 [Var(x1 ) + . . . + Var(xn )] n2 1 2 2 + . . . + X n2 X 2 X . n
Turning to the sample variance, it turns out that this estimator is also unbiased. The proof is a little more complicated, but it will help to begin by rewriting the sample variance of the RV
as
s2 = X
1 n1
[(xi X ) (X X )]2
i=1
(note that this is equivalent to our original expression since the The expected value of
terms cancel).
s2 X
n
is thus
E(s2 ) X
= E
1 n1
[(x X ) (X X )]2
i=1
13
1 E n1 1 E n1 1 E n1 1 E n1
(xi X )2 2(X X )(xi X ) + (X X )2

i=1 n n
(xi X ) 2(X X )
2 i=1 n i=1
(xi X ) + n(X X )2
(xi X )2 2n(X X )2 + n(X X )2

i=1 n
= = = = =
(xi X )2
i=1
n E[(X X )2 ] n1
1 n E(x1 X )2 + . . . + E(xn X )2 E{[X E(X)]2 } n1 n1 1 n 2 ( 2 + . . . + X ) Var(X) n1 X n1 2 2 n X nX n1 n1 n 2 X . s2 , X

if
We won't do so here, but it can shown that the variance of normally distributed RV, is
is a
Var(s2 ) = X
It is useful to compare the estimator
4 2X . n1
s2 X
with an alternative estimator of
variance. Another commonly used estimator of variance of the RV
X,
denoted
X , 2
is dened as
X = 2
The expected value of X is 2
1 n
(xi X)2 =
i=1
n1 2 s . n X
E(X ) 2
= =
n1 E(s2 ) X n n1 2 X , n
is
bias
which implies that the estimator
X 2
of an estimator is equal to the expected value of the estimator minus the
biased
(i.e. it is not unbiased). The
parameter being estimated. For
X 2
, this is
Bias(X ) 2
2 = E(X X ) 2 2 2 = E(X ) E(X ) n1 2 2 = X X n 1 2 = n X

14
If
is normally distributed, we can use the variance of
variance of X : 2
s2 X
to nd the
Var(X ) 2
= = = =
Var
n1 2 s n X
2
n1 n
Var(s2 ) X
2
4 n1 2X n n1 4 2(n 1)X . n2
Note that this is less than the variance of
s2 X
, i.e.
Var(X ) < Var(s2 ). 2 X

Since it has a lower variance, we would say that
X is a more 2 2 estimator of variance than sX . Note that eciency is a relative measure (i.e. we can only speak of the eciency of an estimator in comparison to the eciency
of other estimators). For this reason, we often refer to the the relative eciency of two estimators, which is just equal to the ratio of their variances. For example,
ecient
of
relative eciency
s2 X
and
X 2
is
Var(s2 ) X = Var(X ) 2
any other estimator. Although
n n1
However, we may say that an estimator is ecient, if it is more ecient than
s2 X
has a lower bias than
X , 2
it is not as ecient. This raises the
important question of which estimator is better. Typically (as in this case), we see a trade-o between bias and eciency. A commonly used measure that takes both of these factors into consideration is the
MSE
) of an estimator, which is simply equal to the variance of the estimator
mean squared error

we have
(or
plus the square of the bias of the estimator.
16 For
s2 , X
MSE(s2 ) X
= = =
Var(s2 ) + [Bias(s2 )]2 X X 4 2X + [0]2 n1 4 2X , n1
16 Actually, the MSE of an estimator is dened as its squared expected deviation from its 2 true value. For example, the MSE of s2 is dened as E[(s2 X )2 ]. However, since this is X X equivalent to the estimator's variance plus its squared bias, we usually write it that way. It would be a useful exercise to show this equivalence.
15
while for
X , 2
we have
MSE(X ) 2
= = = =
Var(X ) + [X ]2 2 2
4 1 2 2(n 1)X + 2 n n X 4 4 2(n 1)X + X 2 n n2 4 (2n 1)X . n2 2
To compare the two estimators, we can use the dierence in their MSEs:
MSE(s2 ) MSE(X ) 2 X
4 4 2X (2n 1)X n1 n2 2 4 = n2 X > 0,
which implies
MSE(s2 ) > MSE(X ). 2 X

So, on the basis of MSE,
X 2
would be the preferred estimator. Of course, MSE
is not the only basis on which to compare two (or more) estimators. Before moving on, let's consider an estimator of the covariance of two RVs, called the and
Y,
the usual estimator of covariance, denoted
sample covariance
sX,Y =
. From
observations on each of two RVs,
sX,Y ,
is dened as
1 n1
(xi X)(yi Y ).
i=1
coecient of correlation
where
We can use the sample covariance of , denoted
X and Y rX,Y , as sX,Y , sX sY

of
to calculate the
sample
rX,Y = sX ,
the
the sample variance of
sample standard deviation

X . sY
X,
is just the square root of
is dened similarly.
asymptotic properties n sample properties

The innite size (i.e. where form in
1.11 Asymptotic properties of estimators

of an estimator are based on random samples of tends towards innity). These dier from the that we considered above (where
was nite). Although
small
working with random samples of innite size may seem unrealistic (and it is), asymptotic analysis provides us with an approximation of how estimators per-
large
random samples.
17
17 For
this reason, asymptotic properties are sometimes called large sample properties.
16
In order to examine the asymptotic properties of estimators, it is necessary to understand the concept of depends on introducing some new notation. Let
convergence in probability
Xn > 0.
We begin by
be an RV (such as an estimator), which
n. X n
is said to converge in probability to a constant
if
n
If
lim Prob(|Xn c| > ) = 0, for c
Xn
converges in probability to
(as above), we will usually write
Xn c.
Alternatively, we may say that write
is the
probability limit plim

(or
) of
Xn
and
plim(Xn ) = c.
Using the concept of convergence in probability, we can now consider some useful asymptotic properties of estimators. An estimator of a parameter is said to be probability to the true value of the parameter. This condition is known as of the RV
consistent
A
sucient
if it converges in condition for an and always
estimator to be consistent is that its MSE tends to zero as
implies convergence in probability. For example, consider the sample mean,
convergence in quadratic mean

is
n approaches innity. X,
X.
The MSE of
MSE(X) =
since
2 X , n
is unbiased. So
lim [MSE(X)]
= =
lim
2 X n
0,
which implies that,
p X X . X
is a consistent estimator of the mean of
So, we can conclude that have
Next, consider the estimators of the variance of the RV
X , s2 X
and
X , X . X . We 2
lim [MSE(s2 )] X
= =
lim
4 2X n1
0,
and
lim [MSE(X )] 2
= =
17
lim
4 (2n 1)X n2
0,
implying that
2 s2 X , X
and
2 X X . 2 2 X , X . 2 Note that this is true even though X is a biased estimator of the variance of X . If the bias of an estimator goes to zero as n approaches innity, it is said to be . Recall that the bias of the estimator X was 2
So, both and are consistent estimators of the variance of
s2 X
X 2
asymptotically unbiased
Bias(X ) = 2
As unbiased. Similarly, an estimator is said to be
1 2 . n X
is asymptotically , if, as
n approaches innity, this becomes zero, and therefore, X 2
approaches innity, it is more ecient than any other estimator. of estimators. Specically, we are often interested in the
asymptotically ecient
Before concluding, we should make some mention about the
bution convergence in distribution

of estimators (i.e. The RV
the distribution of estimators which are based on .
asymptotic distri-
distribution
random samples of innite size). This requires understanding of the concept of
Xn ,
with distribution function
bution to the RV
X,
with distribution function
Fn (x), is said F (x), if
to converge in distri-
n
at all continuity points of
lim |Fn (x) = F (x)| = 0

This is usually written as
F (x).
Xn X.
theorem
RV
In considering the asymptotic distribution of an estimator, a (or
tribution of the sample mean can be found by using the Lindeberg-Levy CLT. This CLT states that if
CLT
) often comes in handy.
For example, the asymptotic disof
central limit
X,
where
has mean
x1 , . . . , xn is a random sample 2 X and variance X , then

n d
observations on the
n1/2
i=1
2 (xi X ) N (0, X ).
Alternatively, this can be written as
since
2 n(X X ) N (0, X ),
n1/2
i=1
(xi X )
= n1/2
i=1
xi nX
= n1/2 (nX nX ) = n(X X )

18
where
normal
2
X
.
is as dened earlier.
In this case,
is said to be
What is remarkable about this theorem is that it holds true even
asymptotically
if the RV
does not follow a normal distribution (e.g.
it may be uniformly
distributed).
Regression Models
2.1 Introduction
Natural and social scientists use abstract models to describe the behaviour of the systems that they study. Typically, these models are formulated in mathematical terms, using one or more equations. For example, in an introductory macroeconomics course, the aggregate expenditures model is used to describe the behaviour of an economy. In this model, the equation
C = cD,
is used to describe how the level of consumption, disposable income, by the parameter level of
C,
depends on the level of
D.
Here,
is a linear function of
D,
whose slope is given given the
c,
called the marginal propensity to consume (which is some
arbitrary constant). Such a relationship is said to be
deterministic :
D,
we can determine the exact level of
C.
Deterministic relationships, however, rarely (if ever) hold in observed data. This is for two reasons: First, there will almost always be some error in measuring the variables involved; and second, there will very often be some other,
non-systematic
identied.
factors that aect the relationship but which cannot be easily
exact relationship is said to be statistical.

of interest, but it will not be
Of course, there may be
some
(i.e.
relationship between the variables completely deterministic). Such a
To illustrates the dierence, consider a model which posits that the variable
is determined by some function of the variable
X,
i.e. (1)
Y = m(X).
Now, suppose that we have which we denote
n observations on each of the variables Y and X , y1 , . . . , yn and x1 , . . . , xn , respectively. For any given pair of observations, (yi , xi ), we would like to allow the possibility that the relationship
in (1) does not hold exactly. To do this, we add in what is called a
term
, denoted
ui ,
which gives us the
regression model
i = 1, . . . , n.
disturbance
(2)
yi = m(xi ) + ui ,
By adding this disturbance term, we are able to capture any measurement errors or other non-systematic factors not identied in (1). On average, we would hope that each disturbance term is zero (otherwise, the model would contain some systematic error). However, for each pair of observations, the actual size of
(yi , xi ), i = 1, . . . , n
ui
will dier.
19
Since the disturbance term is, by nature, unpredictable, it is treated as an RV. Like any other RV, it has its own distribution. The most important characteristic of this distribution is that its expected value is zero, i.e. for each
ui , E(ui ) = 0.
Typically, we assume that each
ui , i = 1, . . . , n,
is IID. This means, among
other things, that each has the same variance, i.e. for each
ui ,
2 Var(ui ) = u .
In this case, the disturbance terms are said to be
ally means same variance). Given the IID assumption, it is common to write
homoskedastic
(which liter-
2 ui IID(0, u ).
Note that, since
yi
is a function of the RV
ui
(see Equation (2)), it also must
be an RV itself. Of course, usually referred to as the the
We haven't yet indicated whether experiment, then
dependent variable x independent variable explanatory variable

, and (or
yi
is also dependent on
xi .
For this reason, yi is i is usually referred to as ).
xi
is an RV. This is because sometimes it
is, and sometimes it isn't. If we our observations are generated by a
controlled
x1 , . . . , xn
are held constant (and are obviously not RVs). As
an example, consider an experiment where we give patients dierent doses of a blood pressure medication, and measure the reduction in their blood pressures over some period of time. Here, to patients patients
1, . . . , n, 1, . . . , n. If
and
x1 , . . . , xn are the doses of the medication given y1 , . . . , yn are the reduction in the blood pressures of x1 , . . . , xn constant (i.e. keeping the y1 , . . . , yn would be dierent each In this case, y1 , . . . , yn are RVs (since
we repeat this experiment several times over (with, say, a
new group of patients each time), holding time (due to the disturbances
doses unchanged), we would still expect that
u1 , . . . , un
then
are RVs), but
u1 , . . . , un ). x1 , . . . , xn are not.
On the other hand, if our observations are generated by a
eld
experiment,
x1 , . . . , xn
are RVs. For example, consider an experiment where we observe
dierent workers' education and income levels (perhaps we want to see if higher levels of education result in higher levels of income). Here, education levels of workers workers
x1 , . . . , xn
are the
1, . . . , n, x1 , . . . , xn
and
y1 , . . . , y n
are the income levels of
1, . . . , n.
If we repeat this experiment several times over (with a new will be dierent each time, and therefore
group of workers each time), experiment).
can be considered RVs (since we don't know their values until we complete the
regression model regression model
Up to this point, we have been dealing with what is called a , since it involves just two variables:
yi
and
xi .
Of course,
bivariate
most regression models used in the natural and social sciences involve more than just two variables. A model of this type, referred to as a , has one dependent variable and more than one independent
multivariate
20
variables. A multivariate regression model with takes the form
k dierent independent variables i = 1, . . . .n, X1 , x2,1 , . . . , x2,n

are ob, where
yi = m(x1,i , . . . xk,i ) + ui ,
where
x1,1 , . . . , x1,n
are observations on the variable
servations on the variable
X2 ,
and so on.
Aler natively, it is possible to have a
only one variable is involved. Such models are usually only relevant when the variable involved is in the form of a time-series. An example of such a model is what is known as an is regressed on lagged values of itself.
univariate regression model

Letting
process
the variable
over time, a
) takes the form
autoregressive process y ,...,y p th-order autoregressive process

1
, where the dependent variable
n be observations on (or
AR(p )
yt = m(yt1 , . . . ytp ) + ut ,
t = 1, . . . .n.
For ease of exposition, the rest of this lecture will focus on bivariate regression models, but all of the results are perfectly applicable to multivariate and univariate regression models.
2.2 Estimation of regression models

Usually, the exact form of the function will denote RVs
m()
is unknown. Therefore, the task
of the statistician is to come up with some estimate of this function, which we
m(). Of course, since any estimate of m() will be based on the the y1 , . . . , yn (and x1 , . . . , xn , which also may be RVs), m(), like any other estimator, will also be an RV. Exactly how m() is estimated, and how m() is
distributed (which will depend on the method of estimation), will occupy us for the remainder of the course. Assuming we have an estimate of way. Given a specic observation of
m() in hand, it can be used in the following xi , the estimate m(xi ) will give us an estimate yi = m(xi ).
yi ,
which we will denote
yi ,
i.e.
Clearly, the estimate write
yi
will not always be equal to the observed value
So, we add in what is called an
error term
(or
residual
yi .
), denoted
ui ,
and
yi = m(xi ) + ui .
Rearranging, we have
ui
= yi m(xi ) = yi yi . m()
should have the property
That is, the error term is equal to the dierence between the observed and estimated values of
yi .
Clearly, any estimator of
that it minimizes such errors. This will be a guiding principle in developing the estimators we are interested in.
21
2.3 Specication of regression models

While the exact form of
m()
is usually unknown, it is common for the statis-
tician (or even the theorist) to assume some function form. a specic form, (2) is referred to as a
the function must involve some parameters). On the other hand, if
we will consider nonparametric regression models later in the course. The most common form assumed for regression models involves specifying
gression model
as an
unspecied, the regression model in (2) is referred to as a
. For now, we will focus on parametric regression models, but
parametric regression model m() nonparametric re(since is left
If
m()
is given
m()
ane function18
, i.e.
m(xi ) = + xi ,
which gives use the
linear regression model

yi = + xi + ui .
Here and
and
are parameters to be estimated. Denoting these estimates
we have
m(xi ) = + xi .
Arriving at these estimates will be the focus of the next topic.
regression model
Of course, not all parametric regression models are linear. is one in which
m()
is nonlinear. For example, if
nonlinear
m(xi ) = x i
we have the nonlinear regression model
yi = x + ui . i
As above,
is a parameter to be estimated. Denoting this estimate
we have
m(xi ) = x . i
Estimating nonlinear regression models is beyond the scope of this course.
Ordinary least squares
3.1 Estimating bivariate linear regression models

The most common method for estimating linear regression models (both bivariate and multivariate) is the method of method is based on minimizing the the estimated regression model.
ordinary least squares OLS sum of squared residuals SSR

(or (or
). This ) from
18 An
ane function, g(x), takes the form g(x) = a+bx, where a and b are any real numbers.
22
If we have the bivariate linear regression model
yi = + xi + ui ,
then the
i = 1, . . . .n
(3)
ith
residual from the estimate of this model is
ui = yi xi .
Summing the square of each of these terms (from
i = 1 . . . , n),
we have
u2 i
i=1
=
i=1
(yi xi )2 .
From the left hand side of this equation, it is evident that the sum of squared residuals is a function of both
and
.
n
Therefore, we normally write
SSR( , ) =
i=1
(yi xi )2 .
Since the method of OLS is based on minimizing this function, we are faced with a very simple optimization problem: Minimize the function choosing
SSR(, )
by
and
i.e.
min SSR( , ) =
, i=1
(yi xi )2
The rst-order, necessary conditions for this optimization problem are
n SSR( , ) = 2 (yi xi ) = 0, i=1

and
(4)
SSR( , ) = 2
Equation (4) implies
(yi xi )(xi ) = 0.
i=1
(5)
(yi xi ) = 0.
i=1
Carrying through the summation operator, we have
yi n
i=1 i=1
xi = 0.
23
Finally, solving for
gives
=
which can be rewritten as
1 n
yi
i=1 i=1
xi
= y x,
where,
(6)
y is
the sample mean of the observations
yi , . . . , y n ,
i.e.
y=
and
1 n
yi ,
i=1
is the sample mean of the observations
xi , . . . , xn ,
i.e.
x=
1 n
xi .
i=1
Moving on to Equation (5), we have
(yi xi )(xi ) = 0.
i=1
Multiplying through by
xi ,
we have
(yi xi xi x2 ) = 0. i
i=1
Next, carrying through the summation operator, we have
yi xi
i=1
Now, substituting for
xi
i=1 i=1
x2 = 0. i
from Equation (6) gives
yi xi ( x) y
i=1 i=1
Expanding the term in brackets, we have
xi
i=1
x2 = 0. i
yi xi y
i=1
xi + x
i=1 i=1
xi
i=1
x2 = 0. i
we have
n i=1 yi xi y n x2 x i=1 i
24
n i=1 xi . n i=1 xi
(7)
It is often desirable to write
as
n i=1 (yi y )(xi n 2 i=1 (xi x)
x)
(8)
To show that Equations (7) and (8) are equivalent, we show that the numerator and denominator of each are equivalent. First, expand the numerator in Equation (8) and rearrange:
(yi y )(xi x)
i=1
=
i=1 n
(yi xi yi x y xi + y x)
n n
=
i=1 n
yi xi x
i=1
yi y
i=1 n
xi + nx y xi + ny) x
=
i=1 n
yi xi xn y y
i=1 n
=
i=1
yi xi y
i=1
xi ,
which is exactly the numerator in Equation (7). Second, expand the denominator in Equation (8) and rearrange:
(xi x)2
i=1
=
i=1 n
(x2 2xi x + x2 ) i
n
=
i=1 n
x2 2 x i
i=1
xi + n2 x
=
i=1 n
x2 2n2 + n2 x x i x2 n2 x i
i=1 n n
= =
i=1
x2 x i
i=1
xi ,
which is exactly the denominator in Equation (7). At other times, it will be convenient to to rewrite
as
n i=1 (xi n i=1 (xi
x)yi . x)xi
(9)
To show that this is equivalent to Equation (8) (and therefore Equation (7)), we again show that the numerator and denominator of each are equivalent. To
25
do so, it is helpful to realize that
(xi x)
i=1
=
i=1
xi n x
= nx n x = 0,
and similarly,
(yi y ) = 0.
i=1
Now, the numerator in Equation (8) can be expanded to give
(xi x)(yi y )
i=1
=
i=1 n
[(xi x)yi (xi x)] y

n
=
i=1 n
(xi x)yi y
i=1
(xi x)
=
i=1 n
(xi x)yi 0 (xi x)yi ,

i=1
which is the numerator in Equation (9). Similarly, the denominator in Equation (8) can be expanded to give
(xi x)2
i=1
=
i=1 n
[(xi x)xi (xi x)] x

n
=
i=1 n
(xi x)xi x
i=1
(xi x)
=
i=1 n
(xi x)xi 0 (xi x)xi ,

i=1
which is the denominator in Equation (9). So, we have three dierent (but equivalent) expressions for section, Equation (9) will often come in handy.
Equations (7),
(8), and (9). We most often use Equation (8), but, as we will see in the next
26
3.2 Properties of OLS estimators of bivariate linear regression models

The properties of OLS estimators depend on the properties of the error terms in the linear regression model we are estimating. For now, we will assume that, for each
ui ,
2 ui IID(0, u ).
However, as we progress, we will relax this assumption and see how the properties of OLS estimators change. In addition, we will also assume that
xi , i = 1, . . . , n
are not RVs.
This
assumption does not have nearly as large an impact on the properties of OLS estimators as the IID assumption, but will make the following analysis a little easier. The rst question we want to ask is whether or not OLS estimators are unbiased. Let's start with
To do so, we substitute for
yi
in Equation (9)
from Equation (3):
=
Multiplying through, we have
n i=1 (xi
n i=1 (xi
x)( + xi + ui ) . x)xi
= =
n i=1 (xi
x) +
0+
n i=1 (xi n i=1 (xi n i=1 (xi
n i=1 (xi
n n i=1 (xi x)xi + i=1 (xi n (xi x)xi i=1 n x)xi + i=1 (xi x)ui
x)ui
x)xi
(10)
= +
x)ui . x)xi :
n i=1 (xi n i=1 (xi
Next, we nd the expected value of
E()
= E + = + = ,
x)ui x)xi
n i=1 (xi x)E(ui ) n (xi x)xi i=1
since
E(ui ) = 0.
So,
is unbiased.
Let's now move on to
From Equation (6), we have
1 n
yi x.
i=1
27
Substituting for
yi
from Equation (3), we have
1 n
( + xi + ui ) x
i=1 n n
1 n + xi + ui ) x n i=1 i=1 1 n
n
= + x +
x
i=1 n
1 = + x( ) + n
Taking the expected value of
ui .
i=1
(11)
we have
E( )
= E + x( ) +
1 n
ui
i=1 n
1 = + x[ E()] + n = + x( ) =
So,
E(ui )
i=1
is also unbiased.
Another important property of OLS estimators is their variance. Since unbiased, we have
is
Var() = E[( )2 ].
From Equation (10), this means
Var() = E
n i=1 (xi n i=1 (xi
x)ui x)xi
To make handling the term in brackets a little easier, let's introduce some new notation. Let
ki =
so that
(xi x) , n (xi x)2 i=1

n 2
Var() = E
i=1
Expanding this, we have
ki ui
Var() = E
n 2 ki u2 + 2 i i=1
n1
ki kj ui uj
i=1 j=i+1
28
n1 2 ki E(u2 ) + 2 i
=
i=1
ki kj E(ui uj ).
i=1 j=i+1
Two things should be apparent here: First, notice that
2 E(u2 ) = u i
since
2 u
= Var(ui ) = E [ui E(ui )]2 = E(u2 ) i
Second, for all
i = j,
we have
E(ui uj )
= E(ui )E(uj ) = 0,
is IID (and therefore statistically indepen-
since we have assumed that each dent from one another). So,
ui
Var()
2 u i=1 n
2 ki
2 = u i=1 2 u
(xi x) n 2 i=1 (xi x)

n 2 i=1
= =
We now move on to
(xi x)2 ]
(xi x)2
2 u . n 2 i=1 (xi x)
Since
is unbiased, we have
Var() = E[( )2 ].
From Equation (11), this means
Var( ) 1 = E x( ) + n

n 2
ui
i=1
1 = E x2 ( )2 + 2( ) + 2 x n
ui
i=1
n n1 n 1 = x2 E[( )2 ] + 2[ E()] + 2 E x u2 + 2 ui uj i n i=1 i=1 j=i+1

29
= x2 Var() + 2[ ] + x =
2 = u 2 u x2 n (xi i=1
1 n2
E(u2 ) + i
i=1
2 n2
n1
E(ui uj )
i=1 j=i+1
x)2
1 n 2 n2 u + 1 n .
x2
n i=1 (xi
x)2
Given the variance of OLS estimators, we might want to ask how ecient they are. stated as unbiased estimators, they are in fact the most ecient. This property is often (Best Linear Unbiased Estimator). By best, we mean most ecient. Here, we will prove this only for Using the
Gauss-Markov theorem BLUE

The
states that, in the class of linear,
but the proof for
is very similar.
ki
notation we introduced above, Equation (9) implies
=
i=1
This should make it clear that linear in
ki yi .
is a linear estimator the above function is
yi .
Now, let's consider some linear, unbiased estimator of
call it
=
i=1
where
gi yi ,
gi = ki + wi ,
and
wi
is some non-zero term which depends on
x1 , . . . , xn .
This implies that
=
i=1 n
(ki + wi )yi
n
=
i=1
ki yi +
i=1 n
wi yi
(12)
= +
i=1
Taking expectations, we have
wi yi .
E()
= E
+
i=1 n
wi yi wi yi
i=1
= +E
30
Substituting for
yi
E()
= +E
i=1 n
wi ( + xi + ui )
n n
= +
i=1 n
wi +
i=1 n
wi xi +
i=1
E(ui )
= +
i=1
since
wi +
i=1
wi xi ,
E(ui ) = 0.
In order for
to be unbiased, we must have
wi = 0,
i=1
and
wi xi = 0,
i=1
since, in general,
and
are non-zero. This allows us to rewrite the second
term on the left-hand side of Equation (12) as
wi yi
i=1
= =
wi ( + xi + ui )
i=1 n n n
wi +
i=1
wi xi +
i=1
wi ui
i=1 n
=
i=1
wi ui .
Also, note that Equation (10) implies
=+
i=1
ki ui .
Substituting these last two results into Equation (12), we have
= +
i=1 n
ki ui +
i=1
wi ui
= +
i=1
Finally, the variance of
(ki + wi )ui .
is
Var()
= E
E()
31
= E
(ki + wi )ui
i=1 n
= E
i=1 n
(ki + wi )2 u2 i (ki + wi )2 E(u2 ) i
=
i=1 2 = u
(ki + wi )2
i=1 n 2 2 (ki + 2ki wi + wi ) i=1 n n 2 2 ki + 2u i=1 n i=1 n 2 ki wi + u i=1 i=1 2 wi . 2 ki wi + u i=1 n 2 wi
2 = u
2 = u
2 Var() + 2u
The second term on the right-hand-side of this equation is
ki wi
i=1
= = = 0,
n i=1 (xi x)wi n 2 i=1 (xi x) n n i=1 i=1 xi wi x n 2 i=1 (xi x)
wi
since
n i=1
xi wi = 0
and
n i=1
wi = 0.
Therefore, we have
n 2 Var() = Var() + u i=1

which implies
2 wi ,
Var() > Var(),

since
wi
is non-zero. This proves the Gauss-Markov theorem for
We won't
go through it here, but the proof for
is quite similar.
3.3 Estimating multivariate linear regression models

Now consider the multivariate linear regression model
19
yi = 1 + 2 x2,i + . . . + k xk,i + ui ,
19 Note
i = 1, . . . .n
that the bivariate linear regression model considered above is just a special case of the multivariate linear regression model considered here. For this reason, what follows is applicable to both bivariate and multivariate linear regression models.
32
To make the analysis of such regression models much easier, we typically express this in matrix form:
20
y = X + u
where
(13) and
and
are
n-vectors, X
is an
as follows:
n k matrix, y1 . y = . , . yn x2,1
. . .
is a
k -vector,
dened
X=
1
. . .
...
xk,1
. . .
1 x2,n =
and
1
. . .
. . . xk,n ,
k u=
Letting
u1
. . .
un u
denote the residuals from the estimate of this model, we have
u = y Xb
where
is an estimator of
. = = = =
The sum squared residuals is thus
SSR(b)
uu (y Xb) (y Xb) y y y Xb b X y + b X Xb y y 2b X y + b X Xb. y Xb

is a scalar (this can be conrmed by
Here, we make use of the fact that
checking its dimensions), and is therefore equal to its transpose, i.e.
y Xb
= (y Xb) = b X y.
As we saw in the bivariate case, the method of OLS is based on minimizing this function. Our optimization problem is thus
min SSR(b) = y y 2b X y + b X Xb.

b
algebra.
20 In these notes,
we make the assumption that the reader is reasonably well-versed in matrix
33
The rst-order necessary condition for this optimization problem is
SSR(b) = 2X y + 2X Xb = 0, b
which implies
X y = X Xb.
b,
we have
b = (X X)1 X y.
(14)
3.4 Properties of OLS estimators of multivariate linear regression models21

In order to analyze the properties of means that the about the error terms. As before, we assume that each
b, we again need to make some assumptions ui has mean zero. This u1

. . .
n-vector u
has the following expected value:
E(u)
= E = = 0, 0
. . .
un
where
is a
The variance of
k -vector of u is Var(u)
zeros.
= E ([u E(u)][u E(u)] ) = E(uu ) u1 . = E . ( u1 un . un u2 u1 un 1 . . . = E . , . . un u1 u2 n
which is called the each
ui
is IID with variance
error covariance matrix

2 u , 2 E(u2 ) = u , i
. For now, we will assume that
which means that
21 As noted above, what follows is also applicable to bivariate linear regression models (since the bivariate linear regression model is just a special case of the multivariate linear regression model).
34
and, for all
i = j, E(ui uj ) = 0.
Therefore,
Var(u) =
2 u
. . .
0
. . .
0
2 = u In ,
where
2 u
In
is an
nn
identity matrix.
Given these assumptions, we typically write
2 u IID(0, u In ).
Also, to make things easier, we will continue to assume that assumption. Let's start the analysis by asking whether Equation (13) into Equation (14), we have
is not a
RV. Later in the course, we will see what the consequences are if we relax this
b is unbiased or not.
Substituting
b = (X X)1 X (Xb + u) = (X X)1 X X + (X X)1 X u = + (X X)1 X u.

Now, taking expectations we have
(15)
E(b)
= E + (X X)1 X u = + (X X)1 X E(u) = ,
since
E(u) = 0.
So
is unbiased.
Let's now move on to the variance of
b.
Given that
is unbiased, we have
Var(b)
= E ([b E(b)][b E(b)] ) = E [(b )(b ) ] .
Substituting from Equation (15), we have
Var(b)
= E [(X X)1 X u][(X X)1 X u] = E (X X)X uu X(X X)1 = (X X)1 X E(uu )X(X X)1 2 = u (X X)1 X X(X X)1 2 = u (X X)1 .
35
(16)
Having derived the variance of
b,
we now want to show that the Gauss-
Markov theorem holds here. Consider some other linear, unbiased estimator of
call it
B: B = Gy,
where
G = W + (X X)1 X ,
and
is some non-zero matrix which depends on
X.
The expected value of
is thus
E(B)
= E(Gy) = GE(y). y
is
Note that the expected value of
E(y)
= E(X + u) = X + E(u) = X,
so
E(B)
This means that, in order for
= GX.
GX = Ik ,
where
Ik
is an
kk
identity matrix (check the dimensions of
GX
to conrm
this). Given this condition, notice that
= = = =
Gy G(X + u) GX + Gu + Gu,
and that
WX
= [G (X X)1 X ]X = GX (X X)1 X X = Ik Ik = 0.
The variance of
is therefore
Var(B)
= E ([B E(B)][b E(B)] ) = E[( + Gu )( + Gu ) ]

36
= = = = = =
E[(Gu)(Gu) ] E(Guu G ) GE(uu )G 2 u GG 2 u [W + (X X)1 X ][W + (X X)1 X ] 2 u [WW + WX(X X)1 + (X X)1 X W + (X X)1 X X(X X)1 ] 2 2 = u WW + u (W W)1 .
Finally, since
is a non-zero matrix,
Var(B) > Var(b),

which proves the Gauss-Markov theorem. Before moving on, let's briey consider some of the asymptotic properties of OLS estimators. approaches zero as to its variance, i.e.
22
Arguably the most important of these is consistency. approaches innity. Since
Recall that a sucient condition for estimator to be consistent is that its MSE
is unbiased, its MSE is equal
2 MSE(b) = u (X X)1 .
Now, consider the matrix of
X X.
Since each element of this matrix is a sum
numbers, as
approaches innity, we would expect that each such element
1 n times this times each of these elements would approach some constant. We can write
would also approach innity. However, it would be safe to assume that this assumption as
n
where
lim
1 XX n
= Q, b
(17)
is a nite, non-singular matrix. is consistent by rewriting its
We can use this assumption to show that MSE as
MSE(b) =
the
1 2 1 ( X X)1 . n u n b
since
Notice that this is equivalent to our previous expression for the MSE of
1 n terms cancel each other out. Taking the limit as we have
approaches innity,
lim [MSE(b)]
= =
lim lim
1 2 n u 1 2 n u
1
1 XX n lim
1 XX n
= (0)Q = 0,
22 We didn't do this when we focused on the bivariate linear regression model, but, once again, all of what follows is applicable to that special case.
37
which implies
b .
So,
is consistent.
Finally, let's consider the asymptotic distribution of rewriting Equation (15) as
b.23
To do so, begin by
b=+
Notice that, as above, the
1 XX n
1 Xu . n
1 n terms cancel each other out, leaving us with Equation (15). Rearranging, we have
n(b ) = ith
1 XX n
n1/2 X u. X u. 1 n
(18)
Now, let result that
Xi ui
be the
row of the matrix
Using a CLT, we have the
n1/2 [Xi ui E(Xi ui )] N

0, lim
Var(Xi ui )
i=1
Xi ui
is
E(Xi ui )
= Xi E(ui ) = 0,
so its variance is
Var(Xi ui )
= = = =
E[(Xi ui )(Xi ui ) ] E(Xi ui ui Xi ) Xi E(ui ui )Xi 2 u X i X i .
Using Assumption (17), we thus have
lim
1 n
Var(Xi ui )
i=1
lim
1 n
n 2 u X i X i i=1
2 = u lim 2 = u Q.
1 XX n
23 In small samples, we can't generally say anything about the distribution of b unless we know the exact distribution of X and u. For example, if X is not an RV, and u is normally distributed, then b is also normally distributed since it is a linear function of u (see Equation (15)).
38
Therefore, we can write
2 n1/2 (Xi ui ) N (0, u Q).

Combining this with Assumption (17), we have
1 XX n
1 2 n1/2 (Xi ui ) N (0, u Q1 QQ1 ), d
which, by Equation 18, means
That is,
2 n(b ) N (0, u Q1 ).
is asymptotically normal.
3.5 Estimating the variance of the error terms of linear regression models
Even if we are willing to make the assumption that each of the error terms is IID with mean zero and variance the following estimator
2 u ,
we do not usually know the value of
2 u .
So,
we typically would like to get some estimate of it. We start here by proposing
2 u ,
which we denote
s2 : u
s2 = u
uu . nk
While it may not be immediately clear where this comes from, we can show that it is an unbiased estimator of
2 u .
Let's start by substituting Equation (14) into the denition of
u:
u = y Xb = y X(X X)1 X y = (In X(X X)1 X )y = My,

where
M = (In X(X X)1 X ).

It is easy to show that the matrix that
24
M is both symmetric and idempotent,
i.e.
M = M,
and
MM = M.
24 This
proof of this will be left as an assignment question.
39
Substituting Equation (13) into the above expression for
u,
we have
u = = = =
since
My M(X + u) MX + Mu Mu,
MX
= (In X(X X)1 X )X = X X(X X)1 X X = XX = 0.

we have
Taking expectations of
s2 , u
E(s2 ) u
= = = =
E( u) u nk E[(Mu) (Mu)] nk E(u M Mu) nk E(u Mu) . nk
Notice that
u Mu
is a scalar (check the dimensions to conrm this), and is
therefore equal to its trace, i.e.
u Mu = Tr(u Mu).
So, we can write
E(s2 ) u
= = = = =
E[Tr(u Mu)] nk E[Tr(Muu )] nk Tr[ME(uu )] nk 2 u Tr(M) nk 2 u .
since
Tr(M)
Tr[In X(X X)1 X ]

40
= = = =
So, we conclude that
Tr(In ) Tr[X(X X)1 X ] n Tr[(X X)1 X X] n Tr(Ik ) n k.
2 u . 2 Note that since the variance of b, depends on u (see Equation (31)), we can 2 use su to get an estimate of the variance of b. This estimate, which we denote 2 2 by Var(b), is found by replacing u by su in Equation (31), i.e.
is an unbiased estimator of
s2 u
1 Var(b) = s2 (X X) . u 2 s2 is an unbiased estimator of u , Var(b) is u Var(b). The square root of Var(b) is often called the
Since
standard error
an unbiased estimate of of
b.
Hypothesis testing
4.1 Testing linear restrictions

Having established the properties of the OLS estimator, we now move on to showing how this estimator can be used to test various hypotheses. For illustrative purposes, let's start by considering the following linear regression model:
yi = 1 + 2 x2,i + 3 x3,i + ui ,
i = 1, . . . , n.
(19)
Some examples of null hypotheses we might wish to test are: 1. 2.
H0 : 2 = 2,0 ,
where
2,0
is some specied value (often zero); and
H0 : 2 = 3 = 0.
Both of these examples t into the general linear framework
H0 : R = r,
where, here, 1.
= ( 1 1 1 0 0 ), 0 1
2
and
3 ) ,
and for and
R=( 0 R= 0 0
r = 2,0 ; r=
2.
, and
0 0
Written this way, it can be seen that each of these null hypotheses imposes some linear restriction(s) to the original regression model in (19), which we will call the rewritten in terms of a 1.
unrestricted model restricted model

41
. Each of these null hypotheses can therefore be : and
H0 : yi = 1 + 2,0 x2,i + 3 x3,i + ui ;
2.
H0 : yi = 1 + ui .
In general, for the linear regression model
y = X + u, n k matrix, and is a k -vector, R will be a q k matrix, q -vector, where q is the number of restrictions. Note that the rst example above imposes a single restriction (i.e. q = 1), while the second imposes two restrictions (i.e. q = 2).
where
is an
and
will be a
In order to test any such linear restriction(s), we use the OLS estimator of
, b = (X X)1 X y.
Recall that, if we assume that
2 u IID(0, u In ),
then
E(b) =
and
2 Var(b) = u (X X)1 .
Therefore, we have
E(Rb)
= RE(b) = R,
and
Var(Rb)
= = = = = = =
E([Rb E(Rb)][Rb E(Rb)] ) E[(Rb R)(Rb R) ] E[(Rb R)(b R R )] E[R(b )(b )R ] RE[(b )(b ) ]R RVar(b)R 2 u R(X X)1 R .
If we make the much stronger assumption that
2 u N(0, u In ),
then
2 b N[, u (X X)1 ],
which implies that
2 Rb N[R, u R(X X)1 R ],

or, alternatively, that
2 Rb R N[0, u R(X X)1 R ].
42
Note that, under the null hypothesis, as
R = r,
so we can rewrite the above
2 Rb r N[0, u R(X X)1 R ].

Finally, we can rewrite this in quadratic form (see Appendix A) to get the test statistic
2 (Rb r) [u R(X X)1 R ]1 (Rb r) 2 . (q)

Unfortunately, the parameter it using
(20)
2 u
is usually unknown, so we have to estimate
s2 = u
Replacing
uu . nk
2 u
in (20) by
s2 u
gives
(Rb r) [R(X X)1 R ]1 (Rb r) . u u/(n k)

Dividing both the numerator and denominator of the above by
2 u ,
we have
2 (Rb r) [u R(X X)1 R ]1 (Rb r) . ( u/u )/(n k) u 2

Due to the randomness of A
(21)
u, this quantity is no longer distributed as 2 . (q) uu 2 (nk) , 2 u
How-
ever, as we saw earlier, the numerator is. Furthermore, as shown in Appendix
and is independent of the numerator in (21). distributed as
Therefore, if we divide the nu-
merator in (21) by its degrees of freedom (q ), then we have a quantity which is
F(q,nk) .
That is,
2 (Rb r) [u R(X X)1 R ]1 (Rb r)/q F(q,nk) . ( u/u )/(n k) u 2

Eliminating the the test statistic
2 u
terms and substituting in for the denition of
s2 , u
we have
(Rb r) [s2 R(X X)1 R ]1 (Rb r)/q F(q,nk) . u
(22)
4.2 Testing a single restriction

Let's now see how we can construct specic test statistics for the rst example considered above. First, note that
Rb r = b2 2,0 ,
which is a scalar. Second, note that element element the
33
matrix
R(X X)1 R picks out (X X)1 . Therefore, since
the 2nd diagonal
Var(b) = s2 (X X)1 , u
43
we have
s2 R(X X)1 R = Var(b2 ), u

which is also a scalar (check the dimensions). In general, if there are that the
parameters, and we want to test the null hypothesis
j th
one is equal to some specied value, i.e.
H0 : j = j,0 ,
we will have
Rb r = bj j,0 ,
and
s2 R(X X)1 R = Var(bj ). u

Substituting these into (22), we have the test statistic
(bj j,0 )2 F(1,nk) . Var(bj )

Alternatively, taking the square root of this statistic, we have what is known as a
t -statistic:
bj j,0 t(nk) Se(bj )
4.3 Testing several restrictions

When there is more than one restriction to test, constructing an appropriate test statistic is not so simple. The main problem is that it will usually involve having to estimate both the unrestricted and restricted models. model, we need to use a dierent method of estimation. The idea is to estimate the model To estimate the unrestricted model, OLS can be used. However, to estimate the restricted
y = X + u,
subject to the restriction
To do so, we use the will denote we have
b .
Letting
u denote
restricted least squares

u = y Xb .
R = r.
estimator of
which we
the residuals from the estimate of this model, (23)
The sum squared residuals is thus
SSR(b )
= u u = (y Xb ) (y Xb ) = y y 2b X y + b X Xb .
44
As with OLS, the restricted least squares estimator is based on minimizing this function. Our (constrained) optimization problem is thus
min SSR(b ) = y y 2b X y + b X Xb ,
b
subject to
Rb = r.
The Lagrangian for this problem is
L(b , ) = y y 2b X y + b X Xb 2(Rb r),

and the rst-order necessary conditions are
L(b , ) = 2X y + 2X Xb 2R = 0, b
and
(24)
L(b , ) = 2(Rb r) = 0.
(25)
X Xb = X y + R .
Solving for
b ,
we have
= (X X)1 X y + (X X)1 X R = b + (X X)1 X R , R,

we have
(26)
where
is the usual OLS estimator. Premultiplying by
Rb = Rb + R(X X)1 X R ,
which implies
Rb Rb = R(X X)1 X R ,
and therefore,
[R(X X)1 X R ]1 (Rb Rb) = ,

(27)
Rb = r.
Substituting this into Equation (27), we have
[R(X X)1 X R ]1 (r Rb) = .

Finally, substituting this into Equation (26), we have
b = b + (X X)1 X R [R(X X)1 X R ]1 (r Rb).
(28)
45
Let's now focus on the residuals from restricted model. tracting
Adding and sub-
Xb
to Equation (23), we have
= y Xb Xb + Xb = u X(b b).
25
The sum of squared residuals from the restricted model is thus
u u
= = = =
[ X(b b)] [ X(b b)] u u [ (b b) X ][ X(b b)] u u u u u X(b b) (b b) X u + (b b) X X(b b) u u + (b b) X X(b b). b b
Substituting for
u u
= u u + {(X X)1 X R [R(X X)1 X R ]1 (r Rb)} X X(X X)1 X R [R(X X)1 X R ]1 (r Rb) = u u + (r Rb) [R(X X)1 X R ]1 RX(X X)1 X X(X X)1 X R [R(X X)1 X R ]1 (r Rb) = u u + (r Rb) [R(X X)1 X R ]1 (r Rb),
(the sum of squared residuals from the restricted model) by (the sum of squared residuals from the unrestricted model) by
Denoting
SSRR , SSRU ,
and
u u uu
we have
SSRR SSRU = (r Rb) [R(X X)1 X R ]1 (r Rb).

Notice that the term on the right hand side is in the numerator of (22). This leads to what is known as the
F -statistic:
(SSRR SSRU )/q F(q,nk) . SSRU /(n k)
(29)
4.4 Bootstrapping
The test statistics considered above were built on the assumption that
is
normally distributed. However, quite often, we may be unwilling to make this assumption. In such cases, we therefore do not know the distribution of our test statistics. This problem leads to a procedure known as the
idea is to use the observed data to try to estimate the distribution of the relevant test statistic. In general, this procedure requires the following steps:
bootstrap 26
.
The basic
we use make use of the fact that X u = 0. Proving this would be a useful exercise. we present a very brief introduction to the bootstrap. A more complete introduction can be found in Davidson and MacKinnon, (2004, Section 4.6).
25 Here, 26 Here,
46
1. Using the observed data, estimate both the unrestricted and restricted models. Save the tted values and residuals from the restricted model. Call these
yR
and
uR ,
respectively.
2. Using the estimates from the previous step, calculate the test statistic of interest (e.g. the
t -statistic or the F -statistic).
Cal this
T.
3. Randomly sample, with replacement, model (this is known as
strap sample
as:
resampling
). Call these
n of the residuals from the restricted u . Generate the
boot-
y = yR + u
4. Using the bootstrap sample, re-estimate both the unrestricted and restricted models. 5. Using the estimates from the previous step, calculate the test statistic of interest (this is known as the on the bootstrap sample). Call this Finally, repeat Steps 3-5
bootstrap test statistic

Tb .
and
, since it is based
(some large number
27 ) times. What this leaves you
with is the original test statistic,
T,
dierent bootstrap test statistics,
Tb , b = 1, . . . , B .
It turns out that these we can calculate the
bootstrap test statistics provide a fairly good estiof the original test statistic ,
mate of the distribution of the test statistic of interest. For inference purposes,
simulated P -value
1 p (T ) = B
B b=1
T,
as
I(Tb > T ), Tb > T ,

and zero otherwise.
where
I()
is an indicator function, equal to one if
Generalized least squares
5.1 Introduction
In estimating the linear regression model
y = X + u,
by OLS, we made the (very strong) assumption that
(30)
2 u IID(0, u In ).
27 For a chosen level of signicance, , for the test, B should be chosen so that (B + 1) is an integer (see Davidson and MacKinnon, (2004, Section 4.6)). For = 0.01, appropriate values of B would be 99, 199, 299, and so on. Since computing costs are now so low, B is often chosen to be 999 or even 9999.
47
We would now like to relax this assumption, and consider what happens when the error terms are not IID. While we will continue to assume that
E(u) = 0,
we will now let
Var(u) = E
u2 1
. . .
u1 un
. . .
un u1 = ,
where
u2 n
the error covariance matrix, is some
nn
positive denite matrix.
(and that therefore
E(ui uj ) = 0 for i = j ui and uj are not independent), and that there is some E(u2 ) = E(u2 ) for i = j (and that therefore ui and uj are not identically j i distributed). In other words, we allow the possibility that each ui is not IID.
That is, we allow the possibility that there is some Notice that the OLS estimator is still unbiased since
E(b)
= E[(X X)1 X y] = E[(X X)1 X (Xb + u)] = E + (X X)1 X u = + (X X)1 X E(u) = .
However, the variance of the OLS estimator is now dierent:
Var(b)
= E [(X X)1 X u][(X X)1 X u] = E (X X)X uu X(X X)1 = (X X)1 X E(uu )X(X X)1 = (X X)1 (X X)1 .
(31)
It turns out (as we will see shortly), that, in this more general setting, OLS is no longer the best linear unbiased estimator (BLUE). That is, there is some other linear unbiased estimator which has a variance smaller than (32). In deriving this better estimator, the basic strategy is to transform the linear regression model in (30) so that the error terms become IID. To do so, start by letting
1 = .
where
is some
nn
matrix, which is usually triangular.
Next, multiply both sides of Equation (30) by
y = X + u.
48
It is convenient to rewrite this as
y = X + u
where
(32)
y = y, X = X,
and
u = u.
Note that
E(u )
= E( u) = E(u) = 0
and that
Var(u )
= = = = = = = = = =
E{[u E(u )][u E(u ) ]} E(u u ) E[( u)( u) ] E( uu ) E(uu ) Var(u) ( )1 ( )1 1 In . u IID(0, In ).
That is,
GLS
Applying OLS to Equation (32), we get the ) estimator
generalized least squares
(or
bGLS
= (X X )1 X y = [( X) X]1 ( X) y = (X X)1 X y = (X 1 X)1 X 1 y.
(33)
49
We now want to show that into Equation (33), we have
bGLS
is unbiased. Substituting Equation (30)
bGLS
= (X 1 X)1 X 1 (Xb + u) = (X 1 X)1 X 1 Xb + (X 1 X)1 X 1 u = + (X 1 X)1 X 1 u.
(34)
Taking expectations we have
E(bGLS )
E + (X 1 X)1 X 1 u
= + (X 1 X)1 X 1 E(u) = ,
since
E(u) = 0.
So
bGLS
is unbiased.
Let's now move on to the variance of
bGLS :
Var(bGLS )
= E ([bGLS E(bGLS )][bGLS E(bGLS )] ) = E [(bGLS )(bGLS ) ] .
Substituting from Equation (34), we have
Var(bGLS )
= = = = = =
E{[(X 1 X)1 X 1 u][(X 1 X)1 X 1 u] } E[(X 1 X)1 X 1 uu 1 X(X 1 X)1 ] (X 1 X)1 X 1 E(uu )1 X(X 1 X)1 (X 1 X)1 X 1 Var(uu )1 X(X 1 X)1 (X 1 X)1 X 1 1 X(X 1 X)1 (X 1 X)1 . Var(u) = , bGLS
(35)
We now want to consider a generalization of the Gauss-Markov theorem, which states that, if assumed that estimator of is BLUE. The proof of this will be very similar to the proof of the Gauss-Markov theorem for
(except that, there, we
2 Var(u) = u In ).
We again consider some other linear, unbiased
which we will again call
B:
B = Gy,
where, here,
G = W + (X 1 X)1 X 1 ,
and
is some non-zero matrix which depends on
X.
is thus
E(B)
= = = = =
E(Gy) GE(y) GE(X + u) GX + GE(u) GX

50
This means that, in order for
GX = Ik ,
where
Ik
is an
kk
identity matrix. Given this condition, notice that
= = = =
Gy G(X + u) GX + Gu + Gu,
and that
WX
= [G (X 1 X)1 X 1 ]X = GX (X 1 X)1 X 1 X = Ik Ik = 0.
The variance of
is therefore
Var(B)
E ([B E(B)][b E(B)] ) E[( + Gu )( + Gu ) ] E[(Gu)(Gu) ] E(Guu G ) GE(uu )G GG [W + (X 1 X)1 X 1 ][W + (X 1 X)1 X 1 ] [W + (X 1 X)1 X 1 ][W + 1 X(X 1 X)1 ] WW + W1 X(X 1 X)1 + (X 1 X)1 X W +(X 1 X)1 X 1 X(X 1 X)1 = WW + (X 1 X)1 X = = = = = = = = = W
is a non-zero matrix and
Finally, since
is a positive denite matrix,
Var(B) > Var(bGLS ),

which proves our generalized Gauss-Markov theorem. It should be noted that, if
2 Var(u) = u In ,
then
includes
as a special case
28 , and we have
Var(b) > Var(bGLS ).

28 On the other hand, b 2 GLS includes b as a special case when Var(u) = u In . Proving this would be a useful exercise.
51
5.2 Feasible Generalized Least Squares

While the GLS estimator has some highly desirable properties, the main drawback with this method is that
is almost never actually known in practice. by
That is, GLS is not usually feasible. However, we can use some estimate of
generalized least squares
which we denote
Replacing
(or
FGLS
in Equation (33), we have the ) estimator
feasible
bFGLS = (X 1 X)1 X 1 y.
How we actually go about estimating
depends on the particular problem.
Below, we analyze two of the most common problems encountered in practice.
5.3 Heteroskedasticity
The problem of ing each we have
ui
is independent), but each diagonal element may not be identical
heteroskedasticity
2 u1 0 = . . . 0
occurs when
is a diagonal matrix (mean2 Var(ui ) = ui ,
(meaning each
ui
is not identically distributed). That is, letting
0 2 u2
. . .
0 0 . . . . 2 u1
Since
is diagonal, we have
2 1/u1 0 = . . . 0
0 2 1/u2
. . .
0 0
. . .
2 1/u1
So, if we let
1/u1 0 = = . . . 0
then
0 1/u2
. . .
0 0
. . .
1/un
1 = , y
as desired.
Therefore, transforming our variables as above, we have
= y 1/u1 0 = . . . 0
0 1/u2
. . .
0 0
. . .
0
52
1/un
y1 y2 . . . yn
y1 /u1 y2 /u2 = , . . . yn /un X = X 1/u1 0 0 1/u2 = . . . . . . 0 0 x1,1 /u1 x1,2 /u2 = . . . x1,n /un
x1,1 x1,2 . . . . . . x1,n 1/un xk,1 /u1 xk,2 /u2 , . . . xk,n /un 0 0
xk,1 xk,2 , . . . xk,n
and
= u 1/u1 0 = . . . 0 u1 /u1 u2 /u2 = . . . un /un
0 1/u2
. . .
0 0
. . .
0 .
1/un
u1 u2 . . . un
So, in this case, the transformed model,
y
can be written as
= X + u ,
y1 /u1 x1,1 /u1 y2 /u2 x1,2 /u2 = . . . . . . yn /un x1,n /un

Alternatively, for the
xk,1 /u1 u1 /u1 u2 /u2 xk,2 /u2 + . . . . . . xk,n /un un /un
ith
observation, we have
yi x1,i xk,i ui = 1 + . . . + k + , ui ui ui ui
i = 1, . . . , n.
least squares
Estimating this model by OLS is a special case of GLS known as (or
WLS
). Of course, this requires knowing the value of each
weighted
53
ui
(i.e.
knowing
),
which, as mentioned above, is almost never possible.
Accordingly, we need some method to estimate
so that
we can use FGLS (or,
in this case, what might be referred to as feasible WLS). Typically, this is done as follows. Suppose we think that the variance of can be explained by columns of is
u Z, an nr matrix which may or may not include some of the
X.
A fairly general non-linear function describing this relationship
r 1 E(u2 ) = 1 z1,i + . . . + r zr,i , i

Letting
i = 1, . . . , n.
and
vi
denote the dierence between
E(u2 ) i
u2 , i
we have
r 1 u2 = 1 z1,i + . . . + r zr,i + vi , i
i = 1, . . . , n,
which is in the form of a non-linear regression model. Unfortunately, we don't have time to cover non-linear regression models in this course, but estimating such a model is certainly possible. However, since we can't actually observe we would have to use the residuals from the OLS estimation of model (30) , in their place. That is, we could estimate the non-linear regression model
u2 , i u2 , i
1 r u2 = 1 z1,i + . . . + r zr,i + wi , i
where
i = 1, . . . , n,
wi
is some error term.
29 Using the parameter estimates from this model,

2 ui
as follows:
we could construct estimates of
ui = 1 z1,i + . . . + r zr,i + wi , 2 1 r
Using the square root of these estimates, by estimating
i = 1, . . . , n.
ui , our FGLS estimators are found i = 1, . . . , n,
yi x1,i xk,i ui = 1 + . . . + k + , ui ui ui ui
by OLS.
5.4 Autocorrelation
The problem of o-diagonal elements in diagonal element in
autocorrelation

(or
are non-zero (meaning each
serial correlation
ui
) occurs when the
is no longer indepen-
dent). Here, we will assume that the errors are homoskedastic (meaning each is identical), but it is possible that autocorrelation and heteroskedasticity are simultaneously present. Autocorrelation is usually encountered in time-series data, where the error terms may be generated by some autoregressive process. For example, suppose the error terms follow a (linear) rst-order autoregressive process (or AR(1) process):
ut = ut1 + vt ,
29 This
t = 1, . . . , n,
(36)
model could also be used to testing for the presence of heteroskedasticity. Here, the null hypothesis would be H0 : 1 = . . . = r = 0 (implying homoskedasticity). An F-test could be used to test this restriction.
54
where
|| < 1,
and
2 vt IID(0, v ).
The condition that is what is known as for
ut
is one in which
covariance stationary
E(ut ), Var(ut ),
and
|| < 1
is imposed so that the AR(1) process in (36) . A covariance stationary process for any given
Cov(ut , utj ),
j,
are
independent of at
t. ut
has been in existence for an innite time. First, note
One way to see this is to imagine that, although we only start observing it
t = 1,
the series for
that (36) implies that
ut1 = ut2 + vt1 , ut2 = ut3 + vt2 ,

and so on. Substituting these into (36), we have
ut
= = = =
(ut2 + vt1 ) + vt 2 ut2 + vt1 + vt 2 (ut3 + vt2 ) + vt1 + vt 3 ut3 + 2 vt2 + vt1 + vt ,
and so on. Alternatively, this can be rewritten as
ut
= vt + vt1 + 2 vt2 + 2 vt2 + 3 vt3 + . . .
=
i=0
i vti .
Taking expectations, we have
E(ut )
= =
E
i=0
i vti i E(vti )
i=0
=
since Next, since
0, t.
E(vt ) = 0. So E(ut ) vt is IID,
is independent of
Var(ut )
= =
Var
i=0
i vti 2i Var(vti )
i=0
=
i=0
2 2i v ,
55
which is a innite geometric series (since
|| < 1).
2 v . 1 2
Therefore, we can write
Var(ui ) =
So,
Var(ut )
is also independent of
t.
On the other hand, if ).
|| 1
this innite
series would not converge (rather, it would explode), and therefore depend on
(and make the process
Finally, let's consider the covariance between First, note that
non-stationary
= = = = =
ut
and
utj ,
for any given
j.
Cov(ut , ut1 )
E(ut ut1 ) E[(ut1 + vt )ut1 ] E(u2 + vt ut1 ) t1 E(u2 ) t1 Var(ut1 ) 2 v . = 1 2
Similarly,
Cov(ut , ut2 )
= E(ut ut2 ) = E[(ut1 + vt )ut2 ] = E(ut1 ut2 + vt ut2 ) = E(ut1 ut2 ) 2 2 v . = 1 2
Generalizing these two results, we have
Cov(ut , utj ) =
which is also independent of Using the above results for
2 j v , 1 2
t.
Since this result depends on and
Var(ut )
being
constant, it therefore depends on the condition
Var(ut ) 1 2
. . .
|| < 1. Cov(ut , utj ), 2 1

. . .
we have
2 v = 1 2
1
. . .
n1 n2 n3
. . .
n1
n2
n3
56
Inverting this matrix, we have
2 v
1 0
. . .
1 + 2
. . .
0 1 + 2
. . .
0 1 = In ).
0 0 0 . . . 1
So, if we let
(this can be conrmed by showing that
= v 1 = , y
1 2 0
. . .
0
then as desired.
0 0 0 1 0 0 1 0 , . . . . . . . . . 0 0 1
Therefore, transforming our variables as above, we have
= y = v = v
1 2 0
. . .
0 y 1 1 2 y2 y1 y3 y2
. . .
y1 0 0 0 1 0 0 y2 1 0 y3 . . . . . . . . . . . . yn 0 0 1 ,
yn yn1
= X = v
1 2 0
. . .
0 0 0 x1,1 1 0 0 x1,2 1 0 . . . . . . . . . . . . x1,n 0 0 1 xk,1 1 2 xk,2 xk,1

. . .
xk,1 xk,2 , . . . xk,n ,
x1,1 1 2 x1,2 x1,1 = v . . . x1,n x1,n1
xk,n xk,n1
57
and
= u = v = v
1 2 0
. . .
u1 1 2 u2 u1 u3 u2
. . .
u1 0 0 0 1 0 0 u2 1 0 u3 . . . . . . . . . . . . un 0 0 1 ,
un un1
So, in this case, the transformed model,
y
can be written as
= X + u ,
y 1 1 2 y2 y1 y3 y2
. . .
yn yn1
xk,1 1 2 x1,1 1 2 x1,2 x1,1 xk,2 xk,1 = v . . . . . . x1,n x1,n1 xk,n xk,n1 u 1 1 2 u2 u1 +v u3 u2 . . . . un un1
Alternatively, for the 1st observation, we have
v y1
and for
1 2 = 1 v x1,1 t-th
observation,
1 2 + . . . + k v xk,1
we have
1 2 + v u1
1 2 ,
t = 2, . . . , n,
v (yt yt1 ) = 1 v (x1,t x1,t1 )+. . .+k v (xk,t xk,t1 )+v (ut ut1 ).
Of course, since to get
is a constant, we can divide it out of each side of the above
y1
1 2 = 1 x1,1
1 2 + . . . + k xk,1
1 2 + u 1
1 2 ,
58
and
yt yt1 = 1 (x1,t x1,t1 ) + . . . + k (xk,t xk,t1 ) + vt ,

Of course, the parameter
t = 2, . . . , n,
respectively. Estimating this model by OLS gives us GLS estimates.
is not usually known in practice, so we need to
somehow estimate it (and therefore in this procedure are as follows:
),
and get FGLS estimates. In this case,
we need to resort to a procedure known as
iterated FGLS
. The steps involved
1. Use OLS to estimate (30). Save the residuals,
u = y Xb. j th
estimate of
2. Using the residuals from the previous step, compute the
(j) ,
as
(j) =
3. Use
n i=2 ut ut1 . n i=2 ut
(j)
to get the
residuals,
j th FGLS (j) u = y XbFGLS .
estimate of (30),
bFGLS .
(j)
Save the updated
Repeat Steps 2-3 until
|(j) (j1) | < ,

and
|bFGLS bFGLS | < ,

where
(j)
(j1)
is some small number (say 0.0001).
5.5 Hypothesis Testing

Hypothesis testing can be done using GLS (or FGLS) estimates in very similar manner as it was done using OLS estimates. testing the linear restriction imposed by The only dierence is that, in
H0 : R = r,
we use use
bGLS (or bFGLS ) instead of b as an bGLS , but, at the end of this section,
estimate of
For now, we will use
we will discuss the eects of using
bFGLS .
Note rst, that
E(RbGLS )
= RE(bGLS ) = R,
59
and
Var(RbGLS )
= = = = = = =
E([RbGLS E(RbGLS )][RbGLS E(RbGLS )] ) E[(RbGLS R)(RbGLS R) ] E[(RbGLS R)(bGLS R R )] E[R(bGLS )(bGLS )R ] RE[(bGLS )(bGLS ) ]R RVar(bGLS )R R(X 1 X)1 R .
Here, if we make the assumption that
u N(0, ),
then
bGLS N[, (X 1 X)1 ],

which implies that
RbGLS N[R, R(X 1 X)1 R ],

or, alternatively, that
RbGLS R N[0, R(X 1 X)1 R ].

Note that, under the null hypothesis, as
R = r,
so we can rewrite the above
RbGLS r N[0, R(X 1 X)1 R ].

Finally, we can rewrite this in quadratic form to get the test statistic
(RbGLS r) [R(X 1 X)1 R ]1 (RbGLS r) 2 . (q)

Note that, here, knowledge of know
is needed not only to compute
bGLS ,
but
also to construct the the above test statistic. Of course, since we almost never
we need to use our estimate,
Substituting
for
in the above,
we have the test statistic
(RbFGLS r) [R(X 1 X)1 R ]1 (RbFGLS r),

whose distribution is very hard to ascertain (even if
is normally distributed).
As a result, it is suggested that a bootstrap procedure be used in such tests.
Appendix A
We showed these results in class. These notes will be eventually updated.
References
Davidson, R. and MacKinnon, J. G. (2004). New York, Oxford University Press.
Econometric theory and methods.
60

Econometrics Lecture Notes

Uploaded by

Econometrics Lecture Notes

Uploaded by

Econometrics Lecture Notes

Brennan Thompson 11th September 2006

Review of probability and statistics

1.1 Random variables

random variable RV outcome

) is a variable whose value

As a simple example, consider the random experiment of rolling a single

be the RV determined by the outcome of this experiment. In this

discrete random variable

can take only take on the values only the value of 1, 2, 3, 4, 5,

continuous random variable

1.2 Probability functions

probability function probability mass function probability density function

, which in the case of a , and in the case of a (or

f (x) = Prob(X = x).

Such a probability mass function is said to represent a

since each outcome has an equal (or uniform) probability of occurring.

a and b, the PDF f (x) is implicitly dened

1.3 Distribution functions

probability that an RV will take on any value

cumulative distribution function CDF

less than or equal to

), which tells us the some specic

distribution funcF (x)

value. That is, for an RV dened as

(whether discrete or continuous), the CDF

F (x) = Prob(X x).

this implies that

the denition of a CDF implies that

Finally, it should be noted that for any RV

F (x1 ) F (x2 ) for x1 > x2 .

1.4 Joint probability functions

probability function Y joint probability mass function f (x, y)

. For instance, in the case of two discrete RVs,

will take on the value

tells us the probability that

will take on the value

(simultaneously). That is,

f (x, y) = Prob(X = x, Y = y).

the blue die.

random experiment of let

tossing a pair of dice). combined outcome that

However, the possible outcomes of these two random

experiments can be combined in the following manner:

takes on the value

takes on the value

represents rolling a 1 with the red

1 36 . Accordingly, we can write the joint probability as

1 , for x = 1, . . . , 6 and y = 1, . . . , 36. 36 a and b d (simultaneously). The ) f (x, y) is dened so that

In the continuous case, we may be interested in the probability that the RV

takes on some value in the range between (or

some value between

joint probability density

1.5 Conditional probability functions

Closely related to the concept of a joint probability function is a

probability function ability mass function f (y|x)

. For two discrete RVs,

, tell us the probability that

conditional conditional probtakes on the value

takes on the value

f (y|x) = Pr(Y = y|X = x),

In the continuous case, a

) can be used to tell us the probability that a continuous RV

conditional probability density function

Pr(Y = y, X = x) Pr(X = x) f (y, x) . f (x)

takes on some value in the range between

takes on some value in the range between

takes on some value in the range between

f (y, v)dydv hf (y, x)dy.

takes on some value

takes on some value in the range between

hf (y, x)dy hf (x) f (y, x)dy . f (x) f (y|x)

From this approximation, we implicitly dene that conditional PDF

1.6 Expected value

expected value mean

of an RV, is the average value

a and b, the PDF f (x) is implicitly dened

), which tells us the some specic

value. That is, for an RV dened as

the denition of a CDF implies that

1 , for x = 1, . . . , 6 and y = 1, . . . , 36. 36 a and b d (simultaneously). The ) f (x, y) is dened so that

From this approximation, we implicitly dene that conditional PDF

8 (or sometimes Var(X)), is dened as

X and Y (whether discrete or continuous), Cov(X, Y )), is dened as11