Econometrics Lecture Notes
Econometrics Lecture Notes
will denote an RV by an uppercase letter (such as a lowercase letter (such as die. Let
of a random experiment.
1 In what follows, we
or
Y ),
and an outcome by
or
y ).
experiment, there are six possible outcomes: If the die lands with 1, 2, 3, 4, 5, or 6, facing up, we assign that value to the RV
X.
This example illustrates what is known as a Here, the RV or 6. A discrete RV diers from a
an RV which can only take on a nite (or countably innite) number of values.
can take on any value in some real interval (and therefore an uncountably innite number of values). An example of continuous RV is a person's height. For an adult, this could take on any value between (say) 4 and 8 feet, depending on the precision of measurement.
: an RV which
Consider rst the discrete case. The probability mass function the probability that the discrete RV
PDF
f (x)
). tells us
x.
More formally,
It should be noted that a probability mass function for a discrete RV requires that
0 f (x) 1,
2 and
f (x) = 1.
x
In our die rolling example, it should be clear that, since each outcome (1, 2, 3, 4, 5, or 6) is equally likely, the probability mass function for equal to
X , f (x),
is
1 6 for each
x,
i.e.
f (x) =
1 , for x = 1, . . . , 6. 6
uniform distribution
X
For a continuous RV, the probability associated with any particular value is zero, so we can only assign positive probabilities to ranges of values. example, if we are interested in the probability that the continuous RV take on a value in the range between as For will
Prob(a X b) =
a
f (x)dx.
As in the discrete case, a PDF for a continuous RV has certain (roughly analogous) requirements that must be satised:
f (x) 0,
and
f (x)dx = 1.
x
tion
(or
X,
F (x) =
Xx
f (x).
2 The notation is used to denote the sum over the entire range of values that the x discrete RV X can take on. 3 The notation is used to denote the integral over the entire range of values that the x continuous RV X can take on.
In our example of rolling a single die, it should be easy to see that the CDF of
is
F (x) =
For a continuous RV
x , for x = 1, . . . , 6. 6
x
X,
F (x) =
and
f (t)dt,
f (x) =
ous), the CDF
dF (x) . dx X
(whether discrete or continu-
F (x)
must satisfy
0 F (x) 1, F (+) = 1, F () = 0,
and
joint
y
and
x and
the RV
0 f (x, y) 1,
and
f (x, y) = 1.
x y
For example, consider two dice, one red and one blue. Denote by by
the RV
whose value is determined by the random experiment of rolling the red die, and
the RV whose value is determined by the random experiment of rolling It should be emphasized that these are two do not confuse this with the
experiments (i.e.
single
and
separate
random
(x, y)
denote the
x,
y.
For example, the combined outcome each with equal probability, mass function for
(1, 5)
die, and a 5 with the blue die. In total, there are 36 such combined outcomes,
and
f (x, y) =
function
joint PDF
and
and the RV
will take on
Prob(a x b, c y d) =
a c
The requirements here are similar to that of a PDF (for a single continuous RV):
f (x, y) 0,
and
f (x, y)dxdy = 1.
x y
and
Y , the
y,
given that
x,
i.e.
f (y|x)
= =
conditional PDF
this.
(or
takes on a value in some range, given that another continuous RV takes on some specic value. Of course, as noted above, the probability that a continuous RV takes on a specic outcome is zero, so we need to make some approximation to
Y a and b, given that the continuous RV X takes on some value extremely close to x. The probability that the continuous
Suppose that we want to know the probability that the continuous RV 4
RV X takes on some value extremely close to x, say in the range between x and x + h, where h is some extremely small number, is
x+h
Pr(x X x + h)
=
x
f (v)dv.
hf (x).
Now, the probability that
a and
and
and
x+h
is
x+h
Pr(a Y b, x X x + h)
=
a b x
So, similar to the discrete case, the probability that in the range between
a and b given X
and
x+h
is
Pr(a Y b|x X x + h)
= =
Pr(a Y b, x X x + h) Pr(x X x + h)
b a b a
Pr(a Y b|X x) =
a
f (y|x)dy.
we would expect it to take on if we were to repeat the random experiment which determines its value many times over. It is what is known as a measure of
central tendency.4
X,
mean. However, since these terms are interchangeable, so is the notation. The mathematical expectation of a discrete RV
E(X) =
x
xf (x),
4 Other measures of central tendency are the median and mode. The median of the RV X (whether discrete or continuous) is the value m such that Prob(X m) 0.5 and Prob(X m) 0.5, while the mode of X is the value of x at which f (x) takes it maximum.
we have
E(X) =
xf (x)dx.
It should be clear that these denitions are quite similar, in that the dierent possible values of the RV are weighted by the probability attached to them. In the discrete case, these weighted values are summed, while in the continuous case, integration is used. It should go without saying that
E(c) = c.
On the other hand, it is not so obvious that
constant.
E[E(x)] = E(x),
which might be a little less confusing if we use the notation
E(X ) = X .
In our example of rolling a single die, we can calculate the expected value of
as
E(X) =
In a more general way, we can also calculate the expected value of a function of a RV. Letting
g(X)
X,
the expected
E[g(X)] =
x
Similarly, letting
g(x)f (x). X,
the expected
g(X)
E[g(X)] =
Note that
g(x)f (x)dx.
To see this, it might help to
a function of an RV is an RV itself.
Z,
(i.e.
Z = g(X)).
As an example, let's again consider the RV what is known as a RV called
X,
Z,
where, here,
linear function
g(X) = 4X Z
(this is can be
Z = g(X) = 4X .
calculated as
E(Z) = E(4X) =
Notice that this is just four times the expected value of above, is 3.5).
This example illustrates an important rule: The expected value of an RV multiplied by a constant, is the constant multiplied by the expected value of that RV. That is, if X is an RV and c is a constant, we have
E(cX) = cE(X).
The proof of this follows directly from the denition of the mathematical expectation of a discrete RV:
E(cX)
=
x
= c
= cE(X).
Continuing with our random experiment of tossing a single die, consider the function
g(X) = X 2 Z
Z , where Z = g(X) = X 2 .
nonlinear function
). Once The
E(Z) = E(X 2 ) =
X)
is
not
equal to
[E(X)]
Y ),
3.52 = 12.25. X
and
Quite often, we are interested in functions of more than one RV. Consider
g(X, Y ),
Y.
E[g(X, Y )] =
x
where tion
f (x, y)
and
Y.
and
Y,
g(X, Y )
is
E[g(X, Y )] =
x
where
f (x, y)
Y.
As above, it might help to once
Consider now our example of rolling two separate die, one red and one blue. Now, consider the function
g(X, Y ) = X + Y .
5 This
rule also holds for continuous RVs, and a similar proof could be made.
7
Z,
where
Z = g(X, Y ) = X + Y .6
The expected
can be calculated as
E(Z)
= E(X + Y ) 1 2 3 4 5 6 = (2) + (3) + (4) + (5) + (6) + (7) 36 36 36 36 36 36 4 3 2 1 5 + (8) + (9) + (10) (11) + (12) 36 36 36 36 36 = 7. X
(3.5) plus the expected
This example illustrates another important rule: The expected value of the sum of two (or more) RVs is the sum of their expected values. In general terms,
(3.5).
and
Y,
we have
statistical independence
this is true
), it is not true
conditional expectation
X
of
given that
X and Y . x is
E(Y |X = x) =
y
For two continuous RVs, that and
f (y|x). Y
given
Y,
(see above) is
E(Y |X x) =
y
f (y|x)dy.
1.8 Variance
Roughly speaking, the
its expected value that we would expect it to take on if we were to repeat the
variance
6 Note that, while Z can be seen as the RV whose value is determined by the outcome of the single random experiment of rolling a pair of dice, we are dening it here as the RV whose value is determined by a function of two separate RVs, determined by the outcomes of two separate random experiments. 7 As above, this rule also holds for continuous RVs. We omit the proof for both cases here.
random experiment which determines its value many times over. More formally,
2 X
2 X = E[(X X )2 ].
From this denition, it should be clear that
zero.
2 c
standard deviation
X
dispersion.
Another measure
is usually
X .
9 Keeping this in mind, we can write the variance
is the expected value of the function
Notice that the variance of an RV is just the expectation of a function of that RV (i.e. the variance of the RV
(X X )2 ,
as dened above).
of the discrete RV
as
2 X = x
(x X )2 f (x), X
can be written as
2 X =
(x X )2 f (x)dx. X
(remembering that
X = 3.5),
is calculated as
2 X
which implies
Before moving on, we can also consider the variance of a function of an RV. Consider the linear function
g(X) = cX
is an RV and
c is a constant.10
8 At times, it may be convenient to dene the variance the RV X as 2 = E(X 2 ) 2 . It X is a useful exercise to show that these two denitions are equivalent. 9 While this function is also an RV, its expected value (i.e. its variance) is actually a constant (just like its expected value, as we pointed out above). 10 Here, we will focus on linear functions, but the analysis could be extended to more general function. We will consider the variance of a function of more than one RV in the next section.
Var(cX)
= = = = =
1.9 Covariance
Closely related to the concept of variance is sure of
association
covariance
X,Y
with itself is
X,X
Just as we viewed variance as the expectation of a function of a (single) RV, we can view covariance as the expectation of a function of two RVs. For two discrete RVs
and
Y,
this means
12
X,Y =
x
while for two continuous RVs
and
Y,
this means
E[g(X, Y )] =
x y
As mentioned above, two RVs dent if
E(XY ) = E(X)E(Y ).
11 We can also dene the covariance of the RVs X and Y as X,Y = E(XY ) X Y . As was the case with the denitions of variance, it is a useful exercise to show that these two are equivalent. 12 Here, the expectation is taken of the function g(X, Y ) = (X )(Y y ). X Y
10
Therefore, using the alternate denition of covariance, it can be see that if two RVs
and
X,Y
= E(XY ) X y = E(X)E(Y ) X y = 0
Using the denition of covariance, we can now consider the variance of a function of more than one RV. Consider the linear function where
and
g(X, Y ) = X + Y ,
V ar(X + Y )
= E[(X + Y X+Y )2 ] = E[(X + Y X Y )2 ] = E(X 2 + Y 2 + 2XY 2XX 2XY 2Y X 2Y Y + 2X Y + 2 + 2 ) X Y = E(X 2 2XX + 2 ) + E(Y 2 2Y Y + 2 ) X Y +2E(XY XX Y Y + X Y ) 2 2 = X + Y + 2X,Y . Y
are statistically independent,
Of course, if
and
2 2 Var(X + Y ) = X + Y
since
X,Y
is equal to zero.
A similar analysis (omitted here) could be used to show that the variance of the function
g(X, Y ) = X Y
is equal to
2 2 Var(X Y ) = X + Y 2X,Y ,
unless
and
2 2 Var(X Y ) = X + Y .
relation
and
Y,
correlation
denoted
. The
X,Y ,
is dened as
coecient of cor-
X,Y =
Note that if uncorrelated.
X,Y . X Y
and
Y are statistically independent, the coecient of correX,Y = 0), in which case we would say they are said to be
13 We
again focus on only linear functions, but more general functions could be considered.
11
1.10 Estimators
In order to calculate the measures considered above (expected value, variance, and covariance), it is necessary to know something about the probability function associated with the RV (or, in the case of covariance, the joint probability function associated with the RVs) of interest. In the die rolling example, the RV
was known
a priori
However, the probability function for an RV determined by anything but the most trivial random experiment is usually unknown. common to
assume
that the RV follows some theoretical distribution (such as , we will use the PDF
follows a
normal distribution
1
2 2X
f (x) =
exp
(x X )2 2 2 X
2 X .14
we can't derive these parameters unless we rst know the PDF (which requires knowing the parameters). In some cases we may choose to assume values for them, but in others, we need to A is a set of RVs,
random sample
(or ).
estimate
15
of
x1 , . . . , xn ,
x1 , . . . , xn
are
estimator
n i=1
independent, identically
X,
denoted
X.
For
X,
is
1 X= n x1 , . . . , xn
xi , X. s2 , X
variance
Next, consider an estimator of the the variance of an RV, called the . For the RV
X,
sample
is
s2 X
1 = n1
(xi X)2 .
i=1
It should be clear right away that both the sample mean and sample variance of an RV are functions of RVs (the random sample), and are therefore RVs themselves. Simply put,
14 If the RV X follows a normal distribution with mean and variance 2 , it is common x X 2 to write X N (X , X ). 15 In fact, if we are not comfortable in assuming some theoretical distribution, it is even possible that we can estimate the probability function itself using a sample of observations. This approach is considered nonparametric, since it does not involve estimating the parameters of some theoretical distribution (which is a parametric approach).
12
As RVs, estimators have their own probability function, distribution function, expected value, variance, etc. sample mean of the RV For example, the expected value of the
is
E(X)
= E 1 E n
1 n
xi
i=1 n
= =
xi
i=1
1 [E(x1 ) + . . . + E(xn )] n 1 = (X + . . . + X ) n = X .
This demonstrates that the sample mean is what is a called an that parameter. The variance of the sample mean of the RV mator: an estimator of a parameter whose expected value is the true value of
unbiased
esti-
is
Var(X)
Var
1 n
xi
i=1 n
= = = =
1 Var n2
xi
i=1
1 [Var(x1 ) + . . . + Var(xn )] n2 1 2 2 + . . . + X n2 X 2 X . n
Turning to the sample variance, it turns out that this estimator is also unbiased. The proof is a little more complicated, but it will help to begin by rewriting the sample variance of the RV
as
s2 = X
1 n1
[(xi X ) (X X )]2
i=1
(note that this is equivalent to our original expression since the The expected value of
terms cancel).
s2 X
n
is thus
E(s2 ) X
= E
1 n1
[(x X ) (X X )]2
i=1
13
1 E n1 1 E n1 1 E n1 1 E n1
(xi X ) 2(X X )
2 i=1 n i=1
(xi X ) + n(X X )2
= = = = =
(xi X )2
i=1
n E[(X X )2 ] n1
We won't do so here, but it can shown that the variance of normally distributed RV, is
is a
Var(s2 ) = X
It is useful to compare the estimator
4 2X . n1
s2 X
X,
denoted
X , 2
is dened as
X = 2
The expected value of X is 2
1 n
(xi X)2 =
i=1
n1 2 s . n X
E(X ) 2
= =
n1 E(s2 ) X n n1 2 X , n
is
bias
X 2
biased
X 2
, this is
Bias(X ) 2
If
variance of X : 2
s2 X
to nd the
Var(X ) 2
= = = =
Var
n1 2 s n X
2
n1 n
Var(s2 ) X
2
4 n1 2X n n1 4 2(n 1)X . n2
s2 X
, i.e.
X is a more 2 2 estimator of variance than sX . Note that eciency is a relative measure (i.e. we can only speak of the eciency of an estimator in comparison to the eciency
of other estimators). For this reason, we often refer to the the relative eciency of two estimators, which is just equal to the ratio of their variances. For example,
ecient
of
relative eciency
s2 X
and
X 2
is
Var(s2 ) X = Var(X ) 2
any other estimator. Although
n n1
s2 X
X , 2
important question of which estimator is better. Typically (as in this case), we see a trade-o between bias and eciency. A commonly used measure that takes both of these factors into consideration is the
MSE
(or
16 For
s2 , X
MSE(s2 ) X
= = =
16 Actually, the MSE of an estimator is dened as its squared expected deviation from its 2 true value. For example, the MSE of s2 is dened as E[(s2 X )2 ]. However, since this is X X equivalent to the estimator's variance plus its squared bias, we usually write it that way. It would be a useful exercise to show this equivalence.
15
while for
X , 2
we have
MSE(X ) 2
= = = =
Var(X ) + [X ]2 2 2
4 1 2 2(n 1)X + 2 n n X 4 4 2(n 1)X + X 2 n n2 4 (2n 1)X . n2 2
To compare the two estimators, we can use the dierence in their MSEs:
MSE(s2 ) MSE(X ) 2 X
which implies
X 2
is not the only basis on which to compare two (or more) estimators. Before moving on, let's consider an estimator of the covariance of two RVs, called the and
Y,
sample covariance
sX,Y =
. From
sX,Y ,
is dened as
1 n1
(xi X)(yi Y ).
i=1
coecient of correlation
where
to calculate the
sample
rX,Y = sX ,
the
X,
is dened similarly.
small
working with random samples of innite size may seem unrealistic (and it is), asymptotic analysis provides us with an approximation of how estimators per-
large
random samples.
17
17 For
this reason, asymptotic properties are sometimes called large sample properties.
16
In order to examine the asymptotic properties of estimators, it is necessary to understand the concept of depends on introducing some new notation. Let
convergence in probability
Xn > 0.
We begin by
n. X n
if
n
If
Xn
converges in probability to
Xn c.
Alternatively, we may say that write
is the
) of
Xn
and
plim(Xn ) = c.
Using the concept of convergence in probability, we can now consider some useful asymptotic properties of estimators. An estimator of a parameter is said to be probability to the true value of the parameter. This condition is known as of the RV
consistent
A
sucient
n approaches innity. X,
X.
The MSE of
MSE(X) =
since
2 X , n
is unbiased. So
lim [MSE(X)]
= =
lim
2 X n
0,
p X X . X
is a consistent estimator of the mean of
X , s2 X
and
X , X . X . We 2
lim [MSE(s2 )] X
= =
lim
4 2X n1
0,
and
lim [MSE(X )] 2
= =
17
lim
4 (2n 1)X n2
0,
implying that
2 s2 X , X
and
2 X X . 2 2 X , X . 2 Note that this is true even though X is a biased estimator of the variance of X . If the bias of an estimator goes to zero as n approaches innity, it is said to be . Recall that the bias of the estimator X was 2
So, both and are consistent estimators of the variance of
s2 X
X 2
asymptotically unbiased
Bias(X ) = 2
As unbiased. Similarly, an estimator is said to be
1 2 . n X
is asymptotically , if, as
approaches innity, it is more ecient than any other estimator. of estimators. Specically, we are often interested in the
asymptotically ecient
asymptotic distri-
distribution
Xn ,
bution to the RV
X,
to converge in distri-
n
at all continuity points of
F (x).
Xn X.
theorem
RV
tribution of the sample mean can be found by using the Lindeberg-Levy CLT. This CLT states that if
CLT
central limit
X,
where
has mean
observations on the
n1/2
i=1
2 (xi X ) N (0, X ).
since
2 n(X X ) N (0, X ),
n1/2
i=1
(xi X )
= n1/2
i=1
xi nX
where
normal
2
X
.
is as dened earlier.
In this case,
is said to be
asymptotically
if the RV
it may be uniformly
distributed).
Regression Models
2.1 Introduction
Natural and social scientists use abstract models to describe the behaviour of the systems that they study. Typically, these models are formulated in mathematical terms, using one or more equations. For example, in an introductory macroeconomics course, the aggregate expenditures model is used to describe the behaviour of an economy. In this model, the equation
C = cD,
is used to describe how the level of consumption, disposable income, by the parameter level of
C,
D.
Here,
is a linear function of
D,
c,
deterministic :
D,
C.
Deterministic relationships, however, rarely (if ever) hold in observed data. This is for two reasons: First, there will almost always be some error in measuring the variables involved; and second, there will very often be some other,
non-systematic
identied.
some
(i.e.
To illustrates the dierence, consider a model which posits that the variable
X,
i.e. (1)
Y = m(X).
Now, suppose that we have which we denote
n observations on each of the variables Y and X , y1 , . . . , yn and x1 , . . . , xn , respectively. For any given pair of observations, (yi , xi ), we would like to allow the possibility that the relationship
in (1) does not hold exactly. To do this, we add in what is called a
term
, denoted
ui ,
regression model
i = 1, . . . , n.
disturbance
(2)
yi = m(xi ) + ui ,
By adding this disturbance term, we are able to capture any measurement errors or other non-systematic factors not identied in (1). On average, we would hope that each disturbance term is zero (otherwise, the model would contain some systematic error). However, for each pair of observations, the actual size of
(yi , xi ), i = 1, . . . , n
ui
will dier.
19
Since the disturbance term is, by nature, unpredictable, it is treated as an RV. Like any other RV, it has its own distribution. The most important characteristic of this distribution is that its expected value is zero, i.e. for each
ui , E(ui ) = 0.
Typically, we assume that each
ui , i = 1, . . . , n,
other things, that each has the same variance, i.e. for each
ui ,
2 Var(ui ) = u .
In this case, the disturbance terms are said to be
ally means same variance). Given the IID assumption, it is common to write
homoskedastic
(which liter-
2 ui IID(0, u ).
Note that, since
yi
is a function of the RV
ui
yi
is also dependent on
xi .
xi
controlled
x1 , . . . , xn
an example, consider an experiment where we give patients dierent doses of a blood pressure medication, and measure the reduction in their blood pressures over some period of time. Here, to patients patients
1, . . . , n, 1, . . . , n. If
and
x1 , . . . , xn are the doses of the medication given y1 , . . . , yn are the reduction in the blood pressures of x1 , . . . , xn constant (i.e. keeping the y1 , . . . , yn would be dierent each In this case, y1 , . . . , yn are RVs (since
new group of patients each time), holding time (due to the disturbances
u1 , . . . , un
then
u1 , . . . , un ). x1 , . . . , xn are not.
eld
experiment,
x1 , . . . , xn
dierent workers' education and income levels (perhaps we want to see if higher levels of education result in higher levels of income). Here, education levels of workers workers
x1 , . . . , xn
are the
1, . . . , n, x1 , . . . , xn
and
y1 , . . . , y n
1, . . . , n.
If we repeat this experiment several times over (with a new will be dierent each time, and therefore
can be considered RVs (since we don't know their values until we complete the
Up to this point, we have been dealing with what is called a , since it involves just two variables:
yi
and
xi .
Of course,
bivariate
most regression models used in the natural and social sciences involve more than just two variables. A model of this type, referred to as a , has one dependent variable and more than one independent
multivariate
20
yi = m(x1,i , . . . xk,i ) + ui ,
where
x1,1 , . . . , x1,n
X2 ,
and so on.
only one variable is involved. Such models are usually only relevant when the variable involved is in the form of a time-series. An example of such a model is what is known as an is regressed on lagged values of itself.
process
the variable
over time, a
n be observations on (or
AR(p )
yt = m(yt1 , . . . ytp ) + ut ,
t = 1, . . . .n.
For ease of exposition, the rest of this lecture will focus on bivariate regression models, but all of the results are perfectly applicable to multivariate and univariate regression models.
m()
m(). Of course, since any estimate of m() will be based on the the y1 , . . . , yn (and x1 , . . . , xn , which also may be RVs), m(), like any other estimator, will also be an RV. Exactly how m() is estimated, and how m() is
distributed (which will depend on the method of estimation), will occupy us for the remainder of the course. Assuming we have an estimate of way. Given a specic observation of
m() in hand, it can be used in the following xi , the estimate m(xi ) will give us an estimate yi = m(xi ).
yi ,
yi ,
i.e.
yi
error term
(or
residual
yi .
), denoted
ui ,
and
yi = m(xi ) + ui .
Rearranging, we have
ui
= yi m(xi ) = yi yi . m()
should have the property
That is, the error term is equal to the dierence between the observed and estimated values of
yi .
that it minimizes such errors. This will be a guiding principle in developing the estimators we are interested in.
21
m()
tician (or even the theorist) to assume some function form. a specic form, (2) is referred to as a
we will consider nonparametric regression models later in the course. The most common form assumed for regression models involves specifying
gression model
as an
If
m()
is given
m()
ane function18
, i.e.
m(xi ) = + xi ,
which gives use the
Here and
and
we have
m(xi ) = + xi .
Arriving at these estimates will be the focus of the next topic.
regression model
Of course, not all parametric regression models are linear. is one in which
m()
nonlinear
m(xi ) = x i
we have the nonlinear regression model
yi = x + ui . i
As above,
we have
m(xi ) = x . i
Estimating nonlinear regression models is beyond the scope of this course.
). This ) from
18 An
ane function, g(x), takes the form g(x) = a+bx, where a and b are any real numbers.
22
yi = + xi + ui ,
then the
i = 1, . . . .n
(3)
ith
ui = yi xi .
Summing the square of each of these terms (from
i = 1 . . . , n),
we have
u2 i
i=1
=
i=1
(yi xi )2 .
From the left hand side of this equation, it is evident that the sum of squared residuals is a function of both
and
.
n
SSR( , ) =
i=1
(yi xi )2 .
Since the method of OLS is based on minimizing this function, we are faced with a very simple optimization problem: Minimize the function choosing
SSR(, )
by
and
i.e.
min SSR( , ) =
, i=1
(yi xi )2
(4)
SSR( , ) = 2
Equation (4) implies
(yi xi )(xi ) = 0.
i=1
(5)
(yi xi ) = 0.
i=1
Carrying through the summation operator, we have
yi n
i=1 i=1
xi = 0.
23
gives
=
which can be rewritten as
1 n
yi
i=1 i=1
xi
= y x,
where,
(6)
y is
yi , . . . , y n ,
i.e.
y=
and
1 n
yi ,
i=1
xi , . . . , xn ,
i.e.
x=
1 n
xi .
i=1
(yi xi )(xi ) = 0.
i=1
Multiplying through by
xi ,
we have
(yi xi xi x2 ) = 0. i
i=1
Next, carrying through the summation operator, we have
yi xi
i=1
Now, substituting for
xi
i=1 i=1
x2 = 0. i
yi xi ( x) y
i=1 i=1
Expanding the term in brackets, we have
xi
i=1
x2 = 0. i
yi xi y
i=1
Finally, solving for
xi + x
i=1 i=1
xi
i=1
x2 = 0. i
we have
n i=1 yi xi y n x2 x i=1 i
24
n i=1 xi . n i=1 xi
(7)
as
x)
(8)
To show that Equations (7) and (8) are equivalent, we show that the numerator and denominator of each are equivalent. First, expand the numerator in Equation (8) and rearrange:
(yi y )(xi x)
i=1
=
i=1 n
(yi xi yi x y xi + y x)
n n
=
i=1 n
yi xi x
i=1
yi y
i=1 n
xi + nx y xi + ny) x
=
i=1 n
yi xi xn y y
i=1 n
=
i=1
yi xi y
i=1
xi ,
which is exactly the numerator in Equation (7). Second, expand the denominator in Equation (8) and rearrange:
(xi x)2
i=1
=
i=1 n
(x2 2xi x + x2 ) i
n
=
i=1 n
x2 2 x i
i=1
xi + n2 x
=
i=1 n
x2 2n2 + n2 x x i x2 n2 x i
i=1 n n
= =
i=1
x2 x i
i=1
xi ,
which is exactly the denominator in Equation (7). At other times, it will be convenient to to rewrite
as
x)yi . x)xi
(9)
To show that this is equivalent to Equation (8) (and therefore Equation (7)), we again show that the numerator and denominator of each are equivalent. To
25
(xi x)
i=1
=
i=1
xi n x
= nx n x = 0,
and similarly,
(yi y ) = 0.
i=1
Now, the numerator in Equation (8) can be expanded to give
(xi x)(yi y )
i=1
=
i=1 n
=
i=1 n
(xi x)yi y
i=1
(xi x)
=
i=1 n
which is the numerator in Equation (9). Similarly, the denominator in Equation (8) can be expanded to give
(xi x)2
i=1
=
i=1 n
=
i=1 n
(xi x)xi x
i=1
(xi x)
=
i=1 n
which is the denominator in Equation (9). So, we have three dierent (but equivalent) expressions for section, Equation (9) will often come in handy.
Equations (7),
(8), and (9). We most often use Equation (8), but, as we will see in the next
26
ui ,
2 ui IID(0, u ).
However, as we progress, we will relax this assumption and see how the properties of OLS estimators change. In addition, we will also assume that
xi , i = 1, . . . , n
This
assumption does not have nearly as large an impact on the properties of OLS estimators as the IID assumption, but will make the following analysis a little easier. The rst question we want to ask is whether or not OLS estimators are unbiased. Let's start with
yi
in Equation (9)
=
Multiplying through, we have
n i=1 (xi
n i=1 (xi
x)( + xi + ui ) . x)xi
= =
n i=1 (xi
x) +
0+
n i=1 (xi
n n i=1 (xi x)xi + i=1 (xi n (xi x)xi i=1 n x)xi + i=1 (xi x)ui
x)ui
x)xi
(10)
= +
x)ui . x)xi :
n i=1 (xi n i=1 (xi
E()
= E + = + = ,
x)ui x)xi
since
E(ui ) = 0.
So,
is unbiased.
1 n
yi x.
i=1
27
Substituting for
yi
1 n
( + xi + ui ) x
i=1 n n
1 n + xi + ui ) x n i=1 i=1 1 n
n
= + x +
x
i=1 n
1 = + x( ) + n
Taking the expected value of
ui .
i=1
(11)
we have
E( )
= E + x( ) +
1 n
ui
i=1 n
1 = + x[ E()] + n = + x( ) =
So,
E(ui )
i=1
is also unbiased.
Another important property of OLS estimators is their variance. Since unbiased, we have
is
Var() = E[( )2 ].
From Equation (10), this means
Var() = E
x)ui x)xi
To make handling the term in brackets a little easier, let's introduce some new notation. Let
ki =
so that
Var() = E
i=1
Expanding this, we have
ki ui
Var() = E
n 2 ki u2 + 2 i i=1
n1
ki kj ui uj
i=1 j=i+1
28
n1 2 ki E(u2 ) + 2 i
=
i=1
ki kj E(ui uj ).
i=1 j=i+1
2 E(u2 ) = u i
since
2 u
i = j,
we have
E(ui uj )
= E(ui )E(uj ) = 0,
is IID (and therefore statistically indepen-
since we have assumed that each dent from one another). So,
ui
Var()
2 u i=1 n
2 ki
2 = u i=1 2 u
= =
We now move on to
(xi x)2 ]
(xi x)2
2 u . n 2 i=1 (xi x)
Since
is unbiased, we have
Var() = E[( )2 ].
From Equation (11), this means
Var( ) 1 = E x( ) + n
n 2
ui
i=1
1 = E x2 ( )2 + 2( ) + 2 x n
ui
i=1
= x2 Var() + 2[ ] + x =
2 = u 2 u x2 n (xi i=1
1 n2
E(u2 ) + i
i=1
2 n2
n1
E(ui uj )
i=1 j=i+1
x)2
1 n 2 n2 u + 1 n .
x2
n i=1 (xi
x)2
Given the variance of OLS estimators, we might want to ask how ecient they are. stated as unbiased estimators, they are in fact the most ecient. This property is often (Best Linear Unbiased Estimator). By best, we mean most ecient. Here, we will prove this only for Using the
is very similar.
ki
=
i=1
This should make it clear that linear in
ki yi .
yi .
call it
=
i=1
where
gi yi ,
gi = ki + wi ,
and
wi
x1 , . . . , xn .
=
i=1 n
(ki + wi )yi
n
=
i=1
ki yi +
i=1 n
wi yi
(12)
= +
i=1
Taking expectations, we have
wi yi .
E()
= E
+
i=1 n
wi yi wi yi
i=1
= +E
30
Substituting for
yi
E()
= +E
i=1 n
wi ( + xi + ui )
n n
= +
i=1 n
wi +
i=1 n
wi xi +
i=1
E(ui )
= +
i=1
since
wi +
i=1
wi xi ,
E(ui ) = 0.
In order for
wi = 0,
i=1
and
wi xi = 0,
i=1
since, in general,
and
wi yi
i=1
= =
wi ( + xi + ui )
i=1 n n n
wi +
i=1
wi xi +
i=1
wi ui
i=1 n
=
i=1
wi ui .
=+
i=1
ki ui .
= +
i=1 n
ki ui +
i=1
wi ui
= +
i=1
Finally, the variance of
(ki + wi )ui .
is
Var()
= E
E()
31
= E
(ki + wi )ui
i=1 n
= E
i=1 n
=
i=1 2 = u
(ki + wi )2
i=1 n 2 2 (ki + 2ki wi + wi ) i=1 n n 2 2 ki + 2u i=1 n i=1 n 2 ki wi + u i=1 i=1 2 wi . 2 ki wi + u i=1 n 2 wi
2 = u
2 = u
2 Var() + 2u
ki wi
i=1
= = = 0,
wi
since
n i=1
xi wi = 0
and
n i=1
wi = 0.
Therefore, we have
2 wi ,
wi
We won't
is quite similar.
19
yi = 1 + 2 x2,i + . . . + k xk,i + ui ,
19 Note
i = 1, . . . .n
that the bivariate linear regression model considered above is just a special case of the multivariate linear regression model considered here. For this reason, what follows is applicable to both bivariate and multivariate linear regression models.
32
To make the analysis of such regression models much easier, we typically express this in matrix form:
20
y = X + u
where
(13) and
and
are
n-vectors, X
is an
as follows:
n k matrix, y1 . y = . , . yn x2,1
. . .
is a
k -vector,
dened
X=
1
. . .
...
xk,1
. . .
1 x2,n =
and
1
. . .
. . . xk,n ,
k u=
Letting
u1
. . .
un u
denote the residuals from the estimate of this model, we have
u = y Xb
where
is an estimator of
. = = = =
SSR(b)
y Xb
= (y Xb) = b X y.
As we saw in the bivariate case, the method of OLS is based on minimizing this function. Our optimization problem is thus
algebra.
20 In these notes,
33
SSR(b) = 2X y + 2X Xb = 0, b
which implies
X y = X Xb.
Finally, solving for
b,
we have
b = (X X)1 X y.
(14)
n-vector u
E(u)
= E = = 0, 0
. . .
un
where
is a
The variance of
k -vector of u is Var(u)
zeros.
ui
21 As noted above, what follows is also applicable to bivariate linear regression models (since the bivariate linear regression model is just a special case of the multivariate linear regression model).
34
i = j, E(ui uj ) = 0.
Therefore,
Var(u) =
2 u
. . .
0
. . .
0
2 = u In ,
where
2 u
In
is an
nn
identity matrix.
2 u IID(0, u In ).
Also, to make things easier, we will continue to assume that assumption. Let's start the analysis by asking whether Equation (13) into Equation (14), we have
is not a
RV. Later in the course, we will see what the consequences are if we relax this
b is unbiased or not.
Substituting
(15)
E(b)
since
E(u) = 0.
So
is unbiased.
b.
Given that
is unbiased, we have
Var(b)
Var(b)
= E [(X X)1 X u][(X X)1 X u] = E (X X)X uu X(X X)1 = (X X)1 X E(uu )X(X X)1 2 = u (X X)1 X X(X X)1 2 = u (X X)1 .
35
(16)
b,
Markov theorem holds here. Consider some other linear, unbiased estimator of
call it
B: B = Gy,
where
G = W + (X X)1 X ,
and
X.
is thus
E(B)
= E(Gy) = GE(y). y
is
E(y)
= E(X + u) = X + E(u) = X,
so
E(B)
This means that, in order for
= GX.
GX = Ik ,
where
Ik
is an
kk
GX
to conrm
= = = =
Gy G(X + u) GX + Gu + Gu,
and that
WX
= [G (X X)1 X ]X = GX (X X)1 X X = Ik Ik = 0.
The variance of
is therefore
Var(B)
= = = = = =
E[(Gu)(Gu) ] E(Guu G ) GE(uu )G 2 u GG 2 u [W + (X X)1 X ][W + (X X)1 X ] 2 u [WW + WX(X X)1 + (X X)1 X W + (X X)1 X X(X X)1 ] 2 2 = u WW + u (W W)1 .
Finally, since
is a non-zero matrix,
22
Recall that a sucient condition for estimator to be consistent is that its MSE
2 MSE(b) = u (X X)1 .
Now, consider the matrix of
X X.
numbers, as
1 n times this times each of these elements would approach some constant. We can write
would also approach innity. However, it would be safe to assume that this assumption as
n
where
lim
1 XX n
= Q, b
(17)
MSE(b) =
the
1 2 1 ( X X)1 . n u n b
since
Notice that this is equivalent to our previous expression for the MSE of
approaches innity,
lim [MSE(b)]
= =
lim lim
1 2 n u 1 2 n u
1
1 XX n lim
1 XX n
= (0)Q = 0,
22 We didn't do this when we focused on the bivariate linear regression model, but, once again, all of what follows is applicable to that special case.
37
which implies
b .
So,
is consistent.
b.23
To do so, begin by
b=+
Notice that, as above, the
1 XX n
1 Xu . n
1 n terms cancel each other out, leaving us with Equation (15). Rearranging, we have
n(b ) = ith
1 XX n
n1/2 X u. X u. 1 n
(18)
Xi ui
be the
0, lim
Var(Xi ui )
i=1
Xi ui
is
E(Xi ui )
= Xi E(ui ) = 0,
so its variance is
Var(Xi ui )
= = = =
lim
1 n
Var(Xi ui )
i=1
lim
1 n
n 2 u X i X i i=1
2 = u lim 2 = u Q.
1 XX n
23 In small samples, we can't generally say anything about the distribution of b unless we know the exact distribution of X and u. For example, if X is not an RV, and u is normally distributed, then b is also normally distributed since it is a linear function of u (see Equation (15)).
38
1 XX n
That is,
2 n(b ) N (0, u Q1 ).
is asymptotically normal.
3.5 Estimating the variance of the error terms of linear regression models
Even if we are willing to make the assumption that each of the error terms is IID with mean zero and variance the following estimator
2 u ,
2 u .
So,
we typically would like to get some estimate of it. We start here by proposing
2 u ,
which we denote
s2 : u
s2 = u
uu . nk
While it may not be immediately clear where this comes from, we can show that it is an unbiased estimator of
2 u .
u:
24
i.e.
M = M,
and
MM = M.
24 This
39
u,
we have
u = = = =
since
My M(X + u) MX + Mu Mu,
MX
Taking expectations of
s2 , u
E(s2 ) u
= = = =
Notice that
u Mu
u Mu = Tr(u Mu).
So, we can write
E(s2 ) u
= = = = =
since
Tr(M)
= = = =
So, we conclude that
2 u . 2 Note that since the variance of b, depends on u (see Equation (31)), we can 2 use su to get an estimate of the variance of b. This estimate, which we denote 2 2 by Var(b), is found by replacing u by su in Equation (31), i.e.
is an unbiased estimator of
s2 u
1 Var(b) = s2 (X X) . u 2 s2 is an unbiased estimator of u , Var(b) is u Var(b). The square root of Var(b) is often called the
Since
standard error
an unbiased estimate of of
b.
Hypothesis testing
yi = 1 + 2 x2,i + 3 x3,i + ui ,
i = 1, . . . , n.
(19)
H0 : 2 = 2,0 ,
where
2,0
H0 : 2 = 3 = 0.
H0 : R = r,
where, here, 1.
= ( 1 1 1 0 0 ), 0 1
2
and
3 ) ,
R=( 0 R= 0 0
r = 2,0 ; r=
2.
, and
0 0
Written this way, it can be seen that each of these null hypotheses imposes some linear restriction(s) to the original regression model in (19), which we will call the rewritten in terms of a 1.
2.
H0 : yi = 1 + ui .
y = X + u, n k matrix, and is a k -vector, R will be a q k matrix, q -vector, where q is the number of restrictions. Note that the rst example above imposes a single restriction (i.e. q = 1), while the second imposes two restrictions (i.e. q = 2).
where
is an
and
will be a
In order to test any such linear restriction(s), we use the OLS estimator of
, b = (X X)1 X y.
Recall that, if we assume that
2 u IID(0, u In ),
then
E(b) =
and
2 Var(b) = u (X X)1 .
Therefore, we have
E(Rb)
= RE(b) = R,
and
Var(Rb)
= = = = = = =
E([Rb E(Rb)][Rb E(Rb)] ) E[(Rb R)(Rb R) ] E[(Rb R)(b R R )] E[R(b )(b )R ] RE[(b )(b ) ]R RVar(b)R 2 u R(X X)1 R .
2 u N(0, u In ),
then
2 b N[, u (X X)1 ],
which implies that
42
R = r,
(20)
2 u
s2 = u
Replacing
uu . nk
2 u
in (20) by
s2 u
gives
2 u ,
we have
(21)
How-
F(q,nk) .
That is,
2 u
s2 , u
we have
(22)
Rb r = b2 2,0 ,
which is a scalar. Second, note that element element the
33
matrix
Var(b) = s2 (X X)1 , u
43
we have
j th
H0 : j = j,0 ,
we will have
Rb r = bj j,0 ,
and
t -statistic:
y = X + u,
subject to the restriction
b .
Letting
u denote
R = r.
estimator of
which we
SSR(b )
= u u = (y Xb ) (y Xb ) = y y 2b X y + b X Xb .
44
As with OLS, the restricted least squares estimator is based on minimizing this function. Our (constrained) optimization problem is thus
min SSR(b ) = y y 2b X y + b X Xb ,
b
subject to
Rb = r.
The Lagrangian for this problem is
L(b , ) = 2X y + 2X Xb 2R = 0, b
and
(24)
L(b , ) = 2(Rb r) = 0.
(25)
X Xb = X y + R .
Solving for
b ,
we have
(26)
where
Rb = Rb + R(X X)1 X R ,
which implies
Rb Rb = R(X X)1 X R ,
and therefore,
(27)
Rb = r.
Substituting this into Equation (27), we have
(28)
45
Xb
= y Xb Xb + Xb = u X(b b).
25
u u
= = = =
[ X(b b)] [ X(b b)] u u [ (b b) X ][ X(b b)] u u u u u X(b b) (b b) X u + (b b) X X(b b) u u + (b b) X X(b b). b b
from Equation (28), we have
Substituting for
u u
= u u + {(X X)1 X R [R(X X)1 X R ]1 (r Rb)} X X(X X)1 X R [R(X X)1 X R ]1 (r Rb) = u u + (r Rb) [R(X X)1 X R ]1 RX(X X)1 X X(X X)1 X R [R(X X)1 X R ]1 (r Rb) = u u + (r Rb) [R(X X)1 X R ]1 (r Rb),
(the sum of squared residuals from the restricted model) by (the sum of squared residuals from the unrestricted model) by
Denoting
SSRR , SSRU ,
and
u u uu
we have
F -statistic:
(29)
4.4 Bootstrapping
The test statistics considered above were built on the assumption that
is
normally distributed. However, quite often, we may be unwilling to make this assumption. In such cases, we therefore do not know the distribution of our test statistics. This problem leads to a procedure known as the
idea is to use the observed data to try to estimate the distribution of the relevant test statistic. In general, this procedure requires the following steps:
bootstrap 26
.
The basic
we use make use of the fact that X u = 0. Proving this would be a useful exercise. we present a very brief introduction to the bootstrap. A more complete introduction can be found in Davidson and MacKinnon, (2004, Section 4.6).
25 Here, 26 Here,
46
1. Using the observed data, estimate both the unrestricted and restricted models. Save the tted values and residuals from the restricted model. Call these
yR
and
uR ,
respectively.
2. Using the estimates from the previous step, calculate the test statistic of interest (e.g. the
Cal this
T.
strap sample
as:
resampling
). Call these
boot-
y = yR + u
4. Using the bootstrap sample, re-estimate both the unrestricted and restricted models. 5. Using the estimates from the previous step, calculate the test statistic of interest (this is known as the on the bootstrap sample). Call this Finally, repeat Steps 3-5
, since it is based
T,
Tb , b = 1, . . . , B .
It turns out that these we can calculate the
bootstrap test statistics provide a fairly good estiof the original test statistic ,
mate of the distribution of the test statistic of interest. For inference purposes,
simulated P -value
1 p (T ) = B
B b=1
T,
as
where
I()
5.1 Introduction
In estimating the linear regression model
y = X + u,
by OLS, we made the (very strong) assumption that
(30)
2 u IID(0, u In ).
27 For a chosen level of signicance, , for the test, B should be chosen so that (B + 1) is an integer (see Davidson and MacKinnon, (2004, Section 4.6)). For = 0.01, appropriate values of B would be 99, 199, 299, and so on. Since computing costs are now so low, B is often chosen to be 999 or even 9999.
47
We would now like to relax this assumption, and consider what happens when the error terms are not IID. While we will continue to assume that
E(u) = 0,
we will now let
Var(u) = E
u2 1
. . .
u1 un
. . .
un u1 = ,
where
u2 n
nn
E(ui uj ) = 0 for i = j ui and uj are not independent), and that there is some E(u2 ) = E(u2 ) for i = j (and that therefore ui and uj are not identically j i distributed). In other words, we allow the possibility that each ui is not IID.
That is, we allow the possibility that there is some Notice that the OLS estimator is still unbiased since
E(b)
Var(b)
= E [(X X)1 X u][(X X)1 X u] = E (X X)X uu X(X X)1 = (X X)1 X E(uu )X(X X)1 = (X X)1 (X X)1 .
(31)
It turns out (as we will see shortly), that, in this more general setting, OLS is no longer the best linear unbiased estimator (BLUE). That is, there is some other linear unbiased estimator which has a variance smaller than (32). In deriving this better estimator, the basic strategy is to transform the linear regression model in (30) so that the error terms become IID. To do so, start by letting
1 = .
where
is some
nn
y = X + u.
48
y = X + u
where
(32)
y = y, X = X,
and
u = u.
Note that
E(u )
= E( u) = E(u) = 0
and that
Var(u )
= = = = = = = = = =
E{[u E(u )][u E(u ) ]} E(u u ) E[( u)( u) ] E( uu ) E(uu ) Var(u) ( )1 ( )1 1 In . u IID(0, In ).
That is,
GLS
(or
bGLS
(33)
49
bGLS
bGLS
(34)
E(bGLS )
E + (X 1 X)1 X 1 u
= + (X 1 X)1 X 1 E(u) = ,
since
E(u) = 0.
So
bGLS
is unbiased.
bGLS :
Var(bGLS )
Var(bGLS )
= = = = = =
E{[(X 1 X)1 X 1 u][(X 1 X)1 X 1 u] } E[(X 1 X)1 X 1 uu 1 X(X 1 X)1 ] (X 1 X)1 X 1 E(uu )1 X(X 1 X)1 (X 1 X)1 X 1 Var(uu )1 X(X 1 X)1 (X 1 X)1 X 1 1 X(X 1 X)1 (X 1 X)1 . Var(u) = , bGLS
(35)
We now want to consider a generalization of the Gauss-Markov theorem, which states that, if assumed that estimator of is BLUE. The proof of this will be very similar to the proof of the Gauss-Markov theorem for
2 Var(u) = u In ).
B:
B = Gy,
where, here,
G = W + (X 1 X)1 X 1 ,
and
X.
is thus
E(B)
= = = = =
GX = Ik ,
where
Ik
is an
kk
= = = =
Gy G(X + u) GX + Gu + Gu,
and that
WX
= [G (X 1 X)1 X 1 ]X = GX (X 1 X)1 X 1 X = Ik Ik = 0.
The variance of
is therefore
Var(B)
E ([B E(B)][b E(B)] ) E[( + Gu )( + Gu ) ] E[(Gu)(Gu) ] E(Guu G ) GE(uu )G GG [W + (X 1 X)1 X 1 ][W + (X 1 X)1 X 1 ] [W + (X 1 X)1 X 1 ][W + 1 X(X 1 X)1 ] WW + W1 X(X 1 X)1 + (X 1 X)1 X W +(X 1 X)1 X 1 X(X 1 X)1 = WW + (X 1 X)1 X = = = = = = = = = W
is a non-zero matrix and
Finally, since
2 Var(u) = u In ,
then
includes
as a special case
28 , and we have
51
That is, GLS is not usually feasible. However, we can use some estimate of
which we denote
Replacing
(or
FGLS
feasible
bFGLS = (X 1 X)1 X 1 y.
How we actually go about estimating
5.3 Heteroskedasticity
The problem of ing each we have
ui
heteroskedasticity
2 u1 0 = . . . 0
occurs when
(meaning each
ui
0 2 u2
. . .
0 0 . . . . 2 u1
Since
is diagonal, we have
2 1/u1 0 = . . . 0
0 2 1/u2
. . .
0 0
. . .
2 1/u1
So, if we let
1/u1 0 = = . . . 0
then
0 1/u2
. . .
0 0
. . .
1/un
1 = , y
as desired.
= y 1/u1 0 = . . . 0
0 1/u2
. . .
0 0
. . .
0
52
1/un
y1 y2 . . . yn
y1 /u1 y2 /u2 = , . . . yn /un X = X 1/u1 0 0 1/u2 = . . . . . . 0 0 x1,1 /u1 x1,2 /u2 = . . . x1,n /un
x1,1 x1,2 . . . . . . x1,n 1/un xk,1 /u1 xk,2 /u2 , . . . xk,n /un 0 0
and
0 1/u2
. . .
0 0
. . .
0 .
1/un
u1 u2 . . . un
y
can be written as
= X + u ,
ith
observation, we have
yi x1,i xk,i ui = 1 + . . . + k + , ui ui ui ui
i = 1, . . . , n.
least squares
WLS
weighted
53
ui
(i.e.
knowing
),
so that
in this case, what might be referred to as feasible WLS). Typically, this is done as follows. Suppose we think that the variance of can be explained by columns of is
X.
i = 1, . . . , n.
and
vi
E(u2 ) i
u2 , i
we have
r 1 u2 = 1 z1,i + . . . + r zr,i + vi , i
i = 1, . . . , n,
which is in the form of a non-linear regression model. Unfortunately, we don't have time to cover non-linear regression models in this course, but estimating such a model is certainly possible. However, since we can't actually observe we would have to use the residuals from the OLS estimation of model (30) , in their place. That is, we could estimate the non-linear regression model
u2 , i u2 , i
1 r u2 = 1 z1,i + . . . + r zr,i + wi , i
where
i = 1, . . . , n,
wi
ui = 1 z1,i + . . . + r zr,i + wi , 2 1 r
Using the square root of these estimates, by estimating
i = 1, . . . , n.
yi x1,i xk,i ui = 1 + . . . + k + , ui ui ui ui
by OLS.
5.4 Autocorrelation
The problem of o-diagonal elements in diagonal element in
autocorrelation
(or
serial correlation
ui
is no longer indepen-
dent). Here, we will assume that the errors are homoskedastic (meaning each is identical), but it is possible that autocorrelation and heteroskedasticity are simultaneously present. Autocorrelation is usually encountered in time-series data, where the error terms may be generated by some autoregressive process. For example, suppose the error terms follow a (linear) rst-order autoregressive process (or AR(1) process):
ut = ut1 + vt ,
29 This
t = 1, . . . , n,
(36)
model could also be used to testing for the presence of heteroskedasticity. Here, the null hypothesis would be H0 : 1 = . . . = r = 0 (implying homoskedasticity). An F-test could be used to test this restriction.
54
where
|| < 1,
and
2 vt IID(0, v ).
The condition that is what is known as for
ut
is one in which
covariance stationary
E(ut ), Var(ut ),
and
|| < 1
is imposed so that the AR(1) process in (36) . A covariance stationary process for any given
Cov(ut , utj ),
j,
are
independent of at
t. ut
has been in existence for an innite time. First, note
One way to see this is to imagine that, although we only start observing it
t = 1,
ut
= = = =
(ut2 + vt1 ) + vt 2 ut2 + vt1 + vt 2 (ut3 + vt2 ) + vt1 + vt 3 ut3 + 2 vt2 + vt1 + vt ,
ut
=
i=0
i vti .
E(ut )
= =
E
i=0
i vti i E(vti )
i=0
=
since Next, since
0, t.
is independent of
Var(ut )
= =
Var
i=0
i vti 2i Var(vti )
i=0
=
i=0
2 2i v ,
55
|| < 1).
2 v . 1 2
Var(ui ) =
So,
Var(ut )
is also independent of
t.
|| 1
this innite
series would not converge (rather, it would explode), and therefore depend on
non-stationary
= = = = =
ut
and
utj ,
j.
Cov(ut , ut1 )
Similarly,
Cov(ut , ut2 )
Cov(ut , utj ) =
which is also independent of Using the above results for
2 j v , 1 2
t.
Var(ut )
being
Var(ut ) 1 2
. . .
we have
2 v = 1 2
1
. . .
n1 n2 n3
. . .
n1
n2
n3
56
2 v
1 0
. . .
1 + 2
. . .
0 1 + 2
. . .
0 1 = In ).
0 0 0 . . . 1
So, if we let
= v 1 = , y
1 2 0
. . .
0
then as desired.
0 0 0 1 0 0 1 0 , . . . . . . . . . 0 0 1
= y = v = v
1 2 0
. . .
0 y 1 1 2 y2 y1 y3 y2
. . .
y1 0 0 0 1 0 0 y2 1 0 y3 . . . . . . . . . . . . yn 0 0 1 ,
yn yn1
= X = v
1 2 0
. . .
xk,n xk,n1
57
and
= u = v = v
1 2 0
. . .
u1 1 2 u2 u1 u3 u2
. . .
u1 0 0 0 1 0 0 u2 1 0 u3 . . . . . . . . . . . . un 0 0 1 ,
un un1
So, in this case, the transformed model,
y
can be written as
= X + u ,
y 1 1 2 y2 y1 y3 y2
. . .
yn yn1
xk,1 1 2 x1,1 1 2 x1,2 x1,1 xk,2 xk,1 = v . . . . . . x1,n x1,n1 xk,n xk,n1 u 1 1 2 u2 u1 +v u3 u2 . . . . un un1
v y1
and for
1 2 = 1 v x1,1 t-th
observation,
1 2 + . . . + k v xk,1
we have
1 2 + v u1
1 2 ,
t = 2, . . . , n,
v (yt yt1 ) = 1 v (x1,t x1,t1 )+. . .+k v (xk,t xk,t1 )+v (ut ut1 ).
Of course, since to get
y1
1 2 = 1 x1,1
1 2 + . . . + k xk,1
1 2 + u 1
1 2 ,
58
and
t = 2, . . . , n,
),
iterated FGLS
u = y Xb. j th
estimate of
(j) ,
as
(j) =
3. Use
(j)
to get the
residuals,
estimate of (30),
bFGLS .
(j)
(j)
(j1)
H0 : R = r,
we use use
bGLS (or bFGLS ) instead of b as an bGLS , but, at the end of this section,
estimate of
bFGLS .
Note rst, that
E(RbGLS )
= RE(bGLS ) = R,
59
and
Var(RbGLS )
= = = = = = =
E([RbGLS E(RbGLS )][RbGLS E(RbGLS )] ) E[(RbGLS R)(RbGLS R) ] E[(RbGLS R)(bGLS R R )] E[R(bGLS )(bGLS )R ] RE[(bGLS )(bGLS ) ]R RVar(bGLS )R R(X 1 X)1 R .
u N(0, ),
then
R = r,
bGLS ,
but
also to construct the the above test statistic. Of course, since we almost never
Substituting
for
in the above,
is normally distributed).
Appendix A
We showed these results in class. These notes will be eventually updated.
References
Davidson, R. and MacKinnon, J. G. (2004). New York, Oxford University Press.
60