Econometrics
Notation
Theoretical Linear Regressions (level 1 - in population)
Definition
General Case
Particular Case: Simple Linear Regression (SLR)
Particular Case: SLR without constant
Particular Case: SLR with only constant
Conditional Expectations
Linear regressions as projections
Ordinary Least Square (level 2 - in sample)
Definition
Empirical operators
Properties of the OLS estimator
Goodness-of-fit
Particular case of SLR
Links between simple and multiple linear regressions
Frisch-Waugh theorem
Algebraic link between “short” and “long” regressions
Notation
This is some core notation that the course uses:
is the underlying probability space, where Ω is the sample space, A is the event
(Ω, A, P)
space and P is the probability function.
E.g: Toss a coin, the sample space is Ω = {H , T }, the event space is
A = {∅, {H }, {T }, {H , T }}.
Theoretical Linear Regressions (level 1 - in population)
(Section 2.2)
Linear Regression in the case of all population.
Definition
General Case
Let (Y ∈ R
Ω
) and (X d
∈ (R )
Ω
, for some (d ∈ N ). If
)
∗
(i) (E[|Y | 2
] < +∞) ;
(ii) (E[∥X∥ 2
e
] < +∞) , where (∥ ⋅ ∥ e) denotes the Euclidean norm, that is
1/2
(∥X∥ e := (∑ j=1 |X j | )
d 2
;
)
(iii) (E[XX ′
]) is invertible;
then
d Ω ′
∃! (β 0 , ε) ∈ R × R : Y = X β 0 + ε, E[Xε] = 0.
Besides, that unique (β 0
) is defined as
′ −1
β 0 := E[XX ] E[XY ]
We call X ′
β 0 ∈ (R )
d Ω
the theoretical linear regression of Y on X. We call β 0
d
∈ (R )
Ω
the
coefficients of that theoretical linear regression or the theoretical coefficients of the linear
regression of Y on X.
Particular Case: Simple Linear Regression (SLR)
SLR is any regression with X with D ∈ R .
′
Ω
= (1, D)
When X , we have:
′
= (1, D)
Cov(Y , D) Cov(Y , D)
S
α 0 := E[Y ] − E[D]; β :=
D
V [D] V [D]
with β 0 = (α 0 , β
S
D
)
Special Case: SLR is binary regressor: Support(D) = {0, 1}
with D ∈ {0, 1} (binary variable)
′
X = (1, D)
S
α 0 = E[Y |D = 0]; β D = E[Y |D = 1] − E[Y |D = 0]
Particular Case: SLR without constant
SLR with X = D
E[DY ]
β0 = ∈ R
2
E[D ]
If D is centred (zero expectation)
Cov(Y , D) E[DY ]
=
2
V [D] E[D ]
In words, in a simple linear regression, if the regressor has zero expectation, the slope
theoretical coefficient does not change whether a constant is included or not.
Particular Case: SLR with only constant
SLR with X = 1
β 0 = E[Y ]
Conditional Expectations
Let Y ∈ R
Ω
and X d
= (R )
Ω
. If E[Y 2
] < +∞, then the conditional expectation of Y
given/knowing X, is (almost surely) uniquely defined by
E[Y |X] is a random variable
E[Y |X = x] is a real number, the realized best prediction of Y by arbitrary functions of X
knowing that “X equals x”.
Properties: Conditional expectation is linear
Condition by X for any function g(X)
E[g(X)|X] = g(X)
If X ⊥
⊥ Y , that is X and Y are independent, then:
E[Y |X] = E[Y ]
Satisfies (conditional) Jensen’s inequality
E[f (X)|Y ] ≥ f (E[X|Y ])
Law of Iterated Expectation (or tower property):
E[Y ] = E(E[Y |X])
If g(X) is any function of X, we have
E[Y |g(X)] = E(E[Y |X]|g(X))
Linear regressions as projections
Consider the simplest case (SLR without constant): Let Y ∈ R
Ω
and X d
= (R )
Ω
with
E[|Y | ] < +∞ and E[||X|| ] < +∞.
2 2
e
We define the best linear prediction of Y by linear functions of X as X′β P
∈ R
Ω
, denoted
E lin [Y |X] or L[Y |X] with β P
∈ R
d
defined by:
′ 2
β P := argminE[(Y − X b) ]
d
b∈R
L[Y |X] = X β P
′
is the projection of Y on the finite-dimension subspace:
lin(X) ′ d
L := {X β : β ∈ R }
2
of L made of all linear functions of X.
2
The theoretical linear regression of Y on X and the orthogonal projection of Y on
is coincide
lin(X)
L2
′ −1 ′ 2
E[XX ] E[XY ] =: β 0 = β P := argminE[(Y − X b) ]
d
b∈R
Ordinary Least Square (level 2 - in sample)
SETTING:
Y ∈ R
Ω
(real random variable) and X = (X 1 , X 2 , . . . , X d )
′
∈ (R )
d Ω
(column vector).
X, Y is drawn from some joint distribution P (X,Y )
Goal: predict/explain Y (outcome) using X (regressors/covariates).
=> To do so, we use L[Y |X] =: X ′
β0 , the theoretical linear regression/projection of Y
on X
In practice, β is unknown. We need to estimate it to form our predictions. We will use data
0
for that. For a fixed P , we assume to have access to
(Y ,X)
i.i.d
(Y i, Xi) i=1,...,n ∼ P (Y , X)(i. i. d)
a cross-sectional sample of n ∈ N observations assumed independent and identically
∗
distributed (i.i.d) from the distribution of interest (and thus representative of the population of
interest).
We consider the case of d := dim(X) ∈ N , and will thus see the case of a single scalar
∗
regressor and a constant, namely X = (1, D) , D ∈ R
′ Ω
, as a special case.
Definition
The OLS estimator in the (empirical) linear regression of Y ∈ R
Ω
onX ∈ (R )
d Ω
(also known
as, the (empirical) coefficients in the empirical linear regression of Y on X) is defined as:
n
1 ′ 2 d Ω
^ := argmin
β ∑(Y i − X i b) ∈ (R )
bR
d n
i=1
Note: Key idea (method of moments or plug-in) for estimation: replace unknown theoretical
expectations E[(Y − X b) ] with empirical means n 1 ∑ (Y − X b) .
′ 2 − n
i=1 i
′
i
2
Applied to the definition of β : 0
−1
n n
1 ′
1 ′ −1
^ = (
β ∑ Xi Xi ) ( ^
∑ X i Y i ) = E[XX ] ^
E[XY ]
n n
i=1 i=1
Sample invertibility condition
n
1 ′
∑ Xi Xi is invertible
n
i=1
Empirical operators
Properties of the OLS estimator
If sample invertibility condition holds, then
Goodness-of-fit
Answer for the question "Does (linear combinations of) X predict Y accurately?" with the
(empirical) R
^ 2
^ Y
V[ ^]
^ 2 :=
R ∈ R
Ω
^
V[Y ]
it is thus the part of the empirical variance of the outcome/target Y that is explained by the
predicted/fitted value Y^ .
If we add a new explanatory variable (dim(X) → d + 1), the R
^ cannot decrease, in other
2
words, it always weakly increases.
Particular case of SLR
We now consider the particular case when X = (1, D)′ , with D ∈ RΩ: an intercept and one
scalar (1 by 1) regressor.
Property of the OLS estimator in SLR
Links between simple and multiple linear regressions
Frisch-Waugh theorem
Coefficient is used here to refer either to a theoretical coefficient or an OLS estimator.
The particular case of a simple linear regression provides a simple and quite
understandable, intuitive
(stochatic estimators) of the OLS estimator, β D
∈ R
Ω
, a real random variable,
(non-stochastic parameters) and of the theoretical coefficient β S
D
, a non-stochastic
∈ R
real number,
and the same equality holds with the empirical counterparts (symbolically, simply adding the
hats) for β^ .
S
D
Theorem 3 (Frisch-Waugh (in population – level 1)):
Theorem 3 operates at level 1, in population. The same result holds with the empirical, in
sample, counterparts. We write it directly with the X = (1, D, G ) notation:
′ ′
Algebraic link between “short” and “long” regressions
This is also called “Omitted Variable Bias” (OVB) formula
Let Y ∈ R
Ω
,X = (1, D, G )
′ ′
with D ∈ R and G ∈ (R
Ω dim(X)−2
)
Ω
, where dim(X) > 2.
The Multiple Linear Regression of Y on X is well defined and we write it:
′
L[Y |X] = L[Y |1, D, G] = α 0 + β D D + G β G .
The SLR of Y on (1, D) is well defined and we write it:
S S
L[Y |1, D] = α + β D.
0 D
For each component G of G = (G j 1, . . . , Gp )
′
, with p := dim(X) − 2 = dim(G) ≥ 1, the SLR
of G on 1, D is well defined and we write it:
j
L[G j |1, D] = αj + λ j D. (omitted on included)
For any j ∈ 1, . . . , p, λ is thus the slope coefficient in the linear regression of G on D (and a
j j
constant). We define λ := (λ1, . . . , λp) ∈ R the (column) vector collecting those slope
′ p
coefficients. Then, we have the equality
S ′
βD = βD + λ βG
The same holds in sample at level 2 with OLS estimators instead of theoretical coefficients.
In a nutshell, it says:
short = long + omitted × coef f icients of omitted on included.