Multiple Linear Regression

ECONOMETRICS
Multiple Linear Regression

Professor: Antonio Di Paolo
Degree in International Business

Academic Year 2019/2020
Universitat de Barcelona
1
Multiple Linear Regression (1)
 The Simple Linear Regression appears to be an intuitive and rather simple statistical tool, but its
limitation is quite evident: there are usually more factors that could affect “Y” and are possibly
related to “X” as well.
A simple regression does not allow for a “ceteris paribus” interpretation of the effect of X on Y!!
- The regression framework can be easily generalized to the inclusion of a certain (fixed) number of
observed covariates.
 What is the effect of X on Y, keeping constant all the other (observed) conditional factors of Y?
 This kind of questions can be approached with a Multiple Linear Regression.
- First, we will present the mechanics and interpretation of the OLS estimator for Multiple Linear Regression
(i.e. a regression that includes more explanatory variable at the same time).
- Second, we will define the underlying assumptions under which the OLS estimator represents a good
approximation of the statistical relationship under investigation and the properties that OLS has if such
conditions are satisfied.
- Third, we will derive the tools that can be employed to carry out statistical inference from the model’s
estimate (which are also valid under some “additional” assumptions).
2
 Definitions and notation:
- yi = dependent (endogenous) variable
- Xi = (1, x1i, x2i, x3i, …., xki) = vector of explanatory (exogenous) variables (k + 1, the intercept)
- β = (α, β1, β2, β3, ...., βk) = vector of parameters to be estimated (intercept and slopes for each X)
- εi = error term (random disturbance)
 Linear Regression Model with k explanatory variables:

𝑦𝑖 = 𝛼 + 𝛽1 𝑥1𝑖 + 𝛽2 𝑥2𝑖 + 𝛽3 𝑥3𝑖 + ⋯ + 𝛽𝑘 𝑥𝑘𝑖 + 𝜀𝑖
Matrix notation: 𝑦𝑖 = 𝛽′ 𝑋𝑖 + 𝜀𝑖
𝑌1 𝛼 1 𝑥11 𝑥12 𝑥13 … 𝑥1𝑘 𝑥1′ 𝜀1

𝑌𝑖
𝑌2
𝛽
𝛽1
𝑋𝑖 1𝑥21 𝑥22 𝑥23 … 𝑥2𝑘 𝑥2′ 𝜀𝑖
𝜀2
𝑛×1
= 𝑌3 ;
(𝑘 + 1) × 1
= 𝛽2 ;
(𝑛 × (𝑘 + 1))
= 1𝑥31 𝑥32 𝑥33 … 𝑥3𝑘 = ′
𝑥3 ; (𝑛 × 1) = 𝜀3
⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮
𝑌𝑛 𝛽𝑘 1𝑥𝑛1 𝑥𝑛2 𝑥𝑛3 … 𝑥𝑛𝑘 𝑥𝑛′ 𝜀𝑛
3
 The OLS optimization problem with “k” explanatory variables.
 Intuition: what is the linear combination of Xs that provides the best approximation to explain the
dependent variable (i.e. that makes the sum of squared residuals as small as possible)?
- Objective function: minimization of the Sum of Squared Residuals (SSR)

𝑛 𝑛
𝑆𝑆𝑅 = ෍(𝑌𝑖 −𝛼ො − 𝛽መ1 𝑥1 − 𝛽መ2 𝑥2 −𝛽መ3 𝑥3 − ⋯ − 𝛽መ𝑘 𝑥𝑘 )2 = ෍(𝜀𝑖Ƹ )2 = (𝑌𝑖 − 𝛽′𝑋
መ 𝑖 )′(𝑌𝑖 − 𝛽′𝑋
መ 𝑖)
𝑖=1 𝑖=1
𝜕𝑆𝑆𝑅 𝜕𝑆𝑆𝑅 𝜕(𝑌 ′ 𝑌 − 𝛽መ ′ 𝑋 ′ 𝑌 − 𝑌 ′ 𝑋𝛽መ + 𝛽መ ′ 𝑋 ′ 𝑋𝛽)

መ 𝜕(𝑌 ′ 𝑌 − 2𝑌 ′ 𝑋𝛽መ + 𝛽መ ′ 𝑋 ′ 𝑋𝛽)
መ
֜ min 𝑆𝑆𝑅 ֜ = 0֜ = = =0
෡
𝛽 𝜕𝛽መ 𝜕𝛽መ 𝜕𝛽መ 𝜕𝛽መ
𝜕𝑆𝑆𝑅 𝜕(𝑌 ′ 𝑌 − 𝛽መ ′ 𝑋 ′ 𝑌 − 𝑌 ′ 𝑋𝛽መ + 𝛽መ ′ 𝑋 ′ 𝑋𝛽)

መ 𝜕(𝑌 ′ 𝑌 − 2𝑌 ′ 𝑋𝛽መ + 𝛽መ ′ 𝑋 ′ 𝑋𝛽)
መ
֜ = = =0
𝜕𝛽መ 𝜕𝛽መ 𝜕𝛽መ
𝜕𝑆𝑆𝑅 Vector of “k” estimated slope

֜ ෡ = 0: −2𝑋 𝑌 + 2𝑋 𝑋𝛽መ = 0 ֜ 𝛽መ = 𝑋 𝑋
′ ′ ′ −1
(𝑋′𝑌) coefficients plus the intercept.
𝜕𝛽
4
 What is really done by OLS in a multiple regression? Graphical evidence with simulated data
 𝑦𝑖 = 2 + 0.5𝑥1𝑖 + 1.2𝑥2𝑖 + 𝜀𝑖 ; 𝜀𝑖 ~𝑁 0,1 ; 𝑥1𝑖 ~𝑁 5,2 ; 𝑥2𝑖 ~𝑈(1,50)
5
6
7
 The solution of the OLS optimization problem provides, by construction, the best linear
approximation of “Y” from x1 to xk plus a constant (i.e. 𝛽መ = 𝑋 ′ 𝑋 −1 (𝑋′𝑌)).
Without further assumption, this algebraic tool has a limited value:

- Is it a “good representation” of the phenomenon under investigation?
- Can we make inference about the “estimated” parameters?
- Can we use the model to make “predictions” about future behavior of the outcome of interest?
- More in general, can we use OLS to say something about things that are not observed (yet)?
- We start by considering that the (linear) relationship between Y and the Xs (𝑦𝑖 = 𝛽 ′ 𝑋𝑖 + 𝜀𝑖 ) is valid for any
well-defined population of interest (i.e. all household of a country, all firms of a given industry, etc.).
- This linear equation is assumed to be valid for any possible observation, while we only observe a sample of
“n” observations that is a single possible realization of all possible samples of the same size that could have
been drawn from the same population:
In this case, Y, X and ε are random variables  𝑦𝑖 = 𝛽 ′ 𝑋𝑖 + 𝜀𝑖 becomes a statistical model!
 β is a vector of unknown parameters that characterize the population of interest.
In this setting, the statistical model is tautological without further assumptions (i.e. for any value of β, we
can always define ε such that the linear statistical model that explains Y holds).
8
 The coefficients’ vector obtained by OLS (𝛽መ = 𝑋 ′ 𝑋 −1 (𝑋′𝑌)) represents the estimates of the
true populational parameters’ vector (β).
OLS is thus an estimator: it represents the rule that says how a given sample is translated into an
approximate value of β.
 The OLS estimator is itself a random variable (i.e. a new sample means a new estimate):
- Because the sample is randomly drawn from a larger population.
- Because the data are generated by some random process.
How good is the OLS estimator to represent the true value of the unknown “betas” depends on
the assumptions that we are willing to make.
Given the assumptions that will be made (see next), we can evaluate the quality of OLS as an
estimator based on the properties that is has (that finally depends on the validity of the underlying
assumptions).
9
 Classical OLS Assumptions (Gauss-Markov Assumptions):
1) Conditional Exogeneity (independence between the explanatory variables and the error term):
𝐸 𝜀𝑖 𝑋𝑖 = 0֜𝐸 𝜀𝑖 = 0; 𝐸 𝑋, 𝜀 = 0
This assumptions states that the expected value of ε is 0 for any value of the Xs.
It implies that the expected value of ε is zero and that ε is not correlated with the Xs (notice that the opposite
could not be true).
When is this assumption not satisfied? Three general cases: *
2) Homoskedasticity (constant variance of the error term *):

𝑉𝑎𝑟 𝜀𝑖 |𝑋𝑖 = 𝐸 𝜀𝑖2 𝑋𝑖 = 𝐸[𝜀𝜀 ′ |𝑋𝑖 ] = 𝜎 2 𝑓𝑜𝑟 𝑒𝑎𝑐ℎ 𝑖 = 1, … 𝑛
֜𝑉𝑎𝑟 𝑌𝑖 𝑋𝑖 = 𝑉𝑎𝑟 𝜀𝑖 |𝑋𝑖 = 𝜎 2 𝑓𝑜𝑟 𝑒𝑎𝑐ℎ 𝑖 = 1, … 𝑛
3) No Serial Correlation/Autocorrelation (error terms of different observations are uncorrelated):

𝐶𝑜𝑣 𝜀𝑖 , 𝜀𝑗 |𝑋𝑖 = 𝐸 𝜀𝑖 𝜀𝑗 |𝑋𝑖 = 0 𝑓𝑜𝑟 𝑒𝑎𝑐ℎ 𝑖 ≠ 𝑗
Intuitively, taken together, assumptions 1-3 mean that the matrix of regressor values X does not provide any
information about the first and second moments of the distribution of the unobservables (ε).
10
 Classical OLS Assumptions (Gauss-Markov Assumptions):
4) Linearity of the conditional expectation of Y (linear function of the parameters):
𝑘
𝐸 𝑌𝑖 𝑋𝑖 = 𝛼ො + ෍ 𝛽መ𝑗 𝑥𝑗𝑖 = 𝛽′𝑋

መ 𝑖
𝑗=1
Assumption 4 together with assumption 1 indicate that the multiple linear regression has a “ceteris paribus”
interpretation.
5) Full Rank of X ( no multicollinearity or no perfect linear relationship between variables)

𝑅𝑎𝑛𝑘 𝑋 = 𝑘 + 1 < 𝑛
This assumption implies that the matrix X’X is invertible, which is necessary to obtain the OLS coefficients’ vector
(i.e. 𝛽መ = 𝑋 ′ 𝑋 −1 (𝑋′𝑌)).
 The term “n-(k+1)” represents the “degrees of freedom” with which the model is estimated.
- In general, we would be able to test the validity of assumptions 2, 3 and 5. The failure of assumptions 2 and 3 can be
generally accommodated (using alternative estimators, e.g. Instrumental Variables/Two Stages Least Squares).
- Assumption 4 is generally valid, or at least can be taken as a valid approximation (although non-linear methods also exist).
- Assumption 1 is generally untestable (although an imperfect test can be done when comparing OLS with alternative
estimators).
11
 What are the properties of the OLS estimators under the Gauss-Markov Hypothesis?
 i.e. how good is OLS to explain the reality when assumptions 1-5 are satisfied?
 Let’s consider first the properties of OLS for a small (finite) sample.
- Unbiasedness of the OLS Estimator:
𝑌
𝐸 𝛽መ = 𝐸 𝑋 ′ 𝑋 −1
𝑋′𝑌 =𝐸 𝑋′𝑋 −1
𝑋 ′ 𝛽′ 𝑋 + 𝜀 = 𝛽 + 𝐸 𝑋′𝑋 −1
𝑋′𝜀 = 𝐸 𝑋′𝑋 −1
𝑋 ′ · 𝐸(𝜀)
֜𝐸 𝛽መ = 𝛽 𝑖𝑓 𝐸 𝜀, 𝑋 = 0 (𝑢𝑛𝑑𝑒𝑟 ℎ𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠 1)
If E[ε|X] = E[ε,X] = E[ε] = 0, the OLS estimator of the unknown parameters β is unbiased, which means that
on average, in repeated sampling, the OLS estimator is equal to the populational parameters.
 Notice that assumptions 2 and 3 play no role in the “correctness” of the OLS estimator. This means that OLS
is unbiased even when assumption 2 and 3 fail.
12
 In addition to knowing that, on average, the OLS estimator is correct, we would also like to
make statements about how (un)likely it is to be far off in a given sample.
 How can we characterize the distribution of the OLS estimator?
 What is the degree of precision of 𝛽መ in providing a “correct” estimation of the true relationship of interest?
- The (Conditional) Variance of the OLS Estimator:

′ ′
𝑉𝑎𝑟 𝛽መ 𝑋 = 𝐸 𝛽መ −𝐸 [𝛽]
መ 𝛽መ −𝐸[𝛽]
መ 𝑋 =𝐸 𝛽መ −𝛽 𝛽መ −𝛽 𝑋 = 𝐸 𝑋′𝑋 −1
𝑋 ′ 𝜀𝜀 ′ 𝑋 𝑋 ′ 𝑋 −1
𝑋
= 𝑋′𝑋 −1
𝑋 ′ 𝐸 𝜀𝜀 ′ 𝑋 𝑋 𝑋 ′ 𝑋 −1
= 𝐸 𝜀𝜀 ′ 𝑋 𝑋 ′ 𝑋 −1
= 𝜎 2 𝑋′𝑋 −1
(𝑢𝑛𝑑𝑒𝑟 ℎ𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠 2 − 3)
…..in fact, assumptions 2 and 3 imply that:

𝜎2 ⋯ 0 1 ⋯ 0
𝐸 𝜀𝜀 ′ |𝑋 = 𝛺 = ⋮ ⋱ ⋮ = 𝜎2 ⋮ ⋱ ⋮ = 𝜎 2 𝐼𝑑
0 ⋯ 𝜎2 0 ⋯ 1
13
 Notice that the previous result involves the variance of the error term (ε), which is by
definition unknown (since the errors are unobservable).
In order to obtain an (unbiased) estimation of the variance of the OLS coefficients (under assumptions 1-5),
we use the sample variance of the residuals (𝜀):
Ƹ
𝜀′Ƹ 𝑖 𝜀𝑖Ƹ (𝑌𝑖 − መ 𝑖 )′(𝑌𝑖 − 𝛽′𝑋
𝛽′𝑋 መ 𝑖) σ𝑛𝑖=1 𝜀𝑖Ƹ 2
𝑉𝑎𝑟෣
𝜀𝑖 𝑋𝑖 = 𝑉𝑎𝑟 𝜀𝑖Ƹ 𝑋𝑖 = = = = 𝜎ො 2
𝑛 − (𝑘 + 1) 𝑛 − (𝑘 + 1) 𝑛 − (𝑘 + 1)
- Notice that the estimator of the error term’s variance has a “degrees of freedom” correction, since the
standard formula of the variance would provide a biased estimation of σ2.
Substituting the estimated variance of the errors into the formula of the coefficients’ variance yields:
σ𝑛𝑖=1 𝜀𝑖Ƹ 2 𝑆𝑆𝑅
𝑉𝑎𝑟 𝛽መ 𝑋 = 𝜎ො 2 𝑋 ′ 𝑋 −1
= · 𝑋′𝑋 −1
= · 𝑋′𝑋 −1
𝑛 − (𝑘 + 1) 𝑛 − (𝑘 + 1)
- This is the so-called Variance-Covariance Matrix of the OLS Coefficients.
- The squared values of the main diagonal of this matrix represent the Coefficients’ Standard Errors.
- The elements outside the main diagonal represent the covariances between different coefficients.
14
 The components of the OLS Coefficients’ Variance.
It is possible to analyze the factors that affect the variance of the estimated coefficients:
𝜎ො𝑢 𝜎ො𝑢
𝑉𝑎𝑟 𝛽መ 𝑋 = 𝜎ො 2 𝑋 ′ 𝑋 −1
֜ 𝑉𝑎𝑟 𝛽መ𝑗 = =
𝑆𝑆𝑇𝑗 (1 − 𝑅𝑗2 ) (σ𝑁 2 2
𝑖=1(𝑥𝑖𝑗 − 𝑥𝑗ҧ ) )(1 − 𝑅𝑗 )
- The variance of 𝛽መ𝑗 increases with the estimated error’s variance (𝜎ො𝑢 ).
 The more “noise” in the equation that explains y, the more imprecision in the estimation of the coefficients.
- The variance of 𝛽መ𝑗 decreases with the amount of variation in xj (𝑆𝑆𝑇𝑗 ).
 More variability in the explanatory variable(s) and larger samples (i.e. more degrees of freedom, n-(k+1)) reduce the
variance of the coefficients.
- The variance of 𝛽መ𝑗 increases with the relationship between xj and the other explanatory variables (𝑅𝑗2 ).
If the explanatory variables are excessively correlated (multicollinearity), their coefficients are imprecisely estimated.
Overall, an excessive variance in the estimates coefficients means less precise estimators, which
translates into larger confidence intervals and less accurate hypothesis testing (see later). 15
 Overall, what can be said about the OLS estimator under the Gauss-Markov assumptions?
- In finite samples (i.e. regardless of sample size), the Gauss-Markov Theorem states that the OLS estimator is
BLUE: the Best Linear Unbiased Estimator.
𝛽መ is the estimator the most accurate one, since it has the lowest possible variance (i.e. is the most precise),
among the ones that are unbiased and linear.
This result is quite useful, especially because it can be used as benchmark for a) cases in which any of the
underlying hypothesis fail or b) other estimators that can be alternative to OLS.
- What about the interpretation of the estimated coefficients in a multiple regression?

In terms of the quantitative meaning, all the results (regarding linear/log models) that we obtained for the
simple linear regression still hold. The R-squared has also the same meaning and is computed in the same way.
 Moreover, (if assumption 1 is valid), the multiple regression has a “ceteris paribus” interpretation:
𝑦𝑖 = 𝛼 + 𝛽1 𝑥1𝑖 + 𝛽2 𝑥2𝑖 + 𝜀𝑖
𝜕𝑦
 = 𝛽መ1 ֜ 𝛽መ1 is the effect of changing x1, keeping fixed x2 (as well as any other element in ε, which are
𝜕𝑥1
assumed to be uncorrelated with both x1 and x2.
16
 Can we use the regression model to “predict” the values of the dependent variable under
different scenarios?
- It is indeed possible to use the estimated coefficients to understand what can be expected (in terms of the
value of the outcome of interest) when the explanatory variables take specific values (not necessarily observed
in the estimation sample).
- Moreover, when working with time series data, the regression model can be used to “forecast” future values
of the dependent variable (i.e. to say something about things that haven’t occurred yet!).
Under the Gauss-Markov assumptions (especially assumptions 1 and 4), the OLS coefficients enables
constructing the unbiased predictor of the dependent variable:
𝑦𝑖 = 𝛼 + 𝛽1 𝑥1𝑖 + 𝛽2 𝑥2𝑖 + 𝜀𝑖 ֜ 𝑦ො𝑖 = 𝐸 𝑦𝑖 𝑋𝑖 = 𝛼ො + 𝛽መ1 𝑥1𝑖 + 𝛽መ2 𝑥2𝑖 + 𝐸[𝜀𝑖 ] = 𝛼ො + 𝛽መ1 𝑥1𝑖 + 𝛽መ2 𝑥2𝑖
 Since 𝐸 𝛽መ = 𝛽 ֜ 𝐸[𝑦ො𝑖 − 𝑦𝑖 ] = 0, i.e. the predictor of the dependent variable is unbiased.

If our aim is to carry out a prediction of specific values of the Xs (i.e. 𝑥1∗ , 𝑥2∗ ), the predictor will be:
𝑦ො ∗ = 𝛼ො + 𝛽መ1 𝑥1∗ + 𝛽መ2 𝑥2∗ = 𝛽′𝑋
መ ∗
17
 Can we use the regression model to “predict” the values of the dependent variable under
different scenarios?
- As for the estimated coefficient, it is possible to evaluate the precision with which the model predicts the
outcome for specific values of the regressors by deriving its variance.
መ ∗ = 𝑋 ∗ ′ 𝑉𝑎𝑟 𝛽መ 𝑋 ∗ = 𝑋 ∗ ′ 𝑉𝑎𝑟 𝛽መ 𝑋 ∗ = 𝜎 2 𝑋 ∗ ′ 𝑋 ′ 𝑋
𝑉𝑎𝑟 𝑦ො ∗ = 𝑉𝑎𝑟 𝛽′𝑋 −1
𝑋∗
- However, the variance of the prediction is only an indication of the variation in the predictor if different
መ
samples were drawn (i.e. the variation in the predictor owing to variation in 𝛽).
In order to better appreciate how accurate the prediction is, we need to compute the variance of the
prediction error (𝑦ො𝑖 -𝑦𝑖 ).
መ ∗ − 𝛽 ′ 𝑋 ∗ − 𝜀 ∗ = 𝑋 ∗ ′ 𝛽መ − 𝛽 − 𝜀 ∗ ֜ 𝑉𝑎𝑟 𝑦ො𝑖 − 𝑦𝑖 = 𝜎 2 + 𝜎 2 𝑋 ∗ ′ 𝑋 ′ 𝑋
𝑦ො𝑖 − 𝑦𝑖 = 𝛽′𝑋 −1
𝑋∗
Notice that in a simple regression model the last formula simplifies to:
1 (𝑥 ∗ − 𝑥)ҧ 2
֜ 𝑉𝑎𝑟 𝑦ො𝑖 − 𝑦𝑖 = 𝜎 2 + 𝜎 2 + Intuition: the further the value of x* is from its sample
𝑛 σ𝑖(𝑥𝑖 − 𝑥)ҧ 2
mean, the larger will be the variance of the predictor.
18
 An additional hypothesis (useful but not trivial): Normality of the Error Term
- So far we “only” assumed that the error terms (εi) are independent on the Xs, are mutually uncorrelated and
have constant variance. Therefore, we haven’t established any assumption about the “shape” of the error
terms’ distribution.
- For finite sample, an additional is needed for the purpose of carrying out statistical inference from the
regression model (in finite samples): Joint Normality of the Error Terms.
𝜀𝑖 ~𝑁𝐼𝐷(0, 𝜎 2 )
 Notice that if the error terms are normally distributed, this means that also the conditional distribution of y
(given Xs) follows a normal distribution.
- What is the implication of this hypothesis in terms of the distribution of the beta coefficients?
መ
𝜀𝑖 ~𝑁𝐼𝐷(0, 𝜎 2 ) ֜ 𝛽~𝑁𝐼𝐷(𝛽, 𝜎 2 (𝑋 ′ 𝑋)−1 )
This result actually provides the basis for carrying out statistical inference (i.e. hypothesis testing) from the
Multiple Linear Regression Model (see later).
19
 Asymptotic properties of the OLS estimator
- We now know what are the properties of the OLS estimator for finite (i.e. small) samples, which essentially depend
on the assumptions we made about the error term (ε).
What happens to OLS when sample size grows, hypothetically, infinitively large (i.e. 𝑛 → ∞)?
- Let’s consider the so-called Asymptotic Properties of the OLS estimator, which are derived using the standard results
from the Asymptotic Theory.
Specifically, it is interesting to analyze whether it is possible to “relax” some of the underlying assumptions (1-5)
when sample size goes to infinity.
- Consistency of the OLS estimator:
𝑝 lim 𝛽መ𝑘 − 𝛽𝑘 > 𝛿 = 0 𝑓𝑜𝑟 𝑎𝑙𝑙 𝛿 > 0 ≡ 𝑝 lim 𝛽መ𝑘 = 𝛽𝑘
𝑛→∞ 𝑛→∞
It is possible to show that the consistency property is satisfied is E[ε,X] = 0, which is a weaker hypothesis than (1)
E[ε|X] = 0 (i.e. The latter implies the former, but the opposite is not necessarily true).
- Asymptotic Normality of the OLS estimator:

−1
𝑎
መ 𝑁𝐼𝐷(𝛽, 𝜎 ෍ /𝑁)
𝛽~ 2
𝑥𝑥
When the sample size is rather large, the “approximate” normality of the beta coefficients is a direct result from
asymptotic theory. 20
Statistical Inference (1)
 It is possible to carry out Statistical Inference on the estimated OLS coefficients.
- In order to perform statistical inference from the model’s estimates, the hypothesis of Joint Normality of the
error terms is crucial.
መ
𝜀𝑖 ~𝑁𝐼𝐷(0, 𝜎 2 ) ֜ 𝛽~𝑁𝐼𝐷(𝛽, 𝜎 2 (𝑋 ′ 𝑋)−1 )
 Although normality of the error terms could not be satisfied in finite samples, for large samples it is possible
to rely on Asymptotic Theory, which leads to the Asymptotic Normality of the OLS estimator.
This result actually provides the basis for carrying out statistical inference from the Multiple Linear Regression
Model, specifically we will be able to carry out:
- Hypothesis testing for single coefficients.
- Confidence Intervals of estimated coefficients.
- Confidence Intervals of predictions.
- Hypothesis testing for linear combinations of coefficients and multiple linear restrictions.
21
 Hypothesis testing for single coefficients.
- Given that if the error terms are normally distributed, the estimates coefficients would be also normally
መ
distributed (𝛽~𝑁𝐼𝐷(𝛽, 𝜎 2 (𝑋 ′ 𝑋)−1 )), it is possible to construct a t-Statistic for the single unknown populational
parameter βk.
𝛽መ𝑘 − 𝛽𝑘0
𝑡 − 𝑠𝑡𝑎𝑡𝑖𝑠𝑡𝑖𝑐 𝛽መ𝑘 = 𝑡𝑘 =
𝑠. 𝑒. 𝛽መ1
- Two-sided test:
𝑛𝑢𝑙𝑙 ℎ𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠 (𝑎𝑠𝑠𝑢𝑚𝑒𝑑 𝑡𝑜 𝑏𝑒 𝑡𝑟𝑢𝑒) 𝐻0 : 𝛽𝑘 = 𝛽𝑘0
֜ 𝑃𝑟𝑜𝑏 |𝑡𝑘 | > 𝑡𝑛− 𝑘+1 ;𝛼 Τ2 = 𝛼 𝑠𝑖𝑔𝑛𝑖𝑓𝑖𝑐𝑎𝑛𝑐𝑒 𝑙𝑒𝑣𝑒𝑙 .
𝑎𝑙𝑡𝑒𝑟𝑛𝑎𝑡𝑖𝑣𝑒 ℎ𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠 𝐻1 : 𝛽𝑘 ≠ 𝛽𝑘0
Reject H0 if |tk|> 𝑡𝑛− 𝑘+1 ;𝛼 Τ2 or P-value < α (α usually 0.1, 0.05, 0.01 but always chosen by the researcher).
One-side test:
𝑛𝑢𝑙𝑙 ℎ𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠 𝑎𝑠𝑠𝑢𝑚𝑒𝑑 𝑡𝑜 𝑏𝑒 𝑡𝑟𝑢𝑒 𝐻0 : 𝛽𝑘 ≤ 𝛽𝑘0
0 ֜ 𝑃𝑟𝑜𝑏 𝑡𝑘 > 𝑡𝑛− 𝑘+1 ;𝛼 = 𝛼 𝑠𝑖𝑔𝑛𝑖𝑓𝑖𝑐𝑎𝑛𝑐𝑒 𝑙𝑒𝑣𝑒𝑙 .
𝑎𝑙𝑡𝑒𝑟𝑛𝑎𝑡𝑖𝑣𝑒 ℎ𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠 𝐻1 : 𝛽𝑘 > 𝛽𝑘
Reject H0 if |tk|> 𝑡𝑛− 𝑘+1 ;𝛼 or P-value < α (α usually 0.1, 0.05, 0.01 but always chosen by the researcher).
22
 Hypothesis testing for single coefficients.
- The most common (two-sided) test involves the statistical significance of the estimated βk
𝑛𝑢𝑙𝑙 ℎ𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠 (𝑎𝑠𝑠𝑢𝑚𝑒𝑑 𝑡𝑜 𝑏𝑒 𝑡𝑟𝑢𝑒) 𝐻0 : 𝛽𝑘 = 0 𝛽መ𝑘

֜ 𝑡 − 𝑠𝑡𝑎𝑡𝑖𝑠𝑡𝑖𝑐 𝛽𝑘 = 𝑡𝑘 = .
𝑎𝑙𝑡𝑒𝑟𝑛𝑎𝑡𝑖𝑣𝑒 ℎ𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠 𝐻1 : 𝛽𝑘 ≠ 0 𝑠. 𝑒. 𝛽1መ
Reject H0 (i.e. 𝛽መ𝑘 is statistically different from 0, or is said to be significant) if |tk|> 𝑡𝑛− 𝑘+1 ;𝛼 Τ2 or P-value < α.
- Example:
Estimated model (s.e. within parenthesis),n = 1000: 𝑦𝑖 = 2 + 0.6
ด 𝑥1𝑖 + 1.3
ด 𝑥3𝑖 + 𝜀𝑖Ƹ
0.15 0.9
𝐻0 : 𝛽1 = 0 0.6
መ
֜ 𝑡 − 𝑠𝑡𝑎𝑡𝑖𝑠𝑡𝑖𝑐 𝛽1 = 𝑡𝑘 = = 4 > 1.98472 = 𝑡 0.05 ֜ 𝑅𝑒𝑗𝑒𝑐𝑡 𝐻0
𝐻1 : 𝛽1 ≠ 0 0.15 100−3;
2
Alternatively: p − value 𝑡100−3;0,05 = 0.000123762 < 0.05 ֜ 𝑅𝑒𝑗𝑒𝑐𝑡 𝐻0

2 We selected a significance
level (α) of 5%
 What happens with 𝛽መ2 ?
23
24
25
26
Confidence Intervals of estimated coefficients
 The Confidence Interval (CI) provides the interval of all values of 𝛽𝑘0 for which the null hypothesis 𝐻0 : 𝛽𝑘 = 𝛽𝑘0
would not be rejected (i.e. range of values of the true 𝛽𝑘 that are not unlikely given the data).
𝛽መ𝑘 − 𝛽𝑘
−𝑡𝑛− 𝑘+1 ; 𝛼 Τ2 < < 𝑡𝑛− መ −𝑡𝑛−
𝑘+1 ;𝛼 Τ2 ֜ 𝛽𝑘 𝑘+1 ;𝛼 Τ2 · 𝑠. 𝑒. 𝛽መ1 < 𝛽መ𝑘 < 𝛽መ𝑘 + 𝑡𝑛− 𝑘+1 ;𝛼 Τ2 · 𝑠. 𝑒. 𝛽መ1
መ
𝑠. 𝑒. 𝛽1
𝐿𝑜𝑤𝑒𝑟 𝐵𝑜𝑢𝑛𝑑 𝑜𝑓 𝑡ℎ𝑒 𝐶𝐼 𝑈𝑝𝑝𝑒𝑟 𝐵𝑜𝑢𝑛𝑑 𝑜𝑓 𝑡ℎ𝑒 𝐶𝐼
֜ 𝐶𝐼 𝛽መ𝑘 = 𝛽መ1 ± 𝑡𝑛− 𝑘+1 ;𝛼 Τ2 · 𝑠. 𝑒. 𝛽መ𝑘
መ ∗ ± 𝑡𝑛−
Similarly, the Confidence Interval of the Prediction is: 𝐶𝐼 𝑦ො𝑖 = 𝛽′𝑋 𝑘+1 ;𝛼 Τ2 · 𝜎ො 1 + 𝑋 ∗ ′(𝑋 ′ 𝑋)−1 𝑋 ∗
- Example:
Estimated model (s.e. within parenthesis),n = 1000: 𝑦𝑖 = 2 + 0.6
ด 𝑥1𝑖 + 1.3
ด 𝑥3𝑖 + 𝜀𝑖Ƹ
0.15 0.9
95% Confidence Interval for 𝛽መ1 = 0.6 ± 1.98 · 0.15
27
 Hypothesis testing for linear combinations of coefficients.
- We are often interested in testing the statistical validity of restrictions based on linear combinations of
coefficients, for example:
𝑛𝑢𝑙𝑙 ℎ𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠 (𝑎𝑠𝑠𝑢𝑚𝑒𝑑 𝑡𝑜 𝑏𝑒 𝑡𝑟𝑢𝑒) 𝐻0 : 𝛽1 = 𝛽2 (≡ 𝛽1 − 𝛽2 = 0)
.
𝑎𝑙𝑡𝑒𝑟𝑛𝑎𝑡𝑖𝑣𝑒 ℎ𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠 𝐻1 : 𝛽1 ≠ 𝛽2 (≡ 𝛽1 − 𝛽2 ≠ 0)
There are several ways to approach this kind of tests:

1) Deriving the associated t-statistic:
𝛽෠1 − 𝛽෠2 𝛽෠1 − 𝛽෠2
𝑡 − 𝑆𝑡𝑎𝑡𝑖𝑠𝑡𝑖𝑐 = =
𝑠. 𝑒. (𝛽෠1 − 𝛽෠2 )
𝑉𝑎𝑟 𝛽෠1 + 𝑉𝑎𝑟 𝛽෠2 − 2 · 𝐶𝑜𝑣(𝛽෠1 , 𝛽෠2 )
We would need 𝐶𝑜𝑣(𝛽መ1 , 𝛽መ2 ), which can be retrieved from the variance-covariance matrix of the coefficients:
𝑉𝑎𝑟(𝛽෠0 ) ⋯ 𝐶𝑜𝑣(𝛽෠0 , 𝛽෠𝑘 )
෠ 2 ′ −1 ′ −1
𝑉𝑎𝑟 𝛽 = 𝜎ො (𝑋 𝑋) = 𝜀′Ƹ 𝜀ƸΤ(𝑛 − (𝑘 + 1)) (𝑋 𝑋) = ⋮ ⋱ ⋮
𝐶𝑜𝑣(𝛽෠𝑘 , 𝛽෠1 ) ⋯ 𝑉𝑎𝑟(𝛽෠𝑘 )
Reject H0 if |tk|> 𝑡𝑛− 𝑘+1 ;𝛼 Τ2 or P-value < α.

28
- We are often interested in testing the statistical validity of restrictions based on linear combinations of
coefficients, for example:
.

2) Reparametrizing the model in order to directly test 𝐻0 : 𝛽1 = 𝛽2 from a single coefficient (𝜃):
- The intuition it that it is possible to rearrange the model in order to test 𝐻0 : 𝛽1 = 𝛽2 as a t-test for a single
coefficient in the equation that expresses the linear combination we want to test for.
- The expression 𝛽1 = 𝛽2 is equivalent to 𝛽1 − 𝛽2 = 0. Therefore, we can define 𝛽1 − 𝛽2 = 𝜃 so 𝐻0 : 𝛽1 = 𝛽2 is
the same than 𝐻0 : 𝜃 = 0.
- Therefore, we need to rewrite our equation in a way in which it includes a coefficient that represents 𝜃 = 𝛽1 −
𝛽2 , and test for the null hypothesis 𝐻0 : 𝜃 = 0, which is going to be equivalent to 𝐻0 : 𝛽1 = 𝛽2 .
29
- Example: 𝑦𝑖 = 𝛼 + 𝛽1 𝑥1𝑖 + 𝛽2 𝑥2𝑖 + 𝜀𝑖
𝐻0 : 𝛽1 = 𝛽2 (≡ 𝛽1 − 𝛽2 = 0) 𝐻 : 𝜃=0
֜𝛽1 − 𝛽2 = 𝜃 ֜ 0
𝐻1 : 𝛽1 ≠ 𝛽2 (≡ 𝛽1 − 𝛽2 ≠ 0) 𝐻1 : 𝜃 ≠ 0
- In order to reparametrize the model and obtain an expression that enables testing for 𝐻0 : 𝜃 = 0, we need to
create a new variable like 𝑧𝑖 = 𝑥1𝑖 + 𝑥2𝑖 , that is z is the sum of x1 and x2.
- The new model to be estimated becomes:
𝑧𝑖
𝑦𝑖 = 𝛼 + 𝜃𝑥1𝑖 + 𝛽2 𝑧𝑖 + 𝜀𝑖 = 𝛼 + 𝜃𝑥1𝑖 + 𝛽2 (𝑥1𝑖 + 𝑥2𝑖 ) + 𝜀𝑖
From this new model, testing 𝐻0 : 𝜃 = 0 is equivalent to test for the null hypothesis 𝐻0 : 𝛽1 = 𝛽2 in the original
model, because 𝛽1 = 𝜃 + 𝛽2 , therefore:
𝑧𝑖
𝑦𝑖 = 𝛼 + 𝜃𝑥1𝑖 + 𝛽2 𝑧𝑖 + 𝜀𝑖 = 𝛼 + 𝜃𝑥1𝑖 + 𝛽2 𝑥1𝑖 + 𝑥2𝑖 + 𝜀𝑖 = 𝛼 + 𝜃𝑥1𝑖 + 𝛽2 𝑥1𝑖 + 𝛽2 𝑥2𝑖 +𝜀𝑖
𝜃 𝑧𝑖
֜ 𝑦𝑖 = 𝛼 + 𝜃 + 𝛽2 𝑥1𝑖 + 𝛽2 𝑥2𝑖 + 𝜀𝑖 ≡ 𝛼 + (𝛽1 − 𝛽2 )𝑥1𝑖 + 𝛽2 𝑥1𝑖 + 𝑥2𝑖 + 𝜀𝑖
Therefore, the t-statistic for 𝐻0 : 𝛽1 = 𝛽2 using method 1) is equivalent to the t-statistic for 𝐻0 : 𝜃 = 0 in the
equation 𝑦𝑖 = 𝛼 + 𝜃𝑥1𝑖 + 𝛽2 𝑧𝑖 + 𝜀𝑖 , with 𝑧𝑖 = 𝑥1𝑖 + 𝑥2𝑖 .
30
- We are often interested in testing the statistical validity of restrictions based on linear combinations of coefficients,
for example:
.
3) Constructing an F-Statistic for (multiple) linear hypotheses:
2
- Obtain the Sum of Squared Residuals (or the R2) from the Unrestricted Model (𝑆𝑆𝑅𝑈𝑅 /𝑅𝑈𝑅 ):
2
𝑦𝑖 = 𝛼 + 𝛽1 𝑥1𝑖 + 𝛽2 𝑥2𝑖 + 𝜀𝑖 𝑈𝑁𝑅𝑒𝑠𝑡𝑟𝑖𝑐𝑡𝑒𝑑 𝑚𝑜𝑑𝑒𝑙 ֜ 𝑆𝑆𝑅𝑈𝑅 /𝑅𝑈𝑅
- Obtain the Sum of Squared Residuals (or the R2) from the Restricted Model (𝑆𝑆𝑅𝑅 /𝑅𝑅2 ) that incorporates the
restriction that 𝛽1 = 𝛽2 = 𝛽:
𝑧𝑖
𝑦𝑖 = 𝛼 + 𝛽𝑥1𝑖 + 𝛽𝑥2𝑖 + 𝜀𝑖 ֜ 𝑦𝑖 = 𝛼 + 𝛽 𝑥2𝑖 + 𝑥1𝑖 + 𝜀𝑖 = 𝛼 + 𝛽𝑧𝑖 + 𝜀𝑖 𝑅𝑒𝑠𝑡𝑟𝑖𝑐𝑡𝑒𝑑 𝑚𝑜𝑑𝑒𝑙 ֜ 𝑆𝑆𝑅𝑅 /𝑅𝑅2
2
(𝑆𝑆𝑅𝑅 − 𝑆𝑆𝑅𝑈𝑅 )Τ𝑞 (𝑅𝑈𝑅 − 𝑅𝑅2 )Τ𝑞 𝐹 > 𝐹𝑞,𝑛− 𝑘+1 ; 𝛼
֜𝐹 = = 2 Τ ֜ 𝑅𝑒𝑗𝑒𝑐𝑡 𝐻0 𝑖𝑓 ቊ
𝑆𝑆𝑅𝑈𝑅 Τ(𝑛 − (𝑘 + 1)) (1 − 𝑅𝑈𝑅 ) (𝑛 − (𝑘 + 1)) 𝑃 − 𝑣𝑎𝑙𝑢𝑒 < 𝛼
where q = number of restrictions, n = number of observation and k+1 = number of coefficients to be estimated.
This F-statistic is equivalent to the square of the t-statistic obtained from methods 1) or 2) (i.e. in general, F = t2 when
the test involves a single restriction). 31
 Joint Significance of the Estimated Coefficients.
- The F test can be also used to analyze the joint significance of all (or a subset of) the estimated coefficients:
𝑦𝑖 = 𝛼 + 𝛽1 𝑥1𝑖 + 𝛽2 𝑥2𝑖 + 𝛽3 𝑥3𝑖 + 𝜀𝑖
𝑛𝑢𝑙𝑙 ℎ𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠 𝑎𝑠𝑠𝑢𝑚𝑒𝑑 𝑡𝑜 𝑏𝑒 𝑡𝑟𝑢𝑒 𝐻0 : 𝛽1 = 𝛽2 = 𝛽3
𝑎𝑙𝑡𝑒𝑟𝑛𝑎𝑡𝑖𝑣𝑒 ℎ𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠 𝐻1 : 𝛽1 ≠ 0; 𝛽2 ≠ 0; 𝛽3 ≠ 0;
2
Unrestricted model: 𝑦𝑖 = 𝛼 + 𝛽1 𝑥1𝑖 + 𝛽2 𝑥2𝑖 + 𝛽3 𝑥3𝑖 + 𝜀𝑖 ֜ 𝑆𝑆𝑅 𝑈𝑅 /𝑅𝑈𝑅
Restricted model: 𝑦𝑖 = 𝛼 + 𝑢𝑖 ֜ 𝑆𝑆𝑅 𝑅 (𝑛𝑜𝑡𝑖𝑐𝑒 𝑡ℎ𝑎𝑡 𝑅𝑅2 = 0)
2 Τ
(𝑆𝑆𝑅𝑅 − 𝑆𝑆𝑅𝑈𝑅 )Τ𝑞 𝑅𝑈𝑅 𝑞 𝐹 > 𝐹𝑞,𝑛− 𝑘+1 ; 𝛼
֜𝐹 = = 2 Τ ֜ 𝑅𝑒𝑗𝑒𝑐𝑡 𝐻0 𝑖𝑓 ቊ
𝑆𝑆𝑅𝑈𝑅 Τ(𝑛 − (𝑘 + 1)) (1 − 𝑅𝑈𝑅 ) (𝑛 − (𝑘 + 1)) 𝑃 − 𝑣𝑎𝑙𝑢𝑒 < 𝛼
If the null hypothesis is rejected, the model is said to be jointly significant.
- Notice that the same approach can be used to test the joint significance of a subset of coefficients (but the
simplification for the R-squared is no longer valid).
- Moreover, it is possible to show that the F-Statistic for a single coefficient is equal to the square of the
corresponding t-Statistic.
32
- Example 1:
 Estimated unrestricted model (n = 1000): 𝑦𝑖 = 15 − 0.7𝑥1𝑖 + 1.2𝑥2𝑖 + 2.2𝑥3𝑖 + 𝜀𝑖 ֜ 𝑆𝑆𝑅𝑈𝑅 = 123

.
 Estimated restricted model (n = 1000): 𝑦𝑖 = 13 − 0.6𝑥1𝑖 + 𝑢 ֜ 𝑆𝑆𝑅𝑅 = 154
(𝑆𝑆𝑅𝑅 − 𝑆𝑆𝑅𝑈𝑅 )Τ𝑞 (154 − 123)Τ2

֜𝐹 = = = 125.51 > 3 (= 𝐹2, 996;0.05 ) ֜ 𝑅𝑒𝑗𝑒𝑐𝑡 𝐻0
Τ Τ
𝑆𝑆𝑅𝑈𝑅 (𝑛 − (𝑘 + 1)) 123 (1000 − (3 + 1))
 Alternatively: p − Value (𝐹2, 996 = 125.51) = 3.10652𝑒 − 049 < 0.05 ֜ 𝑅𝑒𝑗𝑒𝑐𝑡 𝐻0
33
34
35
36
Cases when OLS fails
When can we expect that 𝐸 𝜀𝑖 𝑋𝑖 ≠ 0 and/or 𝐸(𝜀𝑖 , 𝑋𝑖 ) ≠ 0?
1) Simultaneity and Reverse Causality (X  Y and Y  X)
2) Omitted Variable Bias (elements that affect Y and are related with X are not controlled for)
3) Measurement Errors in explanatory variables (X* is an imperfect measure/proxy of X)
4) Lagged Dependent Variable as explanatory variable (dynamic model) with Autocorrelation
In all these situation we should apply an alternative estimator: Instrumental Variables
Estimator/Two-Stages Least Squares (or other methods to achieve “identification”).
However, as you will see in Econometrics II, this is not always possible.

Multiple Linear Regression

Uploaded by

Multiple Linear Regression

Uploaded by

ECONOMETRICS

Multiple Linear Regression

Degree in International Business

 Linear Regression Model with k explanatory variables:

𝑌1 𝛼 1 𝑥11 𝑥12 𝑥13 … 𝑥1𝑘 𝑥1′ 𝜀1

- Objective function: minimization of the Sum of Squared Residuals (SSR)

𝜕𝑆𝑆𝑅 𝜕𝑆𝑆𝑅 𝜕(𝑌 ′ 𝑌 − 𝛽መ ′ 𝑋 ′ 𝑌 − 𝑌 ′ 𝑋𝛽መ + 𝛽መ ′ 𝑋 ′ 𝑋𝛽)

𝜕𝑆𝑆𝑅 𝜕(𝑌 ′ 𝑌 − 𝛽መ ′ 𝑋 ′ 𝑌 − 𝑌 ′ 𝑋𝛽መ + 𝛽መ ′ 𝑋 ′ 𝑋𝛽)

𝜕𝑆𝑆𝑅 Vector of “k” estimated slope

Without further assumption, this algebraic tool has a limited value:

2) Homoskedasticity (constant variance of the error term *):

3) No Serial Correlation/Autocorrelation (error terms of different observations are uncorrelated):

𝐸 𝑌𝑖 𝑋𝑖 = 𝛼ො + ෍ 𝛽መ𝑗 𝑥𝑗𝑖 = 𝛽′𝑋

5) Full Rank of X ( no multicollinearity or no perfect linear relationship between variables)

- Unbiasedness of the OLS Estimator:

- The (Conditional) Variance of the OLS Estimator:

…..in fact, assumptions 2 and 3 imply that:

- What about the interpretation of the estimated coefficients in a multiple regression?

 Since 𝐸 𝛽መ = 𝛽 ֜ 𝐸[𝑦ො𝑖 − 𝑦𝑖 ] = 0, i.e. the predictor of the dependent variable is unbiased.

- Asymptotic Normality of the OLS estimator:

𝑛𝑢𝑙𝑙 ℎ𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠 (𝑎𝑠𝑠𝑢𝑚𝑒𝑑 𝑡𝑜 𝑏𝑒 𝑡𝑟𝑢𝑒) 𝐻0 : 𝛽𝑘 = 0 𝛽መ𝑘

Alternatively: p − value 𝑡100−3;0,05 = 0.000123762 < 0.05 ֜ 𝑅𝑒𝑗𝑒𝑐𝑡 𝐻0

֜ 𝐶𝐼 𝛽መ𝑘 = 𝛽መ1 ± 𝑡𝑛− 𝑘+1 ;𝛼 Τ2 · 𝑠. 𝑒. 𝛽መ𝑘

95% Confidence Interval for 𝛽መ1 = 0.6 ± 1.98 · 0.15

There are several ways to approach this kind of tests:

Reject H0 if |tk|> 𝑡𝑛− 𝑘+1 ;𝛼 Τ2 or P-value < α.

There are several ways to approach this kind of tests:

𝑛𝑢𝑙𝑙 ℎ𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠 (𝑎𝑠𝑠𝑢𝑚𝑒𝑑 𝑡𝑜 𝑏𝑒 𝑡𝑟𝑢𝑒) 𝐻0 : 𝛽1 = 𝛽2 (≡ 𝛽1 − 𝛽2 = 0)

(𝑆𝑆𝑅𝑅 − 𝑆𝑆𝑅𝑈𝑅 )Τ𝑞 (154 − 123)Τ2

1) Simultaneity and Reverse Causality (X  Y and Y  X)

3) Measurement Errors in explanatory variables (X* is an imperfect measure/proxy of X)

4) Lagged Dependent Variable as explanatory variable (dynamic model) with Autocorrelation

You might also like