0% found this document useful (0 votes)
18 views13 pages

M Is Specification

The document discusses model specification and data problems in econometrics, focusing on issues such as heteroscedasticity, endogeneity, and functional form misspecification. It introduces methods for detecting misspecification, including Ramsey's RESET test, and the use of proxy variables to address omitted variable bias. The document highlights the importance of ensuring correct model specification to achieve unbiased and consistent estimators in regression analysis.

Uploaded by

vincus27
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views13 pages

M Is Specification

The document discusses model specification and data problems in econometrics, focusing on issues such as heteroscedasticity, endogeneity, and functional form misspecification. It introduces methods for detecting misspecification, including Ramsey's RESET test, and the use of proxy variables to address omitted variable bias. The document highlights the importance of ensuring correct model specification to achieve unbiased and consistent estimators in regression analysis.

Uploaded by

vincus27
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

1 2

Model Specification and Data Problems


I In the previous class we analyzed one failure of Gauss-Markov
MODEL SPECIFICATION and DATA assumptions: MLR. 5 Constant Variance
PROBLEMS I Heteroscedasticity does not cause bias or inconsistency in the
OLS estimators but causes inefficiency. We learned that it is
relatively easy to adjust standard errors and test statistics.
Hüseyin Taştan1 I Now we want to analyze a more serious problem, namely,
violation of the assumption of exogeneity (MLR.3). We will
1 Yıldız
Technical University examine the case where “the error term u is correlated with
Department of Economics
one or more of the explanatory variables” (ie, endogeneity).
These presentation notes are based on I Recall that if the x variable is correlated with the error term it
Introductory Econometrics: A Modern Approach (2nd ed.)
by J. Wooldridge. is called an endogenous variable.
I Recall that when a relevant variable is omitted from the
31 Aralık 2012 model OLS estimators are biased and inconsistent.
I In the special case that the omitted variable is a function of
an explanatory variable in the model, the model suffers from
functional form misspecification.

3 4

Model Specification and Data Problems Functional Form Misspecification


I In this chapter, we will first discuss functional form I A multiple regression model suffers from functional form
misspecification and how to test for it. misspecification when it does not properly account for the
I Then, we will discuss how to use proxy variables to mitigate relationship between the dependent and observed explanatory
omitted variable bias. variables.
I We will also discuss problems caused by measurement errors I For example, if we fit a level-level model instead of a log-log
in dependent and explanatory variables. model (which is the true specification); or if we omit a
quadratic term where we should have added, then the model
I We will discuss the problems caused by endogenous variables
suffers from functional form misspecification. This, of course,
within the context of OLS estimators. In most cases,
leads to biased and inconsistent β̂j .
endogeneity problem cannot be solved within the OLS
framework. We will need consistent estimation methods such I Another example: suppose that the return to an additional
as Instrumental Variables and Two Stage Least Squares year of education changes with the gender implying that the
(2SLS). model should contain an interaction term. If we omit, for
some reason, this interaction term given by f emale × educ
then the functional form will be misspecified.
5 6

Functional Form Misspecification A Test for Functional Form Misspecification


I How to detect misspecified functional form? We can always I Is there a general test that can detect functional form
use the F -test for the joint exclusion restrictions such joint misspecification?
significance of quadratic terms, interaction terms, etc. I Yes, there are many misspecification tests. We will only
I We can use usual statistical inference procedures to mitigate examine one of them.
the functional form misspecification problem. I We will learn “Regression Specification Error Test” or RESET
I Significant quadratic terms may be symptomatic of other test of Ramsey (1969).
functional form problems such as using level of a variable I Ramsey’s RESET test is designed to detect if there are any
when the log is more appropriate. neglected nonlinearities in the model.
I In fact, using log transformation, where appropriate, may work
well in practice.

7 8

Ramsey’s RESET Test RESET Test


I Suppose that in the multiple linear regression model I The auxiliary regression for the RESET test statistic can be
written as follows:
y = β0 + β1 x1 + β2 x2 + . . . + βk xk + u
y = β0 + β1 x1 + β2 x2 + . . . + βk xk + δ1 ŷ 2 + δ2 ŷ 3 + u
the assumption MLR.3 (exogenous xs) is satisfied. I The null hypothesis of the RESET test says that the model is
I This implies that no nonlinear functions of the independent correctly specified:
variables (such as squares and cubes of xj s should be H0 : δ1 = 0, δ2 = 0
significant when added to the model.
I In large samples and under the Gauss-Markov assumptions,
I But, as in the White heteroscedasticity test, adding squares,
the usual F restrictions test follows the F (2, n − k − 3)
cubes and cross-products uses up many degrees of freedom.
distribution.
This is a drawback. I If the F statistic is greater than the critical value at a given
I Instead of this, we can add squares and cubes of the fitted significance level then we reject the null hypothesis of correct
values, ŷ 2 , ŷ 3 , into the model and test for the joint specification. This indicates that there is a functional form
significance of added terms using F or LM test. misspecification.
I We can also use LM test statistic. The LM test statistic
follows the χ22 distribution.
9 10

RESET Test Example: House prices, hprice1.gdt RESET Test Example: Level-level Model
I Level-level model:
Auxiliary regression for RESET specification test
OLS, using observations 1-88
price = β0 + β1 lotsize + β2 sqrf t + β3 bdrms + u Dependent variable: price

I Level-level estimation results: coefficient std. error t-ratio p-value


------------------------------------------------------------
[ = − 21.77 + 0.002 lotsize + 0.123 sqrft + 13.85 bdrms const 166.097 317.433 0.5233 0.6022
price lotsize 0.000153723 0.00520304 0.02954 0.9765
(29.475) (0.0006) (0.013) (9.010)
sqrft 0.0175988 0.299251 0.05881 0.9532
2
n = 88 R = 0.672 bdrms 2.17490 33.8881 0.06418 0.9490
yhat^2 0.000353426 0.00709894 0.04979 0.9604
I We form our test regression by adding squares and cubes of ŷ yhat^3 1.54557e-06 6.55431e-06 0.2358 0.8142
into the model above. Test statistic: F = 4.668205,
with p-value = P(F(2,82) > 4.66821) = 0.012

in GRETL: from the menu within estimation results window:


TESTS-RAMSEY’S RESET-SQUARES and CUBES
RESET Test Result: at 5% significance level we reject the null
hypothesis which states that the functional form is correctly
specified. Thus, there is functional form misspecification.

11 12

RESET Test Example: hprice1.gdt RESET Test Example: hprice1.gdt


I Alternative functional form: log-log model (except bdrms)
Auxiliary regression for RESET specification test
OLS, using observations 1-88
lprice = β0 + β1 llotsize + β2 lsqrf t + β3 bdrms + u
Dependent variable: lprice
I Log-log estimation results: coefficient std. error t-ratio p-value
-------------------------------------------------------
\ = −1.297 + 0.17 llotsize + 0.70 lsqrft + 0.037 bdrms
lprice const 87.8849 240.974 0.3647 0.7163
(0.651) (0.038) (0.093) (0.028)
llotsize -4.18098 12.5952 -0.3319 0.7408
2
n = 88 R = 0.643 lsqrft -17.3491 52.4899 -0.3305 0.7418
bdrms -0.925329 2.76975 -0.3341 0.7392
I Now let us calculate RESET test statistic. yhat^2 3.91024 13.0143 0.3005 0.7646
yhat^3 -0.192763 0.752080 -0.2563 0.7984

Test statistic: F = 2.565042,


with p-value = P(F(2,82) > 2.56504) = 0.0831

RESET Test Result: at 5% significance level, we fail to reject the


null hypothesis of correct specification. This indicates that the
functional form is correct. We prefer log-log specification.
13 14

RESET Test Tests Against Nonnested Alternatives


I A drawback with RESET test is that it provides no real I There are several tests for functional form misspecification.
direction on how to proceed if the model is rejected. Consider the following two models:
I Some have argued that RESET is a very general test for
y = β0 + β1 x1 + β2 x2 + u
model misspecification, including unobserved omitted
variables and heteroscedasticity. y = β0 + β1 log(x1 ) + β2 log(x2 ) + u
I This conclusion is misguided. If the omitted variable is linearly
related to the included variables the RESET test has no power
I These are nonnested models. We cannot write one of these
detecting this. model as a special case of the other.
I Also, if the functional form is correct, the RESET test has no
I In this case we cannot use F test.
power for detecting heteroscedasticity. I As long as the dependent variable is the same, two different
I RESET test is just a functional form test. It should not be approaches have been suggested.
used for other purposes. I We can form a bigger model which includes both models as
special cases and use F test. This method is suggested by
Mizon-Richard.

15 16

Tests Against Nonnested Alternatives Using Proxy Variables for Unobserved Explanatory
I The other method is known as the Davidson-MacKinnon test. Variables
This test is based on including the fitted values ŷ from one I Can we use a proxy variable for an omitted unobserved
model into the other model as an additional regressor and explanatory variable?
conducting a t-test. I We know that if the unobserved variable is an important,
I We will not examine these tests in detail. relevant variable then OLS estimators are biased and
I There are several drawbacks associated with nonnested tests. inconsistent.
I First, these tests may not choose a correct specification. Both I The question can be rephrased as follows: Can we solve or at
models could be rejected or neither model could be rejected. least mitigate the omitted variable bias using proxy variables?
I If neither model could be rejected, we can use the adjusted I A Proxy variable is something that is related to the
R-square to choose between them. unobserved variable that we would like to control for.
I Second, rejecting one model does not automatically mean I Example: recall that in the wage equation we could not
that the alternative is correct. The true model may have a observe innate ability. Can we use intelligence quotient (IQ) as
completely different specification. a proxy for ability?
I Third, if the dependent variable is different, for example if one I IQ does not have to be the same thing as ability, we know
has y and the other has log(y) as dependent variables, these they are not. But what we need is for IQ to be correlated with
tests cannot be used. We need to employ more complex ability.
testing procedures which we will not discuss here.
17 18

Using Proxy Variables Using Proxy Variables


I Consider the following model I How can we use x3 to get an unbiased or at least consistent
estimators?
y = β0 + β1 x1 + β2 x2 + β3 x∗3 + u I We can just pretend that x∗3 and x3 are the same and run the
regression of y on x1 , x2 , x3 . This is called plug-in solution
y :log(wage), x1 :educ, x2 : exper, x∗3 :ability (unobserved)
to the omitted variables problem.
I x∗3 : unobserved; x3 : proxy for unobserved variable I How does this approach produce consistent estimators?
I Proxy variable must be related to the unobserved variable, I To show this we need to make some assumptions about the
represented by the following simple regression: error terms u and ν3 .
I The error term, u, is uncorrelated with x1 , x2 and x∗3 . This is
x∗3 = δ0 + δ3 x3 + ν3
the standard MLR.3 assumption.
I We need the error term ν3 because these variables are not I In addition to this, u must be uncorrelated with x3 . Since x3
exactly related. is the proxy variable, it is irrelevant in the population model.
It is x∗3 that affects y not x3 .
I Typically, these variables are positively correlated so that
δ3 > 0. E(u|x1 , x2 , x∗3 , x3 ) = E(u|x1 , x2 , x∗3 ) = 0
I If δ3 = 0 then x3 cannot be a suitable proxy.

19 20

Using Proxy Variables Using Proxy Variables


I The error term ν3 is uncorrelated with x1 , x2 and x3 . I Plugging in x∗3 = δ0 + δ3 x3 + ν3 into the model and
I This can be stated as follows: rearranging we obtain

E(x∗3 |x1 , x2 , x3 ) = E(x∗3 |x3 ) = δ0 + δ3 x3 y = (β0 + β3 δ0 ) + β1 x1 + β2 x2 + β3 δ3 x3 + u + β3 ν3

I This says that once x3 is controlled for the expected value of I Let the composite error term be e = u + β3 ν3
x∗3 does not depend on x1 and x2 .
y = α0 + β1 x1 + β2 x2 + α3 x3 + e
I For example, in the wage equation where IF is the proxy
variable for ability this condition becomes where α0 = (β0 + β3 δ0 ), α3 = β3 δ3
I If the assumptions for the proxy variables are all satisfied then
E(ability|educ, exper, IQ) = E(ability|IQ) = δ0 + δ3 IQ
the composite error term e will be uncorrelated with the
I This implies that the average level of ability only changes with explanatory variables included in the model. Thus, OLS
IQ, not with educ and exper. Is this a reasonable estimators of α0 , β1 , β2 , α3 will be consistent.
assumption? I The coefficient on IQ, α3 , measures the impact of a one point
change in IQ test score on wage.
21
Dependent variable: log(wage)
Using Proxy Variables: Wage2.gdt
I This data set contains information about monthly wages,
education, experience, tenure, IQ scores, and several
demographic characteristics for a sample of 935 working men
in 1980.
I Adding IQ test scores we obtain the following results:
Model 1: OLS, using observations 1–935
Dependent variable: lwage

Coefficient Std. Error t-ratio p-value


const 5.17644 0.128001 40.4407 0.0000
educ 0.0544106 0.00692849 7.8532 0.0000
exper 0.0141458 0.00316510 4.4693 0.0000
tenure 0.0113951 0.00243938 4.6713 0.0000
married 0.199764 0.0388025 5.1482 0.0000
south −0.0801695 0.0262529 −3.0537 0.0023
urban 0.181946 0.0267929 6.7908 0.0000
black −0.143125 0.0394925 −3.6241 0.0003
IQ 0.00355910 0.000991808 3.5885 0.0004
Mean dependent var 6.779004 S.D. dependent var 0.421144
Sum squared resid 122.1203 S.E. of regression 0.363152
R2 0.262809 Adjusted R2 0.256441

23 24

Using Lagged Dependent Variables as Proxy Variables Using Lagged Dependent Variables as Proxy Variables
I In some applications (eg, the wage example) we have at least I Example: CRIME2.gdt, 1987 crime data for 46 cities,
a vague idea about which unobserved factor we want to information in 1982 also available
I The model without the lagged crime rate:
control.
I In other applications, we suspect that one or more of the \ = 3.34 − 0.029 unem87 + 0.203 l lawexpc87
l crmrte87
(1.251) (0.032) (0.173)
independent variables is correlated with an omitted variable, 2
n = 46 R = 0.057
but we have no idea how to obtain a proxy for that omitted
variable. I The model with lagged crime rate:
I In such cases, we can include the value of the dependent
\ = 0.076 + 0.009 unem87 − 0.140 l lawexpc87 + 1.194 l crmrte82
l crmrte87
variable y from an earlier time period, y−1 . (0.821) (0.02) (0.109) (0.132)
I To do this we need the lagged value of the dependent 2
n = 46 R = 0.680
variable. This provides a way of controlling historical factors
I In the first model, crime rate decreases as unemployment
that cause current differences in dependent variable. increases. This is counterintuitive.
I For example, some cities have had high crime rates in the past I After controlling for the crime rate in 1982 (5 years ago)
Many of the unobserved factors contribute to both high coefficient on unem is positive but insignificant.
current and past crime rates. Slowly moving components in I What is the elasticity of the current crime rate to the crime
rate in the previous period?
dependent variable (inertial effects) can be captured by the
lagged value.
25 26

Measurement Errors Measurement Errors in the Dependent Variable


I In some applications, it may be difficult or impossible to I Let y ∗ be the actual value of the dependent variable that we
collect data on actual values of variables. attempt to explain. For concreteness, suppose that y ∗ is the
I If the true value is not observed (in other words we have an actual savings of households.
imprecise measure of a variable) then the observed value will y ∗ = β 0 + β 1 x1 + β 2 x2 + . . . + β k xk + u
contain measurement error.
I y is the observed (or reported) value. The difference between
I For example, income and consumption reported by households the observed value and the actual value is the measurement
may be different than the actual values. They may tend to error in the population
underreport their income level.
e0 = y − y ∗
I In this section, we are interested in the properties of OLS
estimators under measurement errors. I From this we have y ∗ = y − e0 . Plugging this into the model
I We will examine measurement errors in two parts: (1) we obtain:
measurement errors in the dependent variable and (2) y = β0 + β1 x1 + β2 x2 + . . . + βk xk + u + e0
measurement errors in the explanatory variables.
I Now, the error term in the new model is u + e0 . Measurement
I We will learn under what conditions measurement errors lead error is now in the regression error term. Does OLS produce
to inconsistency in OLS estimators. consistent estimators?

27 28

Measurement Errors in the Dependent Variable Measurement Errors in the Dependent Variable: Example
I The model is: I Consider the following savings model:
y = β0 + β1 x1 + β2 x2 + . . . + βk xk + u + e0
| {z } sav ∗ = β0 + β1 inc + β2 size + β3 educ + β4 age + u
I If the measurement error, e0 , is uncorrelated with each xj sav ∗ : actual household savings, sav: reported (observed)
then consistent estimation is possible. If the measurement household savings, inc: annual household income, size:
error is independent from explanatory variables then OLS number of individuals in the household, educ: education level
estimators are unbiased and consistent. of the household head, age: age of the household head.
I If the error term, u and the measurement error e0 are I When the measurement error (sav − sav ∗ ) creates a problem?
independent (this is usually assumed), then we have: I We can assume that the measurement error is uncorrelated
with income, size, education and age.
Var(u + e0 ) = Var(u) + Var(e0 ) = σu2 + σ02 > σu2
I On the other hand, we may think that families with higher
I This means that measurement error in the dependent variable incomes, or more education, report their savings more
results in a larger error variance than when no error occurs. accurately.
I As a result, OLS estimators will have larger variances and I Since we cannot observe measurement error we may never be
standard errors. In this case, we may try to collect more able to determine if the measurement error is correlated with
“quality” data. income or education.
29 30

Measurement Error in Explanatory Variable Measurement Error in Explanatory Variable


I Measurement error in x can lead to more serious problems I Assume that the error term u is uncorrelated with both x∗1
than measurement errors in y. and x1 so that:
I To determine conditions under which OLS estimators become E(y|x∗1 , x1 ) = E(y|x∗1 )
inconsistent let us consider the simple regression model: I This means that after controlling for x∗1 we no longer need x1
in the model.
y = β0 + β1 x∗1 + u
I If we use x1 instead of x∗1 , what are the properties of OLS
Suppose that the first 4 Gauss-Markov assumptions hold. estimators? Are they still consistent?
I Here, x∗1 is the unobserved actual value and x1 is the observed I This depends on the assumption we make about the
value. measurement error.
I Then, the measurement error is I There are two possible assumptions: (1) measurement error is
uncorrelated with x1 .
e1 = x1 − x∗1 I (2) measurement error is uncorrelated with unobserved actual
value, x∗1 .
I Assume that the expected value of the measurement error is
zero: E(e1 ) = 0

31 32

(1) e1 and x1 are uncorrelated (2) e1 and x∗1 are uncorrelated (CEV Assumption)
I This assumption can be written as I This is known as the “Classical Errors-in-Variables (CEV)”. In
the econometrics literature, when we talk about measurement
Cov(x1 , e1 ) = 0 error in explanatory variable we usually mean CEV.
I The CEV assumption can be written as:
I Since e1 = x1 − x∗1 , it must be the case that e1 and x∗1 are
correlated. Cov(x∗1 , e1 ) = 0
I Under this assumption, substituting x∗1 = x1 − e1 in the model I The observed value can be written as the sum of actual value
we obtain: and measurement error:
y = β0 + β1 x1 + (u − β1 e1 )
x1 = x∗1 + e1
I Expected value and variance of the composite error term:
I Obviously, if x∗1 and e1 are uncorrelated, then, x1 and e1 must
E(u − β1 e1 ) = 0, Var(u − β1 e1 ) = σu2 + β12 σe21 be correlated:

I OLS estimators are consistent because the error term and x1 Cov(x1 , e1 ) = E(x1 e1 ) = E(x∗1 e1 ) + E(e21 ) = 0 + σe21 = σe21
are uncorrelated. But the variance will be higher. I Under CEV assumption, the covariance between x1 and e1 is
equal to the variance of the measurement error.
33 34

(2) CEV Assumption: Cov(x∗1 , e1 ) = 0 (2) CEV Assumption: Cov(x∗1 , e1 ) = 0


I Recall that the model was written as: I In the simple regression model, the probability limit of the
OLS estimator of the slope parameter is:
y = β0 + β1 x1 + (u − β1 e1 )

I Since e1 is included in the composite error term, its covariance Cov(x1 , u − β1 e1 )


plim(β̂1 ) = β1 +
with x1 will create a problem. Var(x1 )
I The covariance between composite error term and x1 is β1 σ 2
= β 1 − 2 e1 2
σx∗ + σe1
Cov(x1 , u − β1 e1 ) = −β1 Cov(x1 , e1 ) = −β1 σe21 1
!
σe21
I Because this covariance is not 0, OLS estimators will be = β1 1 − 2
σx∗ + σe21
1
biased and inconsistent under CEV assumption
σx2∗
!
I We can calculate the amount of inconsistency in OLS. = β1 1
σx2∗ + σe21
1

35 36

(2) CEV Assumption: Cov(x∗1 , e1 ) = 0 (2) CEV Assumption: Cov(x∗1 , e1 ) = 0


I Probability limit of the OLS estimator: I Probability limit of the OLS estimator:

σx2∗ σx2∗
! !
1 1
plim(β̂1 ) = β1 6= β1 plim(β̂1 ) = β1 6= β1
σx2∗ + σe21 σx2∗ + σe21
| 1 {z } | 1 {z }
≤1 ≤1

I The term in the parenthesis will always be smaller than 1. If I If the variance of x∗1 is large as compared to the variance of e1
and only if σe21 = 0 then it is 1. then the ratio Var(x∗1 )/Var(x1 ) will be close to 1. In this case
I This means that: β̂1 is always closer to 0 than the true value the amount of inconsistency may not be large. But it is almost
β1 is. This is called attenuation bias. impossible to determine this.
I If β1 > 0 then β̂1 will approach a value smaller than the true
I Things are more complicated when we add more explanatory
value in the limit (underestimation). Otherwise, it will variables.
approach a bigger value (overestimation). I But we can say that measurement errors generally lead to
inconsistency of all OLS estimators.
37 38

(2) CEV Assumption: Cov(x∗1 , e1 ) = 0 Data Problems


I Consider the following model for the college success: I Measurement errors can be viewed as a data problem because
we cannot obtain data on actual variables of interest.
colGP A = β0 + β1 f aminc∗ + β2 hsGP A + β3 SAT + u
I Another data problem that we saw before is multicollinearity
f aminc: Family income, hsGP A: high school GPA, SAT : among the explanatory variables. When two independent
Scholastic Aptitude Test result variables are highly correlated, it can be difficult to estimate
I f aminc∗ is the actual family income. If a questionnaire the partial effect of each reflected by high standard errors.
method is used to collect data then the student will be asked Remember that no assumption is violated in this case.
to report family income. I There may be several other data problems:
I We can collect data on hsGPA and SAT scores from student
I Missing data
records. But we cannot do this for family income levels.
I If the reported income is different from the actual income, and I Nonrandom samples
if the CEV assumption is valid (ie actual income and I Outliers (extreme observations)
measurement error are uncorrelated) then, OLS estimator for
β1 will be biased and inconsistent.
I As a result, the impact of the family income on the college
success will be underestimated (downward bias).

39 40

Missing Data Nonrandom Sampling


I The missing data problem can arise in a variety of forms. For I Violation of MLR.2 Random Sampling. If the missing data
example, in surveys respondent may not answer some of the results in nonrandom sample then we have a more serious
questions. problem.
I If data are missing for an observation on either the dependent I For example, in the wage equation, suppose we want to
variable or one of the independent variables, then the include IQ scores as an explanatory variable.
observation cannot be used in estimation. Econometric I If obtaining an IQ score is easier for those with higher IQs,
software packages usually ignore observations with missing
then the sample is not representative of the population.
data. As a result sample size decreases.
Workers with high IQs will be over-represented in the sample.
I Is there any serious statistical consequences of missing data?
The answer depends on why the data are missing. If the data
I In this case MLR.2 may not hold and thus OLS estimators
are missing at random then this does not cause any bias. The may be biased.
only result is that the sample is reduced and OLS estimates
will be less precise.
I If the data are missing in a systematic way the OLS
estimators may be biased. For example, in the birthweight
example, if the probability that education is missing is higher
for those people with lower than average level of education
then we have systematic missing data. MLR.2 Random
41 42

Nonrandom Sampling Nonrandom Sampling


I Certain types of nonrandom sampling do not cause bias or I If the sample selection is based on the dependent variable, y,
inconsistency. MLR. 2 will not be satisfied which will cause bias in OLS.
I Sample can be chosen on the basis independent variables I This is called endogenous sample selection.
without causing any statistical problems. I Consider the following wealth equation:
I This is called exogenous sample selection.
I For example, consider the following saving equation: wealth = β0 + β1 educ + β2 exper + β3 age + u
saving = β0 + β1 income + β2 age + β3 size + u
I Suppose that only people with wealth below $250,000 are
I If our data set was based on a survey of people over 35 years included in the sample. This a kind of endogenous sample
of age, then we have exogenous sample selection, a type of selection and will result in biased and inconsistent estimators.
nonrandom sampling.
I This is because the population regression
I If the other assumptions are satisfied, then OLS is still
unbiased and consistent. The reason is that conditional E(wealth|educ, exper, age)
expectation
E(saving|income, age, size) is not the same as the expected value conditional wealth being
is the same for any subset of the population described by less than $250,000.
income, age, or size.

43 44

Outliers - Influential Observations Outliers - Influential Observations


I In some applications, (usually but not only in small data sets) I Outliers can occur for two reasons in practice: (1) a mistake
the OLS estimators are sensitive to the inclusion of one or has been made in collecting and entering the data (eg adding
several observations a zero by mistake or misplacing a decimal point), or (2)
I An observation is an influential observation if dropping it from outlier is a feature of the distribution of the variable.
the analysis changes the key OLS estimates by a practically I In practical applications, it may be a good idea to examine
large amount. summary statistics of variables, eg, mean, median, mode,
I An outlier is an unusually large or small values in some minimum, maximum, standard deviation etc.
observations. I It is not very clear what should be done if the outlier is a
I OLS can be sensitive to the outliers because in minimizing feature of the distribution.
SSR, large residuals receive a lot of weight. I Outlying observations can provide important information by
I How can we determine if an observation is outlier/influential increasing the variation in the explanatory variables resulting
observation? in reduced standard errors.
I Usual practice is that OLS results are reported with and
without outlying observations.
45
Outliers: Example
Outliers: Example
I Research and Development (R&D) intensity and firm
performance:

rdintens = β0 + β1 sales + β2 prof marg + u

rdintens: R&D expenditures as percentage of sales; sales:


sales (in millions $); prof marg: profits as a percentage of
sales, %
I Data set: RDCHEM.gdt, estimation results

\ = 2.62 + 0.00005 sales + 0.045 profmarg


rdintens
(0.585) (0.00004) (0.046)
2
n = 32 R = 0.076

I Neither sales nor prof marg is statistically significant at even


the 10% level.
I Are there any outliers? Let us examine the scatter diagram.

48
Outliers: Example
Outliers: Example
I Of the 32 firms, 31 have annual sales less than $20 billion.
One firm has annual sales of nearly $40 billion
I This may be an outlier. Estimation results without outlier:

\ = 2.297 + 0.000186 sales + 0.0478 profmarg


rdintens
(0.592) (0.000084) (0.0445)
2
n = 31 R = 0.1728

I When the largest firm is dropped from the regression, the


coefficient on sales more than triples, and it now has a t
statistic over 2.
I There is a statistically significant relationship between R&D
intensity and sales.
I The profit margin is still insignificant and its coefficient has
not changed much.
49

Outliers
I Certain functional forms may be less sensitive to outlying
observations. Logarithmic transformation significantly narrows
the range of the data that can potentially mitigate the
problems created by outliers. For example, consider the
following model
log(rd) = β0 + β1 log(sales) + β2 prof marg + u
rd: R$D expenditures, $millions
I n = 32 with outlier:
\ = −4.378 + 1.084 log(sales) + 0.023 profmarg
log(rd)
(0.468) (0.060) (0.013)
2
n = 32 R = 0.918
I n = 31 without outlier:
\ = −4.404 + 1.088 log(sales) + 0.0218 profmarg
log(rd)
(0.511) (0.067) (0.013)
2
n = 31 R = 0.9037
I Results are practically the same. Can we reject the null
hypothesis of unit elasticity?

You might also like