9.
Binary Dependent Variables
• 9.1 Homogeneous models
– Logit, probit models
– Inference
– Tax preparers
• 9.2 Random effects models
• 9.3 Fixed effects models
• 9.4 Marginal models and GEE
• Appendix 9A - Likelihood calculations
9.1 Homogeneous models
• The response of interest, yit, now may be only a 0 or a 1, a binary
dependent variable.
– Typically indicates whether the ith subject possesses an
attribute at time t.
• Suppose that the probability that the response equals 1 is denoted
by Prob(yit = 1) = pit.
– Then, we may interpret the mean response to be the probability
that the response equals 1 , that is, E yit = 0 Prob(yit =
0) + 1 Prob(yit = 1) = pit .
– Further, straightforward calculations show that the variance is
related to the mean through the expression Var yit = pit (1 - pit ) .
Inadequacy of linear models
• Homogeneous means that we will not incorporate subject-specific
terms that account for heterogeneity.
• Linear models of the form yit = xit + it are inadequate because:
– The expected response is a probability and thus must vary
between 0 and 1 although the linear combination, xit , may
vary between negative and positive infinity.
– Linear models assume homoscedasticity (constant variance) yet
the variance of the response depends on the mean which varies
over observations.
– The response must be either a 0 or 1 although the distribution of
the error term is typically regarded as continuous.
Using nonlinear functions of
explanatory variables
• In lieu of linear, or additive, functions, we express the probability of
the response being 1 as a nonlinear function of explanatory variables
pit = (xit ).
• Two special cases are:
z
– the logit case 1 e
π( z ) z
z
– (z ) as a cumulative standard
1 e normal
e 1distribution function, the
probit case.
• These two functions are similar. I focus on the logit case because it
permits closed-form expressions unlike the cumulative normal
distribution function.
Threshold interpretation
• Suppose that there exists an underlying linear model,
yit* = xit + it*.
– The response is interpreted to be the “propensity” to possess a
characteristic.
– We do not observe the propensity but we do observe when the
propensity crosses a threshold, say 0.
– We observe
0 yit* 0
yit *
1 y it 0
• Using the logit distribution function,
Prob (it* a) = 1/ (1 + exp(-a) )
• Note that Prob(-it* xit ) = Prob(it* xit ). Thus,
1
Prob( yit 1) Prob( y 0) Prob( xit β)
*
it
*
it (xit β)
1 exp(xit β)
Random utility interpretation
• In economics applications, we think of an individual choosing
among c categories.
– Preferences among categories are indexed by an
unobserved utility function.
– We model utility as a function of an underlying value plus
random noise, that is, Uitj = uit(Vitj + itj), j = 0,1.
– If Uit1 > Uit0 , then denote this choice as yit = 1.
– Assuming that uit is a strictly increasing function, we have
Prob( y it 1) Prob(U it 0 U it1 )
Probu it (Vit 0 it 0 ) u it (Vit1 it1 ) Prob it 0 it1 Vit1 Vit 0
• Parameterize the problem by taking Vit0 = 0 and Vit1 = xit β.
• We may take the difference in the errors, it0 - it1 , to be normal
or logistic, corresponding to the probit and logit cases.
Logistic regression
• This is another phrase used to describe the logit case.
• Using p = (z), the inverse of can be calculated as z = -1(p)
= ln ( p/(1-p) ) .
– Define logit (p) = ln ( p/(1-p) ) to be the logit function.
– Here, p/(1-p) is known as the odds ratio. It has a convenient
economic interpretation in terms of fair games.
• That is, suppose that p = 0.25. Then, the odds ratio is 0.333.
• The odds against winning are 0.333 to 1, or 1 to 3. If we bet $1, then in a
fair game we should win $3.
• The logistic regression models the linear combination of explanatory
variables as the logarithm of the odds ratio,
xit = ln ( pit/(1-pit) ) .
Parameter interpretation
• To interpret =( 1, 2, …, K), we begin by assuming that jth
explanatory variable, xitj, is either 0 or 1.
• Then, with the notation, we may interpret
j xit1 1 xitK β xit1 0 xitK β
Prob( yit 1 | xitj 1) Prob( yit 1 | xitj 0)
ln ln
• Thus, 1 Prob( yit 1 | xitj 1) 1 Prob( yit 1 | xitj 0)
j
e
Prob( yit 1 | xitj 1) / 1 Prob( yit 1 | xitj 1)
• Prob(
To illustrate, if y = 1 | x then
0.693,
j it itj / 1 Prob(
0)exp( ) = 2.y
j it 1 | xitj 0)
– The odds (for y = 1) are twice as great for xj = 1 as for xj = 0.
More parameter interpretation
• Similarly, assuming that jth explanatory variable is
continuous, we have
d d Prob( yit 1 | xitj )
j xit β ln
dxitj dxitj 1 Prob( yit 1 | xitj )
d
dxitj
Prob( yit 1 | xitj ) / 1 Prob( yit 1 | xitj )
Prob( yit 1 | xitj ) / 1 Prob( yit 1 | xitj )
• Thus, we may interpret j as the proportional change in the
odds ratio, known as an elasticity in economics.
Parameter estimation
• The customary estimation method is maximum likelihood.
• The log likelihood of a single observation is
ln(1 π( xit β)) if yit 0
yit ln π( x it β) (1 yit ) ln(1 π( x it β))
• The log likelihood of the data set is ln π(xit β) if yit 1
y it
it ln π( xit β) (1 yit ) ln(1 π( xit β))
• Taking partial derivatives with respect to b yields the score equations
π(xit β)
it
x it yit π(xit β)
π(xit β)1 π(xit β)
0
– The solution of these equations, say bMLE, yields the maximum
likelihood estimate.
• The score equations can also be expressed as a generalized estimating
equation:
it it
y E y it
β
E y it Var y it 1
0
• where
E yit π(xit β) E yit x it π(xit β)
Var yit π( xit β )1 π( xit β)
β
For the logit function
• The normal equations are:
x y
it
it it ( xit β) 0
– The solution depends on the responses yit only through the vector of
statistics it xit yit .
• The solution of these equations, say bMLE, yields the
maximum likelihood estimate bMLE .
• This method can be extended to provide standard errors for
the estimates.
9.2 Random effects models
• We accommodate heterogeneity by incorporating subject-specific
variables of the form:
pit = (i + xit ).
– We assume that the intercepts are realizations of random variables
from a common distribution.
• We estimate the parameters of the {i} distribution and the K slope
parameters .
• By using the random effects specification, we dramatically reduced the
number of parameters to be estimated compared to the Section 9.3 fixed
effects set-up.
– This is similar to the linear model case.
• This model is computationally difficult to evaluate.
Commonly used distributions
• We assume that subject-specific effects are independent and come from a
common distribution.
– It is customary to assume that the subject-specific effects are normally
distributed.
• We assume, conditional on subject-specific effects, that the responses are
independent. Thus, there is no serial correlation.
• There are two commonly used specifications of the conditional
distributions in the random effects panel data model.
– 1. A logistic model for the conditional distribution of a response. That is,
– 2. A normal model for the conditional distribution of a response. That is,
1
Prob( yit 1 | i ) π( i xit β)
1 exp ( i xit β)
– where is the standard normal distribution function.
Prob( yit 1 | i ) ( i xit β)
Likelihood
• Let Prob(yit = 1| i) =(i + xit ) denote the conditional
probability for both the logistic and normal models.
• Conditional on i, the likelihood for the it th observation is:
yit (1 yit ) 1 π( i xit β) if yit 0
π( i xit β) (1 π( i xit β))
π( i xit β) if yit 1
• Conditional on i, the likelihood for the ith subject is:
Ti
π i xit β it 1 π i xit β it
y 1 y
t 1
• Thus, the (unconditional) likelihood for the ith subject is:
Ti
πa xit β 1 πa xit β 1 y
yit
li it
φ(a )da
t 1
– Here, is the standard normal density function.
• Hence, the total log-likelihood is i ln li .
– Note: lots of evaluations of a numerical integral….
Comparing logit to probit specification
• There are no important advantages or disadvantages when
choosing the conditional probability to be:
– logit function (logit model)
– standard normal (probit model)
• The likelihood involves roughly the same amount of work to
evaluate and maximize, although the logit function is slightly
easier to evaluate than the standard normal distribution function.
• The probit model is slightly easier to interpret because
unconditional probabilities can be expressed in terms of the
standard normal distribution function.
• That is,
x β
Prob( yit 1) E Φ( i xit β) Φ it
2
1
9.3 Fixed effects models
• As with homogeneous models, we express the probability of the
response being 1 as a nonlinear function of linear combinations of
explanatory variables.
• To accommodate heterogeneity, we incorporate subject-specific
variables of the form:
pit = (i + xit ).
– Here, the subject-specific effects account only for the intercepts and
do not include other variables.
– We assume that {i} are fixed effects in this section.
• In this chapter, we assume that responses are serially uncorrelated.
• Important point: Panel data with dummy variables provide inconsistent
parameter estimates….
Maximum likelihood estimation
• Unlike random effect models, maximum likelihood estimators are inconsistent in
fixed effects models.
– The log likelihood of the data set is
y
it
it
– This log likelihood can
ln ( x β) (1 y ) ln(1 ( x β))
still be i it
maximized to yield it
maximum i
likelihood it
estimators.
– However, as the subject size n tends to infinity, the number of parameters also tends to
infinity.
• Intuitively, our ability to estimate is corrupted by our inability to estimate
consistently the subject-specific effects {i } .
– In the linear case, we had that the maximum likelihood estimates are equivalent to the least
squares estimates.
• The least squares estimates of were consistent.
• The least squares procedure “swept out” intercept estimators when producing
estimates of .
Maximum likelihood estimation is
inconsistent
• Example 9.2 (Chamberlain, 1978, Hsiao 1986).
– Let Ti = 2, K=1 and xi1 = 0 and xi2=1.
– Take derivatives of the likelihood function to get the
score functions – these are in display (9.8).
– From (9.8), the score functions are
L e i e i
yi1 yi 2 i
i
0
– and i 1 e 1 e
L e i
yi 2 i
0
β i 1 e
– Appendix 9A.1
• Maximize this to get bmle
• Show that the probability limit of bmle is 2 , and hence is an
inconsistent estimator of .
Conditional maximum likelihood
estimation
• This estimation technique provides consistent estimates of the
beta coefficients.
– It is due to Chamberlain (1980) in the context of fixed
effects panel data models.
• Let’s consider the logit specification of , so that
1
pit π( i xit β)
1 exp ( i xit β)
• Big idea: With this specification, it turns out that t yit is a
sufficient statistic for i.
– Thus, if we condition on t yit , then the distribution of the
responses will not depend on i.
Example of the sufficiency
• To illustrate how to separate the intercept from the slope
effects, consider the case Ti = 2.
– Suppose that the sum, t yit = yi1+yi2, equals either 0 or 2.
• If sum equals 0, then Prob (yi1 = 0, yi2 = 0 |yi1 + yi2 = sum) = 1.
• If sum equals 2, then Prob (yi1 = 1, yi2 = 1 |yi1 + yi2 = sum) = 1.
• Both conditional probabilities do not depend on i .
• Both conditional events are certain and will contribute nothing
to a conditional likelihood.
– If sum equals 1,
Prob yi1 yi 2 1 Prob yi1 0 Prob yi 2 1 Prob yi1 1Prob yi 2 0
exp i xi1β exp i xi 2β
1 exp i xi1β 1 exp i xi 2β
Example of the sufficiency
• Thus,
Prob yi1 0 Prob yi 2 1
Prob yi1 0, yi 2 1 | yi1 yi 2 1
Prob yi1 yi 2 1
exp i xi 2β expxi 2 β
exp i xi1β exp i xi 2β expxi1β expxi 2β
• This does not depend on i .
– Note that if an explanatory variable xij is time-constant (xij2
xij1 ), then the corresponding parameter j disappears
from the conditional likelihood.
Conditional likelihood estimation
• Let Si be the random variable representing t yit and let sumi be
the realization of t yit .
• The conditional likelihood of the data set is
n
piy1i1 piy2i 2 piTyiT
i 1 Prob( S i sumi )
– Note that the ratio equals one when sumi equal 0 or Ti.
– The distribution of Si is messy and is difficult to compute
for moderate size data sets with T more than 10.
• This provides a fix for the problem of “infinitely many
nuisance parameters.”
– Computationally difficult, hard to extend to more complex
models, hard to explain to consumers
9.4 Marginal models and GEE
• Marginal models, also know as “population-averaged” models,
only require specification of the first two moments
– Means, variances and covariances
– Not a true probability model
– Ideal for moment estimation (GEE, GMM)
• Begin in the context of the random effects binary dependent
variable model
– The mean is E yit = it it (β, ) πa x it β d F (a)
– The variance is Var yit = it (1- it ).
– The covariance is Cov (yir, yis)
πa x ir β πa x is β d F (a) ir is
GEE – generalized estimating equations
• This is a method of moments procedure
– Essentially the same as generalized method of moments
– One matches theoretical moments to sample moments, with
appropriate weighting.
• Idea – find the values of the parameters that satisfy
n
0 K G (b EE , EE )Vi (b EE , EE ) (y i μ i (b EE , EE ))
1
i 1
– We have already specified the variance matrix.
– We also use a K x Ti matrix of derivatives
μ i (β, ) μi1 μiTi
G (β, )
β β β
– For binary variables, we have
it xit πa xit β d F (a)
β
Marginal Model
• Choose the mean function to be it Φx it β
– Motivated by probit specification
x β
Prob( yit 1) E Φ( i xit β) Φ it
2
1
• For the variance function, consider Var yit = it (1- it).
• Let Corr(yir, yis) denote the correlation between yir and yis.
– This is known as a working correlation.
• Use the exchangeable correlation structure specified as
1 for r s
Corr ( y ir , y is )
for r s
• Here, the motivation is that the latent variable i is common to
all observations within a subject, thus inducing a common
correlation.
• The parameters τ = (, ) constitute the variance components.
Robust Standard Errors
• Model-based standard errors are taken from the square root of
the diagonal elements of
1
n
G (b EE , EE )Vi (b EE , EE ) G (b EE , EE )
i 1
1
• As an alternative, robust or empirical standards errors are
from
1 1
n
n
n
G Vi1G G Vi1 y i μ i y i μ i Vi1G G Vi1G
i 1 i 1 i 1
• These are robust to misspecified heterscedasticity as well as
time series correlation.