0% found this document useful (0 votes)
37 views80 pages

Week04 Lecture BB

This document summarizes lecture 4 on logistic regression. It introduces maximum likelihood estimation for logistic regression models. It discusses how the log-likelihood and score functions are derived for logistic regression models. It presents the score function and information matrix used to obtain maximum likelihood estimates of the model parameters. The goal is to numerically estimate parameters for logistic regression models.

Uploaded by

Alice Xiao
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
37 views80 pages

Week04 Lecture BB

This document summarizes lecture 4 on logistic regression. It introduces maximum likelihood estimation for logistic regression models. It discusses how the log-likelihood and score functions are derived for logistic regression models. It presents the score function and information matrix used to obtain maximum likelihood estimates of the model parameters. The goal is to numerically estimate parameters for logistic regression models.

Uploaded by

Alice Xiao
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 80

MATH5806 Applied Regression Analysis

Lecture 4 - Logistic Regression

Boris Beranger
Term 2, 2021

1/80
Chapter 4 - Logistic Regression

4.1 - Introduction

4.2 - Maximum likelihood estimation

4.3 - General Logistic Regression

4.4 - Prediction

4.5 - Goodness of fit

4.6 - Residuals

4.7 - Other diagnostics

4.8 - Odds ratios and prevalence ratios

2/80
Chapter 4 - Logistic Regression

4.1 - Introduction

4.2 - Maximum likelihood estimation

4.3 - General Logistic Regression

4.4 - Prediction

4.5 - Goodness of fit

4.6 - Residuals

4.7 - Other diagnostics

4.8 - Odds ratios and prevalence ratios

3/80
Goals

This lecture is about obtaining estimates for

• Logistic regression

This is a case where numerical solutions are needed.

4/80
Introduction

For independent random variables Yi , linear models of the form

E(Yi ) = µi = x>
i β
Yi ∼ N(µi , σ 2 ),

are the basis of most analyses of continuous data.

Advances in theory and softwares allow us to extend the theory we have developed in
Chapter 3 to models where:

• Response variables have distributions other than the normal distribution


• association between the response and the explanatory variables is not linear

5/80
Introduction

Again, suppose we have a set of observations (y1 , x1 ), . . . , (yN , xN ).

We want to estimate the parameters β to the situation where

E(Yi ) = µi (1)

is related to the covariate in a non-linear way

g (µi ) = x>
i β (2)

The function g is called link function.

Most of the course will focus on situations where g is a simple mathematical function, but
functions may be estimated numerically, as in the generalised additive models – we will
introduce that towards the end of the course.
6/80
Introduction

Generalised linear models are defined for independent random variables Y1 , Y2 , . . . , YN


distributed according to a distribution from the exponential family

• The distribution of each Yi has the canonical form

f (yi |θi ) = exp [yi bi (θi ) + ci (θi ) + di (yi )] (3)

• The distributions of all the Yi ’s are of the same form, i.e. bi (θi ) = b(θi ),
ci (θi ) = c(θi ) and di (θi ) = d(θi )

7/80
Introduction

For model specification, we are interested in the effects of covariates β1 , . . . , βp (with


p < N) on the response variables.

Suppose E(Yi ) = µi , then we are interested in a transformation of µi

g (µi ) = x>
i β

• g is monotone and differentiable; it is called link function


• The vector xi is a p × 1 vector of explanatory variables x>
i = [Xi1 , . . . , Xip ] and it
is the i-th row of the design matrix X
• β is a p × 1 vector of parameters β > = [β1 , . . . , βp ]
8/80
Introduction

Remark: The process of predicting qualitative responses, as in the case of logistic


regression, is often referred to as classification, because it involves assigning an
observation to a category or class.
Example
Problems of classification are extremely important in applied sciences, e.g.:

• A person arrives at the emergency room with a set of symptoms that could
possibly be attributed to one of three medical conditions; which of the three
conditions does the individual have?
• An online banking service must be able to determine whether or not a transaction
is fraudolent, on the basis of the IP address, past transactions, etc.
• On the basis of DNA sequence data, we would like to predict if a strain of
tuberculosis is resistant a particular drugs.
9/80
Chapter 4 - Logistic Regression

4.1 - Introduction

4.2 - Maximum likelihood estimation

4.3 - General Logistic Regression

4.4 - Prediction

4.5 - Goodness of fit

4.6 - Residuals

4.7 - Other diagnostics

4.8 - Odds ratios and prevalence ratios

10/80
Maximum likelihood estimation

Let’s recall the following results from the previous lecture.

The joint distribution is


N
Y
f (Y1 , . . . , YN |θ1 , . . . , θN ) = exp[Yi b(θi ) + c(θi ) + d(Yi )]
i=1
N N N
" #
X X X
= exp Yi b(θi ) + c(θi ) + d(Yi )
i=1 i=1 i=1

For each Yi , the log-likelihood is `i = Yi b(θi ) + c(θi ) + d(Yi ), which gives

0 b 00 (θi )c 0 (θi )−c 00 (θi )b 0 (θi )


E(Yi ) = µi = − bc 0 (θ i)
(θi ) , Var(Yi ) = [b 0 (θi )]3
, g (µi ) = x>
i β = ηi .

11/80
Maximum likelihood estimation

The log-likelihood for all the Yi ’s is then


N
X N
X N
X N
X
`(θ; Y1 , . . . , YN ) = `i = Yi b(θi ) + c(θi ) + d(Yi ).
i=1 i=1 i=1 i=1

To obtain the maximum likelihood estimator for the parameter βj we derive the score
function using the chain rule:
N   XN  
d` X d`i d`i dθi dµi
= Uj = = · · . (4)
dβj dβj dθi dµi dβj
i=1 i=1

12/80
Maximum likelihood estimation

Now, let’s consider each term separately


d`i
• dθi = Yi b 0 (θi ) + c 0 (θi ) = b 0 (θi )(Yi − µi )
 −1  00 −1
−c (θi ) c 0 (θi )b 00 (θi )
• dθi
dµi = dµdθi
i
= 0
b (θi ) + 0
[b (θ )] 2 = (b 0 (θi )Var(Yi ))−1
i
dµi dµi dηi dµi
• dβj = dηi · dβj = dηi Xij

Hence, the score function is

N   
X (yi − µi ) dµi
Uj = Xij (5)
Var (Yi ) dηi
i=1

13/80
Maximum likelihood estimation

The variance-covariance matrix of the score is

Ijk = E[Uj Uk ] (6)

which represents the information matrix


( N  N  )
X (Yi − µi )  dµi  X (Yi − µi )

dµi
Ijk = E Xij Xik
Var(Yi ) dηi Var(Yi ) dηi
i=1 i=1
N
E[(Yi − µi )2 Xij Xik ] dµi 2
X  
=
[Var(Yi )]2 dηi
i=1

because E[(Yi − µi )(Yl − µl )] = 0 for i 6= l.

14/80
Maximum likelihood estimation

Since E[(Yi − µi )2 ] = Var(Yi )


N
dµi 2
 
X Xij Xik
Ijk = (7)
Var(Yi ) dηi
i=1

For the linear model where E(Yi ) = x>


i β which implies µi = ηi (i.e. g (µi ) = µi ). Here
we no longer have the simplification dµ i
dηi = 1 and thus cannot write

1 T
I= X X
σ2

15/80
Maximum likelihood estimation

If we want to apply the method of scoring to approximate the MLE, the estimating
equation is

β̂ (m) = β̂ (m−1) + [I (m−1) ]−1 u(m−1)


[I (m−1) ]β̂ (m) = [I (m−1) ]β̂ (m−1) + u(m−1) (8)

From Equation (7), the information matrix can be written as

I = X> WX (9)
 2
dµi
where wii = Var1(Yi ) dηi evaluated at β.

16/80
Maximum likelihood estimation

Finally, the expression on the right hand side of (8) can be written as
p X
N N
dµi 2 (m−1) X (Yi − µi )Xij dµi
   
X Xij Xik
β̂k +
Var(Yi ) dηi Var(Yi ) dηi
k=1 i=1 i=1

which can be written in matrix terms as

[I (m−1) ]−1 β̂ (m−1) + u(m−1) = X> Wz (10)

where
p  
X (m−1) dηi
Zi = Xik β̂k + (Yi − µi )
dµi
k=1

17/80
Maximum likelihood estimation

Therefore (8) reads:

X> W(m−1) Xβ̂ (m) = X> W(m−1) z(m−1)

This has the same form as the weighted least squares equation. Note, however, this
needs to be solved iteratively, since W and z have to be recalculated at each
optimisation step.

Therefore, for generalised linear models, maximum likelihood estimators are obtained
by an iterative weighted least squares procedure.

18/80
Chapter 4 - Logistic Regression

4.1 - Introduction

4.2 - Maximum likelihood estimation

4.3 - General Logistic Regression

4.4 - Prediction

4.5 - Goodness of fit

4.6 - Residuals

4.7 - Other diagnostics

4.8 - Odds ratios and prevalence ratios

19/80
General logistic regression

We now focus on models where the outcome variables are measured on a binary scale.

We define a binary random variable



1 if the outcome is a “success” π
Y = (11)
0 if the outcome is a “failure” (1 − π)

i.e. Y has a Bernoulli distribution.

The goal of the analysis is to relate the probability of the Bernoulli distribution,

πi , to a set of explanatory variables x>


i , i.e.

Pr(Yi = 1|Xi1 , . . . , Xip ) = πi for i = 1, . . . , N (12)


20/80
General logistic regression

The joint likelihood function is


N
Y
f (Y1 , . . . , YN |π) = πiYi (1 − πi )1−Yi
i=1
N N
"   #
X πi X
= exp Yi log + log(1 − πi ) (13)
1 − πi
i=1 i=1

21/80
General logistic regression

We want to describe the probability of success with respect to some predictors:

g (πi ) = x>
i β (14)

so to take into consideration that

• The response variable is binary and not continuous


• The response variable is bounded (in [0, 1])
• The variance is not constant Var (Yi ) = πi (1 − πi )

Similar considerations apply to ordinal response variables.

22/80
General logistic regression

Consider the Default dataset from the ISLR R package. We want to estimate the
probability of default as function of balance.
1.0

1.0
| | | | || | ||||| || ||| ||||||||||| ||||| |||||||||||||||||| |||||||| ||||||||||| |||||||||||||||||||||||||||| ||||||||||||||||||||||||| | |||| | | | | | | | | | | || | ||||| || ||| ||||||||||| ||||| |||||||||||||||||| |||||||| ||||||||||| |||||||||||||||||||||||||||| ||||||||||||||||||||||||| | |||| | | | | | |
Probability of Default

Probability of Default
0.8

0.8
0.6

0.6
0.4

0.4
0.2

0.2
0.0

0.0
||||||||||||||||||||||||||||||||||||||||||||||||||||||
||||||||||||||
|||||||||||||||||||||||||||||
|||||||||||||
||||||||||||||||||||||||||||||||||
|||||||
|||||
|||||||||||||||||||
||||||||||||
||||||||||||||||
|||||||||||||
|||||||||||||||||||
|||||
||||
||||||||||||
|||||||||
||||
||||||||||||||
|||||||||||||||||||||||||||||||||||||||||||||||||||||
|||||||||||||||||||||||||
|||||||||||||||||||||||
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| |||||| || || ||||| | | || ||||||||||||||||||||||||||||||||||||||||||||||||||||||
||||||||||||||
|||||||||||||||||||||||||||||
|||||||||||||
||||||||||||||||||||||||||||||
||||
|||||||
|||||
|||||||||||||||||||
||||||||||||
||||||||||||||||
|||||||||||||
|||||||||||||||||||
|||||
||||
||||||||||||
|||||||||
||||
||||||||||||||
|||||||||||||||||||||||||||||||||||||||||||||||||||||
|||||||||||||||||||||||||
|||||||||||||||||||||||
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| |||||| || || ||||| | | ||

0 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500

Balance Balance

Figure 1: Estimated probability of default using linear (left) and logistic (right) regression.
23/80
General logistic regression

The following code can be used to replicate the previous plots.


1 library ( ISLR )
2 data ( " Default " )
3 attach ( Default )
4

5 default . bin <- rep (0 , length ( default )) # initialise binary vector


6 default . bin [ default == " Yes " ] <- 1
7

8 par ( mfrow = c (1 ,2)) # To put the plots next to each other


9

10 # Linear model
11 lm <- lm ( default . bin ~ balance )
12 plot ( balance , default . bin , pch =3 , col = " orange " ,
13 xlab = " Balance " , ylab = " Probability of Default " )
14 abline ( h = c (0 ,1) , lty =2)
15 abline ( a = lm $ coefficients [1] , b = lm $ coefficients [2] , col = " blue " , lwd =3)
24/80
General logistic regression

The following code can be used to replicate the previous plots.


16 # Logistic model
17 log <- glm ( default . bin ~ balance , family = " binomial " )
18 o <- order ( balance )
19

20 plot ( balance , default . bin , pch =3 , col = " orange " , xlab = " Balance " ,
21 ylab = " Probability of Default " )
22 abline ( h = c (0 ,1) , lty =2)
23 lines ( balance [ o ] , log $ fitted . values [ o ] , col = " blue " , lwd =3)

25/80
General logistic regression

Example
We want to predict the medical condition of a patient in the emergency room on the
basis of the symptoms. Let’s suppose we have three possible diagnoses:

1 stroke


Y = 2 drug overdose


3 epileptic seizure

Using a linear regression would assume

• The ordering is meaningful: numbers 1, 2 and 3 are just labels!


• The difference between “stroke” and “drug overdose” has the same meaning than
that between “drug overdose” and “epileptic seizure”
26/80
General logistic regression

The general logistic regression model is


 
π1
logit(πi ) = log = x>
i β (15)
1 − πi
where xi is a vector of either continuous measurements or categorical variables and β
π1
is a parameter vector. Recall that 1−π i
is an odds taking value between 0 and ∞,
indicating very low and very high probabilities of default. This means that
πi
= exp[x>
i β]
1 − πi
πi = exp[x> >
i β] − πi exp[xi β]
(1 + exp[x> >
i β])πi = exp[xi β]
exp[x>
i β]
πi =
1 + exp[x>
i β]
27/80
General logistic regression

And the log-likelihood can be rewritten with respect to β


N 
exp[x>
   
i β] 1
X
`(β; y, x) = yi log + (1 − y i ) log (16)
i=1
1 + exp[x>
i β] 1 + exp[x>
i β]

The estimation process is the same if Yi is binomially distributed instead of Bernoulli


distributed, with the corresponding modification to consider the number of trials.

If the goal is prediction, one might predict

YN+1 = 1 if πN+1 |x>


N+1 > 0.5.

However, other thresholds could be used, e.g. if we want to be particularly


conservative, we can set the threshold to 0.1.
28/80
Analysis of trade union dataset
Example
The trade union data are collected from 1985 and available in the R package
SemiPar. The variable union.member is binary while variables age and wages are
continuous. We illustrate the model comparison with this data set.
The figure below shows logistic regression fitted to the two variables separately with
union membership as response.

1 library ( SemiPar )
2

3 data ( " trade . union " )


4 attach ( trade . union )
5

6 plot ( wage , union . member , main = " Trade Union Dataset " , pch =3 ,
7 ylab = " Union Membership " , xlab = " Wage " )
8 abline ( h = c (0 ,1) , lty =2)
29/80
Analysis of trade union dataset

1.0
0.8 Trade Union Dataset
Union Membership

0.6
0.4
0.2
0.0

0 10 20 30 40

Wage

30/80
Analysis of trade union dataset

The code below fits the logistic regression model displays the fitted line and prints the
output.
9 union . wage . glm <- glm ( union . member ~ wage , family = " binomial " )
10 o <- order ( wage )
11

12 lines ( wage [ o ] , union . wage . glm $ fitted . values [ o ] , col = " blue " , lwd =3)
13

14 summary ( union . wage . glm )

31/80
Analysis of trade union dataset

1.0
0.8 Trade Union Dataset
Union Membership

0.6
0.4
0.2
0.0

0 10 20 30 40

Wage

32/80
Analysis of trade union dataset

15 # Call :
16 # glm ( formula = union . member ~ wage , family = " binomial ")
17 #
18 # Deviance Residuals :
19 # Min 1Q Median 3Q Max
20 # -1.6140 -0.6338 -0.5592 -0.5174 2.0590
21 #
22 # Coefficients :
23 # Estimate Std . Error z value Pr ( >| z |)
24 # ( Intercept ) -2.20696 0.23298 -9.473 < 2e -16 * * *
25 # wage 0.07174 0.02005 3.577 0.000347 * * *
26 # ---
27 # Signif . codes : 0 ‘* * * ’ 0.001 ‘* * ’ 0.01 ‘* ’ 0.05 ‘. ’ 0.1 ‘ ’ 1
28 #

33/80
Analysis of trade union dataset

29 # ( Dispersion parameter for binomial family taken to be 1)


30 #
31 # Null deviance : 503.08 on 533 degrees of freedom
32 # Residual deviance : 490.50 on 532 degrees of freedom
33 # AIC : 494.5
34 #
35 # Number of Fisher Scoring iterations : 4

34/80
Analysis of trade union dataset

Let’s now model union.member as function of age


36 plot ( jitter ( age ) , union . member , main = " Trade Union Dataset " ,
37 col = " orange " , pch =3 , ylab = " Union Membership " , xlab = " Age " )
38 abline ( h = c (0 ,1) , lty =2)

35/80
Analysis of trade union dataset

1.0
0.8 Trade Union Dataset
Union Membership

0.6
0.4
0.2
0.0

20 30 40 50 60

Age

36/80
Analysis of trade union dataset

The code below fits the logistic regression model displays the fitted line and prints the
output.
39 union . age . glm <- glm ( union . member ~ age , family = " binomial " )
40 o <- order ( age )
41

42 lines ( age [ o ] , union . age . glm $ fitted . values [ o ] , col = " blue " , lwd =3)
43

44 summary ( union . age . glm )

37/80
Analysis of trade union dataset

1.0
0.8 Trade Union Dataset
Union Membership

0.6
0.4
0.2
0.0

20 30 40 50 60

Age

38/80
Analysis of trade union dataset

45 # Call :
46 # glm ( formula = union . member ~ wage , family = " binomial ")
47 #
48 # Deviance Residuals :
49 # Min 1Q Median 3Q Max
50 # -1.6140 -0.6338 -0.5592 -0.5174 2.0590
51 #
52 # Coefficients :
53 # Estimate Std . Error z value Pr ( >| z |)
54 # ( Intercept ) -2.20696 0.23298 -9.473 < 2e -16 * * *
55 # wage 0.07174 0.02005 3.577 0.000347 * * *
56 # ---
57 # Signif . codes : 0 ‘* * * ’ 0.001 ‘* * ’ 0.01 ‘* ’ 0.05 ‘. ’ 0.1 ‘ ’ 1
58 #

39/80
Analysis of trade union dataset

59 # ( Dispersion parameter for binomial family taken to be 1)


60 #
61 # Null deviance : 503.08 on 533 degrees of freedom
62 # Residual deviance : 490.50 on 532 degrees of freedom
63 # AIC : 494.5#
64 # Number of Fisher Scoring iterations : 4

Both age and wage seem to be significant, so let’s fit a model with 3 parameters.

40/80
Analysis of trade union dataset

65 union . wage . age . glm <- glm ( union . member ~ wage + age , family = " binomial " )
66 summary ( union . wage . age . glm )
67

68 # Call :
69 # glm ( formula = union . member ~ wage + age , family = " binomial ")
70 #
71 # Deviance Residuals :
72 # Min 1Q Median 3Q Max
73 # -1.3438 -0.6506 -0.5565 -0.4620 2.1485
74 #
75 # Coefficients :
76 # Estimate Std . Error z value Pr ( >| z |)
77 # ( Intercept ) -2.976095 0.426685 -6.975 3.06 e -12 * * *
78 # wage 0.065169 0.020117 3.239 0.0012 * *
79 # age 0.021861 0.009722 2.249 0.0245 *
80 # ---
81 # Signif . codes : 0 ‘* * * ’ 0.001 ‘* * ’ 0.01 ‘* ’ 0.05 ‘. ’ 0.1 ‘ ’ 1 41/80
Analysis of trade union dataset

82 # ( Dispersion parameter for binomial family taken to be 1)


83 #
84 # Null deviance : 503.08 on 533 degrees of freedom
85 # Residual deviance : 485.52 on 531 degrees of freedom
86 # AIC : 491.52
87 #
88 # Number of Fisher Scoring iterations : 4

42/80
Chapter 4 - Logistic Regression

4.1 - Introduction

4.2 - Maximum likelihood estimation

4.3 - General Logistic Regression

4.4 - Prediction

4.5 - Goodness of fit

4.6 - Residuals

4.7 - Other diagnostics

4.8 - Odds ratios and prevalence ratios

43/80
Prediction

Once the coefficients have been estimated, predictions are obtained by using those
estimates with the desired level of predictors.
Example
Analysis of trade union dataset If we want to predict the probability of union
membership for someone who is 56 years-old and have a wage of 6.5, we compute

exp(β̂0 + β̂1 6.5 + β̂2 56)


πnew =
1 + exp(β̂0 + β̂1 6.5 + β̂2 56)

1 p _ pred <- exp ( sum ( union . wage . age . glm $ coefficients * c (1 ,6.5 ,56))) /
2 (1+ exp ( sum ( union . wage . age . glm $ coefficients * c (1 ,6.5 ,56))))
3 p _ pred
4 # [1] 0.209446

The πnew < 0.5, therefore we can classify the new individual is not a union member. 44/80
Chapter 4 - Logistic Regression

4.1 - Introduction

4.2 - Maximum likelihood estimation

4.3 - General Logistic Regression

4.4 - Prediction

4.5 - Goodness of fit

4.6 - Residuals

4.7 - Other diagnostics

4.8 - Odds ratios and prevalence ratios

45/80
Goodness of fit

In assessment of goodness-of-fit for a linear model, residual plots are useful in


exhibiting violations of model assumptions (e.g. independence, homoscedasticity).

In a GLM, we would like to assign a residual ei to each observation which measures the
discrepancy between Yi and the value predicted by the fitted model. There are two
main difficulties associated with generalised linear models:

• The model variances depend on the expectations;


• It is not obvious that data and fitted values should be compared on the original
scale of the responses.

46/80
Pearson chi-squared statistic

For calculation of Pearson residuals we take the difference between observed and fitted
values and divide by an estimate of the standard deviation of the observed values.
Residuals for the Binomial model. For Yi ∼ Bin(ni , πi ), the Pearson Residuals are

(yi − ni π̂i )
Pi = p , i = 1, . . . , N
ni π̂i (1 − π̂i )

47/80
Pearson chi-squared statistic

Instead of maximising the likelihood, we can estimate the parameters by minimising


the weighted sum of squares
(yi − ni πi )2
Sw =
ni πi (1 − πi )
where E(Yi ) = ni πi and Var(Yi ) = ni πi (1 − πi ), which is also called Pearson
chi-squared statistic
N
X (oi − ei )2
P2 =
ei
i=1

where oi represents the observed frequencies and ei represents the expected frequency.

48/80
Pearson chi-squared statistic

The reason of the equivalence is


N N
X (yi − ni πi )2 X [(ni − yi ) − ni (1 − πi )]2
P2 = +
ni πi ni (1 − πi )
i=1 i=1
N
X (yi − ni πi)2
= (1 − πi + πi ) = Sw
ni πi (1 − πi )
i=1

When the Pearson chi-squared statistic is evaluated at the estimated expected


frequencies
N
2
X (yi − ni π̂i )2
P =
ni π̂i (1 − π̂i )
i=1

49/80
Deviance

The deviance for the logistic model is

N     
X yi ni − yi
D=2 yi log + (ni − yi ) log (17)
ni π̂i ni − ni π̂i
i=1

Check this assertion.

50/80
Deviance

It is possible to prove that it is asymptotically equivalent to the Pearson chi-squared


statistic, though a Taylor approximation
N
"
X 1 (yi − ni π̂i )2
D=2 (yi − ni π̂i ) +
2 ni π̂i
i=1
1 [(ni − yi ) − (ni − ni π̂i )]2

+ [(ni − yi ) − (ni − ni π̂i )] + + ...
2 ni − ni π̂i
N
X (yi − ni π̂i )2
= = P2
ni π̂i (1 − π̂i )
i=1

1 (s−t)2
This comes from the fact that, for s ≈ t, s log st = (s − t) + 2 t + ...

51/80
Deviance

Under the null hypothesis (H0 ), the asymptotic distribution of D is

D ∼ χ2 (N − p) (18)

therefore P 2 ∼ χ2 (N − p).

The adequacy of the approximation depends on how well D or P 2 are χ2 -distributed.


There is some evidence that P 2 is better than D, however both of them are influenced
by small frequencies. This is typical of continuous covariates.

52/80
Deviance
Example (Analysis of trade union dataset)
We fitted the logistic model (M1 )
 
πi
log = β0 + β1 wage + β2 age,
1 − πi

where πi is the probability of union membership (3 params). The observed deviance


is d1 = 503.0841.
Compare with the nested model (M0 )
 
πi
log = β0 ,
1 − πi

where the probability of trade union membership is constant (1 param). The


observed deviance is d0 = 485.5239.
53/80
Deviance

Example (Analysis of trade union dataset)


We wish to test

H0 : β1 = β2 = 0
H1 : β1 , β2 not both zero

If H0 was true, then both models describe the data well. We would have
D0 ∼ χ2 (N − 1) and D1 ∼ χ2 (N − 3), so that D1 − D0 ∼ χ2 (2). However, we observe

d1 − d0 = 503.0841378 − 485.5239 = 17.56029,

which is larger than the 95%-quantile χ2 (2)−1 (0.95) = 5.9914645. Hence we reject
H0 at the 5% significance level.

54/80
Deviance

This code reproduce the calculations of the previous example


1 d1 <- union . wage . age . glm $ null . deviance
2 d0 <- union . wage . age . glm $ deviance
3

4 alpha <- 0.05


5

6 crit . val <- qchisq (1 - alpha , df =2)


7 if ( d1 - d0 > crit . val ){
8 cat ( " We reject H0 at alpha = " , alpha , " significance level . " )
9 } else {
10 cat ( " We cannot reject H0 at alpha = " , alpha , " significance level . " )
11 }

55/80
Hosmer-Lemeshow statistic

A possible solution is to group observations, with approximately equal numbers of


observations in each group. Then the Pearson chi-squared statistic is computed on the
contingency table obtained by grouping observations. This statistic is called
Hosmer-Lemeshow statistic.
1 library ( doBy )
2 uniongrp <- summaryBy ( union . member ~ wage , data = trade . union ,
3 FUN = c ( sum , length ))
4 names ( uniongrp ) = c ( " x " ," y " ," n " )
5 head ( uniongrp )
6 # x y n
7 # 1 1.00 0 1
8 # 2 1.75 0 1
9 # 3 2.01 0 1
10 # 4 2.85 0 1
11 # 5 3.00 1 2
12 # 6 3.35 0 12 56/80
Hosmer-Lemeshow statistic

16 union . grp . glm <- glm ( cbind ( uniongrp $y , uniongrp $n - uniongrp $ y ) ~ uniongrp $x ,
17 family = binomial )
18 summary ( union . grp . glm )
19 # Call :
20 # glm ( formula = cbind ( uniongrp $y , uniongrp $ n - uniongrp $ y ) ~ uniongrp $x ,
21 # family = binomial )
22 #
23 # Deviance Residuals :
24 # Min 1Q Median 3Q Max
25 # -2.4667 -0.7248 -0.5911 0.2845 2.2628
26 #
27 # Coefficients :
28 # Estimate Std . Error z value Pr ( >| z |)
29 # ( Intercept ) -2.20696 0.23298 -9.473 < 2e -16 * * *
30 # uniongrp $ x 0.07174 0.02005 3.577 0.000347 * * *
31 # ---
32 # Signif . codes : 0 ‘* * * ’ 0.001 ‘* * ’ 0.01 ‘* ’ 0.05 ‘. ’ 0.1 ‘ ’ 1 57/80
Hosmer-Lemeshow statistic

33 # ( Dispersion parameter for binomial family taken to be 1)


34 #
35 # Null deviance : 252.50 on 237 degrees of freedom
36 # Residual deviance : 239.92 on 236 degrees of freedom
37 # AIC : 317.24
38 #
39 # Number of Fisher Scoring iterations : 4

The estimates and standard errors are the same but the goodness of fit differ.

58/80
Likelihood ratio χ2 statistic

Sometimes the log-likelihood of the fitted model is compared with the log-likelihood of
the minimal model, the model for which all πi are equal; therefore the estimate is
π̃ = N
P PN
i=1 yi / i=1 ni .

The statistic is defined as

C = 2[`(π̂; y) − `(π̃; y)]


N     
X ŷi ni − ŷi
=2 yi log + (ni − yi ) log
ni π̃ ni − ni π̃
i=1

Therefore C ∼ χ2 (p − 1).

59/80
Likelihood ratio χ2 statistic

1 # Minimal model
2 union0 . glm <- glm ( union . member ~ 1 , family = binomial )
3

4 Cstat <- 2 * ( logLik ( union . wage . age . glm ) - logLik ( union0 . glm ))
5 alpha <- 0.05
6 p <- length ( union . wage . age . glm $ coefficients )
7

8 if ( Cstat [1] > qchisq (1 - alpha ,p -1)){


9 cat ( " We reject H0 at alpha = " , alpha , " significance level . " )
10 } else {
11 cat ( " We cannot reject H0 at alpha = " , alpha , " significance level . " )
12 }
13 # We reject the null hypothesis

In this example the fitted model is preferred compared to the minimal model.
60/80
Pseudo-R 2 c

Analogously to the multiple linear regression, the likelihood ratio statistic can be
normalised
`(π̃; y) − `(π̂; y)
pseudo-R 2 = (19)
`(π̃; y)
representing the proportional improvement in the log-likelihood function due to the
terms in the model of interest, compared with the minimal model.

As for the R 2 , the distribution of the pseudo-R 2 cannot be determined, and it


increases as the number of predictors increases. Therefore, several adjustments have
been proposed.
1 R2 <- ( logLik ( union0 . glm ) - logLik ( union . wage . age . glm ))
2 / logLik ( union0 . glm )
3 R2
4 # ’ log Lik . ’ 0.03490527 ( df =1)
61/80
AIC and BIC

The Akaike information criterion (AIC) and the Bayesian information criterion (BIC)
are very popular goodness of fit statistics based on the log-likelihood, with an
adjustment for the number of parameters and the sample size.

AIC = −2`(π̂; y) + 2p (20)

where p is the number of parameters.

BIC = −2`(π̂; y) + p × log N (21)

where N is the sample size.

62/80
AIC and BIC

Remark: all these statistics (except the pseudo-R 2 ) summarise how well a particular
model fits the data: a small value (or a large p-value) indicates that the model fits well.
1 BIC ( union0 . glm )
2 # [1] 509.3645
3 AIC ( union0 . glm )
4 # [1] 505.0841
5

6 BIC ( union . wage . glm )


7 # [1] 503.0606
8 AIC ( union . wage . glm )
9 # [1] 494.4998
10

11 BIC ( union . wage . age . glm )


12 # [1] 504.365
13 AIC ( union . wage . age . glm )
14 # [1] 491.5239 63/80
Chapter 4 - Logistic Regression

4.1 - Introduction

4.2 - Maximum likelihood estimation

4.3 - General Logistic Regression

4.4 - Prediction

4.5 - Goodness of fit

4.6 - Residuals

4.7 - Other diagnostics

4.8 - Odds ratios and prevalence ratios

64/80
Residuals

The residuals correspond to some of the statistics we have already analysed.


For Yi ∼ Bin(ni , πi ), the Pearson Residuals are
(Yi − ni π̂i )
Pi = p , i = 1, . . . , N
ni π̂i (1 − π̂i )
which can be standardised by dividing by the leverage hi
Pi
eiP = √
1 − hii
PN 2
Notice that i=1 Pi = P 2.
1 pr <- residuals ( union . wage . age . glm , type = " pearson " )
2 prss <- sum ( pr ^2)
3 prss
4 # [1] 517.8315
65/80
Residuals

The Deviance Residuals are defined as


     1/2
yi ni − Yi
dk = sign(Yi − ni π̂i ) 2 Yi log + (ni − yi ) log
ni π̂i ni − ni π̂i

(the sign term makes sure that the signs of di and Pi match).
Note that N 2
P
i=1 di = D, the deviance.
1 union . wage . age . glm $ deviance
2 [ # 1] 485.5239

66/80
Residuals

The residuals can be used in the usual way: they should be plotted

• Plotted against each continuous explanatory variable to check if the assumption of


linearity is appropriate
• Plotted against other possible explanatory variables not included in the model
• Plotted in the order of the measurements to check for correlation
• Through normality plots

In general, for GLM residual plots are less informative than for multiple linear
regression, therefore it is important to check all the other goodness-of-fit statistics.

67/80
Residuals

3
2

2
Residuals

Residuals
1

1
0

0
−1

−1
0 10 20 30 40 20 30 40 50 60

Wage Age

Figure 2: Plot of the residuals against each continuous explanatory variable 68/80
Residuals

Usually we plot residuals against the values of the linear predictors in a generalised
linear model to look for patterns in the residuals that may be related to the mean.

But sometimes it can be hard to see patterns in residual plots for generalised linear
models. For instance, in logistic regression where the responses are binary, the
residuals can only take on two possible values (depending on whether the response is
zero or one) and when residuals are plotted against linear predictor values, all points lie
on one of the two smooth curves.

It is nearly always helpful to superimpose a scatterplot smoother on the residual plot to


help identify any trends. In the figure, the black line is a smoothing spline (more on
this later) suggests that there may be a trend in the residuals, namely the mean is
underestimated in the middle, and overestimated on the two extremes.
69/80
Residuals

3
2
Residuals

1
0
−1

0.1 0.2 0.3 0.4 0.5 0.6

Prediction

Figure 3: Pearson residuals against predicted probability of being a member. Blue illustrates
positive residuals (individual is a member) and red negative residuals. Black line represent the
smoothing spline.

70/80
Residuals

1 # Prediciton
2 pred . memb <- predict ( union . wage . age . glm )
3 pred . memb <- exp ( pred . memb ) / (1+ exp ( pred . memb )) # original scale
4

5 plot ( pred . memb , pr , col = c ( " red " ," blue " )[1+ union . member ] ,
6 xlab = " Prediction " , ylab = " Residuals " )
7 abline ( h =0 , lty =2 , col = " grey " , lwd =2)
8 ss <- smooth . spline ( pred . memb , pr )
9 lines ( ss , lwd =2)

71/80
Chapter 4 - Logistic Regression

4.1 - Introduction

4.2 - Maximum likelihood estimation

4.3 - General Logistic Regression

4.4 - Prediction

4.5 - Goodness of fit

4.6 - Residuals

4.7 - Other diagnostics

4.8 - Odds ratios and prevalence ratios

72/80
Check of the link function

To test for the goodness-of-fit of the logit function, one can consider a more general
family of link functions
(1 − π)−α − 1
 
g (π, α) = log (22)
α

• α = 1, g (π, 1) is the logit


• as α → 0, g (π) → log[− log(1 − π)] is the log-log link

α can be estimated from the data.

73/80
Overdispersion

Observations Yi can have a variance which is greater than ni πi (1 − πi ).


An indicator of this is when the deviance D is much larger than the expected value of
N − p.

This can be due to

• Omission of relevant explanatory variables


• More complex structure of association between observations and explanatory
variables
• Yi are not independent

A solution is to include an extra-parameter φ (this is what R does when the family is


specified as quasibinomial
74/80
Chapter 4 - Logistic Regression

4.1 - Introduction

4.2 - Maximum likelihood estimation

4.3 - General Logistic Regression

4.4 - Prediction

4.5 - Goodness of fit

4.6 - Residuals

4.7 - Other diagnostics

4.8 - Odds ratios and prevalence ratios

75/80
Odds ratios and prevalence ratios

The parameter estimates using the logit link are odds ratio. In a logistic regression
model, increasing X1 by one unit changes the log-odds by β1 , or, equivalently, it
multiplies the odds by a factor e β1 .

β1 does not correspond to the change in π associated with a one-unit increase


in X1 .

However, since the logarithm is a monotonic function,

• If β1 is positive then increasing X1 will be associated with an increasing in π


• If β1 is negative then increasing X1 will be associated with a decreasing in π.

76/80
Odds ratios and prevalence ratios

Example
If the probability of disease in one group is 0.8 and in another is 0.2, then the odds
ratio (OR) is
0.8/(1 − 0.8) 4
OR = = = 16
0.2/(1 − 0.2) 0.25
whereas the prevalence ratio or relative risk (PR) is

0.8
PR = =4
0.2

77/80
Odds ratios and prevalence ratios

The prevalence ratio can be estimated through the log link instead of the logit link.

However, the log link presents some disadvantages

• models using it fail to converge: this is because the logit transformation lead to
values defined in R, while the log-transformation for probabilities lead to
non-positive values
• for continuous explanatory variables, the prevalence ratio is not linearly related to
changes in the explanatory variables, therefore it is necessary to state the level of
the variable for each prevalence ratio value

78/80
Odds ratios and prevalence ratios

1 exp ( union . wage . age . glm $ coefficients )


2 # ( Intercept ) wage age
3 # 0.05099156 1.06733959 1.02210191
4

5 union . wage . age . log <- glm ( union . member ~ wage + age ,
6 family = binomial ( link = " log " ))
7 # Error : no valid set of coefficients has been found :
8 # please supply starting values

79/80
References

Sources and recommended reading:

1. A. J. Dobson & A. G. Barnett (2018) An introduction to generalised linear


models, Chapman and Hall. Chapter 7 (Section 7.3 excluded).
2. G. James, D. Witten, T. Hastie & R. Tibshirani (2013) An Introduction to
Statistical Learning with Applications in R, Springer. Chapter 4, Sections 4.1–4.3
(Section 4.3.5 excluded).

80/80

You might also like