Week04 Lecture BB

MATH5806 Applied Regression Analysis
Lecture 4 - Logistic Regression
Boris Beranger
Term 2, 2021
1/80
Chapter 4 - Logistic Regression
4.1 - Introduction
4.2 - Maximum likelihood estimation
4.3 - General Logistic Regression
4.4 - Prediction
4.5 - Goodness of fit
4.6 - Residuals
4.7 - Other diagnostics
4.8 - Odds ratios and prevalence ratios
2/80
4.1 - Introduction
4.4 - Prediction
4.6 - Residuals
3/80
Goals
This lecture is about obtaining estimates for
• Logistic regression
This is a case where numerical solutions are needed.
4/80
Introduction
For independent random variables Yi , linear models of the form
E(Yi ) = µi = x>
i β
Yi ∼ N(µi , σ 2 ),
are the basis of most analyses of continuous data.
Advances in theory and softwares allow us to extend the theory we have developed in
Chapter 3 to models where:
• Response variables have distributions other than the normal distribution

• association between the response and the explanatory variables is not linear
5/80
Introduction
Again, suppose we have a set of observations (y1 , x1 ), . . . , (yN , xN ).
We want to estimate the parameters β to the situation where
E(Yi ) = µi (1)
is related to the covariate in a non-linear way
g (µi ) = x>
i β (2)
The function g is called link function.
Most of the course will focus on situations where g is a simple mathematical function, but
functions may be estimated numerically, as in the generalised additive models – we will
introduce that towards the end of the course.
6/80
Introduction
Generalised linear models are defined for independent random variables Y1 , Y2 , . . . , YN

distributed according to a distribution from the exponential family
• The distribution of each Yi has the canonical form
f (yi |θi ) = exp [yi bi (θi ) + ci (θi ) + di (yi )] (3)
• The distributions of all the Yi ’s are of the same form, i.e. bi (θi ) = b(θi ),
ci (θi ) = c(θi ) and di (θi ) = d(θi )
7/80
Introduction
For model specification, we are interested in the effects of covariates β1 , . . . , βp (with

p < N) on the response variables.
Suppose E(Yi ) = µi , then we are interested in a transformation of µi
g (µi ) = x>
i β
• g is monotone and differentiable; it is called link function

• The vector xi is a p × 1 vector of explanatory variables x>
i = [Xi1 , . . . , Xip ] and it
is the i-th row of the design matrix X
• β is a p × 1 vector of parameters β > = [β1 , . . . , βp ]
8/80
Introduction
Remark: The process of predicting qualitative responses, as in the case of logistic

regression, is often referred to as classification, because it involves assigning an
observation to a category or class.
Example
Problems of classification are extremely important in applied sciences, e.g.:
• A person arrives at the emergency room with a set of symptoms that could
possibly be attributed to one of three medical conditions; which of the three
conditions does the individual have?
• An online banking service must be able to determine whether or not a transaction
is fraudolent, on the basis of the IP address, past transactions, etc.
• On the basis of DNA sequence data, we would like to predict if a strain of
tuberculosis is resistant a particular drugs.
9/80
4.1 - Introduction
4.4 - Prediction
4.6 - Residuals
10/80
Maximum likelihood estimation
Let’s recall the following results from the previous lecture.
The joint distribution is

N
Y
f (Y1 , . . . , YN |θ1 , . . . , θN ) = exp[Yi b(θi ) + c(θi ) + d(Yi )]
i=1
N N N
" #
X X X
= exp Yi b(θi ) + c(θi ) + d(Yi )
i=1 i=1 i=1
For each Yi , the log-likelihood is ì = Yi b(θi ) + c(θi ) + d(Yi ), which gives
0 b 00 (θi )c 0 (θi )−c 00 (θi )b 0 (θi )

E(Yi ) = µi = − bc 0 (θ i)
(θi ) , Var(Yi ) = [b 0 (θi )]3
, g (µi ) = x>
i β = ηi .
11/80
The log-likelihood for all the Yi ’s is then

N
X N
X N
X N
X
`(θ; Y1 , . . . , YN ) = ì = Yi b(θi ) + c(θi ) + d(Yi ).
i=1 i=1 i=1 i=1
To obtain the maximum likelihood estimator for the parameter βj we derive the score
function using the chain rule:
N XN
d` X dì dì dθi dµi
= Uj = = · · . (4)
dβj dβj dθi dµi dβj
i=1 i=1
12/80
Now, let’s consider each term separately

dì
• dθi = Yi b 0 (θi ) + c 0 (θi ) = b 0 (θi )(Yi − µi )
−1 00 −1
−c (θi ) c 0 (θi )b 00 (θi )
• dθi
dµi = dµdθi
i
= 0
b (θi ) + 0
[b (θ )] 2 = (b 0 (θi )Var(Yi ))−1
i
dµi dµi dηi dµi
• dβj = dηi · dβj = dηi Xij
Hence, the score function is
N
X (yi − µi ) dµi
Uj = Xij (5)
Var (Yi ) dηi
i=1
13/80
The variance-covariance matrix of the score is
Ijk = E[Uj Uk ] (6)
which represents the information matrix

( N N )
X (Yi − µi ) dµi X (Yi − µi )

dµi
Ijk = E Xij Xik
Var(Yi ) dηi Var(Yi ) dηi
i=1 i=1
N
E[(Yi − µi )2 Xij Xik ] dµi 2
X
=
[Var(Yi )]2 dηi
i=1
because E[(Yi − µi )(Yl − µl )] = 0 for i 6= l.
14/80
Since E[(Yi − µi )2 ] = Var(Yi )

N
dµi 2

X Xij Xik
Ijk = (7)
Var(Yi ) dηi
i=1
For the linear model where E(Yi ) = x>

i β which implies µi = ηi (i.e. g (µi ) = µi ). Here
we no longer have the simplification dµ i
dηi = 1 and thus cannot write
1 T
I= X X
σ2
15/80
If we want to apply the method of scoring to approximate the MLE, the estimating
equation is
β̂ (m) = β̂ (m−1) + [I (m−1) ]−1 u(m−1)

[I (m−1) ]β̂ (m) = [I (m−1) ]β̂ (m−1) + u(m−1) (8)
From Equation (7), the information matrix can be written as
I = X> WX (9)
2
dµi
where wii = Var1(Yi ) dηi evaluated at β.
16/80
Finally, the expression on the right hand side of (8) can be written as
p X
N N
dµi 2 (m−1) X (Yi − µi )Xij dµi

X Xij Xik
β̂k +
Var(Yi ) dηi Var(Yi ) dηi
k=1 i=1 i=1
which can be written in matrix terms as
[I (m−1) ]−1 β̂ (m−1) + u(m−1) = X> Wz (10)
where
p
X (m−1) dηi
Zi = Xik β̂k + (Yi − µi )
dµi
k=1
17/80
Therefore (8) reads:
X> W(m−1) Xβ̂ (m) = X> W(m−1) z(m−1)
This has the same form as the weighted least squares equation. Note, however, this
needs to be solved iteratively, since W and z have to be recalculated at each
optimisation step.
Therefore, for generalised linear models, maximum likelihood estimators are obtained
by an iterative weighted least squares procedure.
18/80
4.1 - Introduction
4.4 - Prediction
4.6 - Residuals
19/80
General logistic regression
We now focus on models where the outcome variables are measured on a binary scale.
We define a binary random variable


1 if the outcome is a “success” π
Y = (11)
0 if the outcome is a “failure” (1 − π)
i.e. Y has a Bernoulli distribution.
The goal of the analysis is to relate the probability of the Bernoulli distribution,
πi , to a set of explanatory variables x>

i , i.e.
Pr(Yi = 1|Xi1 , . . . , Xip ) = πi for i = 1, . . . , N (12)

20/80
The joint likelihood function is

N
Y
f (Y1 , . . . , YN |π) = πiYi (1 − πi )1−Yi
i=1
N N
" #
X πi X
= exp Yi log + log(1 − πi ) (13)
1 − πi
i=1 i=1
21/80
We want to describe the probability of success with respect to some predictors:
g (πi ) = x>
i β (14)
so to take into consideration that
• The response variable is binary and not continuous

• The response variable is bounded (in [0, 1])
• The variance is not constant Var (Yi ) = πi (1 − πi )
Similar considerations apply to ordinal response variables.
22/80
Consider the Default dataset from the ISLR R package. We want to estimate the
probability of default as function of balance.
1.0
1.0
| | | | || | ||||| || ||| ||||||||||| ||||| |||||||||||||||||| |||||||| ||||||||||| |||||||||||||||||||||||||||| ||||||||||||||||||||||||| | |||| | | | | | | | | | | || | ||||| || ||| ||||||||||| ||||| |||||||||||||||||| |||||||| ||||||||||| |||||||||||||||||||||||||||| ||||||||||||||||||||||||| | |||| | | | | | |
Probability of Default
Probability of Default
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0.0
0.0
||||||||||||||||||||||||||||||||||||||||||||||||||||||
||||||||||||||
|||||||||||||||||||||||||||||
|||||||||||||
||||||||||||||||||||||||||||||||||
|||||||
|||||
|||||||||||||||||||
||||||||||||
||||||||||||||||
|||||||||||||
|||||||||||||||||||
|||||
||||
||||||||||||
|||||||||
||||
||||||||||||||
|||||||||||||||||||||||||||||||||||||||||||||||||||||
|||||||||||||||||||||||||
|||||||||||||||||||||||
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| |||||| || || ||||| | | || ||||||||||||||||||||||||||||||||||||||||||||||||||||||
||||||||||||||
|||||||||||||||||||||||||||||
|||||||||||||
||||||||||||||||||||||||||||||
||||
|||||||
|||||
|||||||||||||||||||
||||||||||||
||||||||||||||||
|||||||||||||
|||||||||||||||||||
|||||
||||
||||||||||||
|||||||||
||||
||||||||||||||
|||||||||||||||||||||||||||||||||||||||||||||||||||||
|||||||||||||||||||||||||
|||||||||||||||||||||||
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| |||||| || || ||||| | | ||
0 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500
Balance Balance
Figure 1: Estimated probability of default using linear (left) and logistic (right) regression.
23/80
The following code can be used to replicate the previous plots.

1 library ( ISLR )
2 data ( " Default " )
3 attach ( Default )
4
5 default . bin <- rep (0 , length ( default )) # initialise binary vector

6 default . bin [ default == " Yes " ] <- 1
7
8 par ( mfrow = c (1 ,2)) # To put the plots next to each other

9
10 # Linear model
11 lm <- lm ( default . bin ~ balance )
12 plot ( balance , default . bin , pch =3 , col = " orange " ,
13 xlab = " Balance " , ylab = " Probability of Default " )
14 abline ( h = c (0 ,1) , lty =2)
15 abline ( a = lm $ coefficients [1] , b = lm $ coefficients [2] , col = " blue " , lwd =3)
24/80
The following code can be used to replicate the previous plots.

16 # Logistic model
17 log <- glm ( default . bin ~ balance , family = " binomial " )
18 o <- order ( balance )
19
20 plot ( balance , default . bin , pch =3 , col = " orange " , xlab = " Balance " ,
21 ylab = " Probability of Default " )
22 abline ( h = c (0 ,1) , lty =2)
23 lines ( balance [ o ] , log $ fitted . values [ o ] , col = " blue " , lwd =3)
25/80
Example
We want to predict the medical condition of a patient in the emergency room on the
basis of the symptoms. Let’s suppose we have three possible diagnoses:

1 stroke


Y = 2 drug overdose


3 epileptic seizure

Using a linear regression would assume
• The ordering is meaningful: numbers 1, 2 and 3 are just labels!

• The difference between “stroke” and “drug overdose” has the same meaning than
that between “drug overdose” and “epileptic seizure”
26/80
The general logistic regression model is

π1
logit(πi ) = log = x>
i β (15)
1 − πi
where xi is a vector of either continuous measurements or categorical variables and β
π1
is a parameter vector. Recall that 1−π i
is an odds taking value between 0 and ∞,
indicating very low and very high probabilities of default. This means that
πi
= exp[x>
i β]
1 − πi
πi = exp[x> >
i β] − πi exp[xi β]
(1 + exp[x> >
i β])πi = exp[xi β]
exp[x>
i β]
πi =
1 + exp[x>
i β]
27/80
And the log-likelihood can be rewritten with respect to β

N
exp[x>

i β] 1
X
`(β; y, x) = yi log + (1 − y i ) log (16)
i=1
1 + exp[x>
i β] 1 + exp[x>
i β]
The estimation process is the same if Yi is binomially distributed instead of Bernoulli

distributed, with the corresponding modification to consider the number of trials.
If the goal is prediction, one might predict
YN+1 = 1 if πN+1 |x>

N+1 > 0.5.
However, other thresholds could be used, e.g. if we want to be particularly

conservative, we can set the threshold to 0.1.
28/80
Analysis of trade union dataset
Example
The trade union data are collected from 1985 and available in the R package
SemiPar. The variable union.member is binary while variables age and wages are
continuous. We illustrate the model comparison with this data set.
The figure below shows logistic regression fitted to the two variables separately with
union membership as response.
1 library ( SemiPar )
2
3 data ( " trade . union " )

4 attach ( trade . union )
5
6 plot ( wage , union . member , main = " Trade Union Dataset " , pch =3 ,
7 ylab = " Union Membership " , xlab = " Wage " )
8 abline ( h = c (0 ,1) , lty =2)
29/80
1.0
0.8 Trade Union Dataset
Union Membership
0.6
0.4
0.2
0.0
0 10 20 30 40
Wage
30/80
The code below fits the logistic regression model displays the fitted line and prints the
output.
9 union . wage . glm <- glm ( union . member ~ wage , family = " binomial " )
10 o <- order ( wage )
11
12 lines ( wage [ o ] , union . wage . glm $ fitted . values [ o ] , col = " blue " , lwd =3)
13
14 summary ( union . wage . glm )
31/80
1.0
Union Membership
0.6
0.4
0.2
0.0
0 10 20 30 40
Wage
32/80
15 # Call :
16 # glm ( formula = union . member ~ wage , family = " binomial ")
17 #
18 # Deviance Residuals :
19 # Min 1Q Median 3Q Max
20 # -1.6140 -0.6338 -0.5592 -0.5174 2.0590
21 #
22 # Coefficients :
23 # Estimate Std . Error z value Pr ( >| z |)
24 # ( Intercept ) -2.20696 0.23298 -9.473 < 2e -16 * * *
25 # wage 0.07174 0.02005 3.577 0.000347 * * *
26 # ---
27 # Signif . codes : 0 ‘* * * ’ 0.001 ‘* * ’ 0.01 ‘* ’ 0.05 ‘. ’ 0.1 ‘ ’ 1
28 #
33/80
29 # ( Dispersion parameter for binomial family taken to be 1)

30 #
31 # Null deviance : 503.08 on 533 degrees of freedom
32 # Residual deviance : 490.50 on 532 degrees of freedom
33 # AIC : 494.5
34 #
35 # Number of Fisher Scoring iterations : 4
34/80
Let’s now model union.member as function of age

36 plot ( jitter ( age ) , union . member , main = " Trade Union Dataset " ,
37 col = " orange " , pch =3 , ylab = " Union Membership " , xlab = " Age " )
38 abline ( h = c (0 ,1) , lty =2)
35/80
1.0
Union Membership
0.6
0.4
0.2
0.0
20 30 40 50 60
Age
36/80
The code below fits the logistic regression model displays the fitted line and prints the
output.
39 union . age . glm <- glm ( union . member ~ age , family = " binomial " )
40 o <- order ( age )
41
42 lines ( age [ o ] , union . age . glm $ fitted . values [ o ] , col = " blue " , lwd =3)
43
44 summary ( union . age . glm )
37/80
1.0
Union Membership
0.6
0.4
0.2
0.0
20 30 40 50 60
Age
38/80
45 # Call :
46 # glm ( formula = union . member ~ wage , family = " binomial ")
47 #
50 # -1.6140 -0.6338 -0.5592 -0.5174 2.0590
51 #
52 # Coefficients :
54 # ( Intercept ) -2.20696 0.23298 -9.473 < 2e -16 * * *
55 # wage 0.07174 0.02005 3.577 0.000347 * * *
56 # ---
57 # Signif . codes : 0 ‘* * * ’ 0.001 ‘* * ’ 0.01 ‘* ’ 0.05 ‘. ’ 0.1 ‘ ’ 1
58 #
39/80

60 #
63 # AIC : 494.5#
Both age and wage seem to be significant, so let’s fit a model with 3 parameters.
40/80
65 union . wage . age . glm <- glm ( union . member ~ wage + age , family = " binomial " )
66 summary ( union . wage . age . glm )
67
68 # Call :
69 # glm ( formula = union . member ~ wage + age , family = " binomial ")
70 #
73 # -1.3438 -0.6506 -0.5565 -0.4620 2.1485
74 #
75 # Coefficients :
77 # ( Intercept ) -2.976095 0.426685 -6.975 3.06 e -12 * * *
78 # wage 0.065169 0.020117 3.239 0.0012 * *
79 # age 0.021861 0.009722 2.249 0.0245 *
80 # ---
81 # Signif . codes : 0 ‘* * * ’ 0.001 ‘* * ’ 0.01 ‘* ’ 0.05 ‘. ’ 0.1 ‘ ’ 1 41/80

83 #
86 # AIC : 491.52
87 #
42/80
4.1 - Introduction
4.4 - Prediction
4.6 - Residuals
43/80
Prediction
Once the coefficients have been estimated, predictions are obtained by using those
estimates with the desired level of predictors.
Example
Analysis of trade union dataset If we want to predict the probability of union
membership for someone who is 56 years-old and have a wage of 6.5, we compute
exp(β̂0 + β̂1 6.5 + β̂2 56)

πnew =
1 + exp(β̂0 + β̂1 6.5 + β̂2 56)
1 p _ pred <- exp ( sum ( union . wage . age . glm $ coefficients * c (1 ,6.5 ,56))) /
2 (1+ exp ( sum ( union . wage . age . glm $ coefficients * c (1 ,6.5 ,56))))
3 p _ pred
4 # [1] 0.209446
The πnew < 0.5, therefore we can classify the new individual is not a union member. 44/80
4.1 - Introduction
4.4 - Prediction
4.6 - Residuals
45/80
Goodness of fit
In assessment of goodness-of-fit for a linear model, residual plots are useful in

exhibiting violations of model assumptions (e.g. independence, homoscedasticity).
In a GLM, we would like to assign a residual ei to each observation which measures the
discrepancy between Yi and the value predicted by the fitted model. There are two
main difficulties associated with generalised linear models:
• The model variances depend on the expectations;

• It is not obvious that data and fitted values should be compared on the original
scale of the responses.
46/80
Pearson chi-squared statistic
For calculation of Pearson residuals we take the difference between observed and fitted
values and divide by an estimate of the standard deviation of the observed values.
Residuals for the Binomial model. For Yi ∼ Bin(ni , πi ), the Pearson Residuals are
(yi − ni π̂i )
Pi = p , i = 1, . . . , N
ni π̂i (1 − π̂i )
47/80
Instead of maximising the likelihood, we can estimate the parameters by minimising

the weighted sum of squares
(yi − ni πi )2
Sw =
ni πi (1 − πi )
where E(Yi ) = ni πi and Var(Yi ) = ni πi (1 − πi ), which is also called Pearson
chi-squared statistic
N
X (oi − ei )2
P2 =
ei
i=1
where oi represents the observed frequencies and ei represents the expected frequency.
48/80
The reason of the equivalence is

N N
X (yi − ni πi )2 X [(ni − yi ) − ni (1 − πi )]2
P2 = +
ni πi ni (1 − πi )
i=1 i=1
N
X (yi − ni πi)2
= (1 − πi + πi ) = Sw
ni πi (1 − πi )
i=1
When the Pearson chi-squared statistic is evaluated at the estimated expected

frequencies
N
2
X (yi − ni π̂i )2
P =
ni π̂i (1 − π̂i )
i=1
49/80
Deviance
The deviance for the logistic model is
N
X yi ni − yi
D=2 yi log + (ni − yi ) log (17)
ni π̂i ni − ni π̂i
i=1
Check this assertion.
50/80
Deviance
It is possible to prove that it is asymptotically equivalent to the Pearson chi-squared

statistic, though a Taylor approximation
N
"
X 1 (yi − ni π̂i )2
D=2 (yi − ni π̂i ) +
2 ni π̂i
i=1
1 [(ni − yi ) − (ni − ni π̂i )]2

+ [(ni − yi ) − (ni − ni π̂i )] + + ...
2 ni − ni π̂i
N
X (yi − ni π̂i )2
= = P2
ni π̂i (1 − π̂i )
i=1
1 (s−t)2
This comes from the fact that, for s ≈ t, s log st = (s − t) + 2 t + ...
51/80
Deviance
Under the null hypothesis (H0 ), the asymptotic distribution of D is
D ∼ χ2 (N − p) (18)
therefore P 2 ∼ χ2 (N − p).
The adequacy of the approximation depends on how well D or P 2 are χ2 -distributed.

There is some evidence that P 2 is better than D, however both of them are influenced
by small frequencies. This is typical of continuous covariates.
52/80
Deviance
Example (Analysis of trade union dataset)
We fitted the logistic model (M1 )

πi
log = β0 + β1 wage + β2 age,
1 − πi
where πi is the probability of union membership (3 params). The observed deviance

is d1 = 503.0841.
Compare with the nested model (M0 )

πi
log = β0 ,
1 − πi
where the probability of trade union membership is constant (1 param). The

observed deviance is d0 = 485.5239.
53/80
Deviance
Example (Analysis of trade union dataset)

We wish to test
H0 : β1 = β2 = 0
H1 : β1 , β2 not both zero
If H0 was true, then both models describe the data well. We would have
D0 ∼ χ2 (N − 1) and D1 ∼ χ2 (N − 3), so that D1 − D0 ∼ χ2 (2). However, we observe
d1 − d0 = 503.0841378 − 485.5239 = 17.56029,
which is larger than the 95%-quantile χ2 (2)−1 (0.95) = 5.9914645. Hence we reject
H0 at the 5% significance level.
54/80
Deviance
This code reproduce the calculations of the previous example

1 d1 <- union . wage . age . glm $ null . deviance
2 d0 <- union . wage . age . glm $ deviance
3
4 alpha <- 0.05

5
6 crit . val <- qchisq (1 - alpha , df =2)

7 if ( d1 - d0 > crit . val ){
8 cat ( " We reject H0 at alpha = " , alpha , " significance level . " )
9 } else {
10 cat ( " We cannot reject H0 at alpha = " , alpha , " significance level . " )
11 }
55/80
Hosmer-Lemeshow statistic
A possible solution is to group observations, with approximately equal numbers of

observations in each group. Then the Pearson chi-squared statistic is computed on the
contingency table obtained by grouping observations. This statistic is called
Hosmer-Lemeshow statistic.
1 library ( doBy )
2 uniongrp <- summaryBy ( union . member ~ wage , data = trade . union ,
3 FUN = c ( sum , length ))
4 names ( uniongrp ) = c ( " x " ," y " ," n " )
5 head ( uniongrp )
6 # x y n
7 # 1 1.00 0 1
8 # 2 1.75 0 1
9 # 3 2.01 0 1
10 # 4 2.85 0 1
11 # 5 3.00 1 2
12 # 6 3.35 0 12 56/80
16 union . grp . glm <- glm ( cbind ( uniongrp $y , uniongrp $n - uniongrp $ y ) ~ uniongrp $x ,
17 family = binomial )
18 summary ( union . grp . glm )
19 # Call :
20 # glm ( formula = cbind ( uniongrp $y , uniongrp $ n - uniongrp $ y ) ~ uniongrp $x ,
21 # family = binomial )
22 #
25 # -2.4667 -0.7248 -0.5911 0.2845 2.2628
26 #
27 # Coefficients :
29 # ( Intercept ) -2.20696 0.23298 -9.473 < 2e -16 * * *
30 # uniongrp $ x 0.07174 0.02005 3.577 0.000347 * * *
31 # ---
32 # Signif . codes : 0 ‘* * * ’ 0.001 ‘* * ’ 0.01 ‘* ’ 0.05 ‘. ’ 0.1 ‘ ’ 1 57/80

34 #
37 # AIC : 317.24
38 #
The estimates and standard errors are the same but the goodness of fit differ.
58/80
Likelihood ratio χ2 statistic
Sometimes the log-likelihood of the fitted model is compared with the log-likelihood of
the minimal model, the model for which all πi are equal; therefore the estimate is
π̃ = N
P PN
i=1 yi / i=1 ni .
The statistic is defined as
C = 2[`(π̂; y) − `(π̃; y)]

N
X ŷi ni − ŷi
=2 yi log + (ni − yi ) log
ni π̃ ni − ni π̃
i=1
Therefore C ∼ χ2 (p − 1).
59/80
Likelihood ratio χ2 statistic
1 # Minimal model
2 union0 . glm <- glm ( union . member ~ 1 , family = binomial )
3
4 Cstat <- 2 * ( logLik ( union . wage . age . glm ) - logLik ( union0 . glm ))
5 alpha <- 0.05
6 p <- length ( union . wage . age . glm $ coefficients )
7
8 if ( Cstat [1] > qchisq (1 - alpha ,p -1)){

9 cat ( " We reject H0 at alpha = " , alpha , " significance level . " )
10 } else {
11 cat ( " We cannot reject H0 at alpha = " , alpha , " significance level . " )
12 }
13 # We reject the null hypothesis
In this example the fitted model is preferred compared to the minimal model.
60/80
Pseudo-R 2 c
Analogously to the multiple linear regression, the likelihood ratio statistic can be
normalised
`(π̃; y) − `(π̂; y)
pseudo-R 2 = (19)
`(π̃; y)
representing the proportional improvement in the log-likelihood function due to the
terms in the model of interest, compared with the minimal model.
As for the R 2 , the distribution of the pseudo-R 2 cannot be determined, and it

increases as the number of predictors increases. Therefore, several adjustments have
been proposed.
1 R2 <- ( logLik ( union0 . glm ) - logLik ( union . wage . age . glm ))
2 / logLik ( union0 . glm )
3 R2
4 # ’ log Lik . ’ 0.03490527 ( df =1)
61/80
AIC and BIC
The Akaike information criterion (AIC) and the Bayesian information criterion (BIC)
are very popular goodness of fit statistics based on the log-likelihood, with an
adjustment for the number of parameters and the sample size.
AIC = −2`(π̂; y) + 2p (20)
where p is the number of parameters.
BIC = −2`(π̂; y) + p × log N (21)
where N is the sample size.
62/80
AIC and BIC
Remark: all these statistics (except the pseudo-R 2 ) summarise how well a particular
model fits the data: a small value (or a large p-value) indicates that the model fits well.
1 BIC ( union0 . glm )
2 # [1] 509.3645
3 AIC ( union0 . glm )
4 # [1] 505.0841
5
6 BIC ( union . wage . glm )

7 # [1] 503.0606
8 AIC ( union . wage . glm )
9 # [1] 494.4998
10
11 BIC ( union . wage . age . glm )

12 # [1] 504.365
13 AIC ( union . wage . age . glm )
14 # [1] 491.5239 63/80
4.1 - Introduction
4.4 - Prediction
4.6 - Residuals
64/80
Residuals
The residuals correspond to some of the statistics we have already analysed.

For Yi ∼ Bin(ni , πi ), the Pearson Residuals are
(Yi − ni π̂i )
Pi = p , i = 1, . . . , N
ni π̂i (1 − π̂i )
which can be standardised by dividing by the leverage hi
Pi
eiP = √
1 − hii
PN 2
Notice that i=1 Pi = P 2.
1 pr <- residuals ( union . wage . age . glm , type = " pearson " )
2 prss <- sum ( pr ^2)
3 prss
4 # [1] 517.8315
65/80
Residuals
The Deviance Residuals are defined as

1/2
yi ni − Yi
dk = sign(Yi − ni π̂i ) 2 Yi log + (ni − yi ) log
ni π̂i ni − ni π̂i
(the sign term makes sure that the signs of di and Pi match).
Note that N 2
P
i=1 di = D, the deviance.
1 union . wage . age . glm $ deviance
2 [ # 1] 485.5239
66/80
Residuals
The residuals can be used in the usual way: they should be plotted
• Plotted against each continuous explanatory variable to check if the assumption of

linearity is appropriate
• Plotted against other possible explanatory variables not included in the model
• Plotted in the order of the measurements to check for correlation
• Through normality plots
In general, for GLM residual plots are less informative than for multiple linear
regression, therefore it is important to check all the other goodness-of-fit statistics.
67/80
Residuals
3
2
2
Residuals
Residuals
1
1
0
0
−1
−1
0 10 20 30 40 20 30 40 50 60
Wage Age
Figure 2: Plot of the residuals against each continuous explanatory variable 68/80
Residuals
Usually we plot residuals against the values of the linear predictors in a generalised
linear model to look for patterns in the residuals that may be related to the mean.
But sometimes it can be hard to see patterns in residual plots for generalised linear
models. For instance, in logistic regression where the responses are binary, the
residuals can only take on two possible values (depending on whether the response is
zero or one) and when residuals are plotted against linear predictor values, all points lie
on one of the two smooth curves.
It is nearly always helpful to superimpose a scatterplot smoother on the residual plot to

help identify any trends. In the figure, the black line is a smoothing spline (more on
this later) suggests that there may be a trend in the residuals, namely the mean is
underestimated in the middle, and overestimated on the two extremes.
69/80
Residuals
3
2
Residuals
1
0
−1
0.1 0.2 0.3 0.4 0.5 0.6
Prediction
Figure 3: Pearson residuals against predicted probability of being a member. Blue illustrates
positive residuals (individual is a member) and red negative residuals. Black line represent the
smoothing spline.
70/80
Residuals
1 # Prediciton
2 pred . memb <- predict ( union . wage . age . glm )
3 pred . memb <- exp ( pred . memb ) / (1+ exp ( pred . memb )) # original scale
4
5 plot ( pred . memb , pr , col = c ( " red " ," blue " )[1+ union . member ] ,
6 xlab = " Prediction " , ylab = " Residuals " )
7 abline ( h =0 , lty =2 , col = " grey " , lwd =2)
8 ss <- smooth . spline ( pred . memb , pr )
9 lines ( ss , lwd =2)
71/80
4.1 - Introduction
4.4 - Prediction
4.6 - Residuals
72/80
Check of the link function
To test for the goodness-of-fit of the logit function, one can consider a more general
family of link functions
(1 − π)−α − 1

g (π, α) = log (22)
α
• α = 1, g (π, 1) is the logit

• as α → 0, g (π) → log[− log(1 − π)] is the log-log link
α can be estimated from the data.
73/80
Overdispersion
Observations Yi can have a variance which is greater than ni πi (1 − πi ).

An indicator of this is when the deviance D is much larger than the expected value of
N − p.
This can be due to
• Omission of relevant explanatory variables

• More complex structure of association between observations and explanatory
variables
• Yi are not independent
A solution is to include an extra-parameter φ (this is what R does when the family is

specified as quasibinomial
74/80
4.1 - Introduction
4.4 - Prediction
4.6 - Residuals
75/80
Odds ratios and prevalence ratios
The parameter estimates using the logit link are odds ratio. In a logistic regression
model, increasing X1 by one unit changes the log-odds by β1 , or, equivalently, it
multiplies the odds by a factor e β1 .
β1 does not correspond to the change in π associated with a one-unit increase

in X1 .
However, since the logarithm is a monotonic function,
• If β1 is positive then increasing X1 will be associated with an increasing in π

• If β1 is negative then increasing X1 will be associated with a decreasing in π.
76/80
Example
If the probability of disease in one group is 0.8 and in another is 0.2, then the odds
ratio (OR) is
0.8/(1 − 0.8) 4
OR = = = 16
0.2/(1 − 0.2) 0.25
whereas the prevalence ratio or relative risk (PR) is
0.8
PR = =4
0.2
77/80
The prevalence ratio can be estimated through the log link instead of the logit link.
However, the log link presents some disadvantages
• models using it fail to converge: this is because the logit transformation lead to
values defined in R, while the log-transformation for probabilities lead to
non-positive values
• for continuous explanatory variables, the prevalence ratio is not linearly related to
changes in the explanatory variables, therefore it is necessary to state the level of
the variable for each prevalence ratio value
78/80
1 exp ( union . wage . age . glm $ coefficients )

2 # ( Intercept ) wage age
3 # 0.05099156 1.06733959 1.02210191
4
5 union . wage . age . log <- glm ( union . member ~ wage + age ,
6 family = binomial ( link = " log " ))
7 # Error : no valid set of coefficients has been found :
8 # please supply starting values
79/80
References
Sources and recommended reading:
1. A. J. Dobson & A. G. Barnett (2018) An introduction to generalised linear

models, Chapman and Hall. Chapter 7 (Section 7.3 excluded).
2. G. James, D. Witten, T. Hastie & R. Tibshirani (2013) An Introduction to
Statistical Learning with Applications in R, Springer. Chapter 4, Sections 4.1–4.3
(Section 4.3.5 excluded).
80/80

Week04 Lecture BB

Uploaded by

Week04 Lecture BB

Uploaded by

MATH5806 Applied Regression Analysis

Lecture 4 - Logistic Regression

4.2 - Maximum likelihood estimation

4.3 - General Logistic Regression

4.5 - Goodness of fit

4.7 - Other diagnostics

4.8 - Odds ratios and prevalence ratios

4.2 - Maximum likelihood estimation

4.3 - General Logistic Regression

4.5 - Goodness of fit

4.7 - Other diagnostics

4.8 - Odds ratios and prevalence ratios

This lecture is about obtaining estimates for

This is a case where numerical solutions are needed.

For independent random variables Yi , linear models of the form

are the basis of most analyses of continuous data.

• Response variables have distributions other than the normal distribution

Again, suppose we have a set of observations (y1 , x1 ), . . . , (yN , xN ).

We want to estimate the parameters β to the situation where

is related to the covariate in a non-linear way

The function g is called link function.

Generalised linear models are defined for independent random variables Y1 , Y2 , . . . , YN

• The distribution of each Yi has the canonical form

f (yi |θi ) = exp [yi bi (θi ) + ci (θi ) + di (yi )] (3)

For model specification, we are interested in the effects of covariates β1 , . . . , βp (with

Suppose E(Yi ) = µi , then we are interested in a transformation of µi

• g is monotone and differentiable; it is called link function

Remark: The process of predicting qualitative responses, as in the case of logistic

4.2 - Maximum likelihood estimation

4.3 - General Logistic Regression

4.5 - Goodness of fit

4.7 - Other diagnostics

4.8 - Odds ratios and prevalence ratios

Let’s recall the following results from the previous lecture.

The joint distribution is

For each Yi , the log-likelihood is `i = Yi b(θi ) + c(θi ) + d(Yi ), which gives

0 b 00 (θi )c 0 (θi )−c 00 (θi )b 0 (θi )

The log-likelihood for all the Yi ’s is then

Now, let’s consider each term separately

Hence, the score function is

The variance-covariance matrix of the score is

Ijk = E[Uj Uk ] (6)

which represents the information matrix

because E[(Yi − µi )(Yl − µl )] = 0 for i 6= l.

Since E[(Yi − µi )2 ] = Var(Yi )

For the linear model where E(Yi ) = x>

β̂ (m) = β̂ (m−1) + [I (m−1) ]−1 u(m−1)

From Equation (7), the information matrix can be written as

which can be written in matrix terms as

[I (m−1) ]−1 β̂ (m−1) + u(m−1) = X> Wz (10)

Therefore (8) reads:

X> W(m−1) Xβ̂ (m) = X> W(m−1) z(m−1)

4.2 - Maximum likelihood estimation

4.3 - General Logistic Regression

4.5 - Goodness of fit

4.7 - Other diagnostics

4.8 - Odds ratios and prevalence ratios

We define a binary random variable

i.e. Y has a Bernoulli distribution.

πi , to a set of explanatory variables x>

Pr(Yi = 1|Xi1 , . . . , Xip ) = πi for i = 1, . . . , N (12)

The joint likelihood function is

We want to describe the probability of success with respect to some predictors:

so to take into consideration that

• The response variable is binary and not continuous

Similar considerations apply to ordinal response variables.

The following code can be used to replicate the previous plots.

5 default . bin <- rep (0 , length ( default )) # initialise binary vector

8 par ( mfrow = c (1 ,2)) # To put the plots next to each other

The following code can be used to replicate the previous plots.

Using a linear regression would assume

• The ordering is meaningful: numbers 1, 2 and 3 are just labels!

The general logistic regression model is

And the log-likelihood can be rewritten with respect to β

The estimation process is the same if Yi is binomially distributed instead of Bernoulli

If the goal is prediction, one might predict