Week04 Lecture BB
Week04 Lecture BB
Boris Beranger
Term 2, 2021
1/80
Chapter 4 - Logistic Regression
4.1 - Introduction
4.4 - Prediction
4.6 - Residuals
2/80
Chapter 4 - Logistic Regression
4.1 - Introduction
4.4 - Prediction
4.6 - Residuals
3/80
Goals
• Logistic regression
4/80
Introduction
E(Yi ) = µi = x>
i β
Yi ∼ N(µi , σ 2 ),
Advances in theory and softwares allow us to extend the theory we have developed in
Chapter 3 to models where:
5/80
Introduction
E(Yi ) = µi (1)
g (µi ) = x>
i β (2)
Most of the course will focus on situations where g is a simple mathematical function, but
functions may be estimated numerically, as in the generalised additive models – we will
introduce that towards the end of the course.
6/80
Introduction
• The distributions of all the Yi ’s are of the same form, i.e. bi (θi ) = b(θi ),
ci (θi ) = c(θi ) and di (θi ) = d(θi )
7/80
Introduction
g (µi ) = x>
i β
• A person arrives at the emergency room with a set of symptoms that could
possibly be attributed to one of three medical conditions; which of the three
conditions does the individual have?
• An online banking service must be able to determine whether or not a transaction
is fraudolent, on the basis of the IP address, past transactions, etc.
• On the basis of DNA sequence data, we would like to predict if a strain of
tuberculosis is resistant a particular drugs.
9/80
Chapter 4 - Logistic Regression
4.1 - Introduction
4.4 - Prediction
4.6 - Residuals
10/80
Maximum likelihood estimation
11/80
Maximum likelihood estimation
To obtain the maximum likelihood estimator for the parameter βj we derive the score
function using the chain rule:
N XN
d` X d`i d`i dθi dµi
= Uj = = · · . (4)
dβj dβj dθi dµi dβj
i=1 i=1
12/80
Maximum likelihood estimation
N
X (yi − µi ) dµi
Uj = Xij (5)
Var (Yi ) dηi
i=1
13/80
Maximum likelihood estimation
14/80
Maximum likelihood estimation
1 T
I= X X
σ2
15/80
Maximum likelihood estimation
If we want to apply the method of scoring to approximate the MLE, the estimating
equation is
I = X> WX (9)
2
dµi
where wii = Var1(Yi ) dηi evaluated at β.
16/80
Maximum likelihood estimation
Finally, the expression on the right hand side of (8) can be written as
p X
N N
dµi 2 (m−1) X (Yi − µi )Xij dµi
X Xij Xik
β̂k +
Var(Yi ) dηi Var(Yi ) dηi
k=1 i=1 i=1
where
p
X (m−1) dηi
Zi = Xik β̂k + (Yi − µi )
dµi
k=1
17/80
Maximum likelihood estimation
This has the same form as the weighted least squares equation. Note, however, this
needs to be solved iteratively, since W and z have to be recalculated at each
optimisation step.
Therefore, for generalised linear models, maximum likelihood estimators are obtained
by an iterative weighted least squares procedure.
18/80
Chapter 4 - Logistic Regression
4.1 - Introduction
4.4 - Prediction
4.6 - Residuals
19/80
General logistic regression
We now focus on models where the outcome variables are measured on a binary scale.
The goal of the analysis is to relate the probability of the Bernoulli distribution,
21/80
General logistic regression
g (πi ) = x>
i β (14)
22/80
General logistic regression
Consider the Default dataset from the ISLR R package. We want to estimate the
probability of default as function of balance.
1.0
1.0
| | | | || | ||||| || ||| ||||||||||| ||||| |||||||||||||||||| |||||||| ||||||||||| |||||||||||||||||||||||||||| ||||||||||||||||||||||||| | |||| | | | | | | | | | | || | ||||| || ||| ||||||||||| ||||| |||||||||||||||||| |||||||| ||||||||||| |||||||||||||||||||||||||||| ||||||||||||||||||||||||| | |||| | | | | | |
Probability of Default
Probability of Default
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0.0
0.0
||||||||||||||||||||||||||||||||||||||||||||||||||||||
||||||||||||||
|||||||||||||||||||||||||||||
|||||||||||||
||||||||||||||||||||||||||||||||||
|||||||
|||||
|||||||||||||||||||
||||||||||||
||||||||||||||||
|||||||||||||
|||||||||||||||||||
|||||
||||
||||||||||||
|||||||||
||||
||||||||||||||
|||||||||||||||||||||||||||||||||||||||||||||||||||||
|||||||||||||||||||||||||
|||||||||||||||||||||||
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| |||||| || || ||||| | | || ||||||||||||||||||||||||||||||||||||||||||||||||||||||
||||||||||||||
|||||||||||||||||||||||||||||
|||||||||||||
||||||||||||||||||||||||||||||
||||
|||||||
|||||
|||||||||||||||||||
||||||||||||
||||||||||||||||
|||||||||||||
|||||||||||||||||||
|||||
||||
||||||||||||
|||||||||
||||
||||||||||||||
|||||||||||||||||||||||||||||||||||||||||||||||||||||
|||||||||||||||||||||||||
|||||||||||||||||||||||
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| |||||| || || ||||| | | ||
0 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500
Balance Balance
Figure 1: Estimated probability of default using linear (left) and logistic (right) regression.
23/80
General logistic regression
10 # Linear model
11 lm <- lm ( default . bin ~ balance )
12 plot ( balance , default . bin , pch =3 , col = " orange " ,
13 xlab = " Balance " , ylab = " Probability of Default " )
14 abline ( h = c (0 ,1) , lty =2)
15 abline ( a = lm $ coefficients [1] , b = lm $ coefficients [2] , col = " blue " , lwd =3)
24/80
General logistic regression
20 plot ( balance , default . bin , pch =3 , col = " orange " , xlab = " Balance " ,
21 ylab = " Probability of Default " )
22 abline ( h = c (0 ,1) , lty =2)
23 lines ( balance [ o ] , log $ fitted . values [ o ] , col = " blue " , lwd =3)
25/80
General logistic regression
Example
We want to predict the medical condition of a patient in the emergency room on the
basis of the symptoms. Let’s suppose we have three possible diagnoses:
1 stroke
Y = 2 drug overdose
3 epileptic seizure
1 library ( SemiPar )
2
6 plot ( wage , union . member , main = " Trade Union Dataset " , pch =3 ,
7 ylab = " Union Membership " , xlab = " Wage " )
8 abline ( h = c (0 ,1) , lty =2)
29/80
Analysis of trade union dataset
1.0
0.8 Trade Union Dataset
Union Membership
0.6
0.4
0.2
0.0
0 10 20 30 40
Wage
30/80
Analysis of trade union dataset
The code below fits the logistic regression model displays the fitted line and prints the
output.
9 union . wage . glm <- glm ( union . member ~ wage , family = " binomial " )
10 o <- order ( wage )
11
12 lines ( wage [ o ] , union . wage . glm $ fitted . values [ o ] , col = " blue " , lwd =3)
13
31/80
Analysis of trade union dataset
1.0
0.8 Trade Union Dataset
Union Membership
0.6
0.4
0.2
0.0
0 10 20 30 40
Wage
32/80
Analysis of trade union dataset
15 # Call :
16 # glm ( formula = union . member ~ wage , family = " binomial ")
17 #
18 # Deviance Residuals :
19 # Min 1Q Median 3Q Max
20 # -1.6140 -0.6338 -0.5592 -0.5174 2.0590
21 #
22 # Coefficients :
23 # Estimate Std . Error z value Pr ( >| z |)
24 # ( Intercept ) -2.20696 0.23298 -9.473 < 2e -16 * * *
25 # wage 0.07174 0.02005 3.577 0.000347 * * *
26 # ---
27 # Signif . codes : 0 ‘* * * ’ 0.001 ‘* * ’ 0.01 ‘* ’ 0.05 ‘. ’ 0.1 ‘ ’ 1
28 #
33/80
Analysis of trade union dataset
34/80
Analysis of trade union dataset
35/80
Analysis of trade union dataset
1.0
0.8 Trade Union Dataset
Union Membership
0.6
0.4
0.2
0.0
20 30 40 50 60
Age
36/80
Analysis of trade union dataset
The code below fits the logistic regression model displays the fitted line and prints the
output.
39 union . age . glm <- glm ( union . member ~ age , family = " binomial " )
40 o <- order ( age )
41
42 lines ( age [ o ] , union . age . glm $ fitted . values [ o ] , col = " blue " , lwd =3)
43
37/80
Analysis of trade union dataset
1.0
0.8 Trade Union Dataset
Union Membership
0.6
0.4
0.2
0.0
20 30 40 50 60
Age
38/80
Analysis of trade union dataset
45 # Call :
46 # glm ( formula = union . member ~ wage , family = " binomial ")
47 #
48 # Deviance Residuals :
49 # Min 1Q Median 3Q Max
50 # -1.6140 -0.6338 -0.5592 -0.5174 2.0590
51 #
52 # Coefficients :
53 # Estimate Std . Error z value Pr ( >| z |)
54 # ( Intercept ) -2.20696 0.23298 -9.473 < 2e -16 * * *
55 # wage 0.07174 0.02005 3.577 0.000347 * * *
56 # ---
57 # Signif . codes : 0 ‘* * * ’ 0.001 ‘* * ’ 0.01 ‘* ’ 0.05 ‘. ’ 0.1 ‘ ’ 1
58 #
39/80
Analysis of trade union dataset
Both age and wage seem to be significant, so let’s fit a model with 3 parameters.
40/80
Analysis of trade union dataset
65 union . wage . age . glm <- glm ( union . member ~ wage + age , family = " binomial " )
66 summary ( union . wage . age . glm )
67
68 # Call :
69 # glm ( formula = union . member ~ wage + age , family = " binomial ")
70 #
71 # Deviance Residuals :
72 # Min 1Q Median 3Q Max
73 # -1.3438 -0.6506 -0.5565 -0.4620 2.1485
74 #
75 # Coefficients :
76 # Estimate Std . Error z value Pr ( >| z |)
77 # ( Intercept ) -2.976095 0.426685 -6.975 3.06 e -12 * * *
78 # wage 0.065169 0.020117 3.239 0.0012 * *
79 # age 0.021861 0.009722 2.249 0.0245 *
80 # ---
81 # Signif . codes : 0 ‘* * * ’ 0.001 ‘* * ’ 0.01 ‘* ’ 0.05 ‘. ’ 0.1 ‘ ’ 1 41/80
Analysis of trade union dataset
42/80
Chapter 4 - Logistic Regression
4.1 - Introduction
4.4 - Prediction
4.6 - Residuals
43/80
Prediction
Once the coefficients have been estimated, predictions are obtained by using those
estimates with the desired level of predictors.
Example
Analysis of trade union dataset If we want to predict the probability of union
membership for someone who is 56 years-old and have a wage of 6.5, we compute
1 p _ pred <- exp ( sum ( union . wage . age . glm $ coefficients * c (1 ,6.5 ,56))) /
2 (1+ exp ( sum ( union . wage . age . glm $ coefficients * c (1 ,6.5 ,56))))
3 p _ pred
4 # [1] 0.209446
The πnew < 0.5, therefore we can classify the new individual is not a union member. 44/80
Chapter 4 - Logistic Regression
4.1 - Introduction
4.4 - Prediction
4.6 - Residuals
45/80
Goodness of fit
In a GLM, we would like to assign a residual ei to each observation which measures the
discrepancy between Yi and the value predicted by the fitted model. There are two
main difficulties associated with generalised linear models:
46/80
Pearson chi-squared statistic
For calculation of Pearson residuals we take the difference between observed and fitted
values and divide by an estimate of the standard deviation of the observed values.
Residuals for the Binomial model. For Yi ∼ Bin(ni , πi ), the Pearson Residuals are
(yi − ni π̂i )
Pi = p , i = 1, . . . , N
ni π̂i (1 − π̂i )
47/80
Pearson chi-squared statistic
where oi represents the observed frequencies and ei represents the expected frequency.
48/80
Pearson chi-squared statistic
49/80
Deviance
N
X yi ni − yi
D=2 yi log + (ni − yi ) log (17)
ni π̂i ni − ni π̂i
i=1
50/80
Deviance
1 (s−t)2
This comes from the fact that, for s ≈ t, s log st = (s − t) + 2 t + ...
51/80
Deviance
D ∼ χ2 (N − p) (18)
therefore P 2 ∼ χ2 (N − p).
52/80
Deviance
Example (Analysis of trade union dataset)
We fitted the logistic model (M1 )
πi
log = β0 + β1 wage + β2 age,
1 − πi
H0 : β1 = β2 = 0
H1 : β1 , β2 not both zero
If H0 was true, then both models describe the data well. We would have
D0 ∼ χ2 (N − 1) and D1 ∼ χ2 (N − 3), so that D1 − D0 ∼ χ2 (2). However, we observe
which is larger than the 95%-quantile χ2 (2)−1 (0.95) = 5.9914645. Hence we reject
H0 at the 5% significance level.
54/80
Deviance
55/80
Hosmer-Lemeshow statistic
16 union . grp . glm <- glm ( cbind ( uniongrp $y , uniongrp $n - uniongrp $ y ) ~ uniongrp $x ,
17 family = binomial )
18 summary ( union . grp . glm )
19 # Call :
20 # glm ( formula = cbind ( uniongrp $y , uniongrp $ n - uniongrp $ y ) ~ uniongrp $x ,
21 # family = binomial )
22 #
23 # Deviance Residuals :
24 # Min 1Q Median 3Q Max
25 # -2.4667 -0.7248 -0.5911 0.2845 2.2628
26 #
27 # Coefficients :
28 # Estimate Std . Error z value Pr ( >| z |)
29 # ( Intercept ) -2.20696 0.23298 -9.473 < 2e -16 * * *
30 # uniongrp $ x 0.07174 0.02005 3.577 0.000347 * * *
31 # ---
32 # Signif . codes : 0 ‘* * * ’ 0.001 ‘* * ’ 0.01 ‘* ’ 0.05 ‘. ’ 0.1 ‘ ’ 1 57/80
Hosmer-Lemeshow statistic
The estimates and standard errors are the same but the goodness of fit differ.
58/80
Likelihood ratio χ2 statistic
Sometimes the log-likelihood of the fitted model is compared with the log-likelihood of
the minimal model, the model for which all πi are equal; therefore the estimate is
π̃ = N
P PN
i=1 yi / i=1 ni .
Therefore C ∼ χ2 (p − 1).
59/80
Likelihood ratio χ2 statistic
1 # Minimal model
2 union0 . glm <- glm ( union . member ~ 1 , family = binomial )
3
4 Cstat <- 2 * ( logLik ( union . wage . age . glm ) - logLik ( union0 . glm ))
5 alpha <- 0.05
6 p <- length ( union . wage . age . glm $ coefficients )
7
In this example the fitted model is preferred compared to the minimal model.
60/80
Pseudo-R 2 c
Analogously to the multiple linear regression, the likelihood ratio statistic can be
normalised
`(π̃; y) − `(π̂; y)
pseudo-R 2 = (19)
`(π̃; y)
representing the proportional improvement in the log-likelihood function due to the
terms in the model of interest, compared with the minimal model.
The Akaike information criterion (AIC) and the Bayesian information criterion (BIC)
are very popular goodness of fit statistics based on the log-likelihood, with an
adjustment for the number of parameters and the sample size.
62/80
AIC and BIC
Remark: all these statistics (except the pseudo-R 2 ) summarise how well a particular
model fits the data: a small value (or a large p-value) indicates that the model fits well.
1 BIC ( union0 . glm )
2 # [1] 509.3645
3 AIC ( union0 . glm )
4 # [1] 505.0841
5
4.1 - Introduction
4.4 - Prediction
4.6 - Residuals
64/80
Residuals
(the sign term makes sure that the signs of di and Pi match).
Note that N 2
P
i=1 di = D, the deviance.
1 union . wage . age . glm $ deviance
2 [ # 1] 485.5239
66/80
Residuals
The residuals can be used in the usual way: they should be plotted
In general, for GLM residual plots are less informative than for multiple linear
regression, therefore it is important to check all the other goodness-of-fit statistics.
67/80
Residuals
3
2
2
Residuals
Residuals
1
1
0
0
−1
−1
0 10 20 30 40 20 30 40 50 60
Wage Age
Figure 2: Plot of the residuals against each continuous explanatory variable 68/80
Residuals
Usually we plot residuals against the values of the linear predictors in a generalised
linear model to look for patterns in the residuals that may be related to the mean.
But sometimes it can be hard to see patterns in residual plots for generalised linear
models. For instance, in logistic regression where the responses are binary, the
residuals can only take on two possible values (depending on whether the response is
zero or one) and when residuals are plotted against linear predictor values, all points lie
on one of the two smooth curves.
3
2
Residuals
1
0
−1
Prediction
Figure 3: Pearson residuals against predicted probability of being a member. Blue illustrates
positive residuals (individual is a member) and red negative residuals. Black line represent the
smoothing spline.
70/80
Residuals
1 # Prediciton
2 pred . memb <- predict ( union . wage . age . glm )
3 pred . memb <- exp ( pred . memb ) / (1+ exp ( pred . memb )) # original scale
4
5 plot ( pred . memb , pr , col = c ( " red " ," blue " )[1+ union . member ] ,
6 xlab = " Prediction " , ylab = " Residuals " )
7 abline ( h =0 , lty =2 , col = " grey " , lwd =2)
8 ss <- smooth . spline ( pred . memb , pr )
9 lines ( ss , lwd =2)
71/80
Chapter 4 - Logistic Regression
4.1 - Introduction
4.4 - Prediction
4.6 - Residuals
72/80
Check of the link function
To test for the goodness-of-fit of the logit function, one can consider a more general
family of link functions
(1 − π)−α − 1
g (π, α) = log (22)
α
73/80
Overdispersion
4.1 - Introduction
4.4 - Prediction
4.6 - Residuals
75/80
Odds ratios and prevalence ratios
The parameter estimates using the logit link are odds ratio. In a logistic regression
model, increasing X1 by one unit changes the log-odds by β1 , or, equivalently, it
multiplies the odds by a factor e β1 .
76/80
Odds ratios and prevalence ratios
Example
If the probability of disease in one group is 0.8 and in another is 0.2, then the odds
ratio (OR) is
0.8/(1 − 0.8) 4
OR = = = 16
0.2/(1 − 0.2) 0.25
whereas the prevalence ratio or relative risk (PR) is
0.8
PR = =4
0.2
77/80
Odds ratios and prevalence ratios
The prevalence ratio can be estimated through the log link instead of the logit link.
• models using it fail to converge: this is because the logit transformation lead to
values defined in R, while the log-transformation for probabilities lead to
non-positive values
• for continuous explanatory variables, the prevalence ratio is not linearly related to
changes in the explanatory variables, therefore it is necessary to state the level of
the variable for each prevalence ratio value
78/80
Odds ratios and prevalence ratios
5 union . wage . age . log <- glm ( union . member ~ wage + age ,
6 family = binomial ( link = " log " ))
7 # Error : no valid set of coefficients has been found :
8 # please supply starting values
79/80
References
80/80