Course Notes18
Course Notes18
Regression Analysis
Mayer Alvo
i
Contents
5 Regression Diagnostics 45
5.1 Transformations and weighting . . . . . . . . . . . . . . . . . . . . . . . 45
5.1.1 Variance stabilizing transformations . . . . . . . . . . . . . . . . . 45
5.1.2 Transformations to linearize the model . . . . . . . . . . . . . . . 45
5.1.3 Box-Cox transformations . . . . . . . . . . . . . . . . . . . . . . . 46
5.2 Weighted least squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.3 Checking on the linear relationship assumption . . . . . . . . . . . . . . . 53
5.3.1 Descriptive Measures of Linear Association . . . . . . . . . . . . . 53
5.4 Calculations Using R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.5 R Session . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.6 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.7 R session commercial properties . . . . . . . . . . . . . . . . . . . . . . . 55
7 Different Models 69
7.1 Polynomial regression models . . . . . . . . . . . . . . . . . . . . . . . . 69
7.2 Indicator regression models . . . . . . . . . . . . . . . . . . . . . . . . . . 70
7.3 R Session . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
7.4 Suggested Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
7.5 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
8 Multicollinearity 73
8.1 Calculations Using R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
8.2 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
ii
Contents
9.1.2 Mallows Cp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
9.1.3 Akaike information criterion . . . . . . . . . . . . . . . . . . . . . 86
9.1.4 Schwartz’s Bayesian criterion (SBC) . . . . . . . . . . . . . . . . 86
9.1.5 Prediction sum of squares criterion(PRESS) . . . . . . . . . . . . 86
9.2 Model selections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
9.2.1 All possible models . . . . . . . . . . . . . . . . . . . . . . . . . . 86
9.2.2 Forward ,Backward and Stepwise Regression . . . . . . . . . . . . 87
9.2.3 LASSO and LAR regression . . . . . . . . . . . . . . . . . . . . . 88
9.3 Calculations Using R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
9.4 R Session . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
9.5 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
9.6 Suggested Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
10 Logistic Regression 91
10.1 Repeat Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
10.2 Multiple Logistic models . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
10.3 Inference on model parameters . . . . . . . . . . . . . . . . . . . . . . . . 94
10.4 Test for Goodness of Fit . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
10.4.1 Deviance Goodness of Fit Test . . . . . . . . . . . . . . . . . . . . 96
10.4.2 Hosmer-Lenshow Goodness of Fit Test . . . . . . . . . . . . . . . 97
10.5 Diagnostic Measures for Logistic Regression . . . . . . . . . . . . . . . . 97
10.5.1 Detection of Influential Observations . . . . . . . . . . . . . . . . 98
10.5.2 Influence on the Fitted Linear Predictor . . . . . . . . . . . . . . 98
10.6 Calculations Using R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
10.7 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
10.8 R Session . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
12 References 107
iii
1 Introduction
At the start, there are measurements on explanatory variables, denoted X1 , ..., Xp as well
on a response variable Y . Regression analysis then proceeds to describe the behavior of
the response variable in terms of explanatory variables. Specifically, it seeks to establish
a relationship between the response and the explanatory variables in order to monitor
how changes in the latter affect the former. The relationship can also be used for
predicting the value of a response given new values of the explanatory variables.
In all instances, the primary goal in regression is to develop a model that relates
the response to the explanatory variables, to test it and ultimately to use it for inference
and prediction.
Example 1.1. Suppose we have Y = sale values for n = 25 houses and X = Assessed
values. Hence the given data consists of the pairs
{(Xi , Yi ) , i = 1, ..., n}
1
1 Introduction
238 251
270 251
235 253
Sale value Y
239 255
274 275 1000
242 277 900
242 279 800
320 295
700
279 297
413 412 600
We first plot the n paired data Yi vs Xi . If it seems reasonable to fit a straight line
to the points, we then postulate the following simple regression model
Yi = β0 + β1 Xi + ϵi (1.1)
2
1.1 The method of least squares
Assumption: The random error terms are uncorrelated, have mean equal to 0 and com-
mon variance equal to σ 2 .
E[Yi ] = β0 + β1 Xi
σ 2 [Yi ] = σ 2
CAUTION: We emphasize that a well fitting regression model does not imply cau-
sation. One can relate stock market prices in N.Y. to the price of bananas in an offshore
island. This does not mean there is a causal relationship.
3
1 Introduction
The linearity assumption leads to two linear equations in two unknowns whose
solutions denoted b0 , b1 are
b0 = Ȳ − b1 X̄
(1.2)
P
Xi − X̄ Yi − Ȳ
b1 = P 2 (1.3)
Xi − X̄
P
Xi − X̄ Yi
= P 2 (1.4)
Xi − X̄
X
= ki Yi
where
Xi − X̄
ki = P 2
Xi − X̄
Then it can be shown
1
ki2 = P
X X X
ki = 0, ki Xi = 1, 2 .
Xi − X̄
Ŷ = b0 + b1 X (1.5)
Alternatively,
Ŷ = b0 + b1 X̄ + b1 (X − X̄) (1.6)
Theorem 1.1. Gauss Markov) The least square estimators b0 , b1 are unbiased and have
minimum variance among all unbiased linear estimators.
P
Proof. Consider an unbiased estimator for β1 say, β̂1 = ci Yi which must satisfy
β1 = E[β̂1 ]
X
= ci E[Yi ]
X
= ci [β0 + β1 Xi ]
ci Xi = 1 and σ 2 [β̂1 ] = σ 2
P P P 2
Hence, ci = 0, c. i
4
1.1 The method of least squares
= 0
on using the properties of ci . Hence {ki } and {di } are uncorrelated and we have by the
Pythagorean theorem
σ 2 [β̂1 ] = σ 2 c2i
X
nX o
= σ2 ki2 + d2i
X
5
1 Introduction
P 2 P 2 P 2
4. Yi − Ȳ = b21 Xi − X̄ + Yi − Ŷi
5. The point (X̄, Y¯) is on the fitted line. This can be seen from (1.5)
6. Under the normality assumption {ϵi } ∼ i.i.d.N (0, σ 2 ), the method of maximum
likelihood leads to the method of least squares.
P (Xi −X̄ )
Proof. a) We see that b1 = ki Yi where ki = P 2 . Hence, b1 is unbiased in view
(Xi −X̄ )
of the properties of the {ki }
Since Yi ∼ N (β0 + β1 Xi , σ 2 ) , it follows that
ki (β0 + β1 Xi ), σ 2 ki2 )
X X X
b1 = ki Yi ∼ N (
2
σ
∼ N β1 , P
2
Xi − X̄
P
b) As well, b0 = Ȳ − b1 X̄ = n1 Yi − ki Yi X̄ = 1
− ki X̄ Yi
P P
n
The result follows from properties of the ki
c) We shall demonstrate this result using the matrix approach in subsequent sec-
tions.
This theorem can be used to test hypotheses about the parameters and to construct
confidence intervals.
6
1.3 Analysis of Variance (ANOVA) table
P 2
Total SSTO= Yi − Ȳ n−1
P 2 P
Yi − Ȳ has n−1 degrees of freedom because of the constraints that Yi − Ȳ =
0 2
P
b21 Xi − X̄ has one degree of freedom because it is a function of b1
P 2
Yi − Ŷi has n-2 degrees of freedom because it is a function of two parame-
tersEach of the sums of squares is a quadratic form where the rank of the corresponding
matrix is the degrees of freedom indicated.
Cochran’s theorem applies and we conclude that the quadratic forms are indepen-
dent and have chi square distributions. It is well known that the ratio of two independent
chi square divided by their degrees of freedom has a F-distribution
[SSR/ (σ 2 (p − 1))]
F =
[SSE/ (σ 2 (n − p))]
M SR
= ∼ Fp,(n−p)
M SE
The ANOVA table indicates how one can test the null hypothesis
H0 : β1 = 0
H1 : β1 ̸= 0
The null hypothesis is that the slope of the line is equal to 0. Under the null hypothesis,
the expected mean square for regression and the expected mean square error are separate
independent estimates of the variance σ 2 . Hence if the null hypothesis is true, the F
ratio should be small. On the other hand, if the alternative hypothesis H1 is true, then
the numerator of the F ratio will be expected to be large. Consequently, large values
of the F statistic are consistent with the alternative. We reject the null hypothesis for
large values of F.
Example 1.2. We consider the following example on grade point averages at the end
7
1 Introduction
Ŷ = b0 + b 1 X
X1
= + ki X − X̄ Yi
n
Moreover,
h i X1 2
2 2
σ Ŷ = σ + ki X − X̄
n
2
1 X − X̄
= σ2 + P
n
Xi − X̄
h i
We note that the variance increases with the distance of X from X̄ . The variance σ 2 Ŷ
8
1.4 Confidence Intervals
is estimated by 2
h i
1 X − X̄
s2 Ŷ = M SE + P
n
Xi − X̄
Hence inference in the form of confidence interval an hypothesis testing for the average
E[Y] is conducted using the fact that
Ŷ − E[Y ]
∼ tn−2
s[Ŷ ]
Ynew = β0 + β1 X + ϵnew
and
Ŷnew = Ŷ + ϵnew
h i h i
σ 2 Ŷnew = σ 2 Ŷ + σ 2
2
1 X − X̄
= σ2
1 + + P
n Xi − X̄
h i
The variance σ 2 Ŷnew is estimated by
2
h i 1 X − X̄
s2 Ŷnew = M SE
1 + + P
n Xi − X̄
Hence inference in the form of confidence interval an hypothesis testing for the prediction
of a new value is conducted using the fact that
Ŷnew − Ynew
∼ tn−2
s[Ŷnew ]
9
1 Introduction
10
1.6 R Session
1.6 R Session
1.6.1 Rocket Propellant data
To read the data
11
1 Introduction
Rocket=read.table(file.choose(),header=TRUE,sep=’\t’)
Rocket #prints out the data
Shear.strength Age.of.Propellant
y=Rocket$Shear.strength
x=Rocket$Age.of.Propellant
plot(x,y)
hist(y,prob=TRUE,main=’Density histogram of Shear Strength’)
To conert data from table.b1 in R to csv format
write.csv(table.b1,’table.csv’)# this will write the new table in the working directory
of your computer
Regression model for Rocket data
fit=lm(y~x)
fit
Call: lm(formula = y ~ x)
Coefficients: (Intercept) x 2627.82 -37.15
summary(fit)
Call: lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-215.98 -50.68 28.74 66.61 106.76
Coefficients Estimate Std. Error t value Pr(>|t|)
(Intercept) 2627.822 44.184 59.48 < 2e-16 ***
x 37.154 2.889 -12.86 1.64e-10 ***
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 96.11 on 18 degrees of freedom
Multiple R-squared: 0.9018,
Adjusted R-squared: 0.8964
F-statistic: 165.4 on 1 and 18 DF, p-value: 1.643e-10
cor(x,y)
[1] -0.9496533
plot(Rocket$Age.of.Propellant, Rocket$hear.strength, xlab=’Age’,ylab=’Shear Strength’,main
Propellant’)
abline(Rocket,col=’lightblue’)
Regression without intercept
fit=lm(y~x-1,data=Rocket)
summary(fit)
Call: lm(formula = y ~ x - 1, data = Rocket)
Residuals:
Min 1Q Median 3Q Max
-1044.7 -497.6 742.3 1529.4 2428.2
12
1.6 R Session
13
1 Introduction
1.7.1 Homework
Problems 2.1, 2.10, 2.22
14
2 Matrix Approach to Regression
We will preamble the matrix presentation by describing some distributional results.
E[Z] = AE[Y ] + B
nhP i o hP i
Proof: (EZi ) = E j aij Yj + bi = j aij EYj + bi
n o
Definition 2.1. The covariance COV [Y ] = E [Y − EY ] [Y − EY ]′ ≡ Σ
Definition 2.2. A random vector Y has a multivariate normal distribution if its density
is given by
1
|Σ|− 2 1
f (y1 , ..., yn ) = n exp − (y − µ)′ Σ −1 (y − µ)
(2π) 2 2
where
y ′ = (y1 , ..., yn ) , µ′ = (µ1 , ..., µn ) , Σ = COV [Y ]
denoted Y ∼ Nn (µ, Σ) .
A fundamental result is
This theorem implies that any linear combination of normal variates has a normal dis-
tribution. We do not prove this theorem here.
15
2 Matrix Approach to Regression
AY ∼ N1 (Aµ, AΣA′ )
where n
µi , AΣA′ = σj2 + 2
X X X
Aµ = σij
i=1 i̸=j
Y = Xβ + ϵ ∼ Nn (Xβ, σ 2 In )
Derivatives If z = a′ y,then
∂z
=a
∂y
If z = y ′ y,
∂z
= 2y
∂y
If z = a′ Ay,
∂z
= A′ a
∂y
If z = y ′ Ay,
∂z
= A′ y + Ay
∂y
If z = y ′ Ay, and A is symmetric
∂z
= 2A′ y
∂y
16
2.1 Distributional Results
Q = (Y − Xβ)′ (Y − Xβ)
∂Q
= −2X ′ ((Y − Xβ)) (2.1)
∂β
= −2 (X ′ Y − X ′ Xβ) = 0
= AY
But
−1
AXβ = (X ′ X) X ′ Xβ = β
and
−1 −1
AA′ = (X ′ X) X ′ X (X ′ X)
−1
= (X ′ X)
Hence,
−1
b ∼ Np β, σ 2 (X ′ X)
The fitted line is then
Ŷ = Xb
−1
= X (X ′ X) X ′Y
= HY
17
2 Matrix Approach to Regression
HH = H
H′ = H
(I − H) H = H − HH = 0
e = Y − Ŷ
= Y − HY
= (I − H) Y
Y = HY + (I − H) Y
18
2.2 Properties of the hat matrix H
We note that
= (I − H) σ 2 [Y ] (I − H)′
= σ 2 (I − H)
which is estimated by
s2 [e] = (M SE) (I − H)
Moreover,
−1 −1
h i h i
σ 2 [b] = (X ′ X) X ′ σ 2 X (X ′ X)
−1
= σ 2 (X ′ X)
Exercise 2.1. a) For the case p = 2, obtain the hat matrix. Show that rank H= Trace
H =2
b) Show the relationship
X 2 X 2 X 2
Yi − Ȳ = b21 Xi − X̄ + Yi − Ŷi
Definition 2.3. Let Y1 , ..., Yn be a random sample from N (µ, σ 2 ). A quadratic form in
the Y ′ s is defined to be the real quantity
Q = Y ′ AY
A = P ′ ΛP.
19
2 Matrix Approach to Regression
E ∥(P Y )i ∥2 = V ar ∥(P Y )i ∥ + [E (P Y )i ]2
= (P ΣP ′ )ii + [(P EY )i ]2
Hence
= T race (ΛP ΣP ′ ) + µ′ Aµ
= T race (P ′ ΛP Σ) + µ′ Aµ
Lemma 2.1. The sample variance Sn2 is an unbiased estimate of the population variance.
1 − n1 − n1 ... − n1
− n1 1 − n1 ... − n1
A =
... ... ... ...
1
−n − n1 ... 1 − n1
11′
= I−
n
P 2
Then (n − 1) Sn2 = Yi − Ȳ = Y ′ AY and
= σ 2 T raceA + µ2 1′ A1
= σ 2 (n − 1) + 0
20
2.2 Properties of the hat matrix H
Since I − H is idempotent,
′
Y − Ŷ Y − Ŷ = Y ′ (I − H) Y
and
E [Y ′ (I − H) Y ] = T race (I − H) Σ + µ′ (I − H) µ
= σ 2 (n − p) + 0
Definition 2.4. (a) A random variable U is said to have a χ2ν distribution with ν degrees
of freedom if its density is given by
1
f (u; ν) = u(ν/2)−1 e−u/2 , u > 0, ν > 0
2ν/2 Γ (ν/2)
The non-central chi square distribution is a Poisson weighted mixture of central chi
square distributions.The mean and variance are respectively (ν + λ) and (2ν + 4λ).
(c) We include here the fact that the distribution of the ratio of two independent
central chi square distributions divided by their respective degrees of freedom
χ2ν1 /ν1
Fν1 ,ν2 =
χ2ν2 /ν2
Theorem 2.2. Cochran’s Theorem Let Y be a random vector with distribution Nn (µ, σ 2 I).
Suppose that we have the decomposition
Y ′ Y = Q1 + ... + Qk
n o n o
Qi
where Qi = Y ′ Ai Y rank (Ai ) = ni . Then σ2
are independent and have χ2ni (λi )
21
2 Matrix Approach to Regression
1 − n1 − n1 ... − n1
− n1 1 − n1 ... − n1
A =
... ... ... ...
1
−n − n1 ... 1 − n1
11′
= I−
n
Then
n
11′
!
′ ′ ′
Yi2
X
= Y Y = Y AY + Y Y
i=1 n
11′
!
n = rankA + rank
n
= (n − 1) + 1
11′
′ AY
Y′ n
Y
From Cochran’s theorem, Q1 = Y σ2 ∼ χ2n−1 and Q2 = σ2
∼ χ21 are inde-
pendent. But 2
P
Yi − Ȳ nȲ 2
Q1 = , Q2 =
σ2 σ2
and hence the ratio
Q2 /1
F1,n−1 =
Q1 / (n − 1)
nȲ 2
=
Sn2
22
2.2 Properties of the hat matrix H
Y = Xb + (Y − Xb)
∥Y ∥2 = ∥Xb∥2 + ∥Y − Xb∥2
= Y ′ HY + Y ′ (I − H) Y
By Cochran’s theorem,
Y ′ HY ∼ χ2p , Y ′ (I − H) Y ∼ χ2n−p
and are independent. The first term is the sum of squares due to the regression
whereas the second represents the error sum of squares. We summarize this in the
next section in the analysis of variance table.
23
3 Multiple Linear Regression
In practice, one is often presented with several predictor variables. For two predictors,
the linear regression model becomes
Yi = β0 + β1 Xi1 + β2 Xi2 + ϵi
with the assumptions that {ϵi } are i.i.d. N (0, σ 2 ). This model describes a plane in
three dimensions. It is an additive model where β1 represents the rate of change in a
unit increase in X1 when X2 is held fixed. An analogous interpretation can be made for
β2 .
In general, we may have the linear regression model involving (p − 1) explanatory
variables
p−1
X
Yi = β0 + βk Xik + ϵi
k=1
The predictor variables may be qualitative taking values 0 or 1 as for example if one
wishes to take into account gender. So here
0 if the subject is male
X=
1
if the subject is female
Yi = β0 + β1 Xi + β2 Xi2 + ϵi
a transformed response
ln Yi = β0 + β1 Xi1 + β2 Xi2 + ϵi
interaction effects
In all cases, it is instructive to make use of the matrix approach to unify the devel-
opment.
25
3 Multiple Linear Regression
Y = Xβ + ϵ, ϵ ∼ Nn (0, σ 2 In )
Ŷ = Xb
= HY
where the hat matrix H = X (X ′ X)−1 X. The variance -covariance matrix of the resid-
uals e = (I − H) Y is
σ 2 [e] = σ 2 (I − H)
which is estimated by
s2 [e] = (M SE) (I − H)
Also
−1
s2 [b] = (M SE) (X ′ X)
26
3.1 Extra sum of squares principle
where
1
SST O = Y ′ Y − Y ′ JY
n
1
′
= Y I− J Y
n
SSE = e′ e = Y ′ (I − H) Y
1
′ ′
SST R = b X Y − Y ′ JY
n
1
= Y′ H − J Y
n
To test the hypothesis
H0 : β1 = β2 = ... = βp−1
H1 : not all βk = 0
H0 : βk = 0
H1 : βk ̸= 0
for individual coefficients may be conducted using the fact that the standardized coeffi-
cient has a Student t distribution
bk
∼ tn−p
s [bk ]
27
3 Multiple Linear Regression
Step 1 Specify the Full (F ) model Y = β0 + β1 X + ϵ and obtain the error sum of
squares 2
X
SSE(F ) = Yi − Ŷi
Y = β0 + ϵ
and obtain the corresponding error sum of squares
X 2
SSE(R) = Yi − Ȳ
The logic now is to compare the two error sum of squares. With more parameters in the
model, we expect that
SSE(F ) ≤ SSE(R)
If we have equality above, we may conclude the model is not of much help. As a result,
we may test the benefit of the model be computing the test statistic
h i
SSE(R)−SSE(F )
df −df
F∗ = h R Fi
SSE(F )
(3.1)
dfF
and rejecting the null hypothesis H0 : β1 = 0 for large values of F ∗ which has an F
distribution F (dfR − dfF , dfF ).
The {µij } are unrestricted parameters when X = Xj .Their least squares estimates are
Pnj
i=1 Yij
Ȳj =
nj
28
3.1 Extra sum of squares principle
≡ n−c
Note that if all nj = 1, then dfF = 0 , SSE(F ) = 0 and the analysis does not proceed
any further.
Consider now the reduced model which specifies the linear model
Yij = β0 + β1 Xj + ϵij
where
Ŷij = b0 + b1 Xj (3.2)
The degrees of freedom are dfR = (n − 2). Hence, we may test
H0 : E[Y ] = β0 + β1 X
H1 : E[Y ] ̸= β0 + β1 X
So the test here is on whether a linear model is justified at all. This is different from
just testing that the slope is zero.
We may gain some insight into the components of the F ∗ ratio. Note that
Yij − Ŷij = Yij − Ȳj − Ȳj − Ŷij
and
X 2 X 2 X 2
Yij − Ŷij = Yij − Ȳj + Ȳj − Ŷij
ij ij ij
29
3 Multiple Linear Regression
P 2 P
ni (µi −β0
∗ 2
Lack of fit SSLF= ij Ȳi − Ŷij c-2 MSLF=SSLF/(c-2) F σ + c−2
2
σ2
P
Pure error SSPE= ij Yij − Ȳi n-c MSPE=SSPE/(n-c)
P 2
Total ij Yij − Ȳ n-1
We note
SSE (R) = SSLF + SSP E
The approach can be extended to multiple regression. We define
to be the reduction in the error sum of squares when after X1 in included, an additional
variable X2 is added to the model. Since
Similarly, when three variables are involved, we may breakdown the sum of squares
due to the regression as
SSR (X1 , X2 , X3 ) = SSR (X1 ) + SSR (X2 |X1 ) + SSR (X3 |X1 , X2 )
This decomposition enables us to judge the effect an added variable has on the sum
of squares due to the regression. An ANOVA table would be decomposed as follows
30
3.1 Extra sum of squares principle
Source SS df MS
Regression SSR (X1 , X2 , X3 ) 3 M SR (X1 , X2 , X3 )
X1 SSR (X1 ) 1 M SR (X1 )
X2 |X1 SSR (X2 |X1 ) 1 M SR (X2 |X1 )
X3 |X1 , X2 SSR (X3 |X1 , X2 ) 1 M SR (X3 |X1 , X2 )
Error SSE (X1 , X2 , X3 ) n-4 M SE (X1 , X2 , X3 )
Total SST O n-1
The extra sum of squares principle described in Chapter 3 considers a full model and
a reduced model. It then makes use of the statistic below to determine the usefulness of
the reduced model
h i
SSE(R)−SSE(F )
df −df
F∗ = h R Fi
SSE(F )
∼ F (dfR − dfF , dfF )
dfF
represents the sum of squares residual when q variables are included, with p > q. Then
the difference S1 − S2 is defined to be the extra sum of squares. It will be used to test
the hypothesis that
H0 : βq+1 = ... = βp = 0
1 −S2
It can be shown that under that hypothesis, Sp−q is an unbiased estimate of σ 2
independent of MSE and hence their ratio will have an F distribution. Define
P1 = H1
P2 = H2
∥P1 Y − P2 Y ∥2 = C
Pictorially we have
31
3 Multiple Linear Regression
(Y-P1Y)
P1 Y
P2 Y (P1-P2)Y
E (S1 − S2 ) = E {Y ′ (P1 − P2 ) Y }
= (p − q) σ 2 + 0
By repeated application of this principle, we can successively obtain for any regression
32
3.2 Simultaneous confidence intervals
model
SS (b0 ) , SS (b1 |b0 ) , SS (b2 |b1 , b0 ) , ..., SS (bp |bp−1 , ..., b0 )
All these sums of squares are distributed as chi square with one degree of freedom
independent of MSE. The tests are conducted using t tests.
There are occasions when we require simultaneous or joint confidence intervals for the
entire set of parameters. As an example, suppose we wish to obtain confidence intervals
for both the intercept and the slope of a simple linear regression. Computed separately,
we may obtain 95% confidence interval for each. If the statements are independent,
then the probability that both statements are correct is given by (0.95)2 = 0.9025.
Moreover, the intervals make use of the same data and consequently, the events are not
independent.
One approach that is frequently used begins with the Bonferroni inequality. For
two events A1 , A2
≤ P (A1 ) + P (A2 )
≥ 1 − P (A1 ) − P (A2 )
P (A1 ) = P (A2 ) = α
and hence
≥ 1 − 2α
33
3 Multiple Linear Regression
Now the event (A′1 ∩ A′2 ) is the event that the intervals
1 − 2α = 0.95
α = 0.025
t (0.9875; n − 2)
= 1−α
34
3.3 R Session
3.3 R Session
We will use the Delivery Time data
a) Graphic
Delivery=read.table(file.choose(),header=TRUE,sep=’\t’)
names(Delivery)
[1] "Delivery.Time" "Number.of.Cases" "Distance"
plot(Delivery) #two-dimensional scatter plot
install.packages("plot3D") #install three-dimensional plot routine
library("plot3D")
x=Delivery$Number.of.Cases #define the variables
y=Delivery$Distance
z=Delivery$Delivery.Time
scatter3D(x,y,z,theta=15,phi=20,xlim=c(1,30),ylim=c(30,150)) #plot in 3D; many
options are available
b) Model fitting
X1=Delivery$Number.of.Cases
X2=Delivery$Distance
Y=Delivery$Delivery.Time
model=lm(Y~X1+X2,data=Delivery)
summary(model)
Call: lm(formula = Y ~ X1 + X2, data = Delivery)
Residuals:
Min 1Q Median 3Q Max
-5.7880 -0.6629 0.4364 1.1566 7.4197
35
3 Multiple Linear Regression
36
3.4 DATA SETS
37
4 Model adequacy checking
After fitting a regression model, it is important to verify whether or not the assumptions
that led to the analysis are satisfied.
The basic assumptions that were made were
As well, we need to check for influential observations which may unduly influence the
fitted model. Very large or very small values of the response may sometimes heavily
alter the value of the estimated coefficients.
Definition 4.1. The basic tool that is used consists of analyzing the residuals
ei = Yi − Ŷi (4.1)
Then plot e(k) vsEk where e(k) is the residual with rank k. Under normality, one
expects a straight line plot.
39
4 Model adequacy checking
σ 2 [ei ] = σ 2 [1 − hii ]
where hij is the ij th element of the hat matrix. This demonstrates that the variances of
the residuals are not equal. For this reason, we may define the Studentized or standard-
ized residuals which have equal variance
ei ei
e∗i = = √
σ [ei ] s 1 − hii
where s2 = M SE.
Definition 4.2. The semi studentized residuals are defined as
e
√ i
1 − hii
A plot of the standardized residuals vs fitted values is a useful check for non con-
stancy of variance. The plot should show a random distribution of the points. Alterna-
tively, a non constancy of variance would appear as a telescoping increasing or decreasing
collection of points.
plot(fit,3) #plots of standardized residuals e∗i vs Ŷi
A scale-location plot can also be used to examine the homogeneity of the variance
of the residuals. This is a plot of q
|e∗i |vsŶi
Definition 4.3. Press residuals p.139
40
4.4 Residual plots against the regressor
41
4 Model adequacy checking
4.8 R Session
We will use the Delivery time data
boxplot(X1,X2)
fit=lm(Y~X1+X2,data=Delivery)
plot(fit,2) #normal QQ plot of sqrt(standardized residuals) vs theoretical quantiles
library(MASS)
sresid=studres(fit)
hist(sresid,freq=FALSE)
sfit=seq(min(sresid),max(sresid),length=25)
yfit=dnorm(sfit)
lines(sfit,yfit) #superimposes a normal density on the histogram
Alternatively
x=fit$residuals
curve(dnorm(x,mean(x),sd(x)),add=T)
boxcox(fit) # BoxCox transformation for normality
plot(fit,3) #plots of standardized residuals e∗i vs Ŷi
plot(fit)
Hit <Return> to see next plot:
Hit <Return> to see next plot: Hit <Return> to see next plot:
Hit <Return> to see next plot:
To see all 4 plots in a single page
par(mfrow=c(2,2))
plot(fit)
z=fit$residuals
qqnorm(z) # qq plot
qqline(z) # normal line superimposed
42
4.9 DATA SETS
43
5 Regression Diagnostics
We begin by considering transformations to linearize the model and weighting corrections
for violation of the variance assumption.
45
5 Regression Diagnostics
The usual assumptions would then have to be made and verified on the transformed
model.
(b) The model
Y = β0 + β1 X −1 + ϵ
can be linearized using the reciprocal transformation X ∗ = X −1 .
(c) The model
1
= β0 + β1 X + ϵ
Y
can be linearized using the reciprocal transformation Y ∗ = Y −1 .
(d) The model
X
Y =
β0 + β1 X
can be linearized using the reciprocal transformation in two steps. First
Y ∗ = Y −1
and then
X ∗ = X −1
to obtain
Y ∗ = β1 + β0 X ∗
where "P #
−1 lnYi
Ẏ = ln
n
The value of λ is usually determined by trial and error whereby a model is fitted to Y (λ)
for various values of λ and selecting the one which minimizes the residual sum of squares
from a graphic plot.
We note as well that a confidence interval can be constructed for λ . This is useful
in that one may select a simple value of λ which is in the interval such λ = 0.5. (see
5.4.1 p.189 in Montgomery et al.)
46
5.2 Weighted least squares
wi ϵ2i
X
47
5 Regression Diagnostics
√ √
Different weights may be chosen as for example, wi = Xi or w = Y.
The theory proceeds as follows.
Define the matrix W of weights
w1 0 0
0 w2
W =
0 wn
Y = Xβ + ε
to
Y W ≡ X W β + εW
′ −1 ′
bW = (XW XW ) XW YW
−1
= (X ′ W X) X ′W Y
wi (ei )2
P
=
n−p
One way to proceed is to perform the usual regression. Then, group the data using the X
variable. Estimate the variances s2i the Yi for each group. Then fit the variances against
the averages of the Xi of the groups. We illustrate this approach with the Turkey data.
The Breusch-Pagen test for constancy of variance assumes the model
Log σi2 = γ0 + γ1 Xi
48
5.2 Weighted least squares
We regress the squared residuals e2i against Xi in the usual way and calculate an
error sum of squares SSR∗ . We reject th null hypothesis that γ1 = 0 whenever
SSR∗
χ21 ≡ 2
SSE
n
is larger than the chi square critical value. This test can be conducted in R using
library(lmtest)
library(lmtest)
bptest(fit)
49
5 Regression Diagnostics
20
15
10
5
0
1 2 3 4
50
5.2 Weighted least squares
Histogram of Residuals
30
95%
25
30
20
log−Likelihood
Frequency
15
25
10
5
20
0
−4 −3 −2 −1 0 1 2 3 −2 −1 0 1 2
Residuals λ
62 62
42 42
2
2
Standardized residuals
1
1
Residuals
0
0
−1
−1
−2
−2
−3
6
6
−3
−2 −1 0 1 2 11 12 13 14 15 16 17 18
6
62 6
1.5
42
Standardized residuals
0.10
Cook's distance
62 80
1.0
0.05
0.5
0.00
0.0
11 12 13 14 15 16 17 18 0 20 40 60 80
51
5 Regression Diagnostics
12
10
8
Y
6
4
2
2 4 6 8 10
52
5.3 Checking on the linear relationship assumption
SSR SSE
R2 = =1−
SST O SST O
This coefficient may be interpreted as the proportion of explained variation by the
regression model. The larger the proportion the better the model is.
Care must be taken however when using this measure because a large value of R2
may arise if the points lie on a quadratic.
53
5 Regression Diagnostics
Using R
boxcox(model) # determines the value of λ so that Z = Y λ is normally distributed
lambda <- b$x[which.max(b$y)] # Exact lambda
lambda
c) Durbin Watson test
durbinWatsonTest(fit) # tests for autocorrelated errors
d) Test on constancy of variance
library(lmtest)
bptest(fit) #Breusch-Pagen test
e) Installing olsrr
install.packages("olsrr")
library(olsrr)
ols_plot_dfbetas(fit) #plot of dfbetas
ols_plot_dffits(fit)# plot of dffits
5.5 R Session
We use the Delivery data
a) Breusch Pagen test
bptest(fit)
studentized Breusch-Pagan test
data: fit BP = 11.988, df = 2, p-value = 0.002493
b) Weighted regression
wls_model <- lm(Y ~ X1+X2, data = Delivery, weights=1/Y)
summary(wls_model )
Call: lm(formula = Y ~ X1 + X2, data = Delivery, weights = 1/Y)
Weighted Residuals:
Min 1Q Median 3Q Max
-1.04397 -0.26730 0.00011 0.23581 1.33217
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.622929 0.903842 4.008 0.000591 ***
X1 1.583045 0.163823 9.663 2.24e-09 ***
X2 0.011142 0.003123 3.567 0.001722 **
--- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.6156 on 22 degrees of freedom
Multiple R-squared: 0.9369, Adjusted R-squared: 0.9312
F-statistic: 163.3 on 2 and 22 DF, p-value: 6.321e-14
54
5.6 Data Sets
c) Cook’s distance
plot(fit,5)
plot(fit,4,id.n=5)
d) Installing olsrr
install.packages("olsrr")
library(olsrr)
ols_plot_dfbetas(fit) #plot of dfbetas
ols_plot_dffits(fit)# plot of dffits
55
6 Diagnostics for Leverage and
Measures of Influence
A single observation may unduly influence the results of a regression analysis. Hence,
the detection of such influential observations is important. In this connection the hat
matrix plays a very important role. We begin with the minimized sum of squares
′
R β̂ = Y − X β̂ Y − X β̂
−1
= Y ′ Y − Y ′ X (X ′ X) X ′Y
−1
= Y ′ I − X (X ′ X) X′ Y
= Y ′ (I − H) Y
where H = X (X ′ X)−1 X ′ . Let x′i being the ith row of X. Then the ith diagonal of H is
for i = 1, ..., n
−1
hii = x′i (X ′ X) xi
2
1 Xi − X̄
= + P 2
n Xi − X̄
57
6 Diagnostics for Leverage and Measures of Influence
Therefore, hii is a measure of how far the ith observation is from the mean. If Xi = X̄,
then hii = n1 which is the minimum value.
Definition 6.1. The quantity hii is called the leverage of the ith observation.
A further insight is gained by writing the mean X̄ in terms of the mean X̄(i) when
the ith observation is deleted. We can show
1
X̄ = Xi + (n − 1) X̄(i)
n
so that
1h i
Xi − X̄ = Xi − Xi + (n − 1) X̄(i)
n
n−1h i
= Xi − X̄(i)
n
Hence,
2
1 Xi − X̄
hii = +
n P Xi − X̄ 2
2
1 n−1
2 Xi − X̄(i)
= + 2
n n P
Xi − X̄
This shows that the leverage of the ith observation will be large if Xi is far from the
mean of the other observations. So the leverage is concerned with the location of points
in the space of the independent variables which may be influential.
−1
h i
= T r (X ′ X) X ′X (6.2)
= T r [Ip ] = p
58
6.1 Properties of the leverage
6.1.1 DFFITS
A useful measure of the influence that case i has on the fitted value Ŷi is given by
Ŷi − Ŷi(i)
DF F IT Si = q (6.3)
M SE(i) hii
!1
hii 2
= ti
1 − hii
where
" #1
n−p−1 2
ti = ei
SSE (1 − hii ) − e2i
represents the Studentized residual. This shows that it can be calculated from the
original residuals, the error sum of squares and the hat matrix values. The value of
DF F IT Si represents the number of estimated standard deviations of Ŷi that the fitted
value increases or decreases with the inclusion of the ith case in fitting the regression
model.
1
2 hii
If case i is an X outlier and has high leverage, then > 1 and DF F IT S will
1−hii
be large in absolute value. As a guideline, influential cases are flagged if
|DF F IT Si | > 1
59
6 Diagnostics for Leverage and Measures of Influence
Cook’s distance is a function of the residual ei and the leverage hii . It can be large
if either the residual is large and the leverage moderate, or if the residual is moderate
and the leverage is large, or both are large. It can be shown that approximately
Di ∼
= F (p, n − p)
6.1.3 DFBETAS
DFBETAS are a measure for the influence that case i has on each of the regression
coefficients bk , k = 0, 1, ..., p − 1.
bk − bk(i)
DF BET AS(i) = q
M SE(i) cii
where cii is the ith diagonal element of (X ′ X)−1 . The M SE(i) may be computed from
the relationship
e2i
(n − p) M SE = (n − p − 1) M SE(i) +
1 − hii
A large value of DF BET AS(i) indicates a large impact of the ith case on the k th
regression coefficient. As a guideline
√2 large n
n
DF BET AS(i) >
1 small n
60
6.1 Properties of the leverage
where I is the m × m identity matrix. The matrix identity is verified by multiplying the
right hand side by (A − U V ′ ).
Consider the partition of the X matrix
X(M )
X=
XM
where X(M ) is the reduced matrix when M rows are deleted and XM is the matrix
containing the deleted rows. Similarly, let
Y(M )
Y =
YM
Then,
X ′ X = X(M
′ ′
) X(M ) + XM XM
Y ′Y ′
= Y(M ′
) Y(M ) + YM YM
X ′Y ′
= X(M ′
) Y(M ) + XM YM
where
−1
HM = XM (X ′ X) ′
XM
is the hat matrix for the observations left out.
61
6 Diagnostics for Leverage and Measures of Influence
If M observations are left out and the model is refitted, the least squares estimate
becomes
−1
′ ′
β̂(M ) = X(M ) X(M ) X(M ) Y(M ) (6.6)
(6.7)
−1
′
= X(M ) X(M ) (X ′ Y − XM
′
YM ) (6.8)
−1
′
Substituting the expression for X(M ) XM above, we have
−1 −1 −1
h i
β̂(M ) = (X ′ X) + (X ′ X) ′
XM (I − HM )−1 XM (X ′ X) (X ′ Y − XM
′
YM )
Let ŶM = XM β̂ be the estimate of response for the M values left out and set
eM = YM − ŶM be the corresponding vector of residuals.
A little algebra leads to the expression
−1
β̂(M ) − β̂ = − (X ′ X) ′
XM (I − HM )−1 eM (6.9)
The expression (6.9) shows the change in the parameter estimate when M data are
deleted. When M = 1,
− (X ′ X)−1 xi ei
β̂(i) − β̂ =
1 − hii
We may also compute the effect of deletion on the error sum of squares. The error
sum of squares for the full model is
SSE = (n − p) S 2 = Y ′ Y − β̂ ′ X ′ Y
2 ′ ′ ′
(n − p − M ) S(M ) = Y(M ) Y(M ) − β̂(M ) X(M ) Y(M )
= Y ′ Y − YM′ YM − −β̂(M
′ ′ ′
) (X Y − XM YM )
= (n − p) S 2 + β̂ ′ X ′ Y − β̂(M
′ ′ ′ ′ ′
) X Y − YM YM + β̂(M ) XM YM
≡ (n − p) S 2 + A + B
where A = β̂ ′ − β̂(M
′ ′ ′ ′ ′
) X Y and B = −YM YM + β̂(M ) XM YM .
62
6.1 Properties of the leverage
= Yˆ ′ M (I − HM )−1 eM
Similarly,
B = −YM′ YM + β̂(M
′ ′
) XM YM
−1
h i
= −YM′ YM + + β̂ − e′M (I − HM )−1 XM (X ′ X) ′
XM YM
= −YM′ YM + β̂ ′ XM
′
YM − e′M (I − HM )−1 HM YM
2 2
(n − p − M ) S(M ) = (n − p) S + A + B
Since
(I − HM )−1 − HM (I − HM )−1 = I
we have
Yˆ ′ M (I − HM )−1 eM − YM′ HM (I − HM )−1 eM = e′M YM
and hence
−1
2
(n − p − M ) S(M 2 ′
) = (n − p) S + eM (I − HM ) ŶM − e′M (I − HM )−1 YM
= (n − p) S 2 − e′M (I − HM )−1 eM
This shows that the residual sum of squares when M observations are deleted is reduced
by an amount that depends on eM and on the inverse of a matrix containing elements
of the hat matrix.
When M = 1,
2 e2i
(n − p − 1) S(i) = (n − p) S 2 −
1 − hii
63
6 Diagnostics for Leverage and Measures of Influence
Finally, we may compare the value of the ith observation Yi with the prediction Ŷ(i)
when that observation is not used in the fitting.
Let
di = Yi − Ŷ(i)
= Yi − x′i β̂(i)
ei
=
1 − hii
.
σ2
=
1 − hii
−1
−1 x′i (X ′ X)−1 xi x′i (X ′ X)−1 xi
x′i ′
X(i) X(i) xi = x′i (X X)′
xi +
1 − hii
h2ii
= hii +
1 − hii
and
1
−1
1+ x′i ′
X(i) X(i) xi =
1 − hii
Consequently, the studentized residuals are
√
di ei 1 − hii
s =
−1 S(i) 1 − hii
S(i) 1 + x′i X(i)
′
X(i) xi
ei 1
= √ ∼ t(n−p−1)
S(i) 1 − hii
64
6.1 Properties of the leverage
Ŷi − Ŷ(i)
DF F IT S = √
Si hii
" #1/2 !1/2
n−p−1 hii
= ei
SSE (1 − hii ) − e2i 1 − hii
!1/2
hii
= ti
1 − hii
since
2 e2i
(n − p − 1) S(i) = (n − p) S 2 −
1 − hii
We flag influential cases when
> 1 smal/medium data
DF F IT S
q
> 2 p
large data
n
Example 6.3. a) Illustrate lack of fit test using bank data Table 3.4 where
X= size of minimum deposit
Y=# of new accounts
Branch 1 2 3 4 5 6 7 8 9 10 11
X 125 100 200 75 150 175 75 175 125 200 100
Y 160 112 124 28 152 156 42 124 150 104 136
b) Use the following to create diagnostic plots of residuals vs fitted (to check lin-
earity), normal qq plots, scale-location vs fitted values and residuals vs leverage
65
6 Diagnostics for Leverage and Measures of Influence
6.3 R Session
Textbook 4.1
In R
the diagnostic plots are exhibited using the package
library(olsrr)
ols_plot_cooksd_bar(fit) # yields a plot of Cook’s distance vs the observations
ols_plot_cooksd_bchrt(fit) # also yields a plot of Cook’s distance vs the observa-
tions
ols_plot_dfbetas(fit) # yields a plot of DFBETAS vs the observations
ols_plot_dffits(fit) # yields a plot of DFFITS vs the observations
Also the following will display the Hat matrix diagonal elements;
under coefficients, a matrix whose i-th row contains the change in the estimated
coefficients which results when the i-th case is dropped from the regression;
under sigma, a vector whose i-th element contains the estimate of the residual
standard deviation obtained when the i-th case is dropped from the regression
under wt.res. a vector of weighted (or for class glm rather deviance) residuals
diag=lm.influence(model)
diag
To see values of Y X .fitted .resid .hat .sigma .cooksd .std.resid type in R
library(broom)
K=model.diag.metrics=augment(model)
head(model.diag.metrics, 20) or
model.diag.metrics(model) # shows values of fit,hat,sigma,Cook,std.resid influence.measures(m
summary(influence.measures(model)) #this exhibits the potentially influential ob-
servations
Alternatively use
hatvalues(model)
dfbetas(model,~)
dffits(model)
cooks.distance(model)
66
6.4 Data Sets
6.5 R session
Using Weighted data
Weighted=read.table(file.choose(),header=TRUE,sep=’\t’)
names(Weighted) [1] "X" "Y" "W"
Diameter=Weighted$X
Area=Weighted$X
install.packages("ggfortify")
library(ggfortify)
fit=lm(Y~X, data=Weighted)
autoplot(fit)
Residuals vs Fitted Normal Q−Q
33 33
2
Standardized residuals
1
Residuals
0 0
−1
−2
−2
29 29
−4 32 32
−3
3 6 9 −2 −1 0 1 2
Fitted values Theoretical Quantiles
Standardized Residuals
1.5 29 1
33
0
1.0
−1
0.5
−2
29
0.0 −3 32
3 6 9 0.00 0.05 0.10 0.15
Fitted values Leverage
67
6 Diagnostics for Leverage and Measures of Influence
variance. A plot is good if there is a horizontal line with eually randomly spread of
points.
A plot of residuls vs leverage helps to locate influential cases if any. Some cases
may be influential even if they appear to be in a reasonable range of the values. We
watch for outlying values in the upper right or lower right plot. Look for values outside
the dashed lines where Cook’s distance scores are highest.
68
7 Different Models
The regression set up permits us to consider a variety of models which we will discuss
here
Y = β0 + β1 X + β2 X 2 + β3 X 3 + ... + βk X k + ϵ
which may be fitted using the matrix approach. It is important to keep in mind that in
considering such a model, the order k should be as low as possible. The inversion of the
matrix X ′ X will be inaccurate resulting in poor estimates of the parameters and their
variances.
Often, orthogonal polynomials defined below, are used in the modeling because they
simplify the fitting process
P0 (Xi ) = 1
Such polynomials have been tabulated (See Biometrika Tables for Statisticians). The
least squares estimates are given by
Pn
Pj (Xi ) Yi
β̂j = Pi=1
n 2
, j = 0, 1, ..., k
i=1 Pj (Xi )
The principal advantage of using orthogonal polynomials is that the model can be
fitted sequentially. This specific advantage is less important today in the age of high
speed computing compared to the times when much of the modeling was done using
69
7 Different Models
calculators.
Sometimes a low order polynomial does not fit the data well. This can be due to
the fact that the function in question behaves differently in different parts of the range.
In that case, it is common to use splines functions or piece wise polynomial fitting. We
do not pursue this topic here. (Refer to the text book p.236-242)
When two or more variables are involved cross terms are included as in the following
model involving two variables
Such models are called surface response surfaces. Such models are often used in control
theory problems to optimize the selection of control settings of the variables.
An interesting application is to the case where one wishes to fit a simple linear model
as a function of gender. Set
Y = β0 + β1 X1 + β2 X2 + ϵ
In that case,
β + β1 X1 + β2 + ϵ males
0
Y =
+ β1 X1 + ϵ
β
0 f emale
So here the two lines are parallel. This can be generalized to 2 or more dummy variables
X2 X 3
0 0 observation from category 1
1 0 observation from category 2
0 1 observation from category 3
Example 7.1. a) Suppose that we have the following time data and we wish to fit two
70
7.3 R Session
7.3 R Session
Textbook 8.1
71
7 Different Models
72
8 Multicollinearity
When the variables are correlated among themselves multicollinearity is said to exist.
This can cause serious problems among them being that the estimates become highly
unstable. What are some of the symptoms of multicollinearity?
1. Large variation in the estimated coefficients when a new variable is either added
or deleted.
2. Non significant results in individual tests on the coefficients of important vari-
ables.
3. large coefficients of simple correlation between pairs of variables.
4. Wide confidence interval for the regression coefficients of important variables.
The principal difficulty is that the matrix (X ′ X) may not be invertible. As well,
multicollinearity affects the interpretation of the coefficients in that they may vary in
value. To illustrate, consider the case of two predictor variables X1 , X2 . If the variables
are standardized then the matrix
!
′ 1 1 r12
(X X) = 2
1 − r12 r12 1
where r12 is the correlation between the two variables. Moreover the variance -covariance
matrix is !
2 ′ −1 2 1 1 −r12
σ (X X) = σ 2
1 − r12 −r12 1
The two regression coefficients have the same variance and this increases as the correla-
tion increases.
Consequently, as |r12 | → 1, V arβ̂k → ∞; Cov β̂1 , β̂2 → ±∞ as r12 → ±1
The estimates are
−1
β̂ = (X ′ X) X ′ Y
Hence, βˆ1 = r1Y −r12 ˆ
2
1−r12
; β2 = r2Y −r12
2
1−r12
In general, the diagonal elements of (X ′ X)−1 are Cjj = 1
1−Rj2
where Rj2 is the Rsquare value obtained from the regression
of Xj on the other p − 1 variables.
If there is a strong multicollinearity between Xj and the other p − 1 variables,
2 ∼ σ2 ∼
Rj = 1 and V ar β̂j = 1−R 2 = ∞
j
73
8 Multicollinearity
As well, under multicollinearity, the values of the estimates will be large. Set
2
L = β̂ − β
Then,
p
X 2 p
X
E β̂j − βj = V ar β̂j
j=1 j=1
−1
= σ 2 T race (X ′ X)
p
1
= σ2
X
j=1 λj
L = β̂ ′ β̂ − 2β̂β + β ′ β
p
1
= σ2
X
j=1 λj
2
= ∥β∥2 + σ 2
Pp 1
Hence, = E β̂ j=1 λj
The eigenvalues (λj ) can also be used to measure the extent of multicollinearty in
the system.
If one or more are small, then there are near linear dependencies in the columns of
′
X X.
The condition number κ and condition indices κj of X ′ X are defined to be
λmax λmax
κ= , κj =
λmin λj
74
The starting point is to first standardize the variables as
!
1 Yi − Ȳ
Yi∗ = √
n−1 sY
!
∗ 1 Xik − X̄k
Xik = √ , k = 1, ..., p − 1
n−1 sk
75
8 Multicollinearity
variables in the model. Under perfect correlation, i.e.Rk2 = 1 the variance is unbounded.
As a rule of thumb, a value (V IF ) > 10 indicates that multicollinearity exists.
Table-3: VIF interpretation
VIF -value conclusion
VIF 1 Not correlated 1
VIF 5 Moderately correlated
VIF 5 Highly correlated
Tolerance is the amount of variability in one independent variable that is no ex-
plained by the other independent variables, and it is in fact 2 1 R .Tolerance values less
than 0.10 indicate collinearity.
The eigenvalues of X’X are the squares of the singular values of X. The condition
indices are the square roots of the ratio of the largest eigenvalue to each individual
eigenvalue. The largest condition index is the condition number of the scaled X matrix.
Alternatively, as a diagnostic tool, we may compute the average
P
(V IF )k
V IF =
p−1
Mean values much greater than 1 point to serious multicollinearity.
Ridge regression is considered as a remedial measure to multicollinearity. The theory
is as follows. The normal equation
(X ′ X) b = X ′ Y
rXX b = rY X
(rXX + cI) bR = rY X
where c ≥ 0 is a constant and the superscript R indicates “ridge”. The ridge standardized
regression coefficients become
The constant c reflects the fact that the ridge estimators will be biased but they tend
to be more stable or less variable than the ordinary least squares estimators.
The constant c is usually chosen is such a way that the estimators of bRk are stable in
value. Alternatively, whenever the (V IF )k are stable in value. A plot of the coefficients
76
8.1 Calculations Using R
against c is called the ridge trace and this helps in the selection of c.
Finally, we note that ridge regression can also be obtained from the method of
penalized regression. From (8.1), we have the following system of equations:
(1 + c)bR R R
1 + r12 b2 + ..., +r1,p−1 bp−1 = rY 1
r21 bR R R
1 + (1 + c)b2 + ..., +r2,p−1 bp−1 = rY 2 (8.2)
rp−1,1 bR R R
1 + rp−1,2 b2 + ..., +(1 + c)bp−1 = rY,p−1
77
8 Multicollinearity
We obtain plots showing model fit assessment. These plots are used to detect non-
linearity, influential observations and outliers. They consist of side-by-side quantile plots
of the centered fit and the residuals. It shows how much variation in the data is explained
by the fit and how much remains in the residuals. For inappropriate models, the spread
of the residuals in such a plot is often greater than the spread of the centered fit.
5.0
2.5
0.0
−5
−2.5
−10
−5.0
0.0 0.4 0.8 1.2 0.0 0.4 0.8 1.2
Proportion Less Proportion Less
ols_correlations(model)
Next we compare the observed vs predicted plot to assess the fit of the model.
Ideally, all your points should be close to a regressed diagonal line. Draw such a diagonal
line within your graph and check out where the points lie. If your model had a high R
Square, all the points would be close to this diagonal line. The lower the R Square, the
weaker the Goodness of fit of your model, the more foggy or dispersed your points are
from this diagonal line.
ols_plot_obs_fit(model)
78
8.1 Calculations Using R
24
Fitted Value
20
16
12
15 20 25
V4
Next we diagnose
ols_coll_diag(model) #Tolerance and Variance Inflation Factor -
Variables Tolerance VIF
V1 0.001410750 708.8429
V2 0.001771971 564.3434
V3 0.009559681 104.6060
Tolerance is the % of variation in the predictor not explained by the other predictors.
To calculate it, regress the k th predictor on the other predictors in the model. Com-
2
pute Rk. Then
2
Tolerance =1 − Rk.
b) Diagnostics Panel Panel of plots for regression diagnostics
ols_plot_diagnostics(model)
c) Ridge regression
Next we load the package MASS for ridge regression
library(MASS)
y= c(0.01,0.02,0.03,0.04,0.05,0.06,0.07,0.08,0.09,0.10)
lm.ridge(V4~V1+V2+V3,Bodyfat,lambda=y)
y V1 V2 V3
0.01 64.389989 2.7395079 -1.49200145 -1.3458552
0.02 42.218200 2.0683914 -0.91772066 -0.9921824
0.03 30.007043 1.6986330 -0.60142746 -0.7972812
0.04 22.276858 1.4644460 -0.40119461 -0.6738065
0.05 16.944517 1.3028056 -0.26306759 -0.5885534
0.06 13.044863 1.1845111 -0.16204816 -0.5261373
0.07 10.069445 1.0941794 -0.08496703 -0.4784538
0.08 7.724954 1.0229365 -0.02422730 -0.4408273
79
8 Multicollinearity
0
−5
x$lambda
80
8.1 Calculations Using R
81
8 Multicollinearity
82
9 Building the Regression Model
When several predictor variables are involved, the issue that comes up naturally is to
determine how to select the variables as parsimoniously as possible and still produce a
“good” model. If p − 1 predictors are available, then there will be 2p − 1 possible models
which can be constructed. An approach which may be misleading is one where all the
predictors are initially included and then discarding the ones where the studentized
coefficients are not significant. If multicollinearity exists, this approach can lead to
error. The use of diagnostic procedures is important in the final selection of the model
as outliers uncovered by a residual analysis can greatly influence the solution. Some
criteria is essential for the ultimate selection of the model.
9.1.1 R2
This criteria chooses the model with the largest value of explained variation. A plot of
R2 vs the number of variables in the model will appear as a parabola with the last entry
“curving” up a bit. This is the one with all variables in. One may draw a horizontal
line parallel to the x-axis. The point where it meets the parabola determines the best
fitting model since it will be as good as when all the variables are included.
An adjusted R2 takes into account the values of n, p
!
n−1 SSE
Ra2 = 1−
n−p SST O
M SE (p)
= 1−
SST O/ (n − 1)
83
9 Building the Regression Model
9.1.2 Mallows Cp
To derive the Mallows criteria, Suppose that the true model has q predictor variables
Y = Xq βq + ε
Suppose instead we fit a model using only p predictor variables. Let Hp be the hat
matrix using only p variables. The bias for the ith fitted value is
E Ŷi − µi
The total mean squared error for all the fitted values divided by σ 2 is
( )
1 X 2 X h i
Γp = 2 E Ŷi − µi + σ 2 Ŷi
σ i i
We may estimate σ 2 by the MSE when all the variables are included. The vector
of residuals becomes
ep = (I − Hp ) Y
and the error sum of squares is
SSEp = e′p ep
It follows that
bias = E (ep ) = (I − Hp ) EY
= EY − E Ŷ
h i
since E [Hp Y ] = Hp E [Y ] = E Ŷ .
84
9.1 Criteria for model selection
X 2
= σ 2 (n − p) + E Ŷi − µi
i
The total mean squared error for all the fitted values divided by σ 2 is
" #
1 X 2 X
2
h i
Γp = E E Ŷ i − µ i + σ Ŷi
σ2 i i
" #
1 2
X
2
h i
= E [SSE p ] − σ (n − p) + σ Ŷi
σ2 i
n
1 1 X 2
h i
= E [SSE p ] − (n − p) + σ Ŷi
σ2 σ 2 i=1
1
= E [SSEp ] − (n − p) + p
σ2
−1
h i
= σ 2 T race X ′ (X ′ X) X
= σ 2 T raceH = pσ 2
Hence,
1
E [SSEp ] − (n − 2p)
Γp =
σ2
If we estimate σ 2 by M SE and E [SSEp ] by [SSEp ] then the Mallows criteria becomes
SSEp
Cp = − (n − 2p)
M SE
85
9 Building the Regression Model
Mallows proposed a graphical method for finding the best model. Plot Cp vs p.
Models having little bias will be close to the line Cp = p. Models with substantial bias
will be above the line. Sometimes a model may show some bias but contains fewer
variables and as a result may be preferred.
This criterion places a greater penalty than the Akaike criterion and it is the one used
by R.
where Ŷ(i) is the fitted value when the ith observation is deleted.
86
9.2 Model selections
Forward
Step 1 Begin with no regressors in the model. Compute the standardize student t
statistic for each variable and choose the one with the greatest absolute value to include
in the model. This is also the variable that has the largest simple correlation with the
response. A pre-selected critical F value, say Fin is chosen.
Step 2 With the variable in Step 1 in, choose the next variable using the same criteria
as in step 1 after adjusting for the effect of the first variable selected. The criteria makes
use of partial correlations which are computed between the residuals from Step1 and
the residuals from the regressions of the other regressors on Xj that is residuals from
Ŷ = β̂0 + β̂1 X1 and residuals X̂j = α̂0j + α̂1j X1 for j = 2, ..., K. If X2 is selected, it
implies that the largest partial F statistic is
If F > Fin , then X2 is entered into the model.from Check to drop a variable already
in the model if its t value is below a preset limit.
Step 3 Repeat the steps above until the largest partial F statistic no longer exceeds
Fin or until all the variables are included.
Backward
Begin with all the regressors in the model. Compute the partial F statistic for each
regressor as if it were the last one to enter the model. We compare the smallest partial
F with the preselected Fout . If it is smaller, then that variable is removed from the
model. The procedure is repeated until the smallest partial F statistic is not less than
Fout . Backward elimination is often preferred to forward regression because it begins
with all the variables in the model.
Stepwise
87
9 Building the Regression Model
88
9.4 R Session
k = ols_step_backward_aic(model)
k
plot(k)
Next, both
ols_step_both_aic(model) or
k = ols_step_both_aic(model)
plot(k)
9.4 R Session
89
10 Logistic Regression
Sometimes the response variable is discrete. For example, we may wish to model gender
or to estimate the likelihood that a person is wearing a life jacket. Consider the model
Yi = β0 + β1 Xi + ϵi
where
1 with probabilityπi
Yi =
0 with probability1 − πi
Then
E[Yi ] = πi
The usual least squares fitting approach is problematic for the following reasons
3. There is no guarantee that the fitted model will force the estimate Ŷi to be in the
interval (0, 1).
ex
f (x) = , −∞ < x < ∞ (10.1)
(1 + ex )2
et
F (t) =
(1 + et )
We can show
π
EX = 0, σ 2 [X] =
3
91
10 Logistic Regression
= F (k − β0∗ − β1∗ Xi )
= F (β0 + β1 Xi )
exp (β0 + β1 Xi )
=
(1 + exp (β0 + β1 Xi ))
where
β0 = k − β0∗ , β1 = −β1∗
It is common practice to model the logarithm of the odds
!
πi P (Yi = 1)
log = log (10.2)
1 − πi 1 − P (Yi = 1)
= β0 + β1 Xi
yi (β0 + β1 Xi ) −
P P
log (Πi f (yi )) = log (1 + exp (β0 + β1 Xi ))
92
10.1 Repeat Observations
exp (b0 + b1 Xi )
π̂i =
(1 + exp (b0 + b1 Xi ))
To interpret the parameters in the logistic regression model, let us consider the
fitted value at a specific value of X, say X0 . Then the difference between the log odds
at X0 + 1 and the log odds at X0 is
ÔR = eβ̂1
The odds ratio is the estimated increase in the probability of successes associated
with a one unit change in the value of the predictor variable. For a change of d units,
the odds ratio becomes
ˆ
ÔR = edβ 1
93
10 Logistic Regression
β ′X
E[Y ] =
1 + β ′X
so that
π
log = β ′X
1−π
bk − βk
∼ N (0, 1) , k = 0, ..., p − 1
s [bk ]
94
10.4 Test for Goodness of Fit
− 2 [Y lnY + (n − Y ) ln (n − Y ) − nlnn]
where Y is the total number of successes observed and n is the total number of ob-
servations. We reject the null hypothesis that the regression is non significant if L is
large.
′
−1
H1 : E [Y ] ̸= 1 + e−X β
Here we will make use of a Pearson chi-square goodness of fit test. The expected number
of successes is ni π̂i and the expected number of failures is ni (1 − π̂i ) . The Pearson Chi
95
10 Logistic Regression
If the fitted model is correct, the HL statistics follows a chi square with g −1 degrees
of freedom. We reject the null hypothesis for large values of the statistic HL.
Another test for the model is based on the likelihood ratio test whereby we consider the
reduced and the full models. Here we compare the current model to a saturated model
whereby each observation (or group when ni > 1) has its own probability of success
estimated by Yi /ni .
Under the reduced model
′
−1
E [Yi ] = 1 + e−Xi β
whereas under the full model (also called the saturated model)
E [Yi ] = πi
n
" !#
Yi ni − Y i
X
= −2 Yi log + (ni − Yi ) log
i=1 ni π̂i ni (1 − π̂i )
We reject for large values i,e DEV > χ2n−p . The deviance in logistic regression plays an
analogous role to the residual mean squares in ordinary regression.
R computes the null deviance (the deviance of the worst model without any predic-
tor) and the residual deviance. The quantity
Deviance
1−
N ull Deviance
is equal to 1 for a perfect fit and equal to 0 if the predictors do not add anything to the
model.
96
10.5 Diagnostic Measures for Logistic Regression
(Oj − Nj π̂j )2 X
g
X g
(Nj − Oj − Nj (1 − π̂j ))2
HL = +
j=1 Nj π̂j j=1 Nj (1 − π̂j )
g
X (Oj − Nj π̂j )2
=
j=1 Nj π̂j (1 − π̂j )
ei = Yi − π̂i
These do not have constant variance. The deviance residual is for i = 1, ..., n
( " !#)1/2
Yi ni − Yi
di = ± 2 Yi log + (ni − Yi ) log
ni π̂i ni (1 − π̂i )
Yi − π̂i
rP i = q
π̂i (1 − π̂i )
and V is the diagonal matrix with Vii = ni π̂i (1 − π̂i ).The studentized Pearson residuals
97
10 Logistic Regression
so that n
(devi )2
X
DEV =
i=1
For a good model, E [Yi ] = π̂i and plots of rSP i vs π̂i and rSP i vs linear predictor Xi′ β
should show a smooth horizontal Lowess line through 0. Plots of the deviance and the
studentized Pearson residuals are useful to check for outliers. A normal probability plot
of the deviance residuals can be used to check for the fit of the model and for outliers.
A plot of the deviance vs the estimated probability of success can be used to determine
where the model is poorly fitted, at high or low probabilities.
Similarly for a plot of devi vs linear predictor Xi′ β.
98
10.6 Calculations Using R
Data for logistic analysis may come in one of two forms: either Bernoulli or binomial.
In this example it is binary.
library(ggplot2)
names(file)[1]=”experience’
names(file)[2]=”success’
mlogit=glm(success~experience,data=file,family=”binomial”)
summary(mlogit)
confint(mlogit) #confidence intervals
exp(coef(mlogit)) #odds ratio
exp(cbind(OR=coef(mlogit),confint(mlogit))) #odds ratio and 95% confidence in-
terval
newdata=with(file,data.frame(experience=10))
predict(mlogit,newdata=newdata,se=TRUE)
We now proceed with the analysis for the experience data
mydata=read.table(file.choose(),header=TRUE,sep=’\t’)
head(mydata)
experience success Fittedvalue
1 14 0 0.310262
2 29 0 0.835263
3 6 0 0.109996
4 25 1 0.726602
5 18 1 0.461837
6 4 0 0.082130
summary(mydata)
experience success Fittedvalue
Min 4.00 0.00 0.08213
1st Qu 9.00 0.00 0.16710
Median 18.00 0.00 0.46184
Mean 16.88 :0.44 0.44000
3rd Qu 24.00 1.00 0.69338
Max 32.00 1.00 0.89166
standard deviations 9.0752410 0.5066228 0.2874901
sapply(mydata,sd) *
mlogit=glm(success~experience, data=mydata,family="binomial")
summary(mlogit)
Call: glm(formula = success ~ experience, family = "binomial", data = mydata)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.8992 -0.7509 -0.4140 0.7992 1.9624
99
10 Logistic Regression
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.05970 1.25935 -2.430 0.0151 *
experience 0.16149 0.06498 2.485 0.0129 *
--- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 34.296 on 24 degrees of freedom
Residual deviance: 25.425 on 23 degrees of freedom
AIC: 29.425
Number of Fisher Scoring iterations: 4
confint(mlogit)
Waiting for profiling to be done
2.5 % 97.5 %
(Intercept) -6.03725238 -0.9160349
experience 0.05002505 0.3140397
exp(coef(mlogit)) (Intercept) experience
0.04690196 1.17525591
exp(cbind(OR=coef(mlogit),confint(mlogit)))
OR 2.5 % 97.5 %
(Intercept) 0.04690196 0.002388112 0.4001024
experience 1.17525591 1.051297434 1.3689441
newdata=with(mydata,data.frame(experience=10))
predict(mlogit,newdata=newdata,se=TRUE)
$fit 1 -1.444837
$se.fit [1] 0.7072129
$residual.scale [1] 1
Another package
# Installing the package install.packages("dplyr")
# Loading package library(dplyr)
# For Logistic regression install.packages("caTools")
# For ROC curve to evaluate model install.packages("ROCR")
# Loading package library(caTools)
#library(ROCR)
Sometimes the data can be split into a training set and a testing set
# Splitting dataset
split = sample.split(mtcars, SplitRatio = 0.8)
split train_reg = subset(mtcars, split == "TRUE")
test_reg = subset(mtcars, split == "FALSE")
# Training
model logistic_model = glm(vs ~ wt + disp, data = train_reg, family = "binomial")
100
10.7 Data Sets
logistic_model
# Summary
summary(logistic_model)
When the data is aggregated,
p=Y/n
mlogit=glm(p~X,data=Toxicity,weights=n,family="binomial")
mlogit
predict_reg =predict(mlogit, type = "response")
predict_reg
Stepwise logistic regression can also be done when several variables are involved
install.packages(“MASS”)
library(MASS)
stepAIC(model,trace=FALSE)
10.8 R Session
Problem 13.1 in Montgomery et al.
101
11 Poisson Regression
In Poisson regression, we have counting data Y which follows a Poisson distribution with
mean µ and hence variance µ.
e−µ µy
f (y) = , y = 0, ...
y!
The estimation of the parameters β is obtained using the method of maximum likelihood.
As for the logistic case, there is no closed form for the solution.
The log likelihood is given by
n
X n
X n
X
logL (y, β) = yi logµi − µi − logyi !
i=1 i=1 i=1
X ′ β̂ identity link
i
=
exp X ′ β̂
log link
i
Inference on the Poisson model is conducted as in the case of the logistic model.
Both the logistic and the Poisson models are particular examples of a more general
linear model (GLM).
103
11 Poisson Regression
e−λ λyi
f (yi , θi , ϕ) =
yi !
11.1 R Session
104
11.2 Data Sets
105
11 Poisson Regression
106
12 References
The following references were used in the preparation of these notes.
[1] THE TRANSFORMATION OF POISSON, BINOMIAL AND NEGATIVE-
BINOMIAL DATA BY F. J. ANSCOMBE. Biometrika, Volume 35, Issue 3-4, December
1948, Pages 246–254, https://doi-org.proxy.bib.uottawa.ca/10.1093/biomet/35.3-4.246
[2] Linear Models with R, second edition, 2014. Julian J. Faraway.
[3] Applied Regression Analysis and Other Multivariable Methods, fifth edition,
2014. David Kleinbaum, Larry Kupper, Azhar Nizam, Eli S. Rosenberg.
[4] Yu Guan Variance stabilizing transformations of Poisson, binomial and negative
binomial distributions Statistics and Probability Letters 79 (2009) 1621–1629
[5] Applied Linear Statistical Models, fifth edition, 2005. Michael H. Kutner,
Christopher J. Nachtsheim, John Neter, William Li
[6] Introduction to Linear Regression Analysis, sixth edition, 2021. Douglas C.
Montgomery, Elizabeth A. Peck, G. Geoffrey Vining.
107