0% found this document useful (0 votes)
5 views

Course Notes18

Uploaded by

famlay07
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Course Notes18

Uploaded by

famlay07
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 113

Mat 3375

Regression Analysis

Mayer Alvo

November 28, 2023


Contents
1 Introduction 1
1.1 The method of least squares . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Inference in regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Analysis of Variance (ANOVA) table . . . . . . . . . . . . . . . . . . . . 7
1.4 Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.5 Calculations using R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.6 R Session . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.6.1 Rocket Propellant data . . . . . . . . . . . . . . . . . . . . . . . . 11
1.6.2 R Session for Plumbing Supplies data . . . . . . . . . . . . . . . . 13
1.6.3 R Session for the GPA data . . . . . . . . . . . . . . . . . . . . . 13
1.7 Other DATA SETS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.7.1 Homework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2 Matrix Approach to Regression 15


2.1 Distributional Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Properties of the hat matrix H . . . . . . . . . . . . . . . . . . . . . . . . 18

3 Multiple Linear Regression 25


3.1 Extra sum of squares principle . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2 Simultaneous confidence intervals . . . . . . . . . . . . . . . . . . . . . . 33
3.3 R Session . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.4 DATA SETS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.5 Suggested Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4 Model adequacy checking 39


4.1 Checking for normality . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2 Checking for constancy of variance . . . . . . . . . . . . . . . . . . . . . 40
4.3 Residual plots against fitted values . . . . . . . . . . . . . . . . . . . . . 40
4.4 Residual plots against the regressor . . . . . . . . . . . . . . . . . . . . . 41
4.5 Residuals in time sequence plot . . . . . . . . . . . . . . . . . . . . . . . 41
4.6 Lack of fit of the regression model . . . . . . . . . . . . . . . . . . . . . . 41
4.7 Calculations Using R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.8 R Session . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

i
Contents

4.9 DATA SETS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5 Regression Diagnostics 45
5.1 Transformations and weighting . . . . . . . . . . . . . . . . . . . . . . . 45
5.1.1 Variance stabilizing transformations . . . . . . . . . . . . . . . . . 45
5.1.2 Transformations to linearize the model . . . . . . . . . . . . . . . 45
5.1.3 Box-Cox transformations . . . . . . . . . . . . . . . . . . . . . . . 46
5.2 Weighted least squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.3 Checking on the linear relationship assumption . . . . . . . . . . . . . . . 53
5.3.1 Descriptive Measures of Linear Association . . . . . . . . . . . . . 53
5.4 Calculations Using R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.5 R Session . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.6 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.7 R session commercial properties . . . . . . . . . . . . . . . . . . . . . . . 55

6 Diagnostics for Leverage and Measures of Influence 57


6.1 Properties of the leverage . . . . . . . . . . . . . . . . . . . . . . . . . . 58
6.1.1 DFFITS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.1.2 Cook’s Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.1.3 DFBETAS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.1.4 Deletion of Observations: theoretical developments . . . . . . . . 61
6.2 Calculations Using R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.3 R Session . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.4 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.5 R session . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.6 Suggested Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

7 Different Models 69
7.1 Polynomial regression models . . . . . . . . . . . . . . . . . . . . . . . . 69
7.2 Indicator regression models . . . . . . . . . . . . . . . . . . . . . . . . . . 70
7.3 R Session . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
7.4 Suggested Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
7.5 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

8 Multicollinearity 73
8.1 Calculations Using R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
8.2 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

9 Building the Regression Model 83


9.1 Criteria for model selection . . . . . . . . . . . . . . . . . . . . . . . . . . 83
9.1.1 R2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

ii
Contents

9.1.2 Mallows Cp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
9.1.3 Akaike information criterion . . . . . . . . . . . . . . . . . . . . . 86
9.1.4 Schwartz’s Bayesian criterion (SBC) . . . . . . . . . . . . . . . . 86
9.1.5 Prediction sum of squares criterion(PRESS) . . . . . . . . . . . . 86
9.2 Model selections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
9.2.1 All possible models . . . . . . . . . . . . . . . . . . . . . . . . . . 86
9.2.2 Forward ,Backward and Stepwise Regression . . . . . . . . . . . . 87
9.2.3 LASSO and LAR regression . . . . . . . . . . . . . . . . . . . . . 88
9.3 Calculations Using R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
9.4 R Session . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
9.5 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
9.6 Suggested Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

10 Logistic Regression 91
10.1 Repeat Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
10.2 Multiple Logistic models . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
10.3 Inference on model parameters . . . . . . . . . . . . . . . . . . . . . . . . 94
10.4 Test for Goodness of Fit . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
10.4.1 Deviance Goodness of Fit Test . . . . . . . . . . . . . . . . . . . . 96
10.4.2 Hosmer-Lenshow Goodness of Fit Test . . . . . . . . . . . . . . . 97
10.5 Diagnostic Measures for Logistic Regression . . . . . . . . . . . . . . . . 97
10.5.1 Detection of Influential Observations . . . . . . . . . . . . . . . . 98
10.5.2 Influence on the Fitted Linear Predictor . . . . . . . . . . . . . . 98
10.6 Calculations Using R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
10.7 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
10.8 R Session . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

11 Poisson Regression 103


11.1 R Session . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
11.2 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

12 References 107

iii
1 Introduction

At the start, there are measurements on explanatory variables, denoted X1 , ..., Xp as well
on a response variable Y . Regression analysis then proceeds to describe the behavior of
the response variable in terms of explanatory variables. Specifically, it seeks to establish
a relationship between the response and the explanatory variables in order to monitor
how changes in the latter affect the former. The relationship can also be used for
predicting the value of a response given new values of the explanatory variables.

In all instances, the primary goal in regression is to develop a model that relates
the response to the explanatory variables, to test it and ultimately to use it for inference
and prediction.

Example 1.1. Suppose we have Y = sale values for n = 25 houses and X = Assessed
values. Hence the given data consists of the pairs

{(Xi , Yi ) , i = 1, ..., n}

1
1 Introduction

Assessed value X Sale value Y

238 251
270 251
235 253
Sale value Y
239 255
274 275 1000
242 277 900
242 279 800
320 295
700
279 297
413 412 600

389 417 500


361 435 400
408 469
300
389 471
200
471 475
476 475 100
430 487 0
440 490 0 200 400 600 800 1
461 628
573 640
465 645
619 739
640 790
788 800
793 911
958 945

We first plot the n paired data Yi vs Xi . If it seems reasonable to fit a straight line
to the points, we then postulate the following simple regression model

Yi = β0 + β1 Xi + ϵi (1.1)

Here, ϵ represents an unobserved random error term, β0 is the intercept whereas β1


represents the slope of the line. Both β0 , β1 are labeled parameters. They are unknown

2
1.1 The method of least squares

and would need to be estimated in some way from the observed


 data.
Alternatively, the model may be expressed in terms of Xi − X̄
 
Yi = β0 + β1 X̄ + β1 (Xi − X̄) + ϵi

where X̄ represents the average of the Xi .


The proposed model (1.1) is linear in the parameters β0 , β1 . The model would still
be referred to as linear if instead we had Xi2 instead of Xi . It is common practice to
make the following assumption:

Assumption: The random error terms are uncorrelated, have mean equal to 0 and com-
mon variance equal to σ 2 .

Under this assumption

E[Yi ] = β0 + β1 Xi

σ 2 [Yi ] = σ 2

CAUTION: We emphasize that a well fitting regression model does not imply cau-
sation. One can relate stock market prices in N.Y. to the price of bananas in an offshore
island. This does not mean there is a causal relationship.

1.1 The method of least squares


The method of least squares due to Gauss-Legendre is the most popular approach to
fitting a regression model.
Set Q as the sum of square errors
n
ϵ2i
X
Q =
i=1
n
[Yi − β0 − β1 Xi ]2
X
=
i=1

Then minimize Q with respect to the parameters by differentiating with respect to β0 , β1 .


n
∂Q X
= −2 [Yi − β0 − β1 Xi ] = 0
∂β0 i=1
n
∂Q X
= −2 [Yi − β0 − β1 Xi ]Xi = 0
∂β1 i=1

3
1 Introduction

The linearity assumption leads to two linear equations in two unknowns whose
solutions denoted b0 , b1 are

b0 = Ȳ − b1 X̄
(1.2)
P  
Xi − X̄ Yi − Ȳ
b1 = P 2 (1.3)
Xi − X̄
P 
Xi − X̄ Yi
= P 2 (1.4)
Xi − X̄
X
= ki Yi

where  
Xi − X̄
ki = P  2
Xi − X̄
Then it can be shown
1
ki2 = P 
X X X
ki = 0, ki Xi = 1, 2 .
Xi − X̄

The equation of the fitted line is

Ŷ = b0 + b1 X (1.5)

Alternatively,  
Ŷ = b0 + b1 X̄ + b1 (X − X̄) (1.6)

Theorem 1.1. Gauss Markov) The least square estimators b0 , b1 are unbiased and have
minimum variance among all unbiased linear estimators.
P
Proof. Consider an unbiased estimator for β1 say, β̂1 = ci Yi which must satisfy

β1 = E[β̂1 ]

X
= ci E[Yi ]

X
= ci [β0 + β1 Xi ]

ci Xi = 1 and σ 2 [β̂1 ] = σ 2
P P P 2
Hence, ci = 0, c. i

4
1.1 The method of least squares

Consider setting ci = ki + di where di is arbitrary. Then substituting


X X
ki di = ki (ci − ki )
 
X Xi − X̄ 1
= ci P  2 − P 2
Xi − X̄ Xi − X̄

= 0

on using the properties of ci . Hence {ki } and {di } are uncorrelated and we have by the
Pythagorean theorem

σ 2 [β̂1 ] = σ 2 c2i
X

nX o
= σ2 ki2 + d2i
X

showing that the variance is minimized when di are all 0.


We may write Ŷ = b0 + b1 XP for the estimated or fitted line, ei = Yi − Ŷi for the
th ˆ e2i
estimated i residual and σ = n−2 for the estimate of the variance σ 2 .
2

Theorem 1.2. The variances of the least squares estimators are


 
2
1 X̄
σ 2 [b0 ] = σ 2  + P

2 
n Xi − X̄
 
1
σ 2 [b1 ] = σ 2  P 
 
2 
Xi − X̄

These may be estimated by replacing σ 2 by


P 2
e
σˆ2 = i
(1.7)
n−2
also known as the mean square error and denoted MSE.

Properties of the fitted Regression line


P
1. ei = 0
P P
2. Yi = Ŷi
P
3. Xi e i = 0

5
1 Introduction

P 2 P 2 P 2
4. Yi − Ȳ = b21 Xi − X̄ + Yi − Ŷi

5. The point (X̄, Y¯) is on the fitted line. This can be seen from (1.5)

6. Under the normality assumption {ϵi } ∼ i.i.d.N (0, σ 2 ), the method of maximum
likelihood leads to the method of least squares.

1.2 Inference in regression


The method of least squares was used to obtain the equation of the fitted regression
line. For the purpose of drawing inference, it is necessary to make some assumptions
on the distribution of the error terms, the most common of which is that the errors
{ϵi } ∼ i.i.d.N (0, σ 2 ).

Theorem 1.3. Suppose that we have the model Yi = β0 + β1 Xi + ϵi where {ϵi } ∼


i.i.d.N (0, σ 2 ) for i = 1, ..., n . Then
1 −β1
a) bs(b 1)
∼ tn−2 where s2 (b1 ) = P M SE 2
(Xi −"X̄ ) #
b0 −β0 2 1 X̄ 2
b) s(b0 ) ∼ tn−2 where s (b0 ) = M SE n + P 2
(Xi −X̄ )
SE
c) MSE is an unbiased estimate of σ 2 and (n−2)M σ2
∼ χ2n−2 and independent of
b0 , b1

P (Xi −X̄ )
Proof. a) We see that b1 = ki Yi where ki = P 2 . Hence, b1 is unbiased in view
(Xi −X̄ )
of the properties of the {ki }
Since Yi ∼ N (β0 + β1 Xi , σ 2 ) , it follows that

ki (β0 + β1 Xi ), σ 2 ki2 )
X X X
b1 = ki Yi ∼ N (
 
2
σ
∼ N β1 , P 
 
2 
Xi − X̄

P 
b) As well, b0 = Ȳ − b1 X̄ = n1 Yi − ki Yi X̄ = 1
− ki X̄ Yi
P P
n
The result follows from properties of the ki
c) We shall demonstrate this result using the matrix approach in subsequent sec-
tions.

This theorem can be used to test hypotheses about the parameters and to construct
confidence intervals.

6
1.3 Analysis of Variance (ANOVA) table

1.3 Analysis of Variance (ANOVA) table


It is customary and revealing to summarize the statistical analysis in the form of a table.
We illustrate this for the case p = 2 exhibited in the table below.
Source Sum of Squares (SS) df MS=SS/df F statistic E[MS]
2
=b21 SXX 2
β12
P
Regression SSR p−1 MSR MSR/MSE σ + Xi − X̄
P 2
Error SSE= Yi − Ŷi n−p MSE σ2

P 2
Total SSTO= Yi − Ȳ n−1
P 2 P 
Yi − Ȳ has n−1 degrees of freedom because of the constraints that Yi − Ȳ =
0 2
P
b21 Xi − X̄ has one degree of freedom because it is a function of b1
P 2
Yi − Ŷi has n-2 degrees of freedom because it is a function of two parame-
tersEach of the sums of squares is a quadratic form where the rank of the corresponding
matrix is the degrees of freedom indicated.
Cochran’s theorem applies and we conclude that the quadratic forms are indepen-
dent and have chi square distributions. It is well known that the ratio of two independent
chi square divided by their degrees of freedom has a F-distribution

[SSR/ (σ 2 (p − 1))]
F =
[SSE/ (σ 2 (n − p))]
M SR
= ∼ Fp,(n−p)
M SE
The ANOVA table indicates how one can test the null hypothesis

H0 : β1 = 0

H1 : β1 ̸= 0

The null hypothesis is that the slope of the line is equal to 0. Under the null hypothesis,
the expected mean square for regression and the expected mean square error are separate
independent estimates of the variance σ 2 . Hence if the null hypothesis is true, the F
ratio should be small. On the other hand, if the alternative hypothesis H1 is true, then
the numerator of the F ratio will be expected to be large. Consequently, large values
of the F statistic are consistent with the alternative. We reject the null hypothesis for
large values of F.

Example 1.2. We consider the following example on grade point averages at the end

7
1 Introduction

of of the freshman year (Y ) as a function of the ACT test scores (X). .


a) We plot the data
b) We obtain the least squares estimates
c) We plot the estimated regression function and estimate Y when X = 30
d) Compute the ANOVA table
e) Compute confidence intervals for the parameters

Exercise 1.1. Consider the following data on airfreight breakage (Y ) as a function of


shipment route (X). CH01PR21
i 1 2 3 4 5 6 7 8 9 10
Xi 1 0 2 0 3 1 0 1 2 0
Yi 16 9 17 12 22 13 8 15 19 11

a) Compute the ANOVA table


b) Compute confidence intervals for the parameters
c) Compute a confidence interval for the average response when X = 1

1.4 Confidence Intervals


It is of interest to construct confidence intervals for
a) the average E [Y ] = β0 + β1 X for a new observation X
b) the prediction of a new value of Y for a given X
The point estimate Ŷ = b0 + b1 X is used as the point estimate in both cases a) and
b).
It is unbiased and has a normal distribution as seen from

Ŷ = b0 + b 1 X

X1  
= + ki X − X̄ Yi
n
Moreover,
h i X1  2
2 2
σ Ŷ = σ + ki X − X̄
n
  2 
1 X − X̄ 
= σ2  + P 
n

Xi − X̄
h i
We note that the variance increases with the distance of X from X̄ . The variance σ 2 Ŷ

8
1.4 Confidence Intervals

is estimated by   2 
h i
1 X − X̄
s2 Ŷ = M SE  + P 
n

Xi − X̄

Hence inference in the form of confidence interval an hypothesis testing for the average
E[Y] is conducted using the fact that

Ŷ − E[Y ]
∼ tn−2
s[Ŷ ]

a Student t distribution with n − 2 degrees of freedom.


For the prediction problem, note that

Ynew = β0 + β1 X + ϵnew

and
Ŷnew = Ŷ + ϵnew

h i h i
σ 2 Ŷnew = σ 2 Ŷ + σ 2
  2 
1 X − X̄ 
= σ2 
1 + + P 
n Xi − X̄
h i
The variance σ 2 Ŷnew is estimated by
  2 
h i 1 X − X̄ 
s2 Ŷnew = M SE 
1 + + P 
n Xi − X̄

Hence inference in the form of confidence interval an hypothesis testing for the prediction
of a new value is conducted using the fact that

Ŷnew − Ynew
∼ tn−2
s[Ŷnew ]

Example 1.3. Consider the grade point average data (ACT).


a) Compute a confidence interval for the average response when ACT = 3.5
b) Compute a prediction interval for the average response when ACT = 3.5

9
1 Introduction

1.5 Calculations using R


Computations for regression can be conveniently performed with the free software R.
Here is a list of the most useful commands.
a) Load the data
Suppose we have data in a file in directory, usually in csv or text format . It can
be read into R using the command
data=read.table (file.choose(),header=TRUE,sep=’\t’)
R will then open a window to browse for the document. To offset the default, the
command header=TRUE indicates that we will name the columns.
finally, the command sep=’t’ is to use tabs to delimit the columns.
To verify that data is a data frame
is.data.frame(data)
[1] TRUE
We may display the names of the columns
names(data)
[1] ”length” ”Width”
To change names
names(data[1]=”volume”)
b) Accessing the data
To access the data length
data$length
c) Descriptive measures
mean(data$length)
We may assign
x=data$length
y=data$width
plot(x,y,ylab=’width in inches’,xlab=’length in inches’)
cor(x,y)
cor.test(x,y)
summary(data$length)
boxplot(data$length, data$width,names=c(”length”,”width”)
We can determine the number of rows and columns
nrow(data)
ncol(data)
d) Graphics
We can draw a histogram
hist(data$length,prob=TRUE,xlab=’length’,main=’Density histogram of length’)
To superimpose a normal density
curve(dnorm(x,mean(data$length),sd(data$length),add=TRUE)

10
1.6 R Session

e) Fitting the model


fit=lm(y~x)
fit
Alternatively, we may us the original names
fit=lm(width~length,data=data)
Regression without intercept
fit=lm(y~0+x,data=data) or
fit=lm(x-1,data=data)
f) Confidence intervals
The construction of confidence intervals can easily be done using the R commands
once the model has been fitted.
Suppose that the data is labeled ACT
summary (fit) #provides the coefficients and their standard errors
For confidence intervals for the intercept
confint(fit,level=0.95)
g) For prediction of the mean for a new value say X0
new.dat=data.frame(ACT=X0 )
predict(fit, newdata=new.dat, interval=”confidence”) # this provides a 95% confi-
dence interval
For a prediction interval
predict(fit, newdata=new.dat, interval=”prediction”) # this provides a 95% pre-
diction confidence interval
h) Confidence bands using ggplot2
ggplot(data,aes(x=ACT,y=GPA))+
geom_point()+
geom_smooth(method=lm,se=TRUE) #yields 95% confidence interval
temp_var=predict(fit,interval=’prediction’)
new_df=cbind(data,temp_var)
ggplot(new_df,aes(ACT,GPA))+
geom_point()+
geom_line(aes(y=lwr),color=’red’,linetype=’dashed’)+
geom_line(aes(y=upr),color=’red’,linetype=’dashed’)+
geom_smooth(method=lm,se=TRUE) #95% confidence and prediction intervals

1.6 R Session
1.6.1 Rocket Propellant data
To read the data

11
1 Introduction

Rocket=read.table(file.choose(),header=TRUE,sep=’\t’)
Rocket #prints out the data
Shear.strength Age.of.Propellant
y=Rocket$Shear.strength
x=Rocket$Age.of.Propellant
plot(x,y)
hist(y,prob=TRUE,main=’Density histogram of Shear Strength’)
To conert data from table.b1 in R to csv format
write.csv(table.b1,’table.csv’)# this will write the new table in the working directory
of your computer
Regression model for Rocket data
fit=lm(y~x)
fit
Call: lm(formula = y ~ x)
Coefficients: (Intercept) x 2627.82 -37.15
summary(fit)
Call: lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-215.98 -50.68 28.74 66.61 106.76
Coefficients Estimate Std. Error t value Pr(>|t|)
(Intercept) 2627.822 44.184 59.48 < 2e-16 ***
x 37.154 2.889 -12.86 1.64e-10 ***
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 96.11 on 18 degrees of freedom
Multiple R-squared: 0.9018,
Adjusted R-squared: 0.8964
F-statistic: 165.4 on 1 and 18 DF, p-value: 1.643e-10
cor(x,y)
[1] -0.9496533
plot(Rocket$Age.of.Propellant, Rocket$hear.strength, xlab=’Age’,ylab=’Shear Strength’,main
Propellant’)
abline(Rocket,col=’lightblue’)
Regression without intercept
fit=lm(y~x-1,data=Rocket)
summary(fit)
Call: lm(formula = y ~ x - 1, data = Rocket)
Residuals:
Min 1Q Median 3Q Max
-1044.7 -497.6 742.3 1529.4 2428.2

12
1.6 R Session

Coefficients Estimate Std. Error t value Pr(>|t|)


x 112.98 19.22 5.878 1.16e-05 ***
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1315 on 19 degrees of freedom
Multiple R-squared: 0.6452,
Adjusted R-squared: 0.6265
F-statistic: 34.55 on 1 and 19 DF, p-value: 1.165e-05
plots using standard R routines
plots using ggplot
ggplot(Rocket,aes(x,y))+
+ geom_point()+
+ geom_smooth(method=lm,se=TRUE)

1.6.2 R Session for Plumbing Supplies data


Use this data to fit a regression without intercept

1.6.3 R Session for the GPA data


data=read.table(file.choose(),header=TRUE,sep=’\t’)
names(data)
[1] "GPA" "ACT"
fit=lm(GPA~ACT,data=data)
fit
Call: lm(formula = GPA ~ ACT, data = data)
Coefficients: (Intercept) ACT 2.14596 0.03735
ggplot(data,aes(x=ACT,y=GPA))+
geom_point()+
geom_smooth(method=lm,se=TRUE) #yields 95% confidence interval
temp_var=predict(fit,interval=’prediction’)
new_df=cbind(data,temp_var)
ggplot(new_df,aes(ACT,GPA))+
geom_point()+
geom_line(aes(y=lwr),color=’red’,linetype=’dashed’)+
geom_line(aes(y=upr),color=’red’,linetype=’dashed’)+
geom_smooth(method=lm,se=TRUE) #95% confidence and prediction intervals

13
1 Introduction

1.7 Other DATA SETS


Rocket Propellant Data
Delivery Time Data
Patient Satisfaction Data
ACT Scores
Airfreight Data
Copier Maintenance Data
Crime Data
Toluca Refrigeration Data
Grocery Retailer Data
Plumbing Supples Data

1.7.1 Homework
Problems 2.1, 2.10, 2.22

14
2 Matrix Approach to Regression
We will preamble the matrix presentation by describing some distributional results.

2.1 Distributional Results


Let Y = [Y1 , ..., Yn ]′ be the transpose of the column data vector.
Define the expectation
E[Y ] = [EY1 , ..., EYn ]′

Proposition If Z = AY + B for some matrix of constants A, B, then

E[Z] = AE[Y ] + B
nhP i o hP i
Proof: (EZi ) = E j aij Yj + bi = j aij EYj + bi
n o
Definition 2.1. The covariance COV [Y ] = E [Y − EY ] [Y − EY ]′ ≡ Σ

Proposition COV [AY ] = AΣA′

Definition 2.2. A random vector Y has a multivariate normal distribution if its density
is given by
1
|Σ|− 2 1
f (y1 , ..., yn ) = n exp − (y − µ)′ Σ −1 (y − µ)
(2π) 2 2
where
y ′ = (y1 , ..., yn ) , µ′ = (µ1 , ..., µn ) , Σ = COV [Y ]
denoted Y ∼ Nn (µ, Σ) .
A fundamental result is

Theorem 2.1. Let Y ∼ Nn (µ, Σ) . Let A be an arbitrary p × n matrix of constants.


Then
Z = AY + B ∼ Nn (Aµ + B, AΣA′ )

This theorem implies that any linear combination of normal variates has a normal dis-
tribution. We do not prove this theorem here.

15
2 Matrix Approach to Regression

Example 2.1. Let Y ∼ Nn (µ, Σ) . Let A = (1, ..., 1) .Then

AY ∼ N1 (Aµ, AΣA′ )

where n
µi , AΣA′ = σj2 + 2
X X X
Aµ = σij
i=1 i̸=j

The matrix representation of regression makes it easy to generalize to fitting several


independent variables.
Let Y = [Y1 , ..., Yn ]′ be the transpose of the column data vector.
Let β = [β0 , β1 , ...βp−1 ]′ be the transpose of the coefficients
Let ϵ = [ϵ1 , ϵ2 , ..., ϵn ]′ be the transpose of the random error terms
 
1 X11 .. X1p
 1 X21 .. X2p 
 
Let X =    be the matrix which incorporates the p explanatory
 . . .. ..  
1 Xn1 .. Xnp
variables
If ϵ ∼ Nn (0, σ 2 In ) , then the regression model may be expressed as

Y = Xβ + ϵ ∼ Nn (Xβ, σ 2 In )

where In is the n × n identity matrix and Nn is the multivariate normal distribution.

Derivatives If z = a′ y,then
∂z
=a
∂y

If z = y ′ y,
∂z
= 2y
∂y
If z = a′ Ay,
∂z
= A′ a
∂y
If z = y ′ Ay,
∂z
= A′ y + Ay
∂y
If z = y ′ Ay, and A is symmetric

∂z
= 2A′ y
∂y

16
2.1 Distributional Results

The sum of squares is given by

Q = (Y − Xβ)′ (Y − Xβ)

Differentiating with respect to the vector β

∂Q
= −2X ′ ((Y − Xβ)) (2.1)
∂β

= −2 (X ′ Y − X ′ Xβ) = 0

Hence the solutions to the normal equations are


−1
b = (X ′ X) X ′Y

= AY

where A = (X ′ X)−1 X ′ provided the inverse of (X ′ X) exists. It follows that


 
b ∼ Np AXβ, σ 2 AA′

But
−1
AXβ = (X ′ X) X ′ Xβ = β
and
−1 −1
AA′ = (X ′ X) X ′ X (X ′ X)

−1
= (X ′ X)

Hence,
−1
 
b ∼ Np β, σ 2 (X ′ X)
The fitted line is then

Ŷ = Xb

−1
= X (X ′ X) X ′Y

= HY

17
2 Matrix Approach to Regression

where the “hat” matrix H (because it puts a hat on Y ) is given by


−1
H = X (X ′ X) X′ (2.2)

2.2 Properties of the hat matrix H

The hat matrix has some nice properties.

a) It is a projection matrix, idempotent and symmetric

HH = H

H′ = H

b) The matrix H is orthogonal to the matrix I − H

(I − H) H = H − HH = 0

Moreover, (I − H) is idempotent and is a projection matrix as well.

c) The residual vector is expressible as

e = Y − Ŷ

= Y − HY

= (I − H) Y

d) Properties b) and c) imply that the observation vector Y is projected onto a


space spanned by the columns of H and the residuals are in a space orthogonal to it

Y = HY + (I − H) Y

By the Pythagorean theorem

∥Y ∥2 = ∥HY ∥2 + ∥(I − H) Y ∥2 (2.3)

18
2.2 Properties of the hat matrix H

We note that

σ 2 [e] = V ariance [(I − H) Y ]

= (I − H) σ 2 [Y ] (I − H)′

= σ 2 (I − H)

which is estimated by
s2 [e] = (M SE) (I − H)
Moreover,
−1 −1
h i h i
σ 2 [b] = (X ′ X) X ′ σ 2 X (X ′ X)

−1
= σ 2 (X ′ X)

Exercise 2.1. a) For the case p = 2, obtain the hat matrix. Show that rank H= Trace
H =2
b) Show the relationship
X 2 X 2 X 2
Yi − Ȳ = b21 Xi − X̄ + Yi − Ŷi

T otal Sum of Squares = Regression Sum of Squares + Error Sum of Squares

Definition 2.3. Let Y1 , ..., Yn be a random sample from N (µ, σ 2 ). A quadratic form in
the Y ′ s is defined to be the real quantity

Q = Y ′ AY

where A is a symmetric positive definite matrix.


Our next results permit us to compute the expectation of quadratic forms.
Let A be a symmetric matrix and let Y be a random vector. Then the singular
value decomposition of A implies that there exists an orthogonal matrix P such that if
Λ = (λi ) is the diagonal matrix of eigenvalues of A,

A = P ′ ΛP.

Proposition E [Y ′ AY ] = T race [AΣ] + (EY )′ (EY )


Proof: Y ′ AY = Y ′ P ′ ΛP Y = (P Y )′ Λ (P Y ) = λi ∥(P Y )i ∥2
P

19
2 Matrix Approach to Regression

where (P Y )i indicates the ith element in P Y .


(P Y )i is a random variable and its second moment is

E ∥(P Y )i ∥2 = V ar ∥(P Y )i ∥ + [E (P Y )i ]2

= (P ΣP ′ )ii + [(P EY )i ]2

Hence

λi ∥(P Y )i ∥2 = λi (P ΣP ′ )ii + λi [(P EY )i ]2


X X X
E

= T race (ΛP ΣP ′ ) + µ′ Aµ

= T race (P ′ ΛP Σ) + µ′ Aµ

Lemma 2.1. The sample variance Sn2 is an unbiased estimate of the population variance.

Proof. Suppose Y1 , ..., Yn i.i.d.N (µ, σ 2 ). Let Y = [Y1 , ..., Yn ]′ and


g

1 − n1 − n1 ... − n1
 

− n1 1 − n1 ... − n1
 
 
A =  

 ... ... ... ... 

1
−n − n1 ... 1 − n1
11′
= I−
n
P 2
Then (n − 1) Sn2 = Yi − Ȳ = Y ′ AY and

E [Y ′ AY ] = T race [AΣ] + (EY )′ (EY )

= σ 2 T raceA + µ2 1′ A1

= σ 2 (n − 1) + 0

In the regression model,


Y − Ŷ = (I − H) Y

20
2.2 Properties of the hat matrix H

Since I − H is idempotent,
 ′  
Y − Ŷ Y − Ŷ = Y ′ (I − H) Y

and

E [Y ′ (I − H) Y ] = T race (I − H) Σ + µ′ (I − H) µ

= σ 2 T race (I − H) + (Xβ)′ (I − H) (Xβ)

= σ 2 (n − p) + 0

Definition 2.4. (a) A random variable U is said to have a χ2ν distribution with ν degrees
of freedom if its density is given by

1
f (u; ν) = u(ν/2)−1 e−u/2 , u > 0, ν > 0
2ν/2 Γ (ν/2)

The mean and variance of U are respectively ν, 2ν.


(b) A random variable U is said to have a non-central χ2ν (λ) distribution with ν
degrees of freedom and non centrality parameter λ if its density is given by

(λ/2)i
e−λ/2
X
f (u; ν, λ) = f (u; ν + 2i) , u > 0, ν > 0
i=0 i!

The non-central chi square distribution is a Poisson weighted mixture of central chi
square distributions.The mean and variance are respectively (ν + λ) and (2ν + 4λ).
(c) We include here the fact that the distribution of the ratio of two independent
central chi square distributions divided by their respective degrees of freedom
 
χ2ν1 /ν1
Fν1 ,ν2 =  
χ2ν2 /ν2

is an F distribution with ν1 and ν2 degrees of freedom. If the numerator is a non central


chi square distribution, then the F becomes a non central F distribution.

Theorem 2.2. Cochran’s Theorem Let Y be a random vector with distribution Nn (µ, σ 2 I).
Suppose that we have the decomposition

Y ′ Y = Q1 + ... + Qk
n o n o
Qi
where Qi = Y ′ Ai Y rank (Ai ) = ni . Then σ2
are independent and have χ2ni (λi )

21
2 Matrix Approach to Regression

distributions if and only if X


ni = n

The ranks ni are referred to as degrees of freedom. Here, λi = µ′ Ai µ.

Applications Suppose Y1 , ..., Yn i.i.d.N (µ, σ 2 ). Let Y = [Y1 , ..., Yn ]′ and


g

1 − n1 − n1 ... − n1
 

− n1 1 − n1 ... − n1
 
 
A =  

 ... ... ... ... 

1
−n − n1 ... 1 − n1
11′
= I−
n
Then
n
11′
!
′ ′ ′
Yi2
X
= Y Y = Y AY + Y Y
i=1 n

11′
!
n = rankA + rank
n

= (n − 1) + 1
 
11′
′ AY
Y′ n
Y
From Cochran’s theorem, Q1 = Y σ2 ∼ χ2n−1 and Q2 = σ2
∼ χ21 are inde-
pendent. But 2
P
Yi − Ȳ nȲ 2
Q1 = , Q2 =
σ2 σ2
and hence the ratio
Q2 /1
F1,n−1 =
Q1 / (n − 1)

nȲ 2
=
Sn2

has an F distribution with degrees of freedom 1, (n − 1) . Equivalently,



nȲ
Tn−1 =
Sn

has a Student distribution with (n − 1) degrees of freedom.

22
2.2 Properties of the hat matrix H

Application In linear regression,

Y = Xb + (Y − Xb)

∥Y ∥2 = ∥Xb∥2 + ∥Y − Xb∥2

= Y ′ HY + Y ′ (I − H) Y

By Cochran’s theorem,

Y ′ HY ∼ χ2p , Y ′ (I − H) Y ∼ χ2n−p

and are independent. The first term is the sum of squares due to the regression
whereas the second represents the error sum of squares. We summarize this in the
next section in the analysis of variance table.

23
3 Multiple Linear Regression
In practice, one is often presented with several predictor variables. For two predictors,
the linear regression model becomes

Yi = β0 + β1 Xi1 + β2 Xi2 + ϵi

with the assumptions that {ϵi } are i.i.d. N (0, σ 2 ). This model describes a plane in
three dimensions. It is an additive model where β1 represents the rate of change in a
unit increase in X1 when X2 is held fixed. An analogous interpretation can be made for
β2 .
In general, we may have the linear regression model involving (p − 1) explanatory
variables
p−1
X
Yi = β0 + βk Xik + ϵi
k=1

The predictor variables may be qualitative taking values 0 or 1 as for example if one
wishes to take into account gender. So here



 0 if the subject is male

X=


1

if the subject is female

We may also have a second degree polynomial

Yi = β0 + β1 Xi + β2 Xi2 + ϵi

a transformed response

ln Yi = β0 + β1 Xi1 + β2 Xi2 + ϵi

interaction effects

Yi = β0 + β1 Xi1 + β2 Xi2 + β3 Xi1 Xi2 + ϵi

In all cases, it is instructive to make use of the matrix approach to unify the devel-
opment.

25
3 Multiple Linear Regression

We recall from Chapter 2


Let Y = [Y1 , ..., Yn ]′ be the transpose of the column data vector.
Let β = [β0 , β1 , ...βp−1 ]′ be the transpose of the coefficients
Let ϵ = [ϵ1 , ϵ2 , ..., ϵn ]′ be the transpose of the random error terms
 
1 X11 .. X1p−1
 1 X21 .. X2p−1 
 
Let X =    be the matrix which incorporates the p explana-
 . . .. ..  
1 Xn1 .. Xnp−1
tory variables
Then the regression model may be expressed as

Y = Xβ + ϵ, ϵ ∼ Nn (0, σ 2 In )

where In is the n × n identity matrix and Nn is the multivariate normal distribution.


Letting
b′ = [b0 , b1 , ..., bp−1 ]
be the least squares estimate of β we have
−1
b = (X ′ X) X ′Y

The fitted values are

Ŷ = Xb

= HY

where the hat matrix H = X (X ′ X)−1 X. The variance -covariance matrix of the resid-
uals e = (I − H) Y is
σ 2 [e] = σ 2 (I − H)
which is estimated by
s2 [e] = (M SE) (I − H)

Also
−1
s2 [b] = (M SE) (X ′ X)

We may summarize the results in an ANOVA table


Source SS df MS
Regression SSR p-1 MSR=SSR/(p-1)
Error SSE n-p MSE=SSE/(n-p)
Total SSTO n-1

26
3.1 Extra sum of squares principle

where
1
 
SST O = Y ′ Y − Y ′ JY
n
1
   

= Y I− J Y
n

SSE = e′ e = Y ′ (I − H) Y

1
 
′ ′
SST R = b X Y − Y ′ JY
n
1
   
= Y′ H − J Y
n
To test the hypothesis

H0 : β1 = β2 = ... = βp−1

H1 : not all βk = 0

we use the test statistic


M SR
F = ∼ F (p − 1, n − p)
M SE
and reject H0 for large values.
Tests

H0 : βk = 0

H1 : βk ̸= 0

for individual coefficients may be conducted using the fact that the standardized coeffi-
cient has a Student t distribution

bk
∼ tn−p
s [bk ]

3.1 Extra sum of squares principle


A more general approach to regression, labeled the extra sum of squares principle, which
will be useful for more complex models consists of the following steps illustrated here
for p = 2.

27
3 Multiple Linear Regression

Step 1 Specify the Full (F ) model Y = β0 + β1 X + ϵ and obtain the error sum of
squares 2
X
SSE(F ) = Yi − Ŷi

Step 2 Consider the Reduced (R) model whereby β1 = 0

Y = β0 + ϵ
and obtain the corresponding error sum of squares
X 2
SSE(R) = Yi − Ȳ

The logic now is to compare the two error sum of squares. With more parameters in the
model, we expect that
SSE(F ) ≤ SSE(R)
If we have equality above, we may conclude the model is not of much help. As a result,
we may test the benefit of the model be computing the test statistic
h i
SSE(R)−SSE(F )
df −df
F∗ = h R Fi
SSE(F )
(3.1)
dfF

and rejecting the null hypothesis H0 : β1 = 0 for large values of F ∗ which has an F
distribution F (dfR − dfF , dfF ).

Application An immediate application of this approach is to the situation where there


are repeat observations at the same values of X. Suppose that the full model is
given by
Yij = µj + ϵij , i = 1, ..., nj ; j = 1, ..., c
and {ϵij } are i.i.d. N (0, σ 2 ).

The {µij } are unrestricted parameters when X = Xj .Their least squares estimates are
Pnj
i=1 Yij
Ȳj =
nj

The error sum of squares for this full unrestricted model is


X 2
SSE(F ) = Yij − Ȳj
ij

28
3.1 Extra sum of squares principle

The corresponding degrees of freedom are


c
X
dfF = (nj − 1)
j=1

≡ n−c

Note that if all nj = 1, then dfF = 0 , SSE(F ) = 0 and the analysis does not proceed
any further.

Consider now the reduced model which specifies the linear model

Yij = β0 + β1 Xj + ϵij

which has error sum of squares equal to


X 2
SSE(R) = Yij − Ŷij
ij

where
Ŷij = b0 + b1 Xj (3.2)
The degrees of freedom are dfR = (n − 2). Hence, we may test

H0 : E[Y ] = β0 + β1 X

H1 : E[Y ] ̸= β0 + β1 X

by computing the ratio


h i
SSE(R)−SSE(F )
df −df
F∗ = h R Fi
SSE(F )
dfF

So the test here is on whether a linear model is justified at all. This is different from
just testing that the slope is zero.

We may gain some insight into the components of the F ∗ ratio. Note that
     
Yij − Ŷij = Yij − Ȳj − Ȳj − Ŷij

and
X 2 X 2 X 2
Yij − Ŷij = Yij − Ȳj + Ȳj − Ŷij
ij ij ij

29
3 Multiple Linear Regression

The corresponding degrees of freedom are dfR = (n − 2), dfP E = (n − c) , dfLF = (c − 2)


We label these sums of squares as follows:
P  2
Definition 3.1. SSE(R) = ij Yij − Ŷij Error sum of squares for the reduced model
P  2
SSPE = ij Yij − Ȳj Pure error sum of squares
P  2
SSLF = ij Ȳj − Ŷij Error sum of squares due to lack of fit which in view of
(3.2) is independent of i
An ANOVA table summarizes the analysis.
Source SS df MS F E(MS)
P  2
Regression SSR= Ŷij − Ȳ
ij 1 MSR=SSR/1 F=MSR/MSE
P  2
Residual error SSE(R)= ij Yij − Ŷij n-2 MSE=SSER/(n-2)

P  2 P
ni (µi −β0
∗ 2
Lack of fit SSLF= ij Ȳi − Ŷij c-2 MSLF=SSLF/(c-2) F σ + c−2
2
σ2
P
Pure error SSPE= ij Yij − Ȳi n-c MSPE=SSPE/(n-c)

P 2
Total ij Yij − Ȳ n-1

We note
SSE (R) = SSLF + SSP E
The approach can be extended to multiple regression. We define

SSR (X2 |X1 ) = SSE (X1 ) − SSE (X1 , X2 ) (3.3)

to be the reduction in the error sum of squares when after X1 in included, an additional
variable X2 is added to the model. Since

SST O = SSR + SSE

we may re-express (3.3) as

SSR (X2 |X1 ) = SSR (X1 , X2 ) − SSR (X1 )

Similarly, when three variables are involved, we may breakdown the sum of squares
due to the regression as

SSR (X1 , X2 , X3 ) = SSR (X1 ) + SSR (X2 |X1 ) + SSR (X3 |X1 , X2 )

This decomposition enables us to judge the effect an added variable has on the sum
of squares due to the regression. An ANOVA table would be decomposed as follows

30
3.1 Extra sum of squares principle

Source SS df MS
Regression SSR (X1 , X2 , X3 ) 3 M SR (X1 , X2 , X3 )
X1 SSR (X1 ) 1 M SR (X1 )
X2 |X1 SSR (X2 |X1 ) 1 M SR (X2 |X1 )
X3 |X1 , X2 SSR (X3 |X1 , X2 ) 1 M SR (X3 |X1 , X2 )
Error SSE (X1 , X2 , X3 ) n-4 M SE (X1 , X2 , X3 )
Total SST O n-1

The extra sum of squares principle described in Chapter 3 considers a full model and
a reduced model. It then makes use of the statistic below to determine the usefulness of
the reduced model

h i
SSE(R)−SSE(F )
df −df
F∗ = h R Fi
SSE(F )
∼ F (dfR − dfF , dfF )
dfF

In general, suppose that


S1 = SS (b0 (1) , ..., bp (1))
represents the sum of squares residual when p variables are included and

S2 = SS (b0 (2) , ..., bq (2))

represents the sum of squares residual when q variables are included, with p > q. Then
the difference S1 − S2 is defined to be the extra sum of squares. It will be used to test
the hypothesis that
H0 : βq+1 = ... = βp = 0
1 −S2
It can be shown that under that hypothesis, Sp−q is an unbiased estimate of σ 2
independent of MSE and hence their ratio will have an F distribution. Define

P1 = H1

the projection matrix of Y on the p + 1 dimensional space and let

P2 = H2

be the projection matrix of Y under H0 on the q + 1 dimensional space. The difference


is used to construct the extra sum of squares. In fact

∥P1 Y − P2 Y ∥2 = C

Pictorially we have

31
3 Multiple Linear Regression

May 17, 2023 1:08 PM

(Y-P1Y)

P1 Y

P2 Y (P1-P2)Y

New Section 1 Page 1

By construction, P2′ (P1 − P2 ) = 0 since P2 (P1 Y ) = P2 Y . This can be seen in the


simple case when p = 2. The vectors 1 and X − X1 ¯ are orthogonal and span P1 The
projection P2 is spanned by 1.
We may compute

E (S1 − S2 ) = E {Y ′ (P1 − P2 ) Y }

= T race (P1 − P2 ) + µ′ (P1 − P2 ) µ

= (p − q) σ 2 + 0

By repeated application of this principle, we can successively obtain for any regression

32
3.2 Simultaneous confidence intervals

model
SS (b0 ) , SS (b1 |b0 ) , SS (b2 |b1 , b0 ) , ..., SS (bp |bp−1 , ..., b0 )
All these sums of squares are distributed as chi square with one degree of freedom
independent of MSE. The tests are conducted using t tests.

3.2 Simultaneous confidence intervals

There are occasions when we require simultaneous or joint confidence intervals for the
entire set of parameters. As an example, suppose we wish to obtain confidence intervals
for both the intercept and the slope of a simple linear regression. Computed separately,
we may obtain 95% confidence interval for each. If the statements are independent,
then the probability that both statements are correct is given by (0.95)2 = 0.9025.
Moreover, the intervals make use of the same data and consequently, the events are not
independent.
One approach that is frequently used begins with the Bonferroni inequality. For
two events A1 , A2

P (A1 ∪ A2 ) = P (A1 ) + P (A2 ) − P (A1 ∩ A2 )

≤ P (A1 ) + P (A2 )

Consequently, using DeMorgan’s identity

P (A′1 ∩ A′2 ) = 1 − P (A1 ∪ A2 )

≥ 1 − P (A1 ) − P (A2 )

Suppose now that the events are such that

P (A1 ) = P (A2 ) = α

and hence

P (A′1 ∩ A′2 ) ≥ 1 − P (A1 ) − P (A2 )

≥ 1 − 2α

33
3 Multiple Linear Regression

Now the event (A′1 ∩ A′2 ) is the event that the intervals

A′1 : b0 ± t (1 − α/2; n − 2) s[b0 ]

A′2 : b1 ± t (1 − α/2; n − 2) s[b1 ]

simultaneously cover β0 , β1 . If α = 0.05, 1 − 2α = 0.90.


On the other hand, if we wish to have a confidence of 0.95 for the two intervals,
then we should choose

1 − 2α = 0.95

α = 0.025

which implies we need to compute

t (0.9875; n − 2)

In general, if p parameters are involved, then

P (∩i A′i ) ≥ 1 − pα∗

= 1−α

so that α∗ = αp and each confidence interval has confidence 1 − αp .


Calculations using R
a) Model fitting
Suppose we wish to fit a regression of a response against 4 variables X1 , X2 , X3 , X4 .
model=lm(Y ∼X1 + X2 + X3 + X4 , data = CH)
A more efficient command is
model=lm(Y ∼., data = CH)
If it is desired to exclude a specific variable from a long list of other variables to be
included
model=lm(Y ∼ .−X1 , data = CH)
b) ANOVA table
after fitting a model, say Retailer, you may obtain an ANOVA table using the
command
anova(RETAILER)
c) Extra sum of squares
The following R commands carry out the analysis for the extra sum of squares

34
3.3 R Session

Reduced=lm(y~x) # fits the reduced model


Full=lm(y~0+as.factor(x)) #fits the full model
anova(Reduced, Full) # gets the lack of fit test
d) Simultaneous confidence intervals
to obtain the t distribution cutoff
confint(fit, level=1-0.05/2)
Alternatively we can obtain the quantile,
qt(0.9875,n-2)
The Bonferroni intervals are often used because they provide shorter length confi-
dence intervals than some other methods such as Scheffe.

3.3 R Session
We will use the Delivery Time data
a) Graphic
Delivery=read.table(file.choose(),header=TRUE,sep=’\t’)
names(Delivery)
[1] "Delivery.Time" "Number.of.Cases" "Distance"
plot(Delivery) #two-dimensional scatter plot
install.packages("plot3D") #install three-dimensional plot routine
library("plot3D")
x=Delivery$Number.of.Cases #define the variables
y=Delivery$Distance
z=Delivery$Delivery.Time
scatter3D(x,y,z,theta=15,phi=20,xlim=c(1,30),ylim=c(30,150)) #plot in 3D; many
options are available
b) Model fitting
X1=Delivery$Number.of.Cases
X2=Delivery$Distance
Y=Delivery$Delivery.Time
model=lm(Y~X1+X2,data=Delivery)
summary(model)
Call: lm(formula = Y ~ X1 + X2, data = Delivery)
Residuals:
Min 1Q Median 3Q Max
-5.7880 -0.6629 0.4364 1.1566 7.4197

35
3 Multiple Linear Regression

Coefficients Estimate Std. Error t value Pr(>|t|)


(Intercept) 2.341231 1.096730 2.135 0.044170 *
X1 1.615907 0.170735 9.464 3.25e-09 ***
X2 0.014385 0.003613 3.981 0.000631 ***
--- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.259 on 22 degrees of freedom
Multiple R-squared: 0.9596,
Adjusted R-squared: 0.9559
F-statistic: 261.2 on 2 and 22 DF,
p-value: 4.687e-16
c) ANOVA
In R, the Default function anova in R provides sequential sum of squares (type I)
sum of squares. That is to say, each sum of squares for each variable is the sum of
squares conditional on the previous variables is.
To obtain the regula analysis of variance, Use package car to get type II sum of
square. The output looks similar to Minitab output.
>library(car)
>Anova(mylm, type="II")
anova(model)
Analysis of Variance Table
Response: Y
DF Sum Sq MeanSq F value Pr(>F)
X1 1 5382.4 5382.4 506.619 < 2.2e-16 ***
X2 1 168.4 168.4 15.851 0.0006312 ***
Residuals 22 233.7 10.6
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
d) Extra Sum of Squares
Full =lm(Y~X1+X2,data=Delivery)
Reduced=lm(Y~X1)
anova(Reduced, Full)
Analysis of Variance Table
Model 1: Y ~ X1
Model 2: Y ~ X1 + X2
Res Df RSS DF Sum of Sq F Pr(>F)
1 23 402.13
2 22 233.73 1 168.4 15.851 0.0006312 ***
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

36
3.4 DATA SETS

3.4 DATA SETS


Patient Satisfaction Data
Grocer Retailer Data
Delivery Time Data

3.5 Suggested Problems


Problem 3.7

37
4 Model adequacy checking
After fitting a regression model, it is important to verify whether or not the assumptions
that led to the analysis are satisfied.
The basic assumptions that were made were

1. {ϵi } are normally distributed

2. E[ϵi ] = 0 and σ 2 [ϵi ] = σ 2

3. {ϵi } are independent

As well, we need to check for influential observations which may unduly influence the
fitted model. Very large or very small values of the response may sometimes heavily
alter the value of the estimated coefficients.

Definition 4.1. The basic tool that is used consists of analyzing the residuals

ei = Yi − Ŷi (4.1)

4.1 Checking for normality


a) Box plots of residuals under normality should indicate a symmetric box around the
median of 0
b) A histogram of the residuals provides a graphical check on normality
c) A qq- plot (i.e. quantile-quantile plot) consists of comparing the quantiles of the
residual data with the quantiles from a normal distribution. This is a plot of the ranked
residuals against the expected value under normality. Set

!
−1 k − 0.375
Ek = M SEΦ , k = 1, ..., n
n + 0.25

Then plot e(k) vsEk where e(k) is the residual with rank k. Under normality, one
expects a straight line plot.

39
4 Model adequacy checking

4.2 Checking for constancy of variance


We note that the variance and covariances of the residuals are respectively

σ 2 [ei ] = σ 2 [1 − hii ]

Cov [ei , ej ] = σ 2 [1 − hij ]

where hij is the ij th element of the hat matrix. This demonstrates that the variances of
the residuals are not equal. For this reason, we may define the Studentized or standard-
ized residuals which have equal variance
ei ei
e∗i = = √
σ [ei ] s 1 − hii

where s2 = M SE.
Definition 4.2. The semi studentized residuals are defined as
e
√ i
1 − hii

A plot of the standardized residuals vs fitted values is a useful check for non con-
stancy of variance. The plot should show a random distribution of the points. Alterna-
tively, a non constancy of variance would appear as a telescoping increasing or decreasing
collection of points.
plot(fit,3) #plots of standardized residuals e∗i vs Ŷi
A scale-location plot can also be used to examine the homogeneity of the variance
of the residuals. This is a plot of q
|e∗i |vsŶi
Definition 4.3. Press residuals p.139

4.3 Residual plots against fitted values


If the residuals lie in a narrow band around 0 then there are no obvious needs for
corrections.
If the residuals show a telescopting pattern, either increasing or decreasing, this is
a sign that the variance is non constant.
A double-bow pattern is a sign that the variance in the middle is larger than the
variance at the extremes as is the case for binomial data.
If the residuals exhibit a quadratic relationship, we may have a nonlinear relation-
ship that has not been accounted for.

40
4.4 Residual plots against the regressor

See p.144 for the graphics.


If some residuals are very large, they may arise from an outlier. Of course, they
may be due to a non constant variance or a missing term.

4.4 Residual plots against the regressor


As in section 4.3 above, plots of the residuals against the independent variables are
similarly interpreted.
Residuals may also be plotted against independent variables not in the model.
See p.146 in Montgomery et al for the graphics.

4.5 Residuals in time sequence plot


Such plots would reveal a time dependence is they appear as in section 4.3 above
p.148

4.6 Lack of fit of the regression model


See section 3.1 of these notes where the topic was addressed. See also section 4.5 of
Montgomery et al

4.7 Calculations Using R


Suppose the data (Y, X1 , ..., Xp ) is stored in ”file”
plot(file) # provides a scatter plot matrix
boxplot(y~x) # creates side by side boxplots
cor(file) # computes a correlation matrix
cor.test(x,y) # test plus confidence interval for rho
plot(fit,2) #qq plot of sqrt(standardized residuals) vs theoretical quantiles
Other plots may be obtained using
library(MASS)
sresid=studres(fit) #provides the Studentized residuals
hist(sresid,freq=False, main=”Distribution of Studentized residuals”)
sfit=seq(min(sresid),max(sresid),length=n)
yfit=dnorm(sfit)
lines(sfit,yfit)# superimposes a normal density on the histogram
Transformations
We may also try different transformations of Y to obtain a better fit

41
4 Model adequacy checking

boxcox(fit) # determines the value of λ so that Z = Y λ is normally distributed


A plot of the standardized residuals vs fitted values is a useful check for non con-
stancy of variance. The plot should show a random distribution of the points. Alterna-
tively, a non constancy of variance would appear as a telescoping increasing or decreasing
collection of points.
plot(fit,3) #plots of standardized residuals e∗i vs Ŷi
A scale-location plot can also be used to examine the homogeneity of the variance
of the residuals. This is a plot of q
|e∗i |vsŶi

4.8 R Session
We will use the Delivery time data
boxplot(X1,X2)
fit=lm(Y~X1+X2,data=Delivery)
plot(fit,2) #normal QQ plot of sqrt(standardized residuals) vs theoretical quantiles
library(MASS)
sresid=studres(fit)
hist(sresid,freq=FALSE)
sfit=seq(min(sresid),max(sresid),length=25)
yfit=dnorm(sfit)
lines(sfit,yfit) #superimposes a normal density on the histogram
Alternatively
x=fit$residuals
curve(dnorm(x,mean(x),sd(x)),add=T)
boxcox(fit) # BoxCox transformation for normality
plot(fit,3) #plots of standardized residuals e∗i vs Ŷi
plot(fit)
Hit <Return> to see next plot:
Hit <Return> to see next plot: Hit <Return> to see next plot:
Hit <Return> to see next plot:
To see all 4 plots in a single page
par(mfrow=c(2,2))
plot(fit)
z=fit$residuals
qqnorm(z) # qq plot
qqline(z) # normal line superimposed

42
4.9 DATA SETS

4.9 DATA SETS


Restaurant Data
Rocket Propellant Data
Electric Utility Data
Windmill Data

43
5 Regression Diagnostics
We begin by considering transformations to linearize the model and weighting corrections
for violation of the variance assumption.

5.1 Transformations and weighting


5.1.1 Variance stabilizing transformations
It may happen that the variance in the general linear model is not constant. In those
cases, it may be useful to transform the data. Here are some examples of transformations.
(a) Poisson In the case of the Poisson, the variance is equal to the mean. Bartlett

(Anscombe) showed that if Y is distributed as a Poisson variable with mean λ, then Y
is distributed more nearly normally with variance approximately
√ 1/4 if λ is√
large. (The
demonstration is through a Taylor series expansion of Y ). In that case, Y may be
used and regressed against X.
(b) The similar transformation for a binomial variable Y ∼ B (p, n), with mean
m = np is s
Y +c

−1
sin
n + 2c
The optimal value of c is 3/8 if m and n − m are large. The variance is approximately
 −1
1 1
4
n + 2
.

5.1.2 Transformations to linearize the model


It may happen that a plot of Y against X does not appear linear. In those cases, it may
be useful to transform the dependent variable. Here are some examples of transforma-
tions.
(a) The exponential model
Y = β0 eβ1 X ϵ
may be transformed by taking logs

lnY = lnβ0 + β1 X + lnϵ

45
5 Regression Diagnostics

The usual assumptions would then have to be made and verified on the transformed
model.
(b) The model
Y = β0 + β1 X −1 + ϵ
can be linearized using the reciprocal transformation X ∗ = X −1 .
(c) The model
1
= β0 + β1 X + ϵ
Y
can be linearized using the reciprocal transformation Y ∗ = Y −1 .
(d) The model
X
Y =
β0 + β1 X
can be linearized using the reciprocal transformation in two steps. First

Y ∗ = Y −1

and then
X ∗ = X −1
to obtain
Y ∗ = β1 + β0 X ∗

5.1.3 Box-Cox transformations


At times, the data may not appear to be normally distributed. Box-Cox suggested a
power transformation of the type
 λ
Y −1
 λ ̸= 0
 λẏλ−1


Y (λ) = (5.1)


λ=0

Ẏ lnY

where "P #
−1 lnYi
Ẏ = ln
n
The value of λ is usually determined by trial and error whereby a model is fitted to Y (λ)
for various values of λ and selecting the one which minimizes the residual sum of squares
from a graphic plot.
We note as well that a confidence interval can be constructed for λ . This is useful
in that one may select a simple value of λ which is in the interval such λ = 0.5. (see
5.4.1 p.189 in Montgomery et al.)

46
5.2 Weighted least squares

The theory behind the transformation is as follows.


The original model was Y ∼ Nn (Xβ, σ 2 I)
The transformed model is Y (λ) therefore has likelihood function given by
  ′  
1

 y (λ) − Xθ y (λ) − Xθ 

n/2
exp − J (λ; y)
(2π) σn 
 2σ 2 

with the new parameters θ and Jacobian for the transformation


(λ)
dyi
J (λ; y) = Πni=1
dyi

The maximum likelihood estimator of the variance is given by


h i
Y (λ)′ I − X (X ′ X)−1 X ′ Y (λ)
σ̂ 2 =
n
S (λ)

n
The maximized log likelihood for fixed λ is
n
Lmax (λ) = − logσ̂ 2 + logJ (λ; y)
2
n
= − logσ̂ 2 + (λ − 1)
X
logyi
2
under the proposed (5.1).
We may then plot Lmax (λ) vs λ to find the value of λ which yields the maximum.
The exact value of the maximizing λ can be determined by differentiating the above
Lmax (λ) and then solving numerically for λ.

5.2 Weighted least squares


Suppose that the error terms are such that σ 2 [ϵi ] = σi2 . Instead of minimizing the sum
of the square errors, we may minimize the sum of the weighted squared errors

wi ϵ2i
X

where the weights satisfy



σ 2 [ wi ϵi ] = σ 2

47
5 Regression Diagnostics

√ √
Different weights may be chosen as for example, wi = Xi or w = Y.
The theory proceeds as follows.
Define the matrix W of weights
 
w1 0 0
 
 0 w2 
 
W =



 
 
 
0 wn

Then the original model goes from

Y = Xβ + ε

to

W 1/2 Y = W 1/2 Xβ + W 1/2 ε

Y W ≡ X W β + εW

The least squares approach leads to

′ −1 ′
bW = (XW XW ) XW YW

−1
= (X ′ W X) X ′W Y

The MSEW becomes


 2
wi Yi − Ŷi
P
M SEW =
n−p

wi (ei )2
P
=
n−p

One way to proceed is to perform the usual regression. Then, group the data using the X
variable. Estimate the variances s2i the Yi for each group. Then fit the variances against
the averages of the Xi of the groups. We illustrate this approach with the Turkey data.
The Breusch-Pagen test for constancy of variance assumes the model

Log σi2 = γ0 + γ1 Xi

48
5.2 Weighted least squares

and we wish to test


H0 : γ1 = 0

We regress the squared residuals e2i against Xi in the usual way and calculate an
error sum of squares SSR∗ . We reject th null hypothesis that γ1 = 0 whenever

SSR∗
χ21 ≡ 2
SSE
n

is larger than the chi square critical value. This test can be conducted in R using

library(lmtest)

bptest(fit) #Breusch-Pagen test

Example 5.1. Commercial properties

a) Obtain boxplots for each variable

b) scatter plot matrix

c) Fit the regression model

d) Obtain the residuals and compute a boxplot

e) Plot residuals against each predictor variable

f) constancy of variance test

library(lmtest)

bptest(fit)

studentized Breusch-Pagan test

data: fit BP = 12.978, df = 4, p-value = 0.01139

49
5 Regression Diagnostics

20
15
10
5
0

1 2 3 4

50
5.2 Weighted least squares

Histogram of Residuals
30

95%
25

30
20

log−Likelihood
Frequency

15

25
10
5

20
0

−4 −3 −2 −1 0 1 2 3 −2 −1 0 1 2

Residuals λ

Normal Q−Q Residuals vs Fitted


3

62 62
42 42
2

2
Standardized residuals

1
1

Residuals

0
0

−1
−1

−2
−2

−3

6
6
−3

−2 −1 0 1 2 11 12 13 14 15 16 17 18

Theoretical Quantiles Fitted values


lm(Y ~ .) lm(Y ~ .)

Scale−Location Cook's distance


0.15

6
62 6
1.5

42
Standardized residuals

0.10
Cook's distance

62 80
1.0

0.05
0.5

0.00
0.0

11 12 13 14 15 16 17 18 0 20 40 60 80

Fitted values Obs. number


lm(Y ~ .) lm(Y ~ .)

51
5 Regression Diagnostics

Example 5.2. Weighted least squares data


We consider the weighted least squares turkey data.

12
10
8
Y

6
4
2

2 4 6 8 10

It can be seen that there is a telescoping effect.


Next we computed averages and variances for subsets of the data and then fitted
the variances against the averages
X̄i 3.0 5.4 7.8 9.1 10.2
s2j 0.0072 0.3440 1.7404 0.8683 3.8964
ŝ2 = 1.5329 − 0.7334X̄ + 0.0883X̄ 2
The weights were then computed as inverses of the variances. The unweighted
regression was
Y = −0.579 + 1.14X
The weighted regression was
Y = −0.892 + 1.16X

Other R commands for comparisons


fit=lm(Y~X, data=Weighted)
wls_model <- lm(Y ~ X, data = Weighted, weights=W)
plot(fit,1)
plot(wls_model,1)
plot(fit,2)
plot(wls_model,2)
plot(fit,3)
plot(wls_model,3)

52
5.3 Checking on the linear relationship assumption

5.3 Checking on the linear relationship assumption


5.3.1 Descriptive Measures of Linear Association
We may define the coefficient of determination

SSR SSE
R2 = =1−
SST O SST O
This coefficient may be interpreted as the proportion of explained variation by the
regression model. The larger the proportion the better the model is.
Care must be taken however when using this measure because a large value of R2
may arise if the points lie on a quadratic.

5.4 Calculations Using R


a) Plots
A plot of the residuals vs fitted values can be used to check on the linearity assump-
tion. The residuals should hover around 0 and appear random. A plot that exhibits a
pattern presents a flag that perhaps one or more terms are missing from the fit.
plot(fit,1) # plots the residuals vs Ŷ
R can be used to plot 4 plots on a 2 x 2 layout as follows
par(mfrow=c(2,2))
plot(fit) # 4 plots will appear on a single page. It will include fits 1,2,3.
(fit,1) plots e vs Ŷ is used to check the linear assumptions
A horizontal line with no pattern is indicative of a good fit
(fit,2) this is a qq normal plot used to examine if the residuals are normally dis-
tributed. Normality is accepted if we see√a ”straight” line
(fit,3) This is a scale-location plot: Standardizedresidual vs Ŷ the fitted values.
It is used to check homogeneity of variance of the residuals. A horizontal line with equally
spread points is a good indication that the variance is constant (homoscedasticity)
(fit,4) shows the standardized residuals vs leverage which is used to flag influential
observations
plot(fit,5) shows Cook’s distance for the 3 most extreme values. If you want the top
5 extreme values
plot(fit,4,id.n=5)
model.diag.metrics%,% top_n(3,wt=cooksd)
A plot of the residuals against the time ordered sequence may reveal dependencies.
A formal test using the Durbin-Watson statistic can be used to check for autocorrelated
errors.
b) Box Cox transformation

53
5 Regression Diagnostics

Using R
boxcox(model) # determines the value of λ so that Z = Y λ is normally distributed
lambda <- b$x[which.max(b$y)] # Exact lambda
lambda
c) Durbin Watson test
durbinWatsonTest(fit) # tests for autocorrelated errors
d) Test on constancy of variance
library(lmtest)
bptest(fit) #Breusch-Pagen test
e) Installing olsrr
install.packages("olsrr")
library(olsrr)
ols_plot_dfbetas(fit) #plot of dfbetas
ols_plot_dffits(fit)# plot of dffits

5.5 R Session
We use the Delivery data
a) Breusch Pagen test
bptest(fit)
studentized Breusch-Pagan test
data: fit BP = 11.988, df = 2, p-value = 0.002493
b) Weighted regression
wls_model <- lm(Y ~ X1+X2, data = Delivery, weights=1/Y)
summary(wls_model )
Call: lm(formula = Y ~ X1 + X2, data = Delivery, weights = 1/Y)
Weighted Residuals:
Min 1Q Median 3Q Max
-1.04397 -0.26730 0.00011 0.23581 1.33217

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.622929 0.903842 4.008 0.000591 ***
X1 1.583045 0.163823 9.663 2.24e-09 ***
X2 0.011142 0.003123 3.567 0.001722 **
--- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.6156 on 22 degrees of freedom
Multiple R-squared: 0.9369, Adjusted R-squared: 0.9312
F-statistic: 163.3 on 2 and 22 DF, p-value: 6.321e-14

54
5.6 Data Sets

c) Cook’s distance
plot(fit,5)
plot(fit,4,id.n=5)
d) Installing olsrr
install.packages("olsrr")
library(olsrr)
ols_plot_dfbetas(fit) #plot of dfbetas
ols_plot_dffits(fit)# plot of dffits

5.6 Data Sets


Commercial Properties
Weighted least squares data
Restaurant Data
Electric Utility Data
Windmill Data
Plumbing Supplies Data

5.7 R session commercial properties


library(olsrr)
Properties=read.table(file.choose(),header=TRUE,sep=’\t’)
plot(Properties)
fit=lm(Y~.,data=Properties)
RentalRates=Properties$Y
Age=Properties$X1
Vacancyrates=Properties$X2
Vacancyrates=Properties$X3
Expenses=Properties$X2
Footage=Properties$X4
boxplot(RentalRates,Age,Vacancyrates,Expenses)
boxplot(Vacancyrates)
boxplot(Footage)

55
6 Diagnostics for Leverage and
Measures of Influence

A single observation may unduly influence the results of a regression analysis. Hence,
the detection of such influential observations is important. In this connection the hat
matrix plays a very important role. We begin with the minimized sum of squares
   ′  
R β̂ = Y − X β̂ Y − X β̂

−1
= Y ′ Y − Y ′ X (X ′ X) X ′Y

−1
 
= Y ′ I − X (X ′ X) X′ Y

= Y ′ (I − H) Y

where H = X (X ′ X)−1 X ′ . Let x′i being the ith row of X. Then the ith diagonal of H is
for i = 1, ..., n
−1
hii = x′i (X ′ X) xi

In the simple linear regression with p = 2 , x′i = [1, Xi ]


!
X 2 − Xi
P P
x′i P i xi
− Xi n
hii =
Xi2 − ( Xi )2
P P
n
Xi2 − 2Xi Xi + nXi2
P P
= (6.1)
n Xi2 − ( Xi )2
P P

Xi2 − nX̄ 2 + nX̄ 2 − 2nXi X̄ + nXi2


P
=
n Xi2 − ( Xi )2
P P

 2
1 Xi − X̄
= + P 2
n Xi − X̄

57
6 Diagnostics for Leverage and Measures of Influence

Therefore, hii is a measure of how far the ith observation is from the mean. If Xi = X̄,
then hii = n1 which is the minimum value.

Definition 6.1. The quantity hii is called the leverage of the ith observation.

A further insight is gained by writing the mean X̄ in terms of the mean X̄(i) when
the ith observation is deleted. We can show
1 
X̄ = Xi + (n − 1) X̄(i)
n
so that
1h i
Xi − X̄ = Xi − Xi + (n − 1) X̄(i)
n
n−1h i
= Xi − X̄(i)
n
Hence,
 2
1 Xi − X̄
hii = + 
n P Xi − X̄ 2


 2
1 n−1
 2 Xi − X̄(i)
= + 2
n n P
Xi − X̄

This shows that the leverage of the ith observation will be large if Xi is far from the
mean of the other observations. So the leverage is concerned with the location of points
in the space of the independent variables which may be influential.

6.1 Properties of the leverage


The leverage from (6.1) can be utilized to flag influential observations. This follows from
the fact that
−1
h i
T race H = T r X (X ′ X) X′

−1
h i
= T r (X ′ X) X ′X (6.2)

= T r [Ip ] = p

58
6.1 Properties of the leverage

and hence the average P


hii p
=
n n
Consequently, observations with a value of the leverage greater than twice the average
should be flagged i.e.
p
 
hii > 2
n
We note that not all points with high leverage will be influential. A point with high
leverage may lie close to the regression line and hence will not be influential on the fit.
On the other hand, a point with high leverage may be quite far from the fitted line and
may be quite influential.
It is usually important to look at the studentized residuals in conjunction with the
leverage. Observations with large leverage and large residuals are likely to be influential.

6.1.1 DFFITS
A useful measure of the influence that case i has on the fitted value Ŷi is given by

Ŷi − Ŷi(i)
DF F IT Si = q (6.3)
M SE(i) hii
!1
hii 2
= ti
1 − hii

where
" #1
n−p−1 2
ti = ei
SSE (1 − hii ) − e2i
represents the Studentized residual. This shows that it can be calculated from the
original residuals, the error sum of squares and the hat matrix values. The value of
DF F IT Si represents the number of estimated standard deviations of Ŷi that the fitted
value increases or decreases with the inclusion of the ith case in fitting the regression
model.
 1
2 hii
If case i is an X outlier and has high leverage, then > 1 and DF F IT S will
1−hii
be large in absolute value. As a guideline, influential cases are flagged if

|DF F IT Si | > 1

for small to medium data sets and


p
r
|DF F IT Si | > 2
n

59
6 Diagnostics for Leverage and Measures of Influence

for large data sets.

6.1.2 Cook’s Distance


Unlike the previous section which considered the influence of the ith case on the fitted
value Ŷi . Cook’s distance considers the influence of the ith case on the entire collection
of n fitted values
Pn  2
j=1 Ŷj − Ŷj(i)
Di = (6.4)
pM SE
" #
e2i hii
=
pM SE (1 − hii )2

Cook’s distance is a function of the residual ei and the leverage hii . It can be large
if either the residual is large and the leverage moderate, or if the residual is moderate
and the leverage is large, or both are large. It can be shown that approximately

Di ∼
= F (p, n − p)

Since F0.50 (p, n − p)∼


= 1, we consider points for which Di > 1 to be influential.
Ideally, we want the estimated β̂ (i) to be within the boundary of the 10 − 20% confidence
region. In R these regions are indicated in red.

6.1.3 DFBETAS
DFBETAS are a measure for the influence that case i has on each of the regression
coefficients bk , k = 0, 1, ..., p − 1.

bk − bk(i)
DF BET AS(i) = q
M SE(i) cii

where cii is the ith diagonal element of (X ′ X)−1 . The M SE(i) may be computed from
the relationship
e2i
(n − p) M SE = (n − p − 1) M SE(i) +
1 − hii
A large value of DF BET AS(i) indicates a large impact of the ith case on the k th
regression coefficient. As a guideline

 √2 large n
n
DF BET AS(i) >
1 small n

60
6.1 Properties of the leverage

6.1.4 Deletion of Observations: theoretical developments


In this subsection we present the theoretical developments underlying the calculations
of the diagnostic measures above. This development is due to A.C. Atkinson Plots,
transformations and regression (1987).
We begin with a matrix identity. Let A be a p × p square matrix and let U, V be
p × m matrices. Then
−1
−1

(A − U V ′ ) = A−1 + A−1 U I − V ′ A−1 U V ′ A−1 (6.5)

where I is the m × m identity matrix. The matrix identity is verified by multiplying the
right hand side by (A − U V ′ ).
Consider the partition of the X matrix
 
X(M )
 
X=



XM

where X(M ) is the reduced matrix when M rows are deleted and XM is the matrix
containing the deleted rows. Similarly, let
 
Y(M )
 
Y =



YM

Then,

X ′ X = X(M
′ ′
) X(M ) + XM XM

Y ′Y ′
= Y(M ′
) Y(M ) + YM YM

X ′Y ′
= X(M ′
) Y(M ) + XM YM

Setting A = X ′ X and U ′ = V ′ = XM in (6.5) we have


−1
−1 −1 −1


X(M ) XM = (X ′ X) + (X ′ X) ′
XM (I − HM )−1 XM (X ′ X)

where
−1
HM = XM (X ′ X) ′
XM
is the hat matrix for the observations left out.

61
6 Diagnostics for Leverage and Measures of Influence

If M observations are left out and the model is refitted, the least squares estimate
becomes
 −1
′ ′
β̂(M ) = X(M ) X(M ) X(M ) Y(M ) (6.6)
(6.7)
 −1

= X(M ) X(M ) (X ′ Y − XM

YM ) (6.8)

 −1

Substituting the expression for X(M ) XM above, we have

−1 −1 −1
h i
β̂(M ) = (X ′ X) + (X ′ X) ′
XM (I − HM )−1 XM (X ′ X) (X ′ Y − XM

YM )

Let ŶM = XM β̂ be the estimate of response for the M values left out and set
eM = YM − ŶM be the corresponding vector of residuals.
A little algebra leads to the expression
−1
β̂(M ) − β̂ = − (X ′ X) ′
XM (I − HM )−1 eM (6.9)

The expression (6.9) shows the change in the parameter estimate when M data are
deleted. When M = 1,
− (X ′ X)−1 xi ei
β̂(i) − β̂ =
1 − hii

We may also compute the effect of deletion on the error sum of squares. The error
sum of squares for the full model is

SSE = (n − p) S 2 = Y ′ Y − β̂ ′ X ′ Y

After deleting M observations we have

2 ′ ′ ′
(n − p − M ) S(M ) = Y(M ) Y(M ) − β̂(M ) X(M ) Y(M )

= Y ′ Y − YM′ YM − −β̂(M
′ ′ ′
) (X Y − XM YM )

= (n − p) S 2 + β̂ ′ X ′ Y − β̂(M
′ ′ ′ ′ ′
) X Y − YM YM + β̂(M ) XM YM

≡ (n − p) S 2 + A + B
 
where A = β̂ ′ − β̂(M
′ ′ ′ ′ ′
) X Y and B = −YM YM + β̂(M ) XM YM .

62
6.1 Properties of the leverage

From (6.9) we have that since ŶM = XM β̂,


−1
A = (X ′ X) ′
XM (I − HM )−1 eM X ′ Y

= Yˆ ′ M (I − HM )−1 eM

Similarly,

B = −YM′ YM + β̂(M
′ ′
) XM YM

−1
h i
= −YM′ YM + + β̂ − e′M (I − HM )−1 XM (X ′ X) ′
XM YM

= −YM′ YM + β̂ ′ XM

YM − e′M (I − HM )−1 HM YM

= −e′M YM − YM′ HM (I − HM )−1 eM

since it is all scalar. It follows

2 2
(n − p − M ) S(M ) = (n − p) S + A + B

= (n − p) S 2 + Yˆ ′ M (I − HM )−1 eM + −e′M YM − YM′ HM (I − HM )−1 eM

Since
(I − HM )−1 − HM (I − HM )−1 = I
we have
Yˆ ′ M (I − HM )−1 eM − YM′ HM (I − HM )−1 eM = e′M YM
and hence
−1
2
(n − p − M ) S(M 2 ′
) = (n − p) S + eM (I − HM ) ŶM − e′M (I − HM )−1 YM

= (n − p) S 2 − e′M (I − HM )−1 eM

This shows that the residual sum of squares when M observations are deleted is reduced
by an amount that depends on eM and on the inverse of a matrix containing elements
of the hat matrix.

When M = 1,
2 e2i
(n − p − 1) S(i) = (n − p) S 2 −
1 − hii

63
6 Diagnostics for Leverage and Measures of Influence

Finally, we may compare the value of the ith observation Yi with the prediction Ŷ(i)
when that observation is not used in the fitting.

Let

di = Yi − Ŷ(i)

= Yi − x′i β̂(i)

ei
=
1 − hii
.

We note the variance is equal to


  −1 
2
σ [di ] = σ 2
1+ x′i ′
X(i) X(i) xi

σ2
=
1 − hii

where σ 2 is estimated by M SE(i) ≡ S(i)


2
which is independent of Yi .

The above expression follows from

 −1
−1 x′i (X ′ X)−1 xi x′i (X ′ X)−1 xi
x′i ′
X(i) X(i) xi = x′i (X X)′
xi +
1 − hii

h2ii
= hii +
1 − hii

and
1
  −1 
1+ x′i ′
X(i) X(i) xi =
1 − hii
Consequently, the studentized residuals are

di ei 1 − hii
s  =
 −1 S(i) 1 − hii
S(i) 1 + x′i X(i)

X(i) xi

ei 1
= √ ∼ t(n−p−1)
S(i) 1 − hii

64
6.1 Properties of the leverage

We may also compare

Ŷi − Ŷ(i)
DF F IT S = √
Si hii
" #1/2 !1/2
n−p−1 hii
= ei
SSE (1 − hii ) − e2i 1 − hii
!1/2
hii
= ti
1 − hii

since
2 e2i
(n − p − 1) S(i) = (n − p) S 2 −
1 − hii
We flag influential cases when



 > 1 smal/medium data

DF F IT S

 q
> 2 p

large data
n

Example 6.1. For the GPA data example,


a) Obtain the residuals
b) Obtain a histogram plot of the residuals and a qq-plot to check for normality
c) Obtain confidence and prediction bands using ggplot2
d) Plot of residuals against predictor variable
e) Plot of absolute residuals against predictor variable
f) Plot of residuals against fitted value
g) Plot of residuals against time
h) Calculate test for the non constancy of variance

Example 6.2. Consider the crime data for a city where


X= % of individuals having at least a high school diploma
Y= # crimes reported per 100,000 residents last year

Example 6.3. a) Illustrate lack of fit test using bank data Table 3.4 where
X= size of minimum deposit
Y=# of new accounts
Branch 1 2 3 4 5 6 7 8 9 10 11
X 125 100 200 75 150 175 75 175 125 200 100
Y 160 112 124 28 152 156 42 124 150 104 136
b) Use the following to create diagnostic plots of residuals vs fitted (to check lin-
earity), normal qq plots, scale-location vs fitted values and residuals vs leverage

65
6 Diagnostics for Leverage and Measures of Influence

6.2 Calculations Using R


Another package that plots diagnostic measures
install.packages("ggfortify")
library(ggfortify)
autoplot(fit)

6.3 R Session
Textbook 4.1
In R
the diagnostic plots are exhibited using the package
library(olsrr)
ols_plot_cooksd_bar(fit) # yields a plot of Cook’s distance vs the observations
ols_plot_cooksd_bchrt(fit) # also yields a plot of Cook’s distance vs the observa-
tions
ols_plot_dfbetas(fit) # yields a plot of DFBETAS vs the observations
ols_plot_dffits(fit) # yields a plot of DFFITS vs the observations
Also the following will display the Hat matrix diagonal elements;
under coefficients, a matrix whose i-th row contains the change in the estimated
coefficients which results when the i-th case is dropped from the regression;
under sigma, a vector whose i-th element contains the estimate of the residual
standard deviation obtained when the i-th case is dropped from the regression
under wt.res. a vector of weighted (or for class glm rather deviance) residuals
diag=lm.influence(model)
diag
To see values of Y X .fitted .resid .hat .sigma .cooksd .std.resid type in R
library(broom)
K=model.diag.metrics=augment(model)
head(model.diag.metrics, 20) or
model.diag.metrics(model) # shows values of fit,hat,sigma,Cook,std.resid influence.measures(m
summary(influence.measures(model)) #this exhibits the potentially influential ob-
servations
Alternatively use
hatvalues(model)
dfbetas(model,~)
dffits(model)
cooks.distance(model)

66
6.4 Data Sets

6.4 Data Sets


Weighted Data
Crime Data
Delivery Time Data
Bank Data
Housing Data
Grades Data

6.5 R session
Using Weighted data
Weighted=read.table(file.choose(),header=TRUE,sep=’\t’)
names(Weighted) [1] "X" "Y" "W"
Diameter=Weighted$X
Area=Weighted$X
install.packages("ggfortify")
library(ggfortify)
fit=lm(Y~X, data=Weighted)
autoplot(fit)
Residuals vs Fitted Normal Q−Q
33 33
2
Standardized residuals

1
Residuals

0 0

−1
−2
−2
29 29
−4 32 32
−3
3 6 9 −2 −1 0 1 2
Fitted values Theoretical Quantiles

Scale−Location Residuals vs Leverage


32 33
Standardized residuals

Standardized Residuals

1.5 29 1
33
0
1.0

−1

0.5
−2
29

0.0 −3 32
3 6 9 0.00 0.05 0.10 0.15
Fitted values Leverage

Plots of residuals vs leverage exhibit residuals with non-linear patterns. In a good


model, a plot of residuals vs fitted will show points randomly distributed. In a poor
model, there will be some pattern.
A scale-location plot (or spread-location plot) will show if the residuals are spread
equally along the range of predictors. It enbales us to check the assumption of equal

67
6 Diagnostics for Leverage and Measures of Influence

variance. A plot is good if there is a horizontal line with eually randomly spread of
points.
A plot of residuls vs leverage helps to locate influential cases if any. Some cases
may be influential even if they appear to be in a reasonable range of the values. We
watch for outlying values in the upper right or lower right plot. Look for values outside
the dashed lines where Cook’s distance scores are highest.

6.6 Suggested Problems


4.3, 4.21,

68
7 Different Models
The regression set up permits us to consider a variety of models which we will discuss
here

7.1 Polynomial regression models


A k-order polynomial regression model in one variable takes the form

Y = β0 + β1 X + β2 X 2 + β3 X 3 + ... + βk X k + ϵ

which may be fitted using the matrix approach. It is important to keep in mind that in
considering such a model, the order k should be as low as possible. The inversion of the
matrix X ′ X will be inaccurate resulting in poor estimates of the parameters and their
variances.
Often, orthogonal polynomials defined below, are used in the modeling because they
simplify the fitting process

Yi = β0 P0 (Xi ) + β1 P1 (Xi ) + β2 P2 (Xi ) + ... + βk Pk (Xi ) + ϵi

where Pj is a j order orthogonal polynomial satisfying


n
X
Pj (Xi ) Pl (Xi ) = 0, j ̸= l
i=1

P0 (Xi ) = 1

Such polynomials have been tabulated (See Biometrika Tables for Statisticians). The
least squares estimates are given by
Pn
Pj (Xi ) Yi
β̂j = Pi=1
n 2
, j = 0, 1, ..., k
i=1 Pj (Xi )

The principal advantage of using orthogonal polynomials is that the model can be
fitted sequentially. This specific advantage is less important today in the age of high
speed computing compared to the times when much of the modeling was done using

69
7 Different Models

calculators.
Sometimes a low order polynomial does not fit the data well. This can be due to
the fact that the function in question behaves differently in different parts of the range.
In that case, it is common to use splines functions or piece wise polynomial fitting. We
do not pursue this topic here. (Refer to the text book p.236-242)
When two or more variables are involved cross terms are included as in the following
model involving two variables

Y = β0 + β1 X1 + β2 X2 + β11 X12 + β22 X22 + β12 X1 X2 + ϵ

Such models are called surface response surfaces. Such models are often used in control
theory problems to optimize the selection of control settings of the variables.

7.2 Indicator regression models


This subject is treated in Chapter 8 of Montgomery et al.
Regression analysis allows the use of indicator variables which are qualitative or
categorical in nature. Such variables are labeled dummy variables. For example, to take
into account gender we may define



 1 male

X2 =


0

f emale

An interesting application is to the case where one wishes to fit a simple linear model
as a function of gender. Set

Y = β0 + β1 X1 + β2 X2 + ϵ

In that case, 
β + β1 X1 + β2 + ϵ males
 0



Y =


+ β1 X1 + ϵ

β
0 f emale
So here the two lines are parallel. This can be generalized to 2 or more dummy variables
X2 X 3
0 0 observation from category 1
1 0 observation from category 2
0 1 observation from category 3
Example 7.1. a) Suppose that we have the following time data and we wish to fit two

70
7.3 R Session

lines when the abscissa is known to be in 1972.


X0 X 1 X2 Y
1970 1 1 0 2.3
1 2 0 3.8
1971 1 3 0 6.5
1 4 0 7.4
1972 1 5 0 10.2
1 5 1 10.5
1973 1 5 2 12.1
1 5 3 13.2
1974 1 5 4 13.6
The model is
Y = β0 + β1 X1 + β2 X2 + ϵ
b) Suppose that we have the same data as in a) but the abscissa is unknown. In
that case we need an additional dummy variable
X0 X 1 X2 X3 Y
1970 1 1 0 0 2.3
1 2 0 0 3.8
1971 1 3 0 0 6.5
1 4 0 0 7.4
1972 1 5 0 1 10.2
1 5 1 1 10.5
1973 1 5 2 1 12.1
1 5 3 1 13.2
1974 1 5 4 1 13.6
The model is
Y = β0 + β1 X1 + β2 X2 + β3 X3 + ϵ
For other examples, see Example 8.5 p.280 of Montgomery et al.
We conclude by noting that analysis of variance models make extensive use of
indicator variables.

7.3 R Session
Textbook 8.1

7.4 Suggested Problems


7.4; 7.6; 8.16

71
7 Different Models

7.5 Data Sets


Hardwood Data
Voltage Drop Data
Windmill Data
Tool Life Data
Turkey data

72
8 Multicollinearity
When the variables are correlated among themselves multicollinearity is said to exist.
This can cause serious problems among them being that the estimates become highly
unstable. What are some of the symptoms of multicollinearity?
1. Large variation in the estimated coefficients when a new variable is either added
or deleted.
2. Non significant results in individual tests on the coefficients of important vari-
ables.
3. large coefficients of simple correlation between pairs of variables.
4. Wide confidence interval for the regression coefficients of important variables.
The principal difficulty is that the matrix (X ′ X) may not be invertible. As well,
multicollinearity affects the interpretation of the coefficients in that they may vary in
value. To illustrate, consider the case of two predictor variables X1 , X2 . If the variables
are standardized then the matrix
!
′ 1 1 r12
(X X) = 2
1 − r12 r12 1

where r12 is the correlation between the two variables. Moreover the variance -covariance
matrix is !
2 ′ −1 2 1 1 −r12
σ (X X) = σ 2
1 − r12 −r12 1
The two regression coefficients have the same variance and this increases as the correla-
tion increases.  
Consequently, as |r12 | → 1, V arβ̂k → ∞; Cov β̂1 , β̂2 → ±∞ as r12 → ±1
The estimates are
−1
β̂ = (X ′ X) X ′ Y
Hence, βˆ1 = r1Y −r12 ˆ
2
1−r12
; β2 = r2Y −r12
2
1−r12
In general, the diagonal elements of (X ′ X)−1 are Cjj = 1
1−Rj2
where Rj2 is the Rsquare value obtained from the regression
of Xj on the other p − 1 variables.
If there is a strong multicollinearity between Xj and the other p − 1 variables,
2 ∼ σ2 ∼
 
Rj = 1 and V ar β̂j = 1−R 2 = ∞
j

73
8 Multicollinearity

As well, under multicollinearity, the values of the estimates will be large. Set
2
L = β̂ − β

Then,
p 
X 2 p
X  
E β̂j − βj = V ar β̂j
j=1 j=1
−1
= σ 2 T race (X ′ X)

p
1
= σ2
X

j=1 λj

where {λj }are the eigenvalues of X ′ X.


Under multicollinearity, some of these eigenvalues will be small
and hence their inverses will be large.

L = β̂ ′ β̂ − 2β̂β + β ′ β

Taking the expectation, we see


2
E [L] = E β̂ − ∥β∥2

p
1
= σ2
X

j=1 λj

2
= ∥β∥2 + σ 2
Pp 1
Hence, = E β̂ j=1 λj

The eigenvalues (λj ) can also be used to measure the extent of multicollinearty in
the system.
If one or more are small, then there are near linear dependencies in the columns of

X X.
The condition number κ and condition indices κj of X ′ X are defined to be

λmax λmax
κ= , κj =
λmin λj

κ < 100 no serious multicollinearity


100 < κ < 1000 moderate to strong
1000 < κ severe

74
The starting point is to first standardize the variables as
!
1 Yi − Ȳ
Yi∗ = √
n−1 sY
!
∗ 1 Xik − X̄k
Xik = √ , k = 1, ..., p − 1
n−1 sk

When the standardized regression model with no intercept


p−1
Yi∗ βi∗ Xik

+ ε∗i
X
=
k=1

is fitted, we have the relationship


sY
 
βk = βk∗
sk
k−1
X
β0 = Ȳ − βi X̄i
i=1
rXX b∗ = rY X

where rXX is the correlation matrix


 
1 r12 ... r1,p−1
1 ... r2,p−1
 
 
rXX =  

 ... ... 

r1,p−1 1

and rY′ X = (rY 1 ...rrY,p− ), rij = cor (Xi , Xj ) , rY i = cor (Y, Xi ) .


Mathematically, multicollinearity may be diagnosed using variance inflation fac-
tors. Specifically, suppose that the regression is fitted using the standardized predictor
variables. Then,
σ 2 [b∗ ] = σ 2 rXX
−1

We define the variance inflation factor (VIF)


 −1
(V IF )k = 1 − Rk2

where Rk2 is the coefficient of multiple determination when Xk is regressed on the p − 2


other X variables. Hence,  −1
σ 2 [bk ] = σ 2 1 − Rk2
The (V IF )k = 1 when Rk2 = 0 , i.e. whenever Xk is not linearly related to the other X

75
8 Multicollinearity

variables in the model. Under perfect correlation, i.e.Rk2 = 1 the variance is unbounded.
As a rule of thumb, a value (V IF ) > 10 indicates that multicollinearity exists.
Table-3: VIF interpretation
VIF -value conclusion
VIF 1 Not correlated 1
VIF 5 Moderately correlated
VIF 5 Highly correlated
Tolerance is the amount of variability in one independent variable that is no ex-
plained by the other independent variables, and it is in fact 2 1 R .Tolerance values less
than 0.10 indicate collinearity.
The eigenvalues of X’X are the squares of the singular values of X. The condition
indices are the square roots of the ratio of the largest eigenvalue to each individual
eigenvalue. The largest condition index is the condition number of the scaled X matrix.
Alternatively, as a diagnostic tool, we may compute the average
P
(V IF )k
V IF =
p−1
Mean values much greater than 1 point to serious multicollinearity.
Ridge regression is considered as a remedial measure to multicollinearity. The theory
is as follows. The normal equation

(X ′ X) b = X ′ Y

is transformed by using the standardized variables so that it becomes

rXX b = rY X

We suppress the * symbol for b for ease of notation.


Consider solving instead the equation

(rXX + cI) bR = rY X

where c ≥ 0 is a constant and the superscript R indicates “ridge”. The ridge standardized
regression coefficients become

bR = (rXX + cI)−1 rY X (8.1)

The constant c reflects the fact that the ridge estimators will be biased but they tend
to be more stable or less variable than the ordinary least squares estimators.
The constant c is usually chosen is such a way that the estimators of bRk are stable in
value. Alternatively, whenever the (V IF )k are stable in value. A plot of the coefficients

76
8.1 Calculations Using R

against c is called the ridge trace and this helps in the selection of c.
Finally, we note that ridge regression can also be obtained from the method of
penalized regression. From (8.1), we have the following system of equations:

(1 + c)bR R R
1 + r12 b2 + ..., +r1,p−1 bp−1 = rY 1
r21 bR R R
1 + (1 + c)b2 + ..., +r2,p−1 bp−1 = rY 2 (8.2)

rp−1,1 bR R R
1 + rp−1,2 b2 + ..., +(1 + c)bp−1 = rY,p−1

On the hand, consider the penalized least squares


p−1
[Yi − β1 Xi1 − ... − βp−1 Xi,p−1 ]2 + c βj2
X X
Q= (8.3)
j=1

Differentiating (8.3) with respect to each of the parameters leads to (8.2).


The major drawback of ridge regression is that the ordinary inference procedures
are no longer applicable. In that case we need to use bootstrapping methods to obtain
the precision of the estimators.

8.1 Calculations Using R


We consider the Body fat Data which consists of
V1 Tricepts skinfold thickness
V2 Thigh circumference
V3 Midarm circumference
V4 Body fat
Bodyfat=read.table(file.choose(),header=TRUE,sep=’\t’)
names(Bodyfat) [1] "Tricepts" "Thigh" "Midarm" "Fat"
V1=Bodyfat$Tricepts
V2=Bodyfat$Thigh
V3=Bodyfat$Midarm
V4=Bodyfat$Fat
a) Collinearity
diagnostics use the olsrr package
ols_coll_diag(model)
library(olsrr)
model=lm(V4~V1+V2+V3,data=Bodyfat)
ols_plot_resid_fit_spread(model)

77
8 Multicollinearity

We obtain plots showing model fit assessment. These plots are used to detect non-
linearity, influential observations and outliers. They consist of side-by-side quantile plots
of the centered fit and the residuals. It shows how much variation in the data is explained
by the fit and how much remains in the residuals. For inappropriate models, the spread
of the residuals in such a plot is often greater than the spread of the centered fit.

Fit − Mean Residual

5.0

2.5

0.0

−5

−2.5

−10

−5.0
0.0 0.4 0.8 1.2 0.0 0.4 0.8 1.2
Proportion Less Proportion Less

Next we obtain correlations

ols_correlations(model)

Correlations - Variable Zero Order Partial Part -

V1 0.843 0.338 0.160

V2 0.878 -0.267 -0.123

V3 0.142 -0.324 -0.153

Next we compare the observed vs predicted plot to assess the fit of the model.
Ideally, all your points should be close to a regressed diagonal line. Draw such a diagonal
line within your graph and check out where the points lie. If your model had a high R
Square, all the points would be close to this diagonal line. The lower the R Square, the
weaker the Goodness of fit of your model, the more foggy or dispersed your points are
from this diagonal line.

ols_plot_obs_fit(model)

78
8.1 Calculations Using R

Actual vs Fitted for V4

24
Fitted Value

20

16

12

15 20 25
V4

Next we diagnose
ols_coll_diag(model) #Tolerance and Variance Inflation Factor -
Variables Tolerance VIF
V1 0.001410750 708.8429
V2 0.001771971 564.3434
V3 0.009559681 104.6060
Tolerance is the % of variation in the predictor not explained by the other predictors.
To calculate it, regress the k th predictor on the other predictors in the model. Com-
2
pute Rk. Then
2
Tolerance =1 − Rk.
b) Diagnostics Panel Panel of plots for regression diagnostics
ols_plot_diagnostics(model)
c) Ridge regression
Next we load the package MASS for ridge regression
library(MASS)
y= c(0.01,0.02,0.03,0.04,0.05,0.06,0.07,0.08,0.09,0.10)
lm.ridge(V4~V1+V2+V3,Bodyfat,lambda=y)
y V1 V2 V3
0.01 64.389989 2.7395079 -1.49200145 -1.3458552
0.02 42.218200 2.0683914 -0.91772066 -0.9921824
0.03 30.007043 1.6986330 -0.60142746 -0.7972812
0.04 22.276858 1.4644460 -0.40119461 -0.6738065
0.05 16.944517 1.3028056 -0.26306759 -0.5885534
0.06 13.044863 1.1845111 -0.16204816 -0.5261373
0.07 10.069445 1.0941794 -0.08496703 -0.4784538
0.08 7.724954 1.0229365 -0.02422730 -0.4408273

79
8 Multicollinearity

0.09 5.830318 0.9653040 0.02486091 -0.4103718


0.10 4.267704 0.9177169 0.06534957 -0.3852088
plot(lm.ridge(V4~V1+V2+V3,Bodyfat,lambda=y))
Alternatively, instead of computing y, write
lm.ridge(V4~V1+V2+V3,Bodyfat,lambda=seq(0,0.1,0.01))
10
5
t(x$coef)

0
−5

0.02 0.04 0.06 0.08 0.10

x$lambda

Here we see plots of the estimated coefficients with varying values of c


d) Condition Index
Most multivariate statistical approaches involve decomposing a correlation matrix
into linear combinations of variables. The linear combinations are chosen so that the
first combination has the largest possible variance (subject to some restrictions we won’t
discuss), the second combination has the next largest variance, subject to being un-
correlated with the first, the third has the largest possible variance, subject to being
uncorrelated with the first and second, and so forth. The variance of each of these lin-
ear combinations is called an eigenvalue. Collinearity is spotted by finding 2 or more
variables that have large proportions of variance (.50 or more) that correspond to large
condition indices. A rule of thumb is to label as large those condition indices in the
range of 30 or larger.
The R command is
model = lm(mpg ~ disp + hp + wt + qsec, data = mtcars)
ols_eigen_cindex(model)
e) Added Variable Plot
An added variable plot provides information about the marginal importance of a
predictor variable Xk , given the other predictor variables already in the model. It shows
the marginal importance of the variable in reducing the residual variability.
The added variable plot was introduced by Mosteller and Tukey (1977). It enables
us to visualize the regression coefficient of a new variable being considered to be included
in a model. The plot can be constructed for each predictor variable.

80
8.1 Calculations Using R

Let us assume we want to test the effect of adding/removing variable X from a


model. Let the response variable of the model be Y
Steps to construct an added variable plot:
Regress Y on all variables other than X and store the residuals (Y residuals). Regress
X on all the other variables included in the model (X residuals). Construct a scatter
plot of Y residuals and X residuals. What do the Y and X residuals represent? The Y
residuals represent the part of Y not explained by all the variables other than X. The X
residuals represent the part of X not explained by other variables. The slope of the line
fitted to the points in the added variable plot is equal to the regression coefficient when
Y is regressed on all variables including X.
A strong linear relationship in the added variable plot indicates the increased im-
portance of the contribution of X to the model already containing the other predictors.
model = lm(mpg ~ disp + hp + wt + qsec, data = mtcars)
ols_plot_added_variable(model)
f) Residual Plus Component Plot
The residual plus component plot was introduced by Ezekeil (1924). It was called
as Partial Residual Plot by Larsen and McCleary (1972). Hadi and Chatterjee (2012)
called it the residual plus component plot.
Steps to construct the plot:
Regress Y on all variables including X and store the residuals (e). Multiply e
with regression coefficient of X (eX). Construct scatter plot of eX and X The residual
plus component plot indicates whether any non-linearity is present in the relationship
between Y and X and can suggest possible transformations for linearizing the data.
model = lm(mpg ~ disp + hp + wt + qsec, data = mtcars)
K=ols_plot_comp_plus_resid(model)
K
g) Now we consider fitting the ridge regression model.
install.packages("glmnet")
> library(glmnet)
> y=Bodyfat$Fat
> x=data.matrix(Bodyfat[,c(’Tricepts’,’Thigh’,’Midarm’)])
> model <- glmnet(x, y, alpha = 0) #fit ridge regression model
> cv_model=cv.glmnet(x,y,alpha=0) #perform k-fold cross-validation to find op-
timal lambda value
>plot(cv_model)
> best_lambda=cv_model$lambda.min
> best_lambda
[1] 0.4370159
> best_model <- glmnet(x, y, alpha = 0, lambda = best_lambda)
> coef(best_model) #find coefficients of best model

81
8 Multicollinearity

> plot(model, xvar = "lambda") #produce Ridge trace plot


#use fitted best model to make predictions and obtain an R square
> y_predicted <- predict(model, s = best_lambda, newx = x)
> #find SST and SSE
> sst <- sum((y - mean(y))^2)
> sse <- sum((y_predicted - y)^2)
> #find R-Squared
> rsq <- 1 - sse/sst
> rsq
[1] 0.7789357

8.2 Data Sets


Body fat Data
Acetylene Data

82
9 Building the Regression Model
When several predictor variables are involved, the issue that comes up naturally is to
determine how to select the variables as parsimoniously as possible and still produce a
“good” model. If p − 1 predictors are available, then there will be 2p − 1 possible models
which can be constructed. An approach which may be misleading is one where all the
predictors are initially included and then discarding the ones where the studentized
coefficients are not significant. If multicollinearity exists, this approach can lead to
error. The use of diagnostic procedures is important in the final selection of the model
as outliers uncovered by a residual analysis can greatly influence the solution. Some
criteria is essential for the ultimate selection of the model.

9.1 Criteria for model selection


When presented with a set of possible models it is important to develop some criteria
for selection.

9.1.1 R2
This criteria chooses the model with the largest value of explained variation. A plot of
R2 vs the number of variables in the model will appear as a parabola with the last entry
“curving” up a bit. This is the one with all variables in. One may draw a horizontal
line parallel to the x-axis. The point where it meets the parabola determines the best
fitting model since it will be as good as when all the variables are included.
An adjusted R2 takes into account the values of n, p
!
n−1 SSE
Ra2 = 1−
n−p SST O
M SE (p)
= 1−
SST O/ (n − 1)

The criteria minimum M SE (p) and maximum adjusted R2 are equivalent.

83
9 Building the Regression Model

9.1.2 Mallows Cp

To derive the Mallows criteria, Suppose that the true model has q predictor variables

Y = Xq βq + ε

Suppose instead we fit a model using only p predictor variables. Let Hp be the hat
matrix using only p variables. The bias for the ith fitted value is
 
E Ŷi − µi

where µi is the true mean. Consequently,


 2    2  
E Ŷi − µi = E Ŷi − µi + σ 2 Ŷi

The total mean squared error for all the fitted values divided by σ 2 is
( )
1 X 2 X h i
Γp = 2 E Ŷi − µi + σ 2 Ŷi
σ i i

We may estimate σ 2 by the MSE when all the variables are included. The vector
of residuals becomes
ep = (I − Hp ) Y
and the error sum of squares is
SSEp = e′p ep
It follows that

bias = E (ep ) = (I − Hp ) EY

= EY − E Ŷ
h i
since E [Hp Y ] = Hp E [Y ] = E Ŷ .

84
9.1 Criteria for model selection

When p = q, bias = EY − E Ŷ = 0. Now using the idempotency of (I − Hp ) ,


h i
E [SSEp ] = E e′p ep
= E [Y ′ (I − Hp ) Y ]
= E [Y ′ (I − Hp ) (I − Hp ) Y ]
= σ 2 T race (I − Hp ) + (bias)′ (bias)

X 2
= σ 2 (n − p) + E Ŷi − µi
i

The total mean squared error for all the fitted values divided by σ 2 is

" #
1 X 2 X
2
h i
Γp = E E Ŷ i − µ i + σ Ŷi
σ2 i i
" #
1 2
X
2
h i
= E [SSE p ] − σ (n − p) + σ Ŷi
σ2 i
n
1 1 X 2
h i
= E [SSE p ] − (n − p) + σ Ŷi
σ2 σ 2 i=1
1
= E [SSEp ] − (n − p) + p
σ2

The last term follows since Ŷi = x′i β̂ and


n
−1
h i Xh i
σ 2 Ŷi = σ2 x′i (X ′ X)
X
xi
i=1

−1
h i
= σ 2 T race X ′ (X ′ X) X

= σ 2 T raceH = pσ 2

Hence,
1
E [SSEp ] − (n − 2p)
Γp =
σ2
If we estimate σ 2 by M SE and E [SSEp ] by [SSEp ] then the Mallows criteria becomes

SSEp
Cp = − (n − 2p)
M SE

If the p-term model has negligible bias, then E (SSEp ) ≃ (n − p) σ 2 and Cp ≃ p.

85
9 Building the Regression Model

Mallows proposed a graphical method for finding the best model. Plot Cp vs p.
Models having little bias will be close to the line Cp = p. Models with substantial bias
will be above the line. Sometimes a model may show some bias but contains fewer
variables and as a result may be preferred.

9.1.3 Akaike information criterion


Akaike proposed a criteria based on minimizing the expected entropy of the model.
which is essentially a penalized likelihood measure, In the case of ordinary least squares
regression, it becomes
AICp = nln (SSEp ) − nln n + 2p
As more variables are included, AICp decreases and the issue becomes whether or
not the decrease justifies the inclusion of more variables.

9.1.4 Schwartz’s Bayesian criterion (SBC)


A Bayesian extension of the Akaike criterion was proposed by Schwartz

BICSch = nln (SSEp ) − nln n + p (ln n)

This criterion places a greater penalty than the Akaike criterion and it is the one used
by R.

9.1.5 Prediction sum of squares criterion(PRESS)


Sometimes regression equations are used to predict future values. A criteria that is used
is to select the model which minimizes
Xh i2
P RESSp = Yi − Ŷ(i)
i

where Ŷ(i) is the fitted value when the ith observation is deleted.

9.2 Model selections


We shall consider various methods for selection of a model

9.2.1 All possible models


This method is self explanatory. We consider all 2p−1 possible models.
In R

86
9.2 Model selections

with the program olsr


K=ols_step_all_possible(model) #yields all possible subsets and computes R2 , Ra2
and Cp

9.2.2 Forward ,Backward and Stepwise Regression


The evaluation of all possible regressions becomes computationally challenging when
there are many predictor variables. Methods for evaluating a small number of subset
regression models are the forward, backward and stepwise regression approaches. These
are iterative procedures. We begin with the Forward Selection.

Forward

Step 1 Begin with no regressors in the model. Compute the standardize student t
statistic for each variable and choose the one with the greatest absolute value to include
in the model. This is also the variable that has the largest simple correlation with the
response. A pre-selected critical F value, say Fin is chosen.
Step 2 With the variable in Step 1 in, choose the next variable using the same criteria
as in step 1 after adjusting for the effect of the first variable selected. The criteria makes
use of partial correlations which are computed between the residuals from Step1 and
the residuals from the regressions of the other regressors on Xj that is residuals from
Ŷ = β̂0 + β̂1 X1 and residuals X̂j = α̂0j + α̂1j X1 for j = 2, ..., K. If X2 is selected, it
implies that the largest partial F statistic is

SSR (X2 |X1 )


F =
M SE (X1 , X2 )

If F > Fin , then X2 is entered into the model.from Check to drop a variable already
in the model if its t value is below a preset limit.
Step 3 Repeat the steps above until the largest partial F statistic no longer exceeds
Fin or until all the variables are included.

Backward

Begin with all the regressors in the model. Compute the partial F statistic for each
regressor as if it were the last one to enter the model. We compare the smallest partial
F with the preselected Fout . If it is smaller, then that variable is removed from the
model. The procedure is repeated until the smallest partial F statistic is not less than
Fout . Backward elimination is often preferred to forward regression because it begins
with all the variables in the model.

Stepwise

87
9 Building the Regression Model

Stepwise regression combines the previous two approaches. It is a modification of forward


regression in that it reassesses the each of the regressors already in the model to see if
it has become redundant. Here we need to prespecify Fin , Fout . Usually, we choose
Fin > Fout so that it becomes more difficult to add a variable than to remove it.

9.2.3 LASSO and LAR regression


See original paper and see R output
Reference: Montgomery et al p.321.

9.3 Calculations Using R


We begin with the all possible model part
Cement=read.table("C:\\Users\\malvo\\OneDrive - University of Ottawa\\Documents\\CO
3375\\2023\\Hald Cement data.txt",header=TRUE,sep="\t")
library(olsrr)
model = lm(Y ~., data = cement)
ols_step_all_possible(model)
k = ols_step_all_possible(model)
k
plot(k)
Next, the best subset selection
ols_step_best_subset(model)
k = ols_step_best_subset(model)
plot(k)
Next, forward regression
ols_step_forward_p(model)
k = ols_step_forward_p(model)
plot(k)
ols_step_forward_p(model, details = TRUE)
Next, backward regression
ols_step_backward_p(model)
k = ols_step_backward_p(model)
plot(k)
ols_step_backward_p(model, details = TRUE)
Next, stepwise regression
ols_step_both_p(model)
Next, stepwise using AIC
ols_step_forward_aic(model)

88
9.4 R Session

k = ols_step_backward_aic(model)
k
plot(k)
Next, both
ols_step_both_aic(model) or
k = ols_step_both_aic(model)
plot(k)

9.4 R Session

9.5 Data Sets


Hald Cement Data. The data relate to an engineering application that was concerned
with the effect of the composition of cement on heat evolved during hardening. The
response variable Y is the heat evolved in a cement mix.
X1 percentage weight in clinkers of 3CaO.Al2O3
X2 percentage weight in clinkers of 3CaO.SiO2
X3 percentage weight in clinkers of 4CaO.Al2O3.Fe2O3
X4 percentage weight in clinkers of 2CaO.SiO2
Y heat evolved (calories/gram)

9.6 Suggested Problems


10.6,10.12,10.14

89
10 Logistic Regression
Sometimes the response variable is discrete. For example, we may wish to model gender
or to estimate the likelihood that a person is wearing a life jacket. Consider the model

Yi = β0 + β1 Xi + ϵi

where 


 1 with probabilityπi

Yi =


0 with probability1 − πi

Then
E[Yi ] = πi
The usual least squares fitting approach is problematic for the following reasons

1. the variance of Yi = πi (1 − πi ) which is not constant

2. the error terms ϵi are not normally distributed

3. There is no guarantee that the fitted model will force the estimate Ŷi to be in the
interval (0, 1).

Definition 10.1. The logistic distribution has density

ex
f (x) = , −∞ < x < ∞ (10.1)
(1 + ex )2

and cumulative density function

et
F (t) =
(1 + et )

We can show
π
EX = 0, σ 2 [X] =
3

91
10 Logistic Regression

Suppose that a random variableY is binary with



1

 if β0∗ + β1∗ Xi + ϵ∗i < k

Yi =


0 if β0∗ + β1∗ Xi + ϵ∗i > k

for some constant k where ϵ∗i has a logistic distribution. Then

πi = P (Yi = 1) = P (β0∗ + β1∗ Xi + ϵ∗i < k)

= F (k − β0∗ − β1∗ Xi )

= F (β0 + β1 Xi )

exp (β0 + β1 Xi )
=
(1 + exp (β0 + β1 Xi ))

where
β0 = k − β0∗ , β1 = −β1∗
It is common practice to model the logarithm of the odds
!
πi P (Yi = 1)
 
log = log (10.2)
1 − πi 1 − P (Yi = 1)

= β0 + β1 Xi

The estimation of the parameters is based on maximizing the likelihood.

Πi f (yi ) = Πi πiyi (1 − πi )1−yi


yi
πi
 
= Πi (1 − πi )
1 − πi
πi
X  X 
= yi log + log (1 − πi )
1 − πi

yi (β0 + β1 Xi ) −
P P
log (Πi f (yi )) = log (1 + exp (β0 + β1 Xi ))

There is no closed form solution.

92
10.1 Repeat Observations

Instead, iterative methods are used to obtain a solution b0 , b1 and then

exp (b0 + b1 Xi )
π̂i =
(1 + exp (b0 + b1 Xi ))

To interpret the parameters in the logistic regression model, let us consider the
fitted value at a specific value of X, say X0 . Then the difference between the log odds
at X0 + 1 and the log odds at X0 is

logodds (X0 + 1) − logodds (X0 ) = β̂1

Taking the antilogarithms, we obtain the odds ratio

ÔR = eβ̂1

The odds ratio is the estimated increase in the probability of successes associated
with a one unit change in the value of the predictor variable. For a change of d units,
the odds ratio becomes
ˆ
ÔR = edβ 1

10.1 Repeat Observations


Suppose that we have repeat observations at each of the levels of the x variables and
set Yi to be the number of 1’s observed for the ith observation. Let ni be the number of
trials at each observation. Then Yi ∼ Binomial(ni , πi ). In that case, estimation is done
by maximizing
!
n i Yi
logL (β0 , β1 ) = log n
Πi=1 π (1 − πi )ni −Yi
Yi i
n
( ! )
X ni
= log + Yi (logπi ) + (ni − Yi ) log (1 − πi )
i=1 Yi

Example 10.1. Snoring and heart failure


Snoring Score X Heart Disease π̂
Yes No n
Never 0 24 1355 1379 0.021
Sometimes 2 35 603 638 0.044
Almost nightly 4 21 192 213 0.093
Every night 5 30 224 254 0.132
b0 = −3.866, b1 = 0.397
π̂ ′ = −3.866 + 0.397X

93
10 Logistic Regression

OR = eb1 (X2 −X1 )


Comparing X1 = 2, X2 = 5, we have OR = 3.2904

10.2 Multiple Logistic models


Multiple logistic models can also be fitted. Specifically, we replace we have β0 + β1 X by

Xi′ β = β0 + β1 Xi1 + ... + βi,p−1 Xi,p−1

β ′X
E[Y ] =
1 + β ′X
so that
π
log = β ′X
1−π

10.3 Inference on model parameters


The maximum likelihood estimators are for large sample sizes approximately normally
distributed with variances and covariances that are functions of the second order partial
derivatives of the likelihood function.
Let !
∂ 2 L (β)
G= ≡ (gij )
∂βi ∂βj
labeled the Hessian where
n n

 
Yi (Xi′ β) − log 1 + eXi β
X X
logL (β) =
i=1 i=1

It can be shown that


E [b] = β
The variance estimate is given by
−1
V ar (b) = (X ′ V X)

where V = diag (nπ̂i (1 − π̂i )) . Moreover, we have

bk − βk
∼ N (0, 1) , k = 0, ..., p − 1
s [bk ]

which is used for testing and constructing confidence intervals.


To test whether several coefficients are 0, we make use of the likelihood ratio test.
We consider likelihood ratio tests whereby we compare the full model (FM) with the

94
10.4 Test for Goodness of Fit

reduced model (RM). Let " #


2 L (RM )
G = −2log
L (F M )

If the reduced model is correct, G2 follows asymptotically a chi-square distribution


with degrees of freedom equal to the difference in the number of parameters between the
full and reduced-models, dfRM − dfF M = (n − q) − (n − p). We reject for large values,i.e.
G2 > χ2p−q .
In the present situation, for the simple logistic model, the Full Model is the one
that has been fitted whereas the Reduced Model is the one with constant probability of
success
eβ0
E (Y ) = π =
1 + e β0
Under RM, the mle of π is Y /n . Hence,

lnL (RM ) = [ylny + (n − y) ln (n − y) − nlnn]

Hence the likelihood ratio statistics for testing significance of regression is


nX X o
L = 2 Yi lnπ̂i + (ni − Yi )ln(1 − π̂i )

− 2 [Y lnY + (n − Y ) ln (n − Y ) − nlnn]

where Y is the total number of successes observed and n is the total number of ob-
servations. We reject the null hypothesis that the regression is non significant if L is
large.

10.4 Test for Goodness of Fit


Before a logistic regression model is accepted, it needs to be examined. This is analogous
to the usual lack of fit testing regression problem. In that context, we required repeat
observations as we do here. We would like to test
 ′
−1
H0 : E [Y ] = 1 + e−X β

 ′
−1
H1 : E [Y ] ̸= 1 + e−X β

Here we will make use of a Pearson chi-square goodness of fit test. The expected number
of successes is ni π̂i and the expected number of failures is ni (1 − π̂i ) . The Pearson Chi

95
10 Logistic Regression

square test rejects the null hypothesis whenever


n
(Yi − ni π̂i )2 (ni − Yi − ni (1 − π̂i ))2
" #
2
> χ2α,n−p
X
χ = +
i=1 ni π̂i ni (1 − π̂i )

If the fitted model is correct, the HL statistics follows a chi square with g −1 degrees
of freedom. We reject the null hypothesis for large values of the statistic HL.

10.4.1 Deviance Goodness of Fit Test

Another test for the model is based on the likelihood ratio test whereby we consider the
reduced and the full models. Here we compare the current model to a saturated model
whereby each observation (or group when ni > 1) has its own probability of success
estimated by Yi /ni .
Under the reduced model
 ′
−1
E [Yi ] = 1 + e−Xi β

whereas under the full model (also called the saturated model)

E [Yi ] = πi

The deviance goodness of fit statistic , also called Deviance) , is given by

DEV (X0 , X1 , ..., Xp−1 ) = −2 [logL (RM ) − log (F M )]

n
" !#
Yi ni − Y i
X  
= −2 Yi log + (ni − Yi ) log
i=1 ni π̂i ni (1 − π̂i )

We reject for large values i,e DEV > χ2n−p . The deviance in logistic regression plays an
analogous role to the residual mean squares in ordinary regression.
R computes the null deviance (the deviance of the worst model without any predic-
tor) and the residual deviance. The quantity

Deviance
1−
N ull Deviance
is equal to 1 for a perfect fit and equal to 0 if the predictors do not add anything to the
model.

96
10.5 Diagnostic Measures for Logistic Regression

10.4.2 Hosmer-Lenshow Goodness of Fit Test


When there are no replicated on the regressor variables, observations may be grouped
before performing a Pearson chi-square test. Generally about g = 10 groups are used.
Let Oj and Nj − Oj be the observed number of successes and failures respectively in
group j where Nj is the total number of observations in the group. The estimated
probability of success π̂j in the j th group is the average estimated success probability.
Then the Hosmer-Lemenshow test statistic is

(Oj − Nj π̂j )2 X
g
X g
(Nj − Oj − Nj (1 − π̂j ))2
HL = +
j=1 Nj π̂j j=1 Nj (1 − π̂j )
g
X (Oj − Nj π̂j )2
=
j=1 Nj π̂j (1 − π̂j )

10.5 Diagnostic Measures for Logistic Regression


We shall consider the ungrouped case only. Residuals in that case can be used to diagnose
the adequacy of the fitted model. The ordinary residuals are defined as

ei = Yi − π̂i

These do not have constant variance. The deviance residual is for i = 1, ..., n
( " !#)1/2
Yi ni − Yi
 
di = ± 2 Yi log + (ni − Yi ) log
ni π̂i ni (1 − π̂i )

where the sign of di is the same as the sign of ei


Similarly we may compute the standardized Pearson residuals

Yi − π̂i
rP i = q
π̂i (1 − π̂i )

which do not have unit variance or the studentized Pearson residuals


rP i
srP i = √
1 − hii

where hii is the ith diagonal element of the hat matrix


−1
X (X ′ V X)
1/2 1/2
H=V XV

and V is the diagonal matrix with Vii = ni π̂i (1 − π̂i ).The studentized Pearson residuals

97
10 Logistic Regression

are useful to check for outliers.


The deviation residual is
v " #
1 − Yi
u
u Yi
devi = sgn (Yi − π̂i ) −2 Y
t
i ln + (1i − Yi ) ln
π̂i (1 − π̂i )

so that n
(devi )2
X
DEV =
i=1

For a good model, E [Yi ] = π̂i and plots of rSP i vs π̂i and rSP i vs linear predictor Xi′ β
should show a smooth horizontal Lowess line through 0. Plots of the deviance and the
studentized Pearson residuals are useful to check for outliers. A normal probability plot
of the deviance residuals can be used to check for the fit of the model and for outliers.
A plot of the deviance vs the estimated probability of success can be used to determine
where the model is poorly fitted, at high or low probabilities.
Similarly for a plot of devi vs linear predictor Xi′ β.

10.5.1 Detection of Influential Observations


In order to flag influential cases, we consider deleting one observation at a time and
measuring its effect on both the χ2 and DEV statistics. Plots of these vs i will show
spikes for influential observations.
Similarly for plots vs π̂i

10.5.2 Influence on the Fitted Linear Predictor


Cook’s distance here measures the standardized change in the linear predictor X ′ β when
the ith case is deleted.
Indexed plots of Cook’s distance identify cases that have a large influence on the
fitted predictor.
Indexed plots of the leverage values hii help to identify outliers in the X space.
In all cases, visual assessment are needed because here is no actual rule of thumb
for flagging outlier cases.

10.6 Calculations Using R


We will use the data on programming experience. Twenty five individuals were selected.
They hadvarying amounts of experience in programming. All were given the same task
and the results are given as a binary variable.

98
10.6 Calculations Using R

Data for logistic analysis may come in one of two forms: either Bernoulli or binomial.
In this example it is binary.
library(ggplot2)
names(file)[1]=”experience’
names(file)[2]=”success’
mlogit=glm(success~experience,data=file,family=”binomial”)
summary(mlogit)
confint(mlogit) #confidence intervals
exp(coef(mlogit)) #odds ratio
exp(cbind(OR=coef(mlogit),confint(mlogit))) #odds ratio and 95% confidence in-
terval
newdata=with(file,data.frame(experience=10))
predict(mlogit,newdata=newdata,se=TRUE)
We now proceed with the analysis for the experience data
mydata=read.table(file.choose(),header=TRUE,sep=’\t’)
head(mydata)
experience success Fittedvalue
1 14 0 0.310262
2 29 0 0.835263
3 6 0 0.109996
4 25 1 0.726602
5 18 1 0.461837
6 4 0 0.082130
summary(mydata)
experience success Fittedvalue
Min 4.00 0.00 0.08213
1st Qu 9.00 0.00 0.16710
Median 18.00 0.00 0.46184
Mean 16.88 :0.44 0.44000
3rd Qu 24.00 1.00 0.69338
Max 32.00 1.00 0.89166
standard deviations 9.0752410 0.5066228 0.2874901
sapply(mydata,sd) *
mlogit=glm(success~experience, data=mydata,family="binomial")
summary(mlogit)
Call: glm(formula = success ~ experience, family = "binomial", data = mydata)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.8992 -0.7509 -0.4140 0.7992 1.9624

99
10 Logistic Regression

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.05970 1.25935 -2.430 0.0151 *
experience 0.16149 0.06498 2.485 0.0129 *
--- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 34.296 on 24 degrees of freedom
Residual deviance: 25.425 on 23 degrees of freedom
AIC: 29.425
Number of Fisher Scoring iterations: 4
confint(mlogit)
Waiting for profiling to be done
2.5 % 97.5 %
(Intercept) -6.03725238 -0.9160349
experience 0.05002505 0.3140397
exp(coef(mlogit)) (Intercept) experience
0.04690196 1.17525591
exp(cbind(OR=coef(mlogit),confint(mlogit)))
OR 2.5 % 97.5 %
(Intercept) 0.04690196 0.002388112 0.4001024
experience 1.17525591 1.051297434 1.3689441
newdata=with(mydata,data.frame(experience=10))
predict(mlogit,newdata=newdata,se=TRUE)
$fit 1 -1.444837
$se.fit [1] 0.7072129
$residual.scale [1] 1
Another package
# Installing the package install.packages("dplyr")
# Loading package library(dplyr)
# For Logistic regression install.packages("caTools")
# For ROC curve to evaluate model install.packages("ROCR")
# Loading package library(caTools)
#library(ROCR)
Sometimes the data can be split into a training set and a testing set
# Splitting dataset
split = sample.split(mtcars, SplitRatio = 0.8)
split train_reg = subset(mtcars, split == "TRUE")
test_reg = subset(mtcars, split == "FALSE")
# Training
model logistic_model = glm(vs ~ wt + disp, data = train_reg, family = "binomial")

100
10.7 Data Sets

logistic_model
# Summary
summary(logistic_model)
When the data is aggregated,
p=Y/n
mlogit=glm(p~X,data=Toxicity,weights=n,family="binomial")
mlogit
predict_reg =predict(mlogit, type = "response")
predict_reg
Stepwise logistic regression can also be done when several variables are involved
install.packages(“MASS”)
library(MASS)
stepAIC(model,trace=FALSE)

10.7 Data Sets


Program experience
Pneumocomiosis Data

10.8 R Session
Problem 13.1 in Montgomery et al.

101
11 Poisson Regression
In Poisson regression, we have counting data Y which follows a Poisson distribution with
mean µ and hence variance µ.

e−µ µy
f (y) = , y = 0, ...
y!

The model is then given as


Yi = µi + εi , i = 1, ..., n
We assume that there is a link function g (µi ) that specifies the mean which may be one
of the following

g (µi ) = µi = Xi′ β, identity link

g (µi ) = log (µi ) = Xi′ β, log link

The estimation of the parameters β is obtained using the method of maximum likelihood.
As for the logistic case, there is no closed form for the solution.
The log likelihood is given by
n
X n
X n
X
logL (y, β) = yi logµi − µi − logyi !
i=1 i=1 i=1

The fitted Poisson model is then


 
Ŷi = g −1 Xi′ β̂


 X ′ β̂ identity link
 i


= 
  
exp X ′ β̂

log link
i

Inference on the Poisson model is conducted as in the case of the logistic model.
Both the logistic and the Poisson models are particular examples of a more general
linear model (GLM).

103
11 Poisson Regression

The response is assumed o have a distribution which is a member of the exponential


family of distributions.
( )
[yi θ − b (θi )]
f (yi , θi , ϕ) = exp + h (yi , ϕ)
a (ϕ)
2
Here, µ = E (Y ) = db(θ
dθi
i)
V ar (Y ) = d dθ
b(θi )
2 a (ϕ)
i
The basic idea is to develop a linear model for a function of the mean.
Set ηi = g (µi ) = Xi′ β. The function g is called the link function.
For the logistic,
!
ni yi
f (yi , θi , ϕ) = p (1 − p)n−yi
yi
(" ! # !)
p ni
= exp yi log + nlog (1 − p) + log
1−p yi

For the Poisson,

e−λ λyi
f (yi , θi , ϕ) =
yi !

= exp {[yi logλ − λ] − logyi !}

11.1 R Session

11.2 Data Sets


Aircraft Damage Data: This data refers to 30 strike missions involving two types of
aircraft during the Vietnam war. X1 is an indicator variable for the type of aircraft used.
X2 and X3 are bomb loads in tons and total months of aircrew experience respectively.
The response variable Y is the number of locations where damage was inflicted.
Aircraft=read.table(file.choose(),header=TRUE,sep=’\t’)
names(Aircraft) [1] "Locationnumber" "Indicator" "load" "experience"
V1=Aircraft$Locationnumber
V2=Aircraft$Indicator
V3=Aircraft$load
V4=Aircraft$experience
summary(Aircraft)

104
11.2 Data Sets

Locationnumber Indicator load experience


Min 0.000 0.0 4.0 50.00
1st Qu 0.250 0.0 6.0 66.45
Median 1.000 0.5 7.5 80.25
Mean 1.533 0.5 8.1 80.77
3rd Qu 2.000 1.0 10.0 94.50
Max 7.000 1.0 14.0 120.0
mlogit=glm(V1~V2+V3+V4, data=Aircraft,family=poisson(link="log"))
summary(mlogit)
Call: glm(formula = V1 ~ V2 + V3 + V4, family = poisson(link = "log"), data =
Aircraft)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.6418 -1.0064 -0.0180 0.5581 1.9094
Coefficients: Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.406023 0.877489 -0.463 0.6436
V2 0.568772 0.504372 1.128 0.2595
V3 0.165425 0.067541 2.449 0.0143 *
V4 -0.013522 0.008281 -1.633 0.1025
--- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 53.883 on 29 degrees of freedom
Residual deviance: 25.953 on 26 degrees of freedom
AIC: 87.649
Number of Fisher Scoring iterations: 5
Does the model has over-dispersion or under-dispersion?
If the Residual Deviance is greater than the degrees of freedom, then over-dispersion
exists. This means that the estimates are correct, but the standard errors (standard
deviation) are wrong and unaccounted for by the model.
The Null deviance shows how well the response variable is predicted by a model
that includes only the intercept (grand mean) whereas residual with the inclusion of
independent variables.
So, to have a more correct standard error we can use a quasi-poisson model:
mlogit=glm(V1~V2+V3+V4, data=Aircraft,family=quasipoisson(link="log"))
summary(mlogit)
Call: glm(formula = V1 ~ V2 + V3 + V4, family = quasipoisson(link = "log"),
data = Aircraft)
Deviance Residuals:
Min 1Q Median 3Q Max

105
11 Poisson Regression

-1.6418 -1.0064 -0.0180 0.5581 1.9094


Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.406023 0.841982 -0.482 0.6337
V2 0.568772 0.483963 1.175 0.2505
V3 0.165425 0.064808 2.553 0.0169 *
V4 -0.013522 0.007946 -1.702 0.1007
--- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for quasipoisson family taken to be 0.9207088)
Null deviance: 53.883 on 29 degrees of freedom
Residual deviance: 25.953 on 26 degrees of freedom
AIC: NA
Number of Fisher Scoring iterations: 5
For estimation and prediction we can use the same commands as for logistic regres-
sion.

106
12 References
The following references were used in the preparation of these notes.
[1] THE TRANSFORMATION OF POISSON, BINOMIAL AND NEGATIVE-
BINOMIAL DATA BY F. J. ANSCOMBE. Biometrika, Volume 35, Issue 3-4, December
1948, Pages 246–254, https://doi-org.proxy.bib.uottawa.ca/10.1093/biomet/35.3-4.246
[2] Linear Models with R, second edition, 2014. Julian J. Faraway.
[3] Applied Regression Analysis and Other Multivariable Methods, fifth edition,
2014. David Kleinbaum, Larry Kupper, Azhar Nizam, Eli S. Rosenberg.
[4] Yu Guan Variance stabilizing transformations of Poisson, binomial and negative
binomial distributions Statistics and Probability Letters 79 (2009) 1621–1629
[5] Applied Linear Statistical Models, fifth edition, 2005. Michael H. Kutner,
Christopher J. Nachtsheim, John Neter, William Li
[6] Introduction to Linear Regression Analysis, sixth edition, 2021. Douglas C.
Montgomery, Elizabeth A. Peck, G. Geoffrey Vining.

107

You might also like