Multiple Linear Regression
Multiple Linear Regression
1 Strategy for
Building
Model.
Regression
Data collection
and preparation
exploratory
observational studies)
Investigate curvature
and interaction
effects more fullY
Model refinement
and selection
350
Part
Two
ModelValidation
pfuo'iUifitfund
moO"t-iuitaing
Section 9'6'
r -- -^L,^-
ffrcients,rhe
analysis' Validation i ences drawn from the regression p'o"""' S"u"ral methods of assessing
qr,
Su
itar surgic ar unit was intere sted A :'#"#; ;il;;;";; ;: ltr: JIT:I # :::Tflnarients rrndersoins a particular ty ofliver operation' t 11nd9t
^titlJ;
re
Xs v A6
X1
variable
Xg variable
None Moderate
Severe
ars
:
se:
female)
and
Alcohol Use'
Xt
Xe
00 t0 01
TABLE 9.1
chapter
3sl
To illustrate the model-building procedures discussed in this and the next section, we
will
explanatory
h masses of
satorwastherebyalerredtoexaminer",".*"ffi ::""::""?,it#TgJT::*ilTJff,i;
and the correlation matrix were also obtained (not shown). A first-order regression model based on all predictor variables was fitted to serve as a point. A plot of residuals against predicted values for this fitted model is shown
in Figure 9'2a'The plot suggests that both curvature and nonconstant error variance are apparent' In addition, some departure from normality is suggested by the normal probability plot of residuals in Figure 9.2b. To make the distribution of the enor terms more nearly normal and to see if the same transformation would also reduce the apparent curyature, the investigator examined the
:tarting
JRE9.2
e
minary
000
000
lual
._Surgical
Example.
E
'G
O
500
E f
,a
500
o
0
-500
s00 1000
Predicted value
1500
-500' -3 -2 -1
'
012
Expected value
o6 04
A)
'
-o.2 -0.4
Eo
-0.2
/ -"'
t
-0.4
5.5
6.5
3-2
Predicted value
-t012
Expected value
352
Part
Two
:ln
tr
probabilitY terms is error the the distribution of "1ehfl";11:
t"fr::TJlrl,:T"1t
and the conelation matrix with the aho obtained a scatter plor matrix
transrbrmedlvariable;tiresearep,","n."o.inFigureg.3.Inaddition,variousscatterand
Multivariate Correlations
Bloodclot
Progindex
Enzyme Liver
Correlation
Matrix when
Response
1.o0oo 0'2462 0'4699 0.2462 1.oo0o 0'0901 0.4699 0.0901 1'0000 0.6539 -0.1496 -0'0236 0.6493 0.5024 0'3690
Liver
Variable Is
Y'-Surgical
Unit ExamPle.
8 7.5
'
.flt -
ScatterPlot Matrix
6.5
6
LnSurvival
,:.$
:.i:ri.'
..tii''
5.5
11
Bloodclot
5 3
i:,+':.
l
-{3::r
I
.-.$ t'..
,
.l
,{:
,rt! "
;..
r. . t
90 70 50 30
0 10
'r'.
itrC.
i.
Progindex
t:h
.t
.rir'
it':'J
"
il.
0t 110
|oJ 90
--
f0 70
:1.3$
rl'
,r.i?:
'"^tt
Enryme
"J."".
..3
50 ;0
l0
5
I
,'$'
I
-., f"'I1'
l_
$i:
rr
ro ro 50 70 90
:,
. ..r:;.1
Liver
'.': '.. .
5.566.577.5I 3456789
30 50 70 90
1Z 54 )
'!rlll
Chapter
Vulidntiott
:i,53
residual plots were obtained (not shown here). All of these plots indicate that each of rhe predictor variables is linearly assocriated with y', with X3 and Xa showing the highest degrees of association and X 1 the lorvest. 'fhe scatter plot matrix and the correlation matrix furlher show intercorrelations amonf4 the potential predictor variables. In particular, Xa has moderately high pairwise correlations with X1 , X2, and X3. On the basis of these analyses, the investigator concluded to use, at this stage of rhe model-building process, Y' : In I ar; the response variable, to represent the predictor variables in linear terms, and not to inch-Lde any interaction terms. The next stage in the modelbuilding process is to examine whether all of the potential predictor variables are neecled or whether a subset of them is adequate. A rLumber of useful mcasures have been developed to assess the adequacy of the .rarious subsets. We now turn to a discussion of those
measures.
9,3
TAELE 9.2 SSIlp, n;, Rl,p, C p, AIC p, Models--Surgical Unit Example' (1) \fariables irr Model
l',lone
(2)
55EP
(3)
(4)
(s)
(7)
sBc
(8)
p
1
R1
P
R:^
9\F
c"
151 ..498
)(t
/\2
z
2 2 2
3 3 3 3
12.808 12.031
/\f
)(1, X2
)(t, Xt )(t, Xq
)Kz,
7.299
Xt
X+
4.312
6.622
)\2,
3
3
)\3, Xa
5.130
3.109
'\t,
Xz, Xz
X4
Xz,
4
4 4
4
5
6.570
'\1,
,Y,z',
X3, Xa
4.968
3.614
3.084
,Y.1,
X2, X3, Xa
0.000 0.000 0.061 0.04:i 0.221 0.20(; 0.428 0.41)', 0:422 0.410 0.263 0.23t1 0.549 0.531 0.430 0.408 0.653 0.650 0.483 0.46:' 0.599 0.s84 0.757 0.74:l 0.487 0.45(i 0.612 0,589 0.718 0.701 0.759 0.740
43.852
67.972
20.520 57.215
33.504
-1 14.658 -102.067
-130.483
-103.262 -88.162
-107.324
-tzt.t t5 -146.161
-105.748
5.000
PRESIi;P P -73.714 13.296 -73.101 13.512 -83.200 10.744 -99.849 8.32!"7 -99.284 8.0rlt5 -82.195 11.0(t2 -108.691 6.9tii8 -96.100 8.47',2 -124.516 s.065 -1 0'f .357 7.41',6 6.1 1]l -1 5 .1 46 3.914 38.205 -1 -97.792 7.9("t3 -1 12.888 6.207 -130.067 4.5't7 -134.645 4.0t;9
1
Chapter
Validatiort 355
considered sufficiently helpful to enter the regression model. Since the degrees offreedom associated with MSE vary depending on the number of X variables in the model, and since repeated tests on the same data are undertaken, fixed /* limits for adding or deleting a variable have no precise probabilistic meaning. For this reason, software programs often favor the use of predetermined alimits.
fits all regression models with two X variables, where X7 is one of the pair. For each such regression model, the r* test statistic corresponding to the newly added predictor X1 is obtained. This is the statistic for testing whether or not p1 :0 when X7 and X1, Ne the variables in the model. The X variable with the largest /* value----or equivalently, the smallest P-value-is the candidate for addition at the second stage. If this t* value exceeds a predetermined level (i.e., the P-value falls below a predetermined level), the second X
variable is added. Otherwise, the program temlnates.
3.
Suppose X3 is added at the second stage. Now the stepwise regression routine examines
whether any of the other X variables already in the model should be dropped. For our illustration, there is at this stage only one other X variable in the model, X7, so that only one /+ test statistic is obtained:
'1
t4
- slhl
-
b,
(e.1e)
these t+ statistics, one for each of the variables The variable for which this /* value is smallest (or last added. the besides the one in model is largest) is the candidate for deletion. If which the P-value equivalently the variable for this t* value falls below--or the P-value sxsesds-4 predetermined limit, the variable is
4. Suppose X7 is retained so that both X3 and X7 are now in the model. The stepwise regression routine now examines which X variable is the next candidate for addition, then examines whether any of the variables already in the model should now be dropped, and n either be added or deleted, at which point the search
algorithm allows an X variable, brought into the model sequently if it is no longer helpful in conjunction with
ter printout for the forward stepwise regression procehe maximum acceptable o limit for adding a variable is 0. 10 and the minimum acceptable c limit for removing a variable is 0.15, as shown at the top of Figure 9.7. I ) i\ a iutoloN{ We now follow through the steps.
i,!
At the start of the stepwise search, no X variable is in the model so that the model to be fitted is IZ;: fo * ei. In step 1, the t* statistics (9.18) and corresponding P-values
are calculated for each potential
l.
X variable,
(largest /* value) is chosen to enter the equation. We see that Enzyme (X3) had the largest
366
Part
Two
FIGURE9.7
MINITAB
A1Pha-to-Euter:
Response
O'1
Alpha-to-Renove: 0'15
N = 54
Forward
Stepwise
'264
4'
3s1
'29r
3 '852
"
n-"i.pf".
Euzyne T-VaLue P-VaIue Proglnde T-Value P-Value nrbulleav T-Val-ue P-Value Bloodclo T-VaIue P-Va1ue s
3'86 ;
R-sq
R-Sq(adj) C-D
test statistic:
77 B? 2?" ' 81'60 76'47 ?? 65'01 41'66 5'8 5O'5 18'9 tL7 '4
42'76 66'33
"',
h.
t) : 'r
.015124 ,
.002427
^a
s{b:}
-
TheP.valueforthisteststatisticis0.000'whichfallsbelowthemaximum.acceptable' (X:) is added to the model' : d-to-enter value of 0' 10; hence Enzyme Th" completed' been has I 2. At rhis stage, step ".Y*-t::tj:,tl::,:::"*t":'T:i "Step 1"' rher'j disprays, near the top of rhe corumn labeled -?rulrintout : r.^i'-'f r#;;;;;idriLi rr' *"i. ":nf':&i, resres
:t"il:
criteri
il*,,
utr
r"grrrrio,l*i"*i'';1il;i;;& -;'"notherX
MSR(Xrlx3)
"io'oool' ni,,(41'66), and Cl (rr1'4) ar: 1ite11,' 11",Pi:l:t:l-; and the r* are fitted' o..r variab;
At the bottom
oJ
fy- ?'^'::'-"::::T*ll;3il."il3-Hi
MSE(h, xk)
Progindex(Xz)hasthehighestf*value,anditsP-value(0.000)fallsbelow0.l0,sothat Xr oow enters the model'
Chapter
Step 2 in Figure 9.7 summarizes the situation at this point. Enzyme are now in the model, and information about this model is and Progindex (X3 and provided. At this point, a test whether Enzyme (Xl) should be dropped is undertaken, but because the P-value (0.000) corresponding to X3 is not above 0.15, this variable is retained.
X)
4. Next, all regression models containing Xz, Xt, and one of the remaining potential X variables are fitted. The appropriate t* statistics now are:
'k -
The predictor labeled Histheavy (Xa) had the largest rf value (P-value : 0.000) and was next added to the model. 5. The column labeled Step 3 in Figure 9.7 summarizes the situation at this point. X2, X3, and Xs are now in the model. Next, a test is undertaken to determine whether X2 or X3 should be dropped. Since both of the conesponding P-values are less than 0.15, neither predictor is dropped from the model. 6. At step 4 Bloodclot (Xr) is added, and no terms previously included were dropped. The right-most column of Figure 9.7 summarizes the addition of variable X 1 into the model containing variables Xz, Xt, and Xs. Next, a test is undertaken to determine whether either Xz, Xt, or Xs should be dropped. Since all P-values are less than 0.15 (all are 0.000), all
variables are retained.
7. Finally, the stepwise regression routine considers adding one of X+, Xs, Xa, or X7 to the model containing Xr, Xz, X3, and Xs. In each case, the P-values are greater than 0. l0
(not shown); therefore, no additional variables can be added to the model and the search
process is terminated. Thus, the stepwise search algorithrnidentifies (Xt, Xz, Xz, Xi as the "best" subset of X variables. This model also happens to be the model identified by both the SBCp and pRESSp criteria in our previous analyses based on an assessment of "best" subset selection.
Comments
predictor variables that tendencies. Simulation studies have shown that for large pools of uncorrelated large or moderately large of use have been generated to be unconelated with the response variable, that is, it allows liberal; is too that procedure in a cu-to-enter values as the entry criterion results by an automatic produced models hand, other the On model. the too many predictor variables into in o2 being badly resulting underspecified, often are values cv-to-enter small with selection piocedu.e References 9'2 and 9'3)' overestimated and the procedure being too conservative (see, for example, T'n".*imum acceptable d-to-enter value should never be larger than the minimum acceptable value; otherwise, cycling is possible where a variable is continually entered and removed.
1. Thechoiceofc-to-enterandc-to-removevaluesessentiallyrepresentsabalancingofopposing
tSZ.
A
o-to-remove At 3. 1.he order in which variables enter the regression model does not reflect their importance' predicted be it can because stage a latel at Fr.r"r, u variable may enter the model, only to be dropped I well from the other predictors that have been subsequently added'
predictor variables. We Other stepwise procedures are available to find a "best" subset of these' mention two of
410
Part
Two
(V1F):
onglY imPle
se
rl'
and
t;' (Y
Xz not
this
Comments
the variance inflation factor to detect Some computer regression programs use the reciprocal of regression model because ofexcesfitted the into ailowed be not should instances where an X variable X variables in the model' Tolerance sively high interdependence between this variable and the other below which the variable is not or .0001, I Ri frequently used are .01, .001, limits for |
1.
/(vIF)k: -
2. A limitation of
distinguish between several simultaneous multicollinearities' have.been proposed' These 3. A number of other formal methods for detecting multicollinearity texts such as Refin specialized discussed are and are more complex than variance inflation factors
erences 10.5 and
10'6.
10.6
r the six two-factor interaction e examined' These Plots (not interactions are present and need to be shown) did not suggest that any strong two-variable odel. The absence of anY containing X1, X2, X3' a The P-value of the form and the interaction the model containing both the first-order effects from interaction terms effects is .35, indicating that interaction effects at were generated to check Figure 10.9 containJsome of the additional d on the adequacy of the first-order model:
Yi
where Y,l
fro
flrXr
flzXiz
* ftXit I
lsXis
ei
(10.4s)
Figure 10.9a shows no evidence of serious 1. The residual plot against the fitted values in
departures from the model' studies in Section 9'6 2. Oneof the three candid a$e(b) was negative contained X5 (patient age) as the sign of b5 became in model (9.23), but when th" an added-variable plot to study graphically oositive. We will now use a residual ptot and
Chapter
10
FIGURE 10.9
Residual and
X5
0.2
0
Unit
Po
o
-0.2 -0.4
-u
-z
-0.6
-0.4 -0.6
5.5 6
6.5
75
50 xs
70
Predicted Value
X5
-0.2
-0.4
-0.6
-o.4 -0.6
-20 -10
e(X5lX1,
0
X2,
10
-202
Expected Value
\,
X8)
the strength of the marginal relationship between X5 and the response, when Xy, Xz, Xy and Xs are already in the model. Figure 10.9b shows the plot of the residuals for the model containing Xt, Xz, X3, and X3 against X5, the predictor variable not in the model. This plot shows no need to include patient age (Xs) in the model to predict logarithm of survival time. A better view of this marginal relationship is provided by the added-variable plot in Figure 10.9c. The slope coefficient b5 can be seen again to be slightly negative as depicted by the solid line in the added-variable plot. Overall, however, the marginal relationship between X5 and Y' is weak. The P-value of the formal r test (9.18) for dropping X5 from the model containing Xt, Xz, Xz, Xs and Xs is 0.194. In addition, the plot shows that the negative slope is driven largely by one or two outliers----one in the upper left region of the plot, and one in the lower right region. In this way the added-variable plot provides additional support for dropping X5. 3. The normal probability plot of the residuals in Figure 10.9d shows little departure from linearity. The coefficient of correlation between the ordered residuals and their expected values under normality is .982, which is larger than the critical value for significance level .05
in Table 8.6.
412
Part
Two
Multicollineantywasstudiedbycalculatingthevarianceinflationfactors:
Variable
Xr Xz Y.
8
(vtO*
1
among the four predictor variables As may be seen from these results, multicollinearity not a Problem. regression Figure 10.10 contarns index plots of four kev
is
'.
t""l-'. rigui' 10'10a' the leverage value I h;; in liryle l0'10b' :: plots suggest values in Figure 10.10d. These distances D; in Figure ro.io"., andDFFITSi c for values 10.6 lrsts numerical diagnostic further study of cases 17, 28 and 38. Table 1-5 are the residuals^lit columns in these cases. The measures presented -f:':l* t values h;; i1(t!,tS)':n" leverage the (10.24), in studentized deleted."riJuuri ri (DFFITS)i values in (10'30)' The following distance measures n, in if O':il, and the points about the diagnostics in Table 10'6:
,,J"i;;;;;;';';"
1*,9t::tt:::.1:"l1iT :"]:tff
9""lii *-j
noteworthy
r! regard 17 was identified as outlying with CaselTwasidenttttedasoutlylngwrltrrtrBduLUrNr.vsruvs::;'.-: 1. Case We test formally whether''E deviations' standard three ihun rno." by deleted residual, outlying test procedure' For a familY:""#|;iiKI case 1? is outlying uy t"un, of the Bonfenoni
sli:::'',*1'
ti-al2n;n-p-r):t('eee54;4e)'|: ;ffi;i'i:b'iJ'i.""*ntesi;en:54,werequire outlier test indicates that case 22\s116:' < formal the *-3'528' :3i3696 --)rnoe lt17l 3'528.Since l't?l - J'Jw/v -- J')26' -not critical value' and alth an outlier. Still, r17 is very close to the of we may wish to inve to be outlying to any substantial extent'
to remove anY doubts.
cases
outlving Here'
we see the value of multivariable their 28' 32' 38' 42' 32' and 52'we consider 3. To determine the infl uence of cases l't' 23' 1? ic fhe' of these measures, i"t"-11it ry:: .oloi,:"ff;:'J'i""1'rio;rt values. According ro each (DFFITSIt=i"o^ttt^i:j:*X.t and 'i1oa mosr influential, with cooti ai,ott" we note that the cook's value' rreedom' oi degrees il"irlJT'.1lXXlilli. ')un 5 rnd 4e thu-s appear: corresponds to the 1lth fercentite' It cases arso do
on:
-Tralr"J,:"rlffiiimu"
,..of^.ur"
itterest v_ pJ--y u[wrvlvve Lrrv inferences I Here, f_LElgr the P: 9f "f making predictions in the for used be to ,oJ"r i, intended vsrvg on all 54 observations was f;I based v4uw ! rrLlsu value each fitted ft"n"", nence, eaclr
percent differences:
., ,-!^-^-^^- ^{;nraracr urrs also conductod interestwasarsoconducretli ontheinferences of fit the regression model because the' are in
y*,','t'*case17isdeIetedinfittingtheregressionmoder.llIG4v9I46v
lP','r,.-
roo Y, 'l
I
414
Part
Two
is for case 17) is only absolute percent difference (which on the fitted not have such a disproportionate influence be required.
4.Insummary,thediagnosticanalysesidentifiedanumberofpotentialproblems,but enough to require further remedial action' none of these was consrdeld to be serious
Cited
References
l0.l.Atkinson,A.C.Plots,Transformations'andRegression'Oxford:ClarendonPress'1987' ..Diagnostic Value of Residual and Partial Residual Plots,,' 10.2' Mansfield, E. R', unavt. o. Con".ly.
The
Robust
on and
John
10.5.
John WileY
Reg
ColI
Problems
to.l.Astudentasked:..WhyisitnecessarytoperformdiagnosticchecksofthefitwhenR2is
large?" Comment'
l0.2.Aresearcherstated:..onegoodthingaboutadded-variableplotsisthattheyareextremely usefulforidentifyingmodeladequacyevenwhenthepredictorvariablesarenotproperly
specified in the regression model'" Comment'
each ofthe predictor variables' Prepare an added-variable plot for regression b. Do your plots in part (a) suggest that the for any ofthe predictor areinappropriate 6.5b ProUlem in function problem 6.5b by separately regressing both c. obtain rhe firted regression function in appropriate fashion' an in X2 ort X 1,unO tt'"n-'"g'"'sing the residuals
a.
relationships
ron
r and
b.
X2' each of the predictor variables X1 and Prepare an added-variable plot for
c.DoyourplotsinPart(a)Suggestthattheregressionrelationshipsinthefittedfeglessron ofthe predictor variables? Explain' in part (a) are inappropriate for any
function
d,obtainthefittedregressionfunctioninpart(a)byseparatelyregressingbothlandX2on
in an appropriate fashion' X r . and then regres"sing the residuals
6'15c' 10.7. Refer to Patient satisfaction Problem of the predictor variables' each for plot a. Prepare an added-variable
p.oij"* o.r!.