Multiple Linear Regression in Data Mining
Multiple Linear Regression in Data Mining
Data Mining
Contents
2.1. A Review of Multiple Linear Regression
2.2. Illustration of the Regression Process
2.3. Subset Selection in Linear Regression
1
2 Chap. 2 Multiple Linear Regression
fitted (predicted) values at the observed values in the data. The sum of squared
differences is given by
n
(yi − β0 − β1 xi1 − β2 xi2 − . . . − βp xip )2
i=1
Let us denote the values of the coefficients that minimize this expression by
β̂0 , β̂1 , β̂2 , . . . , β̂p . These are our estimates for the unknown values and are called
OLS (ordinary least squares) estimates in the literature. Once we have com-
puted the estimates β̂0 , β̂1 , β̂2 , . . . , β̂p we can calculate an unbiased estimate σ̂ 2
for σ 2 using the formula:
n
1
σ̂ 2 = (yi − β̂0 − β1 xi1 − β2 xi2 − . . . − βp xip )2
n − p − 1 i=1
E(Y |x1 , x2 , . . . , xp ) = β0 + β1 x1 + β2 x2 + . . . + βp xp .
3. Unbiasness The noise random variable εi has zero mean, i.e., E(εi ) = 0
for i = 1, 2, . . . , n.
will give the smallest value of squared error on the average. We elaborate on
this idea in the next section.
The Normal distribution assumption was required to derive confidence in-
tervals for predictions. In data mining applications we have two distinct sets of
data: the training data set and the validation data set that are both representa-
tive of the relationship between the dependent and independent variables. The
training data is used to estimate the regression coefficients β̂0 , β̂1 , β̂2 , . . . , β̂p .
The validation data set constitutes a “hold-out” sample and is not used in
computing the coefficient estimates. This enables us to estimate the error in
our predictions without having to assume that the noise variables follow the
Normal distribution. We use the training data to fit the model and to estimate
the coefficients. These coefficient estimates are used to make predictions for
each case in the validation data. The prediction for each case is then compared
to value of the dependent variable that was actually observed in the validation
data. The average of the square of this error enables us to compare different
models and to assess the accuracy of the model in making predictions.
each variable is 25 and the maximum value is 125. These ratings are answers
to survey questions given to a sample of 25 clerks in each of 30 departments.
The purpose of the analysis was to explore the feasibility of using a question-
naire for predicting effectiveness of departments thus saving the considerable
effort required to directly measure effectiveness. The variables are answers to
questions on the survey and are described below.
In Table 2.3 we use ten more cases as the validation data. Applying the previous
equation to the validation data gives the predictions and errors shown in Table
2.3. The last column entitled error is simply the difference of the predicted
minus the actual rating. For example for Case 21, the error is equal to 44.46-
50=-5.54
We note that the average error in the predictions is small (-0.52) and so
the predictions are unbiased. Further the errors are roughly Normal so that this
model gives prediction errors that are approximately 95% of the time within
±14.34 (two standard deviations) of the true value.
Case Y X1 X2 X3 X4 X5 X6
1 43 51 30 39 61 92 45
2 63 64 51 54 63 73 47
3 71 70 68 69 76 86 48
4 61 63 45 47 54 84 35
5 81 78 56 66 71 83 47
6 43 55 49 44 54 49 34
7 58 67 42 56 66 68 35
8 71 75 50 55 70 66 41
9 72 82 72 67 71 83 31
10 67 61 45 47 62 80 41
11 64 53 53 58 58 67 34
12 67 60 47 39 59 74 41
13 69 62 57 42 55 63 25
14 68 83 83 45 59 77 35
15 77 77 54 72 79 77 46
16 81 90 50 72 60 54 36
17 74 85 64 69 79 79 63
18 65 60 65 75 55 80 60
19 65 70 46 57 75 85 46
20 50 58 68 54 64 78 52
Let us illustrate the last two points using the simple case of two indepen-
dent variables. The reasoning remains valid in the general situation of more
than two independent variables.
Y = β1 X1 + ε (2.2)
8 Chap. 2 Multiple Linear Regression
Y = β1 X1 + β2 X2 + ε. (2.3)
σ2
E(β̂2 ) = 0, V ar(β̂2 ) = 2 ) n x2 ,
(1 − R12 i=1 i2
where R12 is the correlation coefficient between X1 and X2 .
We notice that β̂1 is an unbiased estimator of β1 and β̂2 is an unbiased
estimator of β2 , since it has an expected value of zero. If we use Model (2) we
obtain that
σ2
E(β̂1 ) = β1 , V ar(β̂1 ) = n 2.
i=1 x1
The variance is the expected value of the squared error for an unbiased
estimator. So we are worse off using the irrelevant estimator in making predic-
tions. Even if X2 happens to be uncorrelated with X1 so that R12 2 = 0 and the
variance of β̂1 is the same in both models, we can show that the variance of a
prediction based on Model (3) will be worse than a prediction based on Model
(2) due to the added variability introduced by estimation of β2 .
Although our analysis has been based on one useful independent variable
and one irrelevant independent variable, the result holds true in general. It is
always better to make predictions with models that do not include
irrelevant variables.
Notice that β̂1 is a biased estimator of β1 with bias equal to R12 β2 and its
Mean Square Error is given by:
M SE(β̂1 ) = E[(β̂1 − β1 )2 ]
= E[{β̂1 − E(β̂1 ) + E(β̂1 ) − β1 }2 ]
= [Bias(β̂1 )]2 + V ar(β̂1 )
= (R12 β2 )2 + σ 2 .
If we use Model (3) the least squares estimates have the following expected
values and variances:
σ2
E(β̂1 ) = β1 , V ar(β̂1 ) = 2 ),
(1 − R12
σ2
E(β̂2 ) = β2 , V ar(β̂2 ) = 2 ).
(1 − R12
Now let us compare the Mean Square Errors for predicting Y at X1 =
u 1 , X2 = u 2 .
10 Chap. 2 Multiple Linear Regression
M SE2(Ŷ ) = E[(Ŷ − Y )2 ]
= E[(u1 β̂1 − u1 β1 − ε)2 ]
= u21 M SE2(β̂1 ) + σ 2
= u21 (R12 β2 )2 + u21 σ 2 + σ 2
M SE3(Ŷ ) = E[(Ŷ − Y )2 ]
= E[(u1 β̂1 + u2 β̂2 − u1 β1 − u2 β2 − ε)2 ]
= V ar(u1 β̂1 + u2 β̂2 ) + σ 2 , because now Ŷ isunbiased
= u21 V ar(β̂1 ) + u22 V ar(β̂2 ) + 2u1 u2 Covar(β̂1 , β̂2 )
(u21 + u22 − 2u1 u2 R12 ) 2
= 2 ) σ + σ2.
(1 − R12
Model (2) can lead to lower mean squared error for many combinations
of values for u1 , u2 , R12 , and (β2 /σ)2 . For example, if u1 = 1, u2 = 0, then
M SE2(Ŷ ) < M SE3(Ŷ ), when
σ2
(R12 β2 )2 + σ 2 < 2 ),
(1 − R12
i.e., when
|β2 | 1
< .
σ 2
1 − R12
Forward Selection
Here we keep adding variables one at a time to construct what we hope is a
reasonably good subset. The steps are as follows:
2. Compute the reduction in the sum of squares of the residuals (SSR) ob-
tained by including each variable that is not presently in S. We denote
by SSR(S) the sum of square residuals given that the model consists of
the set S of variables. Let σ̂ 2 (S) be an unbiased estimate for σ for the
model consisting of the set S of variables. For the variable, say, i, that
gives the largest reduction in SSR compute
Backward Elimination
1. Start with all variables in S.
2. Compute the increase in the sum of squares of the residuals (SSR) ob-
tained by excluding each variable that is presently in S. For the variable,
say, i, that gives the smallest increase in SSR compute
SSR(S−{i})−SSR(S)
Fi = M ini∈S/ σ̂ 2 (S)
If Fi < Fout , where Fout is a threshold (typically between 2 and 4) then
drop i from S.
Backward Elimination has the advantage that all variables are included in
S at some stage. This addresses a problem of forward selection that will never
select a variable that is better than a previously selected variable that is strongly
correlated with it. The disadvantage is that the full model with all variables is
required at the start and this can be time-consuming and numerically unstable.
Step-wise Regression
This procedure is like Forward Selection except that at each step we consider
dropping variables as in Backward Elimination.
Convergence is guaranteed if the thresholds Fout and Fin satisfy: Fout <
Fin . It is possible, however, for a variable to enter S and then leave S at a
subsequent step and even rejoin S at a yet later step.
As stated above these methods pick one best subset. There are straight-
forward variations of the methods that do identify several close to best choices
for different sizes of independent variable subsets.
None of the above methods guarantees that they yield the best subset
for any criterion such as adjusted R2 . (Defined later in this note.) They are
reasonable methods for situations with large numbers of independent variables
but for moderate numbers of independent variables the method discussed next
is preferable.
2 n−1
Radj =1− (1 − R2 ).
n−k−1
It can be shown that using Radj2 to choose a subset is equivalent to picking
Table 2.4 gives the results of the subset selection procedures applied to
the training data in the Example on supervisor data in Section 2.2.
Notice that the step-wise method fails to find the best subset for sizes of 4,
5, and 6 variables. The Forward and Backward methods do find the best subsets
of all sizes and so give identical results as the All subsets algorithm. The best
subset of size 3 consisting of {X1, X3} maximizes Radj 2 for all the algorithms.
This suggests that we may be better off in terms of MSE of predictions if we
use this subset rather than the full model of size 7 with all six variables in the
model. Using this model on the validation data gives a slightly higher standard
deviation of error (7.3) than the full model (7.1) but this may be a small price to
pay if the cost of the survey can be reduced substantially by having 2 questions
instead of 6. This example also underscores the fact that we are basing our
analysis on small (tiny by data mining standards!) training and validation data
sets. Small data sets make our estimates of R2 unreliable.
A criterion that is often used for subset selection is known as Mallow’s
Cp . This criterion assumes that the full model is unbiased although it may have
variables that, if dropped, would improve the M SE. With this assumption
we can show that if a subset model is unbiased E(Cp ) equals k, the size of the
subset. Thus a reasonable approach to identifying subset models with small bias
is to examine those with values of Cp that are near k. Cp is also an estimate
of the sum of MSE (standardized by dividing by σ 2 ) for predictions (the fitted
values) at the x-values observed in the training set. Thus good models are those
that have values of Cp near k and that have small k (i.e. are of small size). Cp
is computed from the formula:
SSR
Cp = + 2k − n,
σ̂F2 ull
14 Chap. 2 Multiple Linear Regression
Stepwise Selection
Models
Size SSR RSq RSq Cp 1 2 3 4 5 6 7
(adj)
2 874.467 0.593 0.570 -0.615 Constant X1
3 786.601 0.634 0.591 -0.161 Constant X1 X3
4 783.970 0.635 0.567 1.793 Constant X1 X2 X3
5 781.089 0.637 0.540 3.742 Constant X1 X2 X3 X4
6 775.094 0.639 0.511 5.637 Constant X1 X2 X3 X4 X5
7 738.900 0.656 0.497 7.000 Constant X1 X2 X3 X4 X5 X6
where σ̂F2 ull is the estimated value of σ 2 in the full model that includes all the
variables. It is important to remember that the usefulness of this approach
depends heavily on the reliability of the estimate of σ 2 for the full model. This
requires that the training set contains a large number of observations relative to
the number of variables. We note that for our example only the subsets of size
6 and 7 seem to be unbiased as for the other models Cp differs substantially
from k. This is a consequence of having too few observations to estimate σ 2
accurately in the full model.