Multiple Linear Regression
Multiple Linear Regression
The objective of this section is to introduce multiple linear regression and by end of the chapter
you will be able to:
1.1 Introduction
Up to now, we have been dealing with regression relationships in which two variables,
dependent and one independent variable were involved. Multiple linear regression analysis is
merely an extension of simple linear regression. In multiple linear regression, there are 𝑝 − 1
explanatory variables, and the relationship between the dependent variable and the explanatory
variables is represented by the following equation:
Where 𝛽0is a constant term and 𝛽1 to 𝛽𝑝−1 are the coefficients relating the 𝑝 − 1 explanatory
variables to the variables (𝑋1𝑖 , 𝑋2𝑖 , … , 𝑋𝑝−1,𝑖 ) of interest.
Note: The multiple linear regression has 𝑝 parameters, that is, 𝛽0 to 𝛽𝑝−1.
So, multiple linear regression can be thought of an extension of simple linear regression, where
there are 𝑝 parameters, or simple linear regression can be thought of as a special case of multiple
linear regression, where 𝑝 = 2.
The term ‘linear’ is used because in multiple linear regression we assume that 𝑌 is directly
related to a linear combination of the explanatory variables.
There are many ways of estimating the parameters in a regression model. As we did in
simple linear regression we shall focus attention on the Least Squares method. There are two
to ways we can apply the least squares method;
In multiple linear regression, the matrix approach seem to be more appropriate as compared to
estimation by substitution because as the number of explanatory variables increase the
substitution method become complex.
⋮ ⋮ ⋮ ⋮ + … + ⋮ ⋮
𝑦1 1 𝑥11 … 𝑥𝑝−1,1 𝛽0 𝜀1
𝑦2 1 𝑥12 … 𝑥𝑝−1,2 𝛽1 𝜀2
[⋮]=[ ⋮ ][ ⋮ ] + [ ⋮ ]
⋮ ⋮ ⋮
𝑦𝑛 1 𝑥1𝑛 … 𝑥𝑝−1,𝑛 𝛽𝑝−1 𝜀𝑛
𝛽0
𝛽
𝜷 is the vector of parameters give by 𝜷 = [ 1 ] and
⋮
𝛽𝑝−1
𝜀1
𝜀2
𝜺 is the error vector given by 𝜺 = [ ⋮ ]
𝜀𝑛
𝜺~𝑁(𝟎, 𝜎 2 𝐈), where 𝐈 is 𝑛 × 𝑛 the identity matrix. Ones in the diagonal elements of I specify
that the variance of each 𝜀𝑖 is 1 times 𝜎 2 . Zeros in the off diagonal elements of I specify that the
covariance between different 𝜀𝑖 zero implying that that the correlations are zero.
𝑿′ 𝑿𝜷 = 𝑿′ 𝒀
𝛽̂0
̂ = 𝛽̂1 . Pre-multiply both
Solving this equation for 𝜷 gives the least squares solution for 𝜷
⋮
̂
[𝛽𝑝−1 ]
sides by the inverse of 𝑿 𝑿(assuming it exists), that is [𝑿 𝑿] 𝑿 𝑿𝜷 = [𝑿′ 𝑿]−𝟏 𝑿′ 𝒀 we have
′ ′ −𝟏 ′
̂ = [𝑿′ 𝑿]−𝟏 𝑿′ 𝒀.
𝜷
Having estimated regression coefficients we have to test for the significance of each parameter.
If the coefficient of a given variable is insignificant it implies that that variable should be
removed from the model.
𝛽̂𝑖 − 𝑏
𝑡= ~𝑡(𝑛 − 𝑝)
√𝑉𝑎𝑟(𝛽̂𝑖 )
Note: In most cases𝑏 = 0, because we will be testing for the significance of regression
parameter.
(A) We reject 𝐻0 if
𝛽̂𝑖 − 𝑏
|𝑡| = || || > 𝑡𝛼 (𝑛 − 𝑝)
2
√𝑉𝑎𝑟(𝛽̂𝑖 )
(B) We reject 𝐻0 if
𝛽̂𝑖 − 𝑏
𝑡= < −𝑡𝛼 (𝑛 − 𝑝)
√𝑉𝑎𝑟(𝛽̂𝑖 )
(C) We reject 𝐻0 if
𝛽̂𝑖 − 𝑏
𝑡= > −𝑡𝛼 (𝑛 − 𝑝)
√𝑉𝑎𝑟(𝛽̂𝑖 )
Sum of Squares
The total sum of squares can be partitioned into two components, regression and error sum of
squares. In matrix terms they are defined as;
2
𝑆𝑆𝑇 = 𝒀′ 𝒀 − 𝑛𝑦
̂ ′ 𝑿′ 𝒀 − 𝑛𝑦2
𝑆𝑆𝑅 = 𝜷
𝑆𝑆𝐸 = 𝒀′ 𝒀 − 𝜷 ̂ ′ 𝑿′ 𝒀
Thus once 𝜷 have been estimated, the sum of squares can easily be computed.
Degrees of freedom
Mean Squares
The sum of squares divided by its degrees of freedom is called mean squares. The two important
mean squares are the Regression Mean Squares (𝑀𝑆𝑅) and Error Mean Squares (𝑀𝑆𝐸) and
these are given by;
𝑆𝑆𝑅
𝑀𝑆𝑅 =
𝑝−1
𝑆𝑆𝐸
𝑀𝑆𝐸 =
𝑛−𝑝
F-ratio
𝑀𝑆𝑅
𝐹= ~𝐹(𝑝 − 1, 𝑛 − 𝑝)
𝑀𝑆𝐸
Table 4.1: ANOVA table (Multiple regression)
To test for the significance of regression using ANOVA, our hypotheses are of the form
𝐻0 : 𝛽1 = 𝛽2 = ⋯ = 𝛽𝑝−1 = 0
Failing to reject 𝐻0 implies that that there is no regression relationship between the response
variable 𝑌 and the 𝑝 − 1 explanatory variables. If 𝐻0 has been rejected, it implies that there is
regression relationship. However we should go on to test the significance of each parameter to
find out which variable(s) led to the rejection of the null hypothesis.
Example 4.1 In a small-scale regression study, the following data were obtained:
X1i 7 4 16 3 21 8
X2i 33 41 7 49 5 31
Yi 42 33 75 28 91 55
(a) Express the regression model in matrix form, defining all the terms
(b) Find the least squares estimates of β, given that,
(c) Construct the ANOVA table and test for the significance of the regression line using
α=0.05.
(d) Test the hypothesis H0:β2=0 versus H1: 𝛽2 ≠ 0 at α=0.05.
(e) Find the 95% confidence interval for the intercept and test whether it is significant.
Solution
(a) 𝒀 = 𝑿𝜷 + 𝜺, where
7 1 33 42 𝜀1
4 1 41 33 𝛽0 𝜀2
16 1 7 75 𝜀3
𝒀= ,𝑿= , 𝜷 = [𝛽1 ] and 𝜺 = 𝜀
3 1 49 28 4
21 1 5 91 𝛽2 𝜀5
[8] [1 31 55] [𝜀6 ]
̂ = [𝑿′ 𝑿]−𝟏 𝑿′ 𝒀
(b) 𝜷
324
𝑿′ 𝒀 = [4061].
6796
34.5785574 −1.65089268 −0.65704022 324 33.9321
̂ = [−1.6508927
Thus, 𝜷 0.08030796 0.03112763 ] [4061] = [ 2.7848 ]
−0.6570402 0.03112763 0.01268501 6796 −0.2644
2
(c) 𝑆𝑆𝑇 = 𝒀′ 𝒀 − 𝑛𝑦 = 20568 − 6(54)2 = 3072
̂ ′ 𝑿′ 𝒀 − 𝑛𝑦2
𝑆𝑆𝑅 = 𝜷
324
= [33.9321 2.7848 −0.2644] [4061] − 6(54)2
6796
= 3010.2108
𝑆𝑆𝑅 3010.2108
𝑀𝑆𝑅 = = = 1505.1054
𝑝−1 2
𝑆𝑆𝐸 61.7892
𝑀𝑆𝐸 = = = 20.5964
𝑛−𝑝 3
𝑀𝑆𝑅
𝐹= = 73.0761
𝑀𝑆𝐸
Table 4.2: ANOVA table
𝐻0 : 𝛽1 = 𝛽2 = 0
Conclusion: Since 𝐹 > 9.55, we reject 𝐻0 and conclude that the regression relationship is
significant.
(d) 𝐻0 : 𝛽2 = 0 versus 𝐻1 : 𝛽2 ≠ 0
̂2
𝛽
Test statistic: 𝑡 = ~𝑡(𝑛 − 𝑝)
̂2 )
√𝑉𝑎𝑟(𝛽
Test statistic:
𝛽̂2
|𝑡| = || ||
√𝑉𝑎𝑟(𝛽̂2 )
Conclusion: Since |𝑡| < 2.35, we fail to reject 𝐻0 and conclude the variable 𝑋2𝑖 is
insignificant, that is, it has to be removed from the model.
Activity 4.1
Consider the following data set, where 𝑌 is the dependent variable and 𝑋1𝑖 and 𝑋2𝑖 are the
regressors.
Suppose the data can be described by the model 𝑦𝑖 = 𝛽0 + 𝛽1 𝑥1𝑖 + 𝛽2 𝑥2𝑖 + 𝜀𝑖 , where
𝜀𝑖 ~𝑁(0, 𝜎 2 ) and 𝐶𝑜𝑣(𝜀𝑖 , 𝜀𝑗 ) = 0 for 𝑖 ≠ 𝑗.
𝑆𝑆𝑅 𝑆𝑆𝐸
𝑅2 = = 1−
𝑆𝑆𝑇 𝑆𝑆𝑇
As we add more and more variables to the model (even random ones), 𝑅 2 will increase to 1.
Adjusted 𝑅 2 tries to take this into account by replacing sums of squares by mean squares
𝑆𝑆𝐸⁄
𝑛−𝑝
𝑅 2 (𝑎𝑑𝑗) = 1 −
𝑆𝑆𝑇⁄
𝑛−1
Example 4.2 Referring to example 4.1, calculate 𝑅 2 and 𝑅 2 (𝑎𝑑𝑗) and comment.
𝑆𝑆𝑅 𝑆𝑆𝐸⁄
𝑛−𝑝
Solution: 𝑅 2 = 𝑆𝑆𝑇 = 0.9799 and 𝑅 2 (𝑎𝑑𝑗) = 1 − 𝑆𝑆𝑇⁄ = 0.9665
𝑛−1
Comment: The two values 𝑅 2 and 𝑅 2 (𝑎𝑑𝑗) are almost the same and are quite high indicating
that the model is quite good.
In example 4.1 we dealt with two explanatory variables. As the number of explanatory increases
the estimation of parameters become complex, so we would want to use statistical software to
estimate the parameters rather than doing it manually. The following example will illustrate how
we use SPSS in multiple linear regression;
Example 4.3: An auctioneer of rugs kept records of his weekly auctions in order to determine
the relationships among price, age of carpet or rug, number of people attending the auction, and
number of times the winning bidder had previously attended his auctions. He felt that, with this
information, he could plan his auctions better, serve his steady customers better and make a
higher overall profit for himself. The results shown in the table below were obtained.
To perform a multiple linear regression analysis, go to the Analyze > Regression > Linear
We will be presented with a dialog box:
Choose the dependent and independent (explanatory) variables you require, in this case price is
the dependent variable and age, audience size and previous attendance are the independent
variables. The default ‘enter’ method puts all explanatory variables you specify in the model, in
the order that you specify them. Note that the order is unimportant in terms of the modeling
process. There are other methods available for model building, based on statistical significance,
such as backward elimination or forward selection but when building the model on a substantive
basis, the enter method is best: variables are included in the regression equation regardless of
whether or not they are statistically significant. Having chosen the variables we have the
following dialogue box;
Then we have the following output if press ‘OK’ on the above dialog box.
b
Variables Entered/Removed
Variables Variables
Model Entered Removed Method
1 Previous
attendance,
. Enter
Audience size,
a
Age
b
ANOVA
Total 5724333.333 23
a
Coefficients
Standardized
Unstandardized Coefficients Coefficients
The first table confirms that price is the dependent variable and age, audience size and previous
attendance are the independent variables.
The second table, model summary, shows that we have explained about 92.4% of the variation in
price with the three explanatory variables and our model is quite good.
The third table, ANOVA, indicates that the model is highly significant,
since 𝑝 − 𝑣𝑎𝑙𝑢𝑒 = 0.00 < 0.05. The table of coefficients, shows us that not all parameters are
significant. The constant term, 𝛽0 is not significant and the coefficient of previous attendance is
not significant. This basically means that previous attendance should be removed as an
explanatory variable. However, the two other explanatory variables are significant, that is, age
and audience size. The following output is obtained if age and audience size are our independent
variables:
b
Variables Entered/Removed
Variables Variables
Model Entered Removed Method
1 Audience size,
a
. Enter
Age
Model Summary
b
ANOVA
Total 5724333.333 23
Standardized
Unstandardized Coefficients Coefficients
Using the ANOVA table, our model is significant. Also from the coefficients table above, all the
𝑝 − 𝑣𝑎𝑙𝑢𝑒𝑠 are less than 0.05 (default level of significance) implying that all the parameters are
significant, hence the fitted model is;
Activity 4.2
In an effort to model annual company executive salaries for the year 2010, thirty three firms
were selected and data were gathered on salaries, sales, profits and employment. The following
table shows the data:
Firm Annual salary Sales(thousands) Profits(thousands) Employment
(thousands)
1 45 460.6 128.1 480
2 38.7 925.5 783.9 559
3 36.8 152.6 80.2 137
4 27.7 168.3 79.0 277
5 67.6 752.8 231.5 340
6 45.4 205.8 129.5 265
7 50.7 384.6 281.8 308
8 49.6 746.0 237.9 410
9 48.7 434.0 222.3 259
10 38.3 470.6 63.7 860
11 31.1 508 149.5 210
12 27.1 464.4 62.0 680
13 52.4 329.3 277.3 390
14 49.8 377.5 250.7 343
15 84.3 1174.3 820.6 940
16 34.3 174.3 82.6 194
17 32.4 724.7 190.8 400
18 22.5 178.9 63.3 56
19 25.4 66.8 42.8 139
20 20.8 191 48.5 106
21 51.8 933.1 310.6 392
22 40.6 613.2 491.6 400
23 33.2 457.8 228.0 96
24 34.0 545.3 254.6 78
25 69.8 2286.2 1011.3 571
26 30.6 361.0 203.1 52
27 61.3 614.1 201.0 500
28 30.2 101.3 81.3 47
29 54.0 560.3 194.6 300
30 29.3 855.7 260.3 123
31 52.8 421.6 352.1 180
32 45.6 544.04 455.2 177
33 41.7 229.9 97.5 146
Using SPSS, fit an appropriate multiple linear regression model to the data.