Chapter 3 - Multiple Linear Regression
Chapter 3 - Multiple Linear Regression
1. Introduction
A regression model that involves more than one regressor variable is called a multiple regression
model.
In multiple regressions, the mean of the response variable is a function of two or more explanatory
variables.
In Chapter 2 we examined the relationship between HSGPA and College GPA. There are some
other possible factors that may be related to College GPA, such as ACT Scores, Rank in high school
class, etc.
The corresponding multiple linear regression model in this case would look like this:
The multiple linear regression model with 𝑘 regressors or predictor variables is:
𝑦 = 𝛽 + 𝛽 𝑥 + 𝛽 𝑥 +⋯+ 𝛽 𝑥 + 𝜀
where
1. 𝑦 is the response variable that we want to predict.
2. 𝑥 , 𝑥 , . . . , 𝑥 are 𝑘 predictor variables.
3. 𝛽 , 𝛽 , 𝛽 , … , 𝛽 are unknown parameters.
4. 𝛽 is the intercept – the average value of 𝒀 when 𝑿𝟏 , 𝑿𝟐 and 𝑿𝒌 are all zeros.
5. 𝛽 ’s are called the (partial) regression coefficients which represent the expected change in
the response 𝑦 per unit change in 𝑥 when all the remaining regressor variables,
𝑥 (𝑖 ≠ 𝑗), are held constant. For this reason, the parameters, 𝛽 ; 𝑗 = 1, 2, … , 𝑘 , are often
called partial regression coefficients.
6. 𝜀 is the random error.
Models with complex structure may often still be analyzed by multiple linear regression techniques.
For example:
𝑦 = 𝛽 + 𝛽 𝑥 + 𝛽 𝑥 + 𝛽 𝑥 + 𝜀 can be written as 𝑦 = 𝛽 + 𝛽 𝑥 + 𝛽 𝑥 + 𝛽 𝑥 + 𝜀
where 𝑥 = 𝑥 , 𝑥 = 𝑥 , 𝑥 = 𝑥 .
𝑦 1 𝑥 𝑥 ⋯ 𝑥 𝛽 𝜀
𝑦 1 𝑥 𝑥 ⋯ 𝑥 𝛽 𝜀
where 𝑦 = , 𝑋= , 𝛽= , 𝜀=
⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮
𝑦 1 𝑥 𝑥 ⋯ 𝑥 𝛽 𝜀
The method of least squares can be used to estimate the regression coefficients, 𝛽 ’s. Suppose that
𝑛 > 𝑘 observations are available with sample regression model
𝑦 = 𝛽 + 𝛽 𝑥 + 𝛽 𝑥 + ⋯+ 𝛽 𝑥 + 𝜀 = 𝛽 + 𝛽 𝑥 + 𝜀 ; 𝑖 = 1, … , 𝑛 .
𝑆(𝛽 , 𝛽 , … , 𝛽 ) = 𝜀 = 𝑦 − 𝛽 − 𝛽𝑥 ,
𝑆(𝜷) = 𝒚 𝒚 − 𝜷 𝑿 𝒚 − 𝒚 𝑿𝜷 + 𝜷 𝑿 𝑿𝜷 = 𝒚 𝒚 − 2𝜷 𝑿 𝒚 + 𝜷 𝑿 𝑿𝜷
𝜷 = (𝑿 𝑿) 𝑿𝒚
provided that the inverse matrix (𝑿 𝑿)– exists. The (𝑿 𝑿)– will always exists if the regressors
are linearly independent, that is, if none of the columns of the X matrix is a linear combination of
the other columns.
𝒚 = 𝑿𝜷 = 𝑿(𝑿 𝑿) 𝑿 𝒚 = 𝑯𝒚 ,
where 𝑯 = 𝑿(𝑿 𝑿) 𝑿 is an 𝑛 × 𝑛 matrix and is called the hat matrix. It maps the vector of
observed values into a vector of fitted values. The hat matrix and its properties play a central role
in regression analysis.
LSE Properties:
1. 𝐸(𝛽 ) = 𝐸[(𝑋 ′ 𝑋) 𝑋 ′ 𝑦] = 𝛽.
2. 𝛽 = (𝑋 ′ 𝑋) 𝑋 ′ 𝑦 is the best linear unbiased estimator (BLUE) of 𝛽.
3. cov 𝜷 = 𝐸 𝜷 − 𝐸 𝜷 𝜷−𝐸 𝜷 = var 𝜷 = var[(𝑿 𝑿) 𝑿 𝒚] = 𝝈 (𝑿 𝑿) .
which is a 𝑝 × 𝑝 symmetric matrix whose jth diagonal element is the variance of 𝛽 and
(ij)th off-diagonal element is the covariance between 𝛽 and 𝛽 .
Residuals
The difference between the observed value 𝑦 and the corresponding fitted value 𝑦 is the residual
𝑒 = 𝑦 – 𝑦 . The n residuals may be conveniently written in matrix notation as
𝒆 = 𝒚 − 𝒚 = 𝒚 − 𝑿𝜷 = 𝒚 − 𝑯𝒚 = (𝑰 − 𝑯)𝒚
Example 3.1:
In a small-scale experimental study of the relation between degree of brand liking (𝑌) and moisture
content (𝑋 ) and sweetness (𝑋 ) of the product, the following results were obtained from the
experiment based on a completely randomized design. Fit a multiple linear regression model
relating the brand liking to the content and sweetness of the product.
𝑖 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
𝑥 4 4 4 4 6 6 6 6 8 8 8 8 10 10 10 10
𝑥 2 4 2 4 2 4 2 4 2 4 2 4 2 4 2 4
𝑦 64 73 61 76 72 80 71 83 83 89 86 93 88 95 94 100
Solution:
1 4 2 64 1 4 2
1 1 ⋯ 1 16 112 48
1 4 4 73 1 4 4
𝑋= ,𝑦= ,𝑋 𝑋= 4 4 ⋯ 10 = 112 864 336
⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮
2 4 ⋯ 4 48 336 160
1 10 4 100 1 10 4
64
1 1 ⋯ 1 1308
73
𝑋 𝑦= 4 4 ⋯ 10 = 9510
⋮
2 4 ⋯ 4 3994
100
𝛽 = (𝑋 ′ 𝑋) 𝑋 ′ 𝑦
16 112 48 1308 1.2375 −0.0875 −0.1875 1308 37.65
= 112 864 336 9510 = −0.0875 0.0125 0 9510 = 4.425
48 336 160 3994 −0.1875 0 0.0625 3994 4.375
𝒚= 37.65+4.425x1+4.375x2
X1 X2
16 112 48
X1 112 864 336
X2 48 336 160
Output
Call:
lm(formula = B.liking ~ content + sweetness) B0
B1 B2
Coefficients:
(Intercept) content sweetness
37.650 4.425 4.375
⋮ ⋮ ⋮ ⋮
25 10.75 4 150
R-Codes
#read data from file
setwd("E:/… ")
Ex2.2delivery.dat<-read.table(file = "C02EX2.2Delivery.txt", header=TRUE)
Output:
Call:
lm(formula = Delivery ~ NoCases + Distance, data = Ex2.2delivery.dat)
Residuals:
Min 1Q Median 3Q Max
-5.7880 -0.6629 0.4364 1.1566 7.4197
B0
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.341231 1.096730 2.135 0.044170 * B1
NoCases 1.615907 0.170735 9.464 3.25e-09 ***
Distance 0.014385 0.003613 3.981 0.000631 *** B2
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
MSE
Residual standard error: 3.259 on 22 degrees of freedom
Multiple R-squared: 0.9596, Adjusted R-squared: 0.9559
df=n-p
F-statistic: 261.2 on 2 and 22 DF, p-value: 4.687e-16
Solution:
The multiple regression model is
𝑦 =𝛽 +𝛽 𝑥 +𝛽 𝑥 + 𝜀 or 𝐸(𝒀) = 𝛽 + 𝛽 𝑿 + 𝛽 𝑿
a.
𝛽= : The delivery time of a machine is expected to increase by1.62 minsfor each additional
case (𝑋 ) when the distance (𝑋 ) is held constant.
b.
𝑥 = 15 , 𝑥 = 800𝑓𝑡 .
𝑦= 38.0997
Consider the data shown in figure below. These data were generated from the equation
𝑦 = 8 − 5𝑥 + 12𝑥
y X1 X2
10 2 1
17 3 2
48 4 5
27 1 2
55 5 6
26 6 4
9 7 3
16 8 4
The matrix of scatterplots is shown in the figure. The y-versus-x1, plot does not exhibit any apparent
relationship between the two variables. The y-versus-x2 plot indicates that a linear relationship
exists, with a slope of approximately 8. Note that both scatter diagrams convey erroneous
information.
This example illustrates that constructing scatter diagrams of y versus 𝑥 (𝑗 = 1, … , 𝑘) can be
misleading, even in the case of only two regressors operating in a perfectly additive fashion with
no noise.
Estimation of 𝝈𝟐
As in simple linear regression, we may develop an estimator of 𝝈𝟐 from the residual sum of squares
𝑆𝑆 = (𝑦 − 𝑦) = 𝑒 = 𝑒 𝑒 = 𝒚 − 𝑿𝜷 𝒚 − 𝑿𝜷
𝑆𝑆 = 𝑆𝑆 =𝒚 𝒚−𝜷 𝑿 𝒚.
This residual sum of squares has 𝑛 − 𝑘 − 1 = 𝑛 − 𝑝 degrees of freedom associate with it since 𝑘 +
1 parameters are estimated in the regression model.
Example 3.3:
Estimate the error variance 𝜎 for Ex. 3.1, Brand liking of product.
Solution:
64 1308
1 1 ⋯ 1
73 9510
𝑋 𝑦= 4 4 ⋯ 10 = 3994
⋮
2 4 ⋯ 4
100
𝛽 = (𝑋 ′ 𝑋) 𝑋 ′ 𝑦
16 112 48 1308 1.2375 −0.0875 −0.1875 1308 37.65
4.425
= 112 864 336 9510 = −0.0875 0.0125 0 9510 = 4.375
48 336 160 3994 −0.1875 0 0.0625 3994
𝒚= 37.65+4.425X1+4.375X2
64
73
𝒚 𝒚 = [64 73 ⋯ 100] = 108896
⋮
100
1308
𝜷 𝑿 𝒚 = [37.65 4.425 4.375] 9510 = 108801.7
3994
𝒚𝒚 𝜷𝑿𝒚 108896-108801.7
𝑀𝑆 = = ------------------------ =7.2538
16-2-1
X1 X2
16 112 48
X1 112 864 336
X2 48 336 160
# To compute y'y
YPY <- t(Y)%*%Y
YPY
[,1]
[1,] 108896
# To compute B'X'Y
BPXPY <- t(Beta)%*%XPY
BPXPY
[,1]
[1,] 108801.7
# To compute MSE
MSE <- (YPY - BPXPY)/13
MSE
[,1]
[1,] 7.253846
R-Codes
#Use lm( ) function to fit a linear regression
Brand.Reg <- lm(formula = B.liking~content+sweetness)
summary(Brand.Reg)
Output:
Call:
lm(formula = B.liking ~ content + sweetness)
Residuals:
Min 1Q Median 3Q Max
-4.400 -1.762 0.025 1.587 4.200
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.6500 2.9961 12.566 1.20e-08 ***
content 4.4250 0.3011 14.695 1.78e-09 ***
sweetness 4.3750 0.6733 6.498 2.01e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 MSE=(2.639)^2
=7.252249
Residual standard error: 2.693 on 13 degrees of freedom
Multiple R-squared: 0.9521, Adjusted R-squared: 0.9447
F-statistic: 129.1 on 2 and 13 DF, p-value: 2.658e-09
Once we have estimated the parameters in the model, we face two immediate questions:
Several hypothesis testing procedures prove useful for addressing these questions. The formal tests
require that our random errors be 𝑁𝐼𝐷(0, 𝜎 ) .
The test procedure is a generalization of the analysis of variance used in simple linear regression.
𝑆𝑆 = ∑(𝑦 − 𝑦̄ ) = 𝑦 𝑦 − 𝑦 𝐽𝑦 is measure of "how good" 𝑦̄ does.
∑
Computational formula: 𝑆𝑆 = 𝑦 ′ 𝑦 − .
SS E ee yi yˆ i
2
′
= 𝑦 − 𝑋𝛽 𝑦 − 𝑋𝛽
= 𝑦 𝑦−𝛽 𝑋 𝑦 is a measure of "how good" 𝑦 does
𝑆𝑆 = 𝑆𝑆 − 𝑆𝑆
=∑(𝑦 − 𝑦̄ ) − ∑(𝑦 − 𝑦 )
= (𝑦 𝑦 − 𝑦 𝐽𝑦) – (𝑦 ′ 𝑦 − 𝛽 ′ 𝑋 ′ 𝑦)
=𝛽 𝑋 𝑦− 𝑦 ′ 𝐽𝑦 is the amount “gained” by doing the regression.
∑
Computational formula: 𝑆𝑆 = 𝛽 ′ 𝑋 ′ 𝑦 −
Note that:
/
~𝜒 ; ~𝜒 ; 𝐹= /( )
= ~𝐹 ,
Total 𝑆𝑆 𝑛−1
Notes:
1) 𝑆𝑆 has 𝑘 degrees of freedom. (Note: When there is only 1 independent variable, the degree
of freedom is 1).
2) 𝑆𝑆 has 𝑛 − 𝑘 − 1 degrees of freedom.
3) 𝐻 rejects if 𝐹 > 𝐹( , , ).
Is the regression equation that uses information provided by the predictor variables 𝑥 , 𝑥 , … , 𝑥
substantially better than the simple predictor 𝑦̄ that does not rely on any of the 𝑋 −values?
The test for significance of regression is a test to determine if there is a linear relationship
between the response 𝑦 and any of the regressor variables 𝑥 , 𝑥 , … , 𝑥 . This procedure is often
thought of as an overall or global test of model adequacy.
𝐻 : 𝛽 =𝛽 =⋯=𝛽 =0
𝐻 : 𝛽 ≠ 0 for at least one 𝑗
Test statistics:
𝐹= ~𝐹 , ; Reject 𝐻 when 𝐹 > 𝐹 ; ,
R-Codes:
#Use lm( ) function to obtain the ANOVA table
anv.delivery <- anova(Delivery.Reg)
anv.delivery
SSR <- sum(anv.delivery$"Sum Sq"[1:2])
SSR
Output:
Analysis of Variance Table
Response: Delivery
SSR
[1] 5550.811
Solution:
𝐻 : 𝛽 =𝛽 =⋯=𝛽 =0
𝐻 : 𝛽 ≠ 0 for at least one 𝑗
Test statistics:
𝑀𝑆
𝐹= = 5550.8/2 ~ 𝐹( , ) .
𝑀𝑆 -------------=261.8302
10.6
Reject H0
Decision Rule: _______________ when 𝐹 > 𝐹 ; ,
With α = 0.01, F(2, 22, 0.01) = ______________, and F = _______________ > F2, 22, 0.01.
Solution:
∑ ∑
𝑆𝑆 = 𝑦 𝑦 − , 𝑆𝑆 = 𝛽 ′ 𝑋 ′ 𝑦 − , 𝑆𝑆 = 𝑆𝑆 − 𝑆𝑆 ,
64
73 (∑ 𝑦) (1308)
𝒚 𝒚 = [64 73 ⋯ 100] = 108896 , = = 106929 ,
⋮ 𝑛 16
100
1308
𝜷 𝑿 𝒚 = [37.65 4.425 4.375] 9510 = 108801.7 ,
3994
𝑆𝑆 = 108896-106929=1967 , 𝑆𝑆 = 108801.7-106929=1872.7 ,
1872.7
𝑀𝑆 2 936.35
𝑆𝑆 = 94.3 ;𝐹= = = = 129.0841
𝑀𝑆 94.3 7.2538
13
𝐹( . , , ) = 3.8056 ; 𝐹 = 129.08 > 𝐹 .
Reject 𝑯𝟎 , we have sufficient evidence to conclude that the model using content 𝑋 , and sweetness
𝑋 , of product as predictor variables is useful for estimating the brand liking.
R-Output:
# To compute ANOVA table
anv.Brand <- anova(Brand.Reg)
anv.Brand
Response: B.liking
Once we have determined that the model is useful for predicting 𝑌, we should explore the nature of
the “usefulness” in more detail. Do all the predictor variables add important information for
prediction in the presence of other predictors already in the model?
Test statistics:
𝑡= = ~𝑡 ; Reject 𝐻 when |𝑡 | > 𝑡 / ; ,
If 𝑯𝟎 is not rejected, this indicates that the regressor 𝒙𝒋 does not contribute significantly to
the model, in another word, 𝒙𝒋 can be deleted from the model.
Is number of cases (𝑋 ) significantly related to delivery time in the model, given that distance (𝑋 )
is already in the model? Use 𝛼 = 0.01 for the hypothesis test.
(i.e. Should cases be used in the model (with distance) to estimate delivery time?)
Solution:
𝐻 :𝛽 = 0
𝐻 :𝛽 ≠ 0
𝛽 1.615907
𝑇= = = 9.4644 ~ t ( ) ; t( . , ) = 2.819 .
𝑠𝑒 𝛽 0.170735
Output:
Call:
lm(formula = Delivery ~ NoCases + Distance, data = Ex2.2delivery.dat)
Residuals:
Min 1Q Median 3Q Max
-5.7880 -0.6629 0.4364 1.1566 7.4197
Coefficients: B1
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.341231 1.096730 2.135 0.044170 * se(B1)
NoCases 1.615907 0.170735 9.464 3.25e-09 ***
Distance 0.014385 0.003613 3.981 0.000631 *** t-value for
--- Ho:B1=0
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
p-value for
Residual standard error: 3.259 on 22 degrees of freedom H1:B1=0
Multiple R-squared: 0.9596, Adjusted R-squared: 0.9559
F-statistic: 261.2 on 2 and 22 DF, p-value: 4.687e-16
Notes:
Since number of cases and distance have a p-value < 0.01 = , both variables should used in the
model.
Example 3.6.1:
Call:
lm(formula = Price ~ Area + HValue + LValue, data = Home.dat)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1470.2759 5746.3246 0.256 0.80132
Area 13.5286 6.5857 2.054 0.05666 .
HValue 0.8204 0.2112 3.885 0.00131 **
LValue 0.8145 0.5122 1.590 0.13137
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
This is a test to investigate the contribution of a subset of the regressor variables to the model.
Let the vector of regression coefficients be partitioned into 2 groups:
𝛽= where 𝛽 = (𝛽 , 𝛽 , . . . , 𝛽 ) and 𝛽 = (𝛽 , 𝛽 , . . . , 𝛽 ).
The regression sum of squares due to 𝛽 given that 𝛽 is already in the model is
𝑆𝑆 (𝜷𝟐 |𝜷𝟏 ) = 𝑆𝑆 (𝑅𝑒𝑑𝑢𝑐𝑒𝑑) − 𝑆𝑆 (𝐹𝑢𝑙𝑙) = 𝑆𝑆 (𝜷𝟏 ) − 𝑺𝑺𝑬 (𝜷) , =SSR(F)-SSR(R)
with (𝑘 − 𝑟 + 1) degrees of freedom.
This is called “extra sum of squares” because it measures the increase in the regression sum of
squares that results from adding the regressors 𝑋 , . . . , 𝑋 to a model that already contains
𝑋 ,...,𝑋 .
Test statistics:
( | )/( )
𝐹= ~𝐹 , , Reject 𝐻 when 𝐹 > 𝐹 ,
Remark:
The Partial 𝐹 −test on a single variable 𝑋 is equivalent to the 𝑡 − test.
R-Codes:
# Read data from file
setwd("E:/… ")
Real.dat <-read.table("C02EX2.7RealEs.txt", header=TRUE)
#Use lm( ) function to fit a linear regression to reduced and full models
PropertyR.Reg <- lm(Price~Area+Bath, data=Real.dat) (reduced only left important variables)
PropertyF.Reg <- lm(Price~Area+Bath+Floor+Bedroom, data=Real.dat)
# To obtain ANOVA Tables
ANVR <- anova(PropertyR.Reg)
ANVR
ANVF <- anova(PropertyF.Reg)
ANVF
Output:
Analysis of Variance Table
SSR
Response: Price
Df Sum Sq Mean Sq F value Pr(>F)
Area 1 14829.3 14829.3 221.870 4.21e-09 ***
Bath 1 750.8 750.8 11.233 0.005763 **
Residuals 12 802.1 66.8
---
Response: Price
/
𝐹= . = 3.55 , F0.05, 2, 10 = 4.10
Confidence intervals on individual regression coefficients and confidence intervals on the mean
response given specific levels of the regressors play the same important role in multiple regression
that they do in simple linear regression.
R Output
Call:
lm(formula = Delivery ~ NoCases + Distance, data = Ex2.2delivery.dat)
Residuals:
Min 1Q Median 3Q Max
-5.7880 -0.6629 0.4364 1.1566 7.4197
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.341231 1.096730 2.135 0.044170 *
NoCases 1.615907 0.170735 9.464 3.25e-09 ***
Distance 0.014385 0.003613 3.981 0.000631 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Solution:
We have found that:
𝛽 = 1.6159, 𝑠𝑒 𝛽 = 0.1707, 𝑡 . , = 2.074
With 95% confidence, we estimate that the change in the mean of delivery time when No of cases
increase by one unit, holding the distance value constant, is somewhere between ____________
1.26mins
and ______________________.
1.97mins
R-Codes for CI
# Compute CI for regresstion parameters
confint(Delivery.Reg, level=0.95)
---------------------
Output:
2.5 % 97.5 %
(Intercept) 0.066751987 4.61571030
NoCases 1.261824662 1.96998976
Distance 0.006891745 0.02187791
Recall that the least squares line yielded the same value for both the estimate for 𝐸(𝑌 ) and the
prediction of some future value of 𝑦 . The confidence interval for the mean 𝐸(𝑌 ) is narrower than
the prediction interval for 𝑦 because of the additional uncertainty attributable to the random error
𝜀 when predicting some future value of 𝑦ℎ .
Note:
se yˆ h MS E (1 xh (XX) 1 x h ) MS E MS E xh (XX) 1 x h MS E ( se[ Eˆ (Yh )]) 2
Means that 95% of such intervals will contain the true delivery time.
R-Codes
#read data from file
setwd("E:/… ")
Ex2.2delivery.dat<-read.table(file = "C02EX2.2Delivery.txt", header=TRUE)
Output:
> CIM
$fit
fit lwr upr
[1,] 19.22432 17.6539 20.79474
$se.fit se(E(yh)
[1] 0.7572407
$df
[1] 22
$residual.scale
[1] 3.259473
$se.fit
[1] 0.7572407 it is not se(yh)
$df
[1] 22
df for error
$residual.scale
[1] 3.259473
Example 3.10:
When performing a regression of 𝑌 on 𝑋 and 𝑋 , we find that
i. 𝑦 = 20 − 1.5𝑥 + 1.8𝑥
ii. Source DF SS MS
Regression 2 42 21
Error 3 12 4
4/3 −1/4 −1/3
iii. (𝑋′𝑋) = −1/4 1/16 0
−1/3 0 2/3
a. Find 𝑠𝑒 𝛽 .
b. Calculate the value of the test statistic for testing 𝐻 : 𝛽 = 1.
c. Suppose 𝑥 = 2, and 𝑥 = 3, find 𝑠𝑒[𝐸 (𝑌 )] and 𝑠𝑒(𝑦ℎ ).
Solutions:
(a) 𝑠𝑒 𝛽 = 𝑀𝑆𝐸 𝐶 = 4(2/3)=1.633
(b) H0 = 𝛽 = 1
𝛽 −𝛽 1.8-1
𝑡= =
𝑠𝑒 𝛽 ---------- = 0.49
1.633
(c)
4 1 1
⎡ − − ⎤
⎢ 3 4 3⎥
1 55
1 1
𝑥ℎ′ (𝑥′𝑥) 𝑥ℎ = (1 2 3) ⎢ − 0⎥ 2 =
⎢ 4 16 ⎥ 12
3
⎢ 1 2⎥
⎣− 3 0
3⎦
Two other ways to assess the overall adequacy of the model are 𝑅 and the adjusted 𝑅 , denoted
𝑅 . Recall that
𝑆𝑆 𝑆𝑆
𝑅 = = 1−
𝑆𝑆 𝑆𝑆
𝑅 has the same interpretation as before, but with respect to 𝑘 independent variables.
(i.e. 𝑹𝟐 𝟏𝟎𝟎%of the variation in 𝒀 can be explained by using the independent variables to
predict 𝒀)
Notes:
1) Use 𝑅 as a measure of fit when the sample size is substantially larger than the number of
variables in the model; otherwise, 𝑅 may be artificially high.
2) As more variables are added to the model, 𝑹𝟐 will always increase even if the additional
variables do a poor job of estimating 𝑌 (i.e. 𝑆𝑆 can never become larger with more predictor
variables and 𝑆𝑆 is always the same for a given set of responses). Therefore, some regression
model builders prefer to use adjusted R2.
/( )
𝑅 =1− /( )
= 1− (1 − 𝑅 )
Note:
1. Since 𝑆𝑆 /(𝑛 − 𝑘 − 1) is the residual mean square and 𝑆𝑆 /(𝑛 − 1) is constant regardless
of how many variables are in the model, the 𝑅 will only increase on adding a variable
to the model if the addition of the variable reduces the residual mean square.
2. The interpretation of 𝑅 is about the same as 𝑅 .
3. 𝑅 ≤𝑅 .
4. 𝑅 can be less than 0.
/ ( )
𝐹=( )/( )
= ( )
Solution:
𝑆𝑆 5382.4+168.4
𝑅 = = ------------------------------ = 0.9596
𝑆𝑆
5382.4+168.4+233.7
𝑛−1
𝑅 = 1− (1 − 𝑅 ) = 1- 24/22 (1-0.9596)=0.9559
𝑛−𝑘−1
Since 𝑅 0.9559
= ____________, 95.6%
approximately __________ of the variation in delivery time can be
explained by using cases and distance to predict delivery time.
R Output
Call:
lm(formula = Delivery ~ NoCases + Distance, data = Ex2.2delivery.dat)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.341231 1.096730 2.135 0.044170 *
NoCases 1.615907 0.170735 9.464 3.25e-09 ***
Distance 0.014385 0.003613 3.981 0.000631 ***
---
Residual standard error: 3.259 on 22 degrees of freedom
Multiple R-squared: 0.9596, Adjusted R-squared: 0.9559
F-statistic: 261.2 on 2 and 22 DF, p-value: 4.687e-16
----------------------------------------------
Analysis of Variance Table
Response: Delivery
Df Sum Sq Mean Sq F value Pr(>F)
NoCases 1 5382.4 5382.4 506.619 < 2.2e-16 ***
Distance 1 168.4 168.4 15.851 0.0006312 ***
Residuals 22 233.7 10.6
Example 3.12:
Examine what happens to 𝑅 and 𝑅 , when additional variables are added to the model
Consider the model
𝑌 = 𝛽 +𝛽 𝑋 +𝛽 𝑋 +𝛽 𝑋 +𝛽 𝑋 +𝜀
The data for 𝑌, 𝑋 , 𝑋 , 𝑋 , 𝑋 for 𝑛 = 15 observations were input into R program below.
Output:
model R.sq adj.R.sq
1 x1 0.9052093 0.8979178
2 x1, x2 0.9510411 0.9428813
3 x1, x2, x3 0.9703503 0.9622640
4 x1, x2, x3, x4 0.9713634 0.9599088
>mod.fit1
lm(formula = y ~ x1)
Coefficients:
(Intercept) x1
35.55 7.50
> mod.fit2
lm(formula = y ~ x1 + x2)
Coefficients:
(Intercept) x1 x2
18.62 5.75 19.49
> mod.fit3
lm(formula = Price ~ x1 + x2 + x3)
Coefficients:
(Intercept) x1 x2 x3
15.811 6.044 28.222 -14.664
> mod.fit4
lm(formula = y ~ x1 + x2 + x3 + x4)
Coefficients:
(Intercept) x1 x2 x3 x4
18.763 6.270 30.271 -16.203 -2.673
Model 𝑅 𝑅
𝑌 = 35.5 + 7.5𝑋 0.9052 0.8979
𝑌 = 18.62 + 5.75𝑋 + 19.49𝑋 0.9510 0.9429
𝑌 = 15.81 + 6.04𝑋 + 128.22𝑋 − 14.66𝑋 0.9704 0.9623
𝑌 = 18.76 + 6.27𝑋 + 30.3𝑋 − 16.2𝑋 − 2.67𝑋 0.9714 0.9599
a) Note that 𝑋 , 𝑋 and 𝑋 were significantly related to 𝑌 in the model, but 𝑋 was not.
When a variable is added to the model that “may not” be useful, the 𝑅 decreased. Thus,
the decrease in 𝑅 after 𝑋 is added to the model suggests that 𝑿𝟒 may not be useful in
estimating 𝒀.
b) Notice that 𝑅 increased after each variable was added to the model.
Note:
(the highest adjusted R the best)
𝑅 is mainly used to compare two or more models that use different numbers of predictor
variables.
Example 3.13:
𝑅 and 𝑅 were calculated for all possible subsets of three independent variables. The results are
as follow:
Subsets of Regression: 𝑌 versus 𝑋 , 𝑋 , 𝑋
Independent Variable 𝑅 𝑅
𝑋 0.9052 0.8979
𝑋 0.6948 0.6713
𝑋 0.5565 0.5223
𝑋 ,𝑋 0.9510 0.9429
𝑋 ,𝑋 0.9150 0.9008
𝑋 ,𝑋 0.7565 0.7159
𝑋 ,𝑋 ,𝑋 0.9519 0.9388 after adding X3, the
value decreases, X3
may not be useful
If you had to compare these models and choose the best one, which model would you choose?
Explain.
Solution:
The term involving 𝑥 , called a quadratic term (or second-order term).When the curve opens
upward, the sign of 𝛽 is positive (see figure 2.2a); when the curve opens downward, the sign of
𝛽 is negative (See figure 2.2b). This polynomial model is a second-order model with one
predictor variable.
Figure 2.2a: Graphs for Quadratic Models when 11 > 0 (Concave up)
5
Y
0 1 2 3
Figure 2.2b: Graphs for Quadratic Models when 11 < 0 (Concave down)
-1
-2
-3
-4
Y
-5
-6
-7
-8
-9
0 1 2 3
Example 3.14:
In all-electric homes, the amount of electricity expended is of interest to consumers. Suppose we
wish to investigate the monthly electric usage, 𝑌, in all-electric homes and its relationship to the
size, 𝑋, of the home. Moreover, suppose we think that monthly electrical usage in all-electric homes
is related to the size of the home by the quadratic model
𝑦 =𝛽 +𝛽 𝑥 +𝛽 𝑥 +𝜀
To fit the model, the values of 𝑌 and 𝑋 are collected for 10 homes during a particular month. The
data are shown below:
a. Fit the data to a regression model. Plot the fitted regression function and the data? Does the
quadratic regression function appear to be a good fit here? Find 𝑅 .
b. Explain why the value 𝛽 = 1703 has no practical interpretation.
c. Explain why the value 𝛽 = 0.7068 should not be interpret as a slope.
d. Examine the value of 𝛽 to determine the nature of the curvature (concave upward or
downward) in the sample data.
e. Test whether or not there is a regression relation; use 𝛼 = 0.01.
f. Is there sufficient evidence of concave down curvature in the electric-home size relationship?
Test with 𝛼 = 0.01.
g. Estimate the mean electric usage for all 1200 sq ft. houses with 95% confidence interval.
Interpret your interval.
h. Predict the electric usage for a 1200 sq ft. house with 95% confidence interval. Interpret your
interval.
R-codes
# To enter the data
Size <- c(1290,1350,1470,1600,1710,1840,1980,2230,2400,2930)
Usage <-c(1182,1172,1264,1493,1571,1711,1804,1840,1956,1954)
sSize <- Size-mean(Size)
# To fit the regression function. I() = identity function, use to prevent special # interpretation of
operators in a model formula.
Electric.Reg <- lm(Usage ~ sSize + I(sSize^2))
summary(Electric.Reg)
# Plot the fitted regression function and the data
plot(x = sSize, y = Usage, xlab = “Size”, ylab =
“Usage”, main = “Usage vs. Size”, col = “red”, pch = 19,cex=1.5)
max(sSize))
Output:
lm(formula = Usage ~ sSize + I(sSize^2))
Residuals:
Min 1Q Median 3Q Max
-73.792 -22.426 5.886 31.689 52.436
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.703e+03 2.054e+01 82.914 9.77e-12 *** p value for i test
sSize 7.068e-01 3.723e-02 18.985 2.80e-07 ***
I(sSize^2) -4.500e-04 5.908e-05 -7.618 0.000124 ***
Signif. Codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1
Note:
Notice how the I() (identity) function is used in the formula statements of lm(). The I() function
helps to protect the meaning of what is inside of it. Note that just saying age^2 without the function
will not work properly. Thus, age^2 just means age to R because there are no other terms with it.
Solution:
a. Fit the data to a regression model. Plot the fitted regression function and the data?
Does the quadratic regression function appear to be a good fit here? Find 𝑹𝟐 .
1400
1200
Size
The figure above illustrates that the electrical usage appears to __________________________
increase the curvillinear matter
with the size of the home. This provides some support for the inclusion of the quadratic term 𝑥
in the model and hence provides a good fit to the data.
R2 =0.9819
This implies that almost ____________
98.19% of the sample variation in electrical usage (𝑌) can be
explained by the quadratic model.
c.
d. Explain why the value 𝜷𝟏 = 0.7068 should not be interpret as a slope.
𝛽 = 0.7068 is no longer representing a slope in the presence of the quadratic term 𝑥 .
The estimated coefficient of the first-order 𝑥 __________________________________
will not have meaningful interpretation in the
quadratic model.
e. Examine the value of 𝜷𝟏𝟏 to determine the nature of the curvature (concave upward
or downward) in the sample data.
𝛽 = -0.000454 . The negative sign of 𝛽 downwards
indicates that the curve is concave _________.
H0: β1 = β11= 0
H1: at least one of the β1 or β11 is not zero.
F = 189.7 , v1 = 2, v7 = 7
p-value = 0,0000008 < 𝛼 = 0.01
Reject H0
________________, we have sufficient evidence to say that the overall model is a useful
model to predict the electrical usage.
H0: B11 =0
H1: B11<0
h. Estimate the mean electric usage for all 1200 sq ft. houses with 95% confidence
interval. Interpret your interval.
With 95% confidence, we could conclude that the mean electric usage for all 1200 sq ft. houses
925.43
falls between __________ Kilowatt-hours and _____________
1103.60 Kilowatt-hours.
i. Predict the electric usage for a 1200 sq ft. house with 95% confidence interval.
Interpret your interval.
# Part (h):Construct a 95% PI for yh
PI <-predict(object = Electric.Reg, newdata = ND, se.fit = TRUE,interval=c("prediction"),
level=0.95)
PI
Output:
> PI
$fit
fit lwr upr
[1,] 1014.514 872.4483 1156.580
$se.fit
[1] 37.67245
$df
[1] 7
$residual.scale
[1] 46.80133
With 95% confidence, we could conclude that the predicted electric usage for a 1200 sq ft. house
falls somewhere between ____________
872.45 Kilowatt-hours and ______________
1156.58 Kilowatt-hours.
3D regression plane
22
-10
3
1
-26
3 -1 x2
1
-1 -3
x1 -3
Example 3.15:
A collector of antique clocks knows that the price received for the clocks increases with the age of
the clocks. Moreover, the collector believes that the rate of increase of the auction price with age
will be driven upward by a large number of bidders. Consequently, the interaction model is
proposed:
𝑌 =𝛽 +𝛽 𝑋 +𝛽 𝑋 +𝛽 𝑋 𝑋 +𝜀
A sample of 32 auction prices of clocks, along with their age and the number of bidders, is given
below.
Age, x1 127 115 127 150 156 182 156 132 137 113 137
# of bidders, x2 13 12 7 9 6 11 12 10 9 9 15
Auction Price, y 1235 1080 845 1522 1047 1979 1822 1253 1297 946 1713
Age, x1 117 137 153 117 126 170 182 162 184 143 159
# of bidders, x2 11 8 6 13 10 14 8 11 10 6 9
Auction Price, y 1024 1147 1092 1152 1336 2131 1550 1884 2041 845 1483
Age, x1 108 175 108 179 111 187 111 115 194 168
# of bidders, x2 14 8 6 9 15 8 7 7 5 7
Auction Price, y 1055 1545 729 1792 1175 1593 785 744 1356 1262
The 32 data points were used to fit the model with interaction.
R Codes:
# to enter the data
Age <- c(127, 115, 127, 150, 156, 182, 156, 132, 137, 113, 137, 117, 137, 153, 117, 126, 170,
182, 162, 184, 143, 159, 108, 175, 108, 179, 111, 187, 111, 115, 194, 168)
Bidder <- c(13, 12, 7, 9, 6, 11, 12, 10, 9, 9, 15, 11, 8, 6, 13, 10, 14, 8, 11, 10, 6, 9, 14, 8, 6, 9, 15,
8, 7, 7, 5, 7)
Price <- c(1235, 1080, 845, 1522, 1047, 1979, 1822, 1253, 1297, 946, 1713, 1024, 1147, 1092,
1152, 1336, 2131, 1550, 1884, 2041, 845, 1483, 1055, 1545, 729, 1792, 1175, 1593, 785, 744,
1356, 1262)
# To fit the model
auc.reg <- lm(Price ~ Age+Bidder+I(Age*Bidder))
summary(auc.reg)
R Output:
Call:
lm(formula = Price ~ Age + Bidder + I(Age * Bidder))
Residuals:
Min 1Q Median 3Q Max
-154.995 -70.431 2.069 47.880 202.259
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 320.4580 295.1413 1.086 0.28684
Age 0.8781 2.0322 0.432 0.66896
Bidder -93.2648 29.8916 -3.120 0.00416
I(Age * Bidder) 1.2978 0.2123 6.112 1.35e-06
R Codes:
auc.anv <-anova(auc.reg)
SSR <- auc.reg$"Sum Sq"[1:3]
Fstat <- (SSR/sum(auc.anv$"Df"[1:3]))/ (auc.anv$"Sum Sq"[4]/sum(auc.anv$"Df"[4]))
auc.anv
Fstat
R Output:
Analysis of Variance Table
Response: Price
Df Sum Sq Mean Sq F value Pr(>F)
Age 1 2555224 2555224 323.209 < 2.2e-16 ***
Bidder 1 1727838 1727838 218.554 9.382e-15 ***
I(Age * Bidder) 1 295364 295364 37.361 1.353e-06 ***
Residuals 28 221362 7906
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#F-statistic
[1] 193.0411
Solution:
(a) H0: 𝛽 = 𝛽 = 𝛽 =0
H1: At least one of the β’j is not 0.
F= , p – value = < 0.05
Reject H0, we can conclude that there is statistically evidence that the regression model is useful in
estimating auction price.
(b) H0:
H1:
t= , p- value < 0.05
Reject H0. We have sufficient evidence to prove that the price-age slope increases as the number of
bidders increase, i.e. the age and number of bidders interact positively.
( )
(c) x1 = 150, =𝛽 +𝛽 𝑥
Note:
Once interaction effect is significant in the model 𝐸(𝑌) = 𝛽 + 𝛽 𝑋 + 𝛽 𝑋 + 𝛽 𝑋 𝑋 , do not
conduct 𝒕 −test on the 𝜷 coefficients of the first order terms 𝑿𝟏 and 𝑿𝟐 . These terms should
be kept in the model regardless of the magnitude of their associated 𝑝 −values.
Example 3.15.1:
Refer to Ex. 2.7. Real Estate
R-Output
Call:
lm(formula = Price ~ Area + Bath + Floor + Bedroom, data = Real.dat)
Residuals:
Min 1Q Median 3Q Max
-12.700 -1.616 0.984 2.510 11.759
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 18.7633 9.2074 2.038 0.06889 .
Area 6.2698 0.7252 8.645 5.93e-06 ***
Bath 30.2705 6.8487 4.420 0.00129 **
Floor -16.2033 6.2121 -2.608 0.02611 *
Bedroom -2.6730 4.4939 -0.595 0.56519
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 46.1270 31.4595 1.466 0.1860
Area 4.8124 2.3829 2.020 0.0832 .
Bath 11.6684 22.1825 0.526 0.6151
Floor -40.3080 26.5576 -1.518 0.1729
Bedroom 9.3871 17.3444 0.541 0.6051
I(Area * Bath) 1.4876 1.6551 0.899 0.3986
I(Area * Floor) 1.2242 1.7087 0.716 0.4969
I(Area * Bedroom) -0.9592 1.2965 -0.740 0.4835
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Multiple regression models can also be written to include qualitative predictor variables. Qualitative
variables, unlike quantitative variables, cannot be measured on a numerical scale. Therefore, we
must code the values of the qualitative variable (called levels) as number before we can fit the
model. These coded qualitative variables are called dummy (or indicator) variables.
Example 3.16:
To enter gender as a variable, use
1, if male
𝑋 =
0, if female
Qualitative variables that involve 𝑘 categories are entered into the model by using 𝑘 − 1 dummy
variables.
Example 3.17:
In a model that relates the mean salary of group of employees to a number of predictor variables,
you may want to include the employee’s ethnic background. If each employee included in your
study belongs to one of the three ethnic groups – say, A, B, or C –you can enter the qualitative
variable “ethnicity” into your model using two dummy variables:
1, if group B 1, if group C
𝑋 = ,𝑋 =
0, if not 0, if not
The model is 𝐸(𝑌) = 𝛽 + 𝛽 𝑋 + 𝛽 𝑋 .
For employees in group A:
𝐸(𝑌) = 𝛽 + 𝛽 (0) + 𝛽 (0) = 𝛽
For employees in group B:
𝐸(𝑌) = 𝛽 + 𝛽 (1) + 𝛽 (0) = 𝛽 + 𝛽
For employees in group C:
𝐸(𝑌) = 𝛽 + 𝛽 (0) + 𝛽 (1) = 𝛽 + 𝛽
The model allows a different average response for each group.
𝛽 measures the average response for group A.
𝛽 measures the difference in the average responses between groups B and A.
𝛽 measures the difference in the average responses between groups C and A.
Example 3.18:
Consider the following model:
𝐸(𝑌) = 𝛽 + 𝛽 𝑋 + 𝛽 𝑋
where 𝑌 is the annual salary of a college lecturer,
𝑋 is the number of years of teaching experience,
1, if male college lecturer
𝑋 =
0, otherwise
Model above contains one quantitative variable (years of teaching experience) and one qualitative
variable (gender) which has two categories, i.e. male and female.
Mean salary for female college lecturer: 𝐸(𝑌) = 𝛽 + 𝛽 𝑋 + 𝛽 (0) = 𝛽 + 𝛽 𝑋
Mean salary for male college lecturer: 𝐸(𝑌) = 𝛽 + 𝛽 𝑋 + 𝛽 (1) = (𝛽 + 𝛽 ) + 𝛽 𝑋
Figure 2.4
Y (Salary)
Male Teacher
Female Teacher
2
0
Years of Experience
The fact that the slopes of the two lines may differ means that the two predictor variables interact;
that is, the change in 𝐸(𝑌) corresponding to a change in 𝑿𝟏 depends on whether the lecturer is
a man or a woman. To allow for this interaction, the interaction term 𝑋 𝑋 is introduced into the
model.
𝐸(𝑌) = 𝛽 + 𝛽 𝑋 + 𝛽 𝑋 + 𝛽 𝑋 𝑋
Mean salary for female college lecturer:
𝐸(𝑌) = 𝛽 + 𝛽 𝑋 + 𝛽 (0) + 𝛽 𝑋 (0) = 𝛽 + 𝛽 𝑋
which is a straight line with slope 𝜷𝟏 and intercept 𝜷𝟎 (see Figure 2.4).
Figure 2.5:
Y (Salary)
Male Teacher
1 + 12
Female Teacher
1
2
0
Years of Experience
The two lines have different slopes and different intercepts, which allows the relationship between
salary 𝑌 and years of experience 𝑋 to behave differently for men and women.
Example 3.19:
Table below gives hypothetical data on starting annual salaries and years of experience of 10 college
lecturers.
Years of Experience, 𝑋 Salary for Men (in RM1000) Salary for women (in RM1000)
5 27 24
4 26.7 23
3 26 23.5
2 25.5 22
1 26.2 22.5
R-Codes:
# to enter the data
Year <- c(5,5,4,4,3,3,2,2,1,1)
Gender <- c(1,0,1,0,1,0,1,0,1,0)
Salary <- c(27,24,26.7,23,26,23.5,25.5,22,26.2,22.5)
# Fit the model
Pay.Reg <- lm(Salary~Year+Gender+I(Year*Gender))
summary(Pay.Reg)
curve(expr = Pay.Reg$coefficients[1] +
Pay.Reg$coefficients[2]*x +
Pay.Reg$coefficients[3] +
Pay.Reg$coefficients[4]*x, col = "red", lty =
"solid", lwd = 2, xlim = c(1,6), ylim = c(22,27), xlab = "Year", ylab = "Salary", main =
"Salary vs. Years", panel.first = grid(col = "gray", lty = "dotted"))
curve(expr = Pay.Reg$coefficients[1] +
Pay.Reg$coefficients[2]*x, col = "blue", lty =
"solid", lwd = 2,xlim = c(1,6), ylim = c(22,27), add = TRUE)
legend(x=4.55, y=26.5, legend = c("Male", "Female"), col = c("red", "blue"), lty ="solid", bty =
"n", cex = 1, lwd = 2)
b. Fit the model and graph the prediction equations for Men and Woman lecturer
Salary vs. Years
27
Male
26
Female
25
Salary
24
23
22
1 2 3 4 5 6
Year
𝑦=
The prediction equation for woman X2= 0,𝑦 = 21.8 + 0.4𝑋 + 3.64(0) − 0.12𝑋 (0)
=
The prediction equation for men X2=1 , 𝑦 = 21.8 + 0.4𝑋 + 3.64(1) − 0.12𝑋 (1)
=
c. Use the prediction equation to find the mean salary for male with 3.5 years of
experience.
𝑦=
$fit
fit lwr upr
[1,] 26.42 25.83889 27.00111
$se.fit
[1] 0.2374868
$df
[1] 6
d. Find a 95% confidence interval for the mean salary of all male lecturers with 3.5 years of
experience.
With 95% confidence, the mean salary of all male lecturers with 3.5 years of experience is
between ______________ and ________________.
R-Codes:
ND<- data.frame(Year = c(3.5), Gender = c(1))
PI <- predict(object = Pay.Reg, newdata = ND, se.fit = TRUE,interval=c("prediction"),
level=0.95)
PI
$fit
fit lwr upr
1 26.42 25.06408 27.77592
$se.fit
[1] 0.2374868
$df
e. Find a 95% prediction interval for the salary of a male lecturer with 3.5 years of
experience.
With 95% confidence, the salary of a male lecturer with 3.5 years of experience is between
____________ and ________________.
f. Do the data provide sufficient evidence to indicate that the annual rate of increase in female
lecturer salaries exceeds the annual rate of increase in male lecturer salaries? Test at 𝛼 =
0.1
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 21.8000 0.5251 41.516 1.31e-08 ***
Year 0.4000 0.1583 2.526 0.04489 *
Gender 3.6400 0.7426 4.902 0.00271 **
Since β12 measures the difference in slopes, the slope of the 2 lines will be identified if β12= 0.
H0: β12 = 0 H1: β12 < 0
t= , p – value = > 0.10
____________________, the data do not provide sufficient evidence to indicate that the annual rate
of increase in female lecturer salaries exceeds the annual rate of increase in male lecturer salaries.
Remark:
If the indicator variable is defined to be
1, if female college lecturer
𝑋 =
0, otherwise
𝐸(𝑌) = 𝛽 + 𝛽 𝑋 + 𝛽 𝑋 + 𝛽 𝑋 𝑋
where 𝑌is the annual salary of a college lecturer,
𝑋 is the number of years of teaching experience,
R-code
Year <- c(5,5,4,4,3,3,2,2,1,1)
Gender <- c(0,1,0,1,0,1,0,1,0,1)
Salary <- c(27,24,26.7,23,26,23.5,25.5,22,26.2,22.5)
Pay.Reg <- lm(Salary~Year+Gender+I(Year*Gender))
summary(Pay.Reg)
R-output:
Residuals:
Min 1Q Median 3Q Max
-0.600 -0.370 0.150 0.275 0.500
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 25.4400 0.5251 48.448 5.19e-09 ***
Year 0.2800 0.1583 1.769 0.12738
Gender -3.6400 0.7426 -4.902 0.00271 **
I(Year * Gender) 0.1200 0.2239 0.536 0.61127
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1