10 Inference For Regression Part2
10 Inference For Regression Part2
Topics Outline
Analysis of Variance (ANOVA)
F Test for the Slope
Calculating Confidence and Prediction Intervals
Normal Probability Plots
Analysis of Variance (ANOVA)
Question: If the least squares fit is the best fit, how good is it?
The answer to this question depends on the variability in the values of the response variable,
that is on the deviations of the observed ys from their mean y :
total
regression
error
yi y
total
variation
y i y
variation explained
by the regression line
y i y i
unexplained
variation
If we square these deviations and add them up, we obtain the following three sources of
variability:
-1-
Total Sum
of Squares
Regression Sum
of Squares
( yi
y) 2
i 1
Error Sum
of Squares
( y i
y) 2
( yi
+
i 1
i 1
SST
y i ) 2
SSR
SSE
s y2
1
n 1i
yi
n 1
SST
So, the total sum of squares SST really is a measure of total variation.
It has n 1 degrees of freedom.
The regression sum of squares SSR represents variation due to the relationship between x and y.
It has 1 degree of freedom.
The error sum of squares SSE measures the amount of variability in the response variable due
to factors other than the relationship between x and y.
It has n 2 degrees of freedom.
For the car plant electricity usage example (see Excel output on pages 5 and 6),
SST
SSR
1.5115 = 1.2124 +
SSE
0.2991
By themselves, SSR, SSE, and SST provide little that can be directly interpreted.
However, a simple ratio of the regression sum of squares SSR to the total sum of squares SST
provides a measure of the goodness of fit for the estimated regression equation. This ratio is the
coefficient of determination:
SSR
SSE
r2
1
SST
SST
The coefficient of determination is the proportion of the total sum of squares that can be
explained by the sum of squares due to regression. In other words, r 2 measures the proportion of
variation in the response variable y that can be explained by ys linear dependence on x in the
regression model.
For our data,
r2
SSR
SST
1.2124
1.5115
0.802
This high proportion of explained variation indicates that the estimated regression equation
provides a good fit and can be very useful for predictions of electricity usage.
-2-
If we divide the sum of squares for regression and error by their degrees of freedom,
we obtain the regression mean square MSR and the error mean square MSE:
MSR
MSE
SSR
SSR (variance due to regression)
1
SSE
(variance due to error)
n 2
Taking a square root of the variance due to error we obtain the regression standard error
or standard error of estimate:
se
n
y i
yi
MSE
SSE
n 2
i 1
SSR
1.2124
1
MSE
SSE
n 2
0.2991
12 2
0.02991
MSR
MSE
MSR
MSE
40.53
It can be proved that the F-statistic follows an F-distribution with 1 and (n 2) degrees of
freedom and can be used to test the hypothesis for a linear relationship between x and y.
-3-
Recall that
H0 :
Ha :
If the null hypothesis is rejected, we would conclude that there is evidence of a linear relationship.
For our example, the corresponding F distribution has df1 = 1 and df2 = n 2 = 12 2 = 10
degrees of freedom. To calculate the P-value associated with the value F = 40.53 of the test
statistic, Excel can be used as follows:
P-value = FDIST(test statistic,df1,df2) = FDIST(40.53,1,10) = 0.000082
Figure 2 Testing for significance of slope using F distribution with 1 and 10 degrees of freedom
With P-value this small, we reject the null hypothesis and conclude that there is a significant
linear relationship between the electricity usage and the production levels.
Notice that the P-value = 0.000082 for the F test of the slope is the same as the P-value for the
t test of the slope performed earlier. Moreover, it can be shown that the square of t distribution
with n 2 degrees of freedom equals the F distribution with 1 and n 2 degrees of freedom:
t n2
F1,n
b
SEb
0.4988301
0.078352
6.3665267
40.53
With only one explanatory variable, the F test will provide the same conclusion as the t test.
But with more than one explanatory variable, only the F test can be used to test for an overall
significant relationship.
-4-
Regression Statistics
Multiple R
R Square
Adjusted R Square
Standard Error
Observations
Regression
Residual
Total
Intercept
Production
df
1
10
11
0.895606
0.802109
0.782320
0.172948
12
SS
1.212382
0.299110
1.511492
3.25
3
2.75
2.5
2.25
2
3.5
4.5
5.5
Production ($ million)
MS
F
Significance F
1.212382 40.532970 0.000082
0.029911
t Stat
1.059736
6.366551
-5-
P-value
Lower 95%
0.314190 -0.450992
0.000082 0.324252
Upper 95%
1.269089
0.673409
Degrees
Sum
of
of Squares
Freedom
Mean Square
(Variance)
Regression
SSR
MSR
Error
n 2
SSE
MSE
Total
n 1
SST
Regression
Residual
Total
df
1
10
11
SS
1.212382
0.299110
1.511492
Coefficients
y-intercept
SSR
1
SSE
n 2
MS
1.212382
0.029911
F-statistic
P-value
MSR
MSE
Prob > F
F
40.532970
Significance F
0.000082
P-value
Lower 95%
Upper 95%
Prob>| t |
a t*SEa
a + t*SEa
Prob>| t |
b t*SEb
b + t*SEb
t Stat
P-value
Lower 95%
1.059736 0.314190 -0.450992
6.366551 0.000082 0.324252
-6-
Upper 95%
1.269089
0.673409
Example 2
Sunflowers Apparel
The sales for Sunflowers Apparel, a chain of upscale clothing stores for women, have increased
during the past 12 years as the chain has expanded the number of stores. Until now, Sunflowers
managers selected sites based on subjective factors, such as the availability of a good lease or the
perception that a location seemed ideal for an apparel store.
The new director of planning wants to develop a systematic approach that will lead to making
better decisions during the site selection process. He believes that the size of the store significantly
contributes to store sales, and wants to use this relationship in the decision-making process.
To examine the relationship between the store size and its annual sales, data were collected from
a sample of 14 stores. The data are stored in Sunflowers_Apparel.xlsx.
(a) Use Excels Regression tool or StatTools to run a linear regression.
Does a straight line provide a useful mathematical model for this relationship?
Regression Statistics
Multiple R
0.950883
R Square
0.904179
Adjusted R Square
0.896194
Standard Error
0.966380
Observations
14
ANOVA
df
1
12
13
SS
105.747610
11.206676
116.954286
MS
105.747610
0.933890
F
113.233513
Significance F
0.00000018
Coefficients
0.964474
1.669862
Standard Error
0.526193
0.156925
t Stat
1.832927
10.641124
P-value
0.091727
0.000000
Lower 95%
-0.182003
1.327951
Regression
Residual
Total
Intercept
Square Feet
14
y = 1.6699x + 0.9645
R = 0.9042
12
10
8
6
4
2
0
0
4
Square Feet (thousands)
-8-
Upper 95%
2.110950
2.011773
As the scatter plot and the high r = 0.95 show, as the size of the store increases, annual sales
increase approximately as a straight line. Thus, we can assume that a straight line provides a
useful mathematical model for this relationship.
(b) Can you safely predict the annual sales for a store whose size is 7 thousands of square feet?
The square footage varies from 1.1 to 5.8 thousands of square feet. Therefore annual sales
should be predicted only for stores whose size is between 1.1 to 5.8 thousands of square feet.
It would be improper to use the prediction line to forecast the sales for a new store containing
7,000 square feet because the relationship between sales and store size has a point of
diminishing returns. If that is true, as square footage increases beyond 5,800 square feet, the
effect on sales becomes smaller and smaller.
(c) What are the values of SST, SSR, and SSE? Please verify that SST = SSR + SSE.
SST = 116.954286
SSR = 105.747610
SSE = 11.206676
116.954286 = 105.747610 + 11.206676
(d) Use the above sums to calculate the coefficient of determination and interpret it.
r2
SSR
SST
105.747610
116.954286
0.904179
Therefore, 90.42% of the variation in annual sales is explained by the variability in the size of
the store as measured by the square footage. This large r2 indicates a strong linear relationship
between these two variables because the use of a regression model has reduced the variability in
predicting annual sales by 90.42%. Only 9.58% of the sample variability in annual sales is due to
factors other than what is accounted for by the linear regression model that uses square footage.
(e) Interpret the standard error of the estimate.
s = 0.966380
Recall that the standard error of the estimate represents a measure of the variation around the
prediction line. It is measured in the same units as the dependent variable y.
Here, the typical difference between actual annual sales at a store and the predicted annual sales
using the regression equation is approximately 0.966380 millions of dollars or $966,380.
(f) What are the expected annual sales and the residual value for the last data pair (x = 3, y = 4.1).
Interpret both of these values in business terms.
From the regression output, the expected (or predicted) annual sales equal 5.974061,
indicating that we expect annual sales to be $5,974,061, on average, for a store with size of
3,000 square feet. The residual equals 1.874061, indicating that for the store corresponding
to the last pair in the data set the actual annual sales were $1,874,061 lower than expected.
-9-
Residuals
0.5
0
-0.5 0
-1
-1.5
-2
-2.5
Square Feet
(h) Use the residual plot to evaluate the regression model assumptions about linearity
(mean of zero), independence, and equal spread of the residuals.
Linearity
Although there is widespread scatter in the residual plot, there is no clear pattern or
relationship between the residuals and xis. The residuals appear to be evenly spread above
and below 0 for different values of x.
Independence
You can evaluate the assumption of independence of the errors by plotting the residuals in
the order or sequence in which the data were collected. If the values of y are part of a time
series, one residual may sometimes be related to the previous residual.
If this relationship exists between consecutive residuals (which violates the assumption of
independence), the plot of the residuals versus the time in which the data were collected will
often show a cyclical pattern. Because the Sunflowers Apparel data were collected during the
same time period, you can assume that the independence assumption is satisfied for these data.
Equal spread
There do not appear to be major differences in the variability of the residuals for different xi
values. Thus, you can conclude that there is no apparent violation in the assumption of equal
spread at each level of x.
- 10 -
(i) Use StatTools to construct a normal probability plot and evaluate the regression model
assumption about normality of the residuals.
Q-Q Normal Plot of Residual / Data Set #2
3.5
Standardized Q-Value
2.5
1.5
0.5
-3.5
-2.5
-0.5
-0.5
-1.5
0.5
1.5
2.5
3.5
-1.5
-2.5
-3.5
Z-Value
From the QQ plot of the residuals, the data do not appear to depart substantially from a
normal distribution. The robustness of regression analysis with modest departures from
normality enables you to conclude that you should not be overly concerned about departures
from this normality assumption in the Sunflowers Apparel data.
Note: Excel does not readily provide a normal probability plot of residuals, but could be used to
get it in the following way. Run a regression with y = Residuals, x = any numbers (for example x
= 1,2,3, ...) and check the Normal Probability Plots box. Here is the result for our example.
Residuals
1
0
-1
20
40
60
80
100
120
-2
-3
Sample Percentile
(j) Find the standard error of the slope coefficient. What does this number indicate?
The standard error of the slope coefficient indicates the uncertainty in the estimated slope.
It measures about how far the estimated slope (the regression coefficient computed from the
sample) differs from the (idealized) true population slope, , due to the randomness of
sampling. Here, the estimated slope b = 1.669862 or $1,669,862 differs from the population
slope by about SEb = 0.156925 or $156,925.
- 11 -
(k) Find and interpret the 95% confidence interval for the slope coefficient.
(Note: In economics language for the slope, this question sounds like the following.
Find the 95% confidence interval for the expected marginal value of an additional 1,000
square feet to Sunflowers Apparel.)
The 95% confidence interval extends from 1.327951 to 2.011773 or $1,327,951 to $2,011,773.
We are 95% confident that the additional annual sales, for an additional 1,000 square feet in
store size, are between $1,327,951 to $2,011,773, on average.
(In economics language: We are 95% sure that the expected marginal value of an additional
1,000 square feet is between $1,327,951 to $2,011,773.)
(l) Use the F statistic from the regression output to determine whether the slope is statistically
significant.
The ANOVA table shows that the computed test statistic is F = 113.2335 and the P-value is
approximately 0. Therefore, you reject the null hypothesis H 0 :
0 and conclude that the
size of the store is significantly related to annual sales.
(m) Use the t statistic from the regression output to determine whether the slope is statistically
significant.
The slope is significantly different from 0. This can be seen either from the P-value of the
test statistic (for t = 10.6411, P-value
0) or from the confidence interval for the slope
(1.33 to 2.01) which does not include 0. This says that the size of the store significantly
contributes to store annual sales. Stores with larger size make larger annual sales, on average.
(n) Use the ConfPredInt.xlsx to construct a 95% confidence interval of the mean annual sales
for the entire population of stores that contain 4,000 square feet (x = 4).
The confidence interval is 6.971119 to 8.316727. Therefore, the mean annual sales are
between $6,971,119 and $8,316,727 for the population of stores with 4,000 square feet.
(o) Use the ConfPredInt.xlsx to construct a 95% prediction interval of the annual sales for an
individual store that contains 4,000 square feet (x = 4).
The prediction interval is 5.433482 to 9.854364. Therefore, with 95% confidence, you predict
that the annual sales for an individual store with 4,000 square feet is between $5,433,482 and
$9,854,364.
(p) Compare the intervals constructed in (n) and (o).
If you compare the results of the confidence interval estimate and the prediction interval
estimate, you see that the width of the prediction interval for an individual store is much wider
than the confidence interval estimate for the mean.
- 12 -