Module 11 Unit 2 Simple Linear Regression
Module 11 Unit 2 Simple Linear Regression
Learning Outcomes:
(1) Develop an estimated simple linear regression model to predict the value of a
dependent variable based on one independent variable.
(2) Interpret the constants in the estimated simple linear regression equation.
At the end of this learning module, you are expected to know how to model certain
phenomena using simple linear regression. You will be tasked to derive linear regression
models given some business-related data using a scientific calculator.
Regression analysis is a tool for building and developing a statistical (regression) model that
will characterize the association between a dependent variable and one or more
independent, or explanatory, variables. If the regression model is found to be adequate, it
can then be used to estimate or forecast values of the dependent variable. In simple linear
regression, there is only one independent variable, while multiple linear regression uses two
or more independent variables.
Correlation and regression analysis are closely related since both involve relationship
between two variables and they both use paired observations obtained from the same (or
matched) subjects. While correlation is used to determine the degree as well as the direction
of relationship between variables, regression analysis deals with the use of the relationship
for forecasting or predicting the value of a dependent variable.
For instance, regression analysis can be used for the following situations:
• Managers wish to predict the level of sales based on selling price, or extrapolate
a trend into the future.
• A company may wish to predict sales based on the GDP and the 10-year treasury
bond rate to capture the influence of the business cycle.
• A marketing researcher might want to predict the intent of buying a particular car
model based on a survey that measured consumer attitudes toward the brand,
negative word-of-mouth, and income level.
Before proceeding with (simple linear) regression analysis, a scatter diagram of 𝑌 versus 𝑋
can be done. It may give an idea of the form of relationship between them. It is important
to note here that the variable being predicted is always the dependent variable 𝑌, and must
be on the vertical (𝑦) axis.
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 1
SIMPLE LINEAR REGRESSION
Simple linear regression attempts to model the relationship between two variables by fitting
a linear equation to observed data. One variable is considered to be a regressor/predictor
or independent variable, and the other is considered to be a response or dependent
variable (the variable being predicted). The simple linear regression model postulates that
𝑌 = 𝛽0 + 𝛽1 𝑋 + 𝑒
In practice, the parameter values 𝛽0 and 𝛽1 are not known and must be estimated using
sample data. In general, the goal of simple linear regression is to find the line that best
predicts 𝑌 from 𝑋, that is, to find the line 𝒀
̂ = 𝒂 + 𝒃𝑿 (fitted regression line) that best estimates
the regression model 𝑌 = 𝛽0 + 𝛽1 𝑋 + 𝑒. It determines 𝑎 and 𝑏 that best estimate 𝛽0 and 𝛽1 ,
where the variables involved are defined as follows:
The value of the slope 𝑏 and 𝑦-intercept 𝑎 can be obtained using the method of least
squares, using the following formulas:
𝑛 ∑ 𝑥𝑦−∑ 𝑥 ∑ 𝑦 ∑𝑦 ∑𝑥
𝑏= and 𝑎 = −𝑏 = 𝑦̅ − 𝑏𝑥̅
𝑛 ∑ 𝑥 2 −(∑ 𝑥)2 𝑛 𝑛
The 𝑦-intercept 𝑎 is interpreted as the value of 𝑦 when 𝑥 is zero (if such a case exists). The
slope 𝑏 is interpreted as the amount of increase (if it is positive) or decrease (if it is negative)
in the value of 𝑌 for every unit increase in the value of 𝑋.
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 2
Example 1:
To illustrate the interpretation of the values 𝑎 and 𝑏, suppose we wish to find the line that best
estimates the relationship between the number of hours of study of a student (𝑋) and the
score obtained in a test (𝑌). Suppose that the fitted regression line is found to be 𝑌̂ = 12 + 5𝑋.
Then a student who does not study at all is predicted to get a score of 12. In addition, every
additional hour of study will increase the student’s score by 5.
Example 2:
As a second example, suppose the linear regression equation 𝑌̂ = 14 − 2.5𝑋 predicts the
selling price 𝑌 of a second-hand laptop in thousands of pesos based on the age 𝑋 of the
laptop. Then the equations indicate that the price of a brand-new laptop is P14,000, and for
every additional year of use, its selling price will decrease by P2,500.
Let us now show how to find the estimated simple linear regression equation.
Example 3:
In the 1990’s, research efforts have focused on the problem of predicting a manufacturer’s
market share using information on the quality of its product. Suppose that the following data
are available on market share, in percentage (𝑌), and product quality, on scale of 0 to 100,
determined by an objective evaluation procedure (𝑋).
𝑿 27 39 73 66 33 43 47 55 60 68 70 75
𝒀 2 3 10 9 4 6 5 8 7 9 10 13
Solution:
a. Here is the scatter diagram for the data set. It appears that there is a positive
relationship between product quality rating and market share.
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 3
b. To find the estimated simple linear regression equation, we determine the values
of the following using the calculator: 𝑛, 𝑥, ∑ 𝑥 , ∑ 𝑥 2 , 𝑦, ∑ 𝑦 , ∑ 𝑦 2 , ∑ 𝑥𝑦. The process is
the same as the procedure shown in the previous unit on finding the Pearson
correlation coefficient (choose 𝑦 = 𝑎 + 𝑏𝑥 in the statistics mode of the calculator).
Verify that these are the values for this problem:
We copied the entire value of 𝑥 and 𝑦 to minimize round off errors in computing 𝑎.
Hence, the slope and intercept are given respectively as follows:
𝑛 ∑ 𝑥𝑦 − ∑ 𝑥 ∑ 𝑦 12(5267) − 656(86)
𝑏= = = 0.1888913624
𝑛 ∑ 𝑥 2 − (∑ 𝑥)2 12(38856) − 6562
𝑎 = 𝑦 − 𝑏𝑥
𝑎 = 7.166666667 − 0.1888913624(54.66666667)
𝑎 = −3.159394479
Copying the values of 𝑎 and 𝑏 up to the fourth decimal place (rounding off
properly), the estimated simple linear regression equation is given by
𝑌̂ = 𝑎 + 𝑏𝑋
𝑌̂ = −3.1594 + 0.1889𝑋
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 4
negative. On the other hand, the value 𝑏 = 0.1889 means that the market share
increases by 0.1889% for every unit increase in the product quality rating.
The figure below shows the scatter plot and graph of the fitted regression line. It
can be seen that most of the data points of the problem are close but not on the
fitted regression line.
c. Next, let us estimate the market share when the product quality rating is 95. The
market share can be predicted using the estimated simple regression equation,
by substituting 𝑋 = 95:
Thus, the market share is predicted to be 14.7861% when the product quality
rating is 95.
When solving for the estimated linear regression equation, it is usually advisable to solve for
the Pearson correlation coefficient to see the magnitude and direction of the linear
relationship between the two variables. Using the values for 𝑛 and the summations above,
we have
𝑛 ∑ 𝑥𝑦 − ∑ 𝑥 ∑ 𝑦
𝑟=
√𝑛 ∑ 𝑥 2 − (∑ 𝑥)2 √𝑛 ∑ 𝑦 2 − (∑ 𝑦)2
12(5267) − 656(86)
𝑟=
√12(38856) − 6562 √12(734) − 862
𝑟 = 0.9529
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 5
This means that there is a very strong positive correlation between the product quality rating
and the market share. Thus, the estimated linear regression equation gives a good prediction
of market share.
Note that the values of 𝑎, 𝑏, and 𝑟 can all be obtained directly from the calculator. However,
just as I mentioned in the previous unit, I will still ask you to show your substitution into the
formula in exercises and graded activities. It would also be good for you to compare these
values given by the statistics mode output with the values you obtained by using the formula.
Another way to determine how good a fit the estimated simple regression equation we
obtained is would be to compute the coefficient of determination. The coefficient of
determination, 𝑟 2 , is the square of the coefficient of correlation. It is used to determine the
proportion of the variance (fluctuation) of one variable that is predictable from the other
variable. It allows us to determine how certain one can be in making predictions from a
certain model/graph. It has values from 0 to +1, and measures how well the fitted regression
line represents the data (the percent of the data that is the closest to the line of best fit). That
is, 𝑟 2 is the proportion of the total variation in the dependent variable 𝑌 that is explained, or
accounted for, by the variation in the independent variable 𝑋.
For example, if 𝑟 = 0.922, then 𝑟 2 = 0.8501. This means that 85.01% of the total variation in 𝑌
can be explained by the linear relationship between 𝑋 and 𝑌. Alternately, we can say that
85.01% of the variation in 𝑌 is explained by the variation in 𝑋. The other 14.99% of the total
variation in 𝑌 remains unexplained. If the regression line passes exactly through every point
on the scatter plot, it would be able to explain all of the variation. The further the line is away
from the points, the less it is able to explain.
Example 4:
Compute the coefficient of determination for the example above and interpret the resulting
value.
Solution:
Using the values from Example 3, we have
𝑟 2 = 0.95292 = 0.9080
Thus, 90.80% of the variation in the market share is explained by the variation in the
product quality rating. Alternately, we can say that 90.80% of the variation in market
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 6
share is explained by the linear relationship between product quality rating and
market share.
In using the R software as well as any other statistical software for regression analysis, the
output will not just be the coefficients of the regression model but we will be provided as well
with hypothesis test results for the significance of the regression coefficients. Let us work on
Example 3 using R.
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 7
The scatterplot shows a linear trend being exhibited by the points. This implies a linear
relationship between the X and the Y variable. We can then, hence, proceed with simple
linear regression modeling.
Call:
lm(formula = Y ~ X, data = market)
Residuals:
Min 1Q Median 3Q Max
-1.2074 -0.6935 -0.1852 0.8093 1.9925
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.15939 1.08148 -2.921 0.0153 *
X 0.18889 0.01901 9.939 1.68e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The R output gives the regression intercept and slope (or coefficient of X). With these, we
can present the estimated simple linear regression model as
𝑌̂ = −3.1594 + 0.1889𝑋
Before using this regression model to predict market share (Y) based on product quality
rating (X), we need to make sure that the regression model is statistically significant.
The Coefficients portion of the output not only gives us the coefficients of the regression
equation but it also gives us the p-value for testing the significance of these coefficients. For
the intercept, the p-value is 0.0153 while for the coefficient of X, the p-value is 1.68x10-6. Both
p-values are lesser than a 0.05 significance level for the hypothesis test, hence this indicate
that these coefficients are significant in the model.
To assess the significance of the regression model as a whole, we now look at the last row
labeled F-statistic, and we should be interested as well with the p-value result. Here the p-
value is 1.682 x 10-6, lesser than a 0.05 significance level. This indicate that the regression
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 8
model is significant, that is, it can be used to predict or estimate market share based on
product quality rating.
(1) A department of transportation’s study on driving speed and miles per gallon for
midsize automobiles resulted in the following data.
a. Plot and interpret the scatter diagram.
b. Find the estimated simple linear regression equation to predict gas
consumption from speed.
c. Compute the coefficient of determination and interpret.
d. Estimate the gas consumption when the speed is 45 miles per hour.
(2) The marketing manager of a large supermarket chain would like to know the
correlation between shelf space and sales of pet food. A random sample of 12 equal-
sized stores is selected, with the following results.
a. Plot and interpret the scatter diagram.
b. Find the estimated simple linear regression equation to predict weekly sales
from shelf space.
c. Compute the coefficient of determination and interpret.
d. Estimate the weekly sales when the shelf space is 18 feet.
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 9
Store Shelf Space (Feet) Weekly Sales ($)
1 5 160
2 5 220
3 5 140
4 10 190
5 10 240
6 10 260
7 15 230
8 15 270
9 15 280
10 20 260
11 20 290
12 20 310
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 10