Simple Linear
1 Regression
2
Material from Devore’s book (Ed 8), and Cengagebrain.com
Simple Linear Regression
8
0
6
0
Ratin
g
4
0
2
0
0 5 1 1
0 5
Suga
r 2
Simple Linear Regression
8
0
6
0
Ratin
g
4
0
2
0
0 5 1 1
0 5
Suga
r 3
Simple Linear Regression
8
0
6
0
Ratin
g
4
0
x
x
2
0
0 5 1 1
0 5
Suga
r 4
The Simple Linear Regression
Model
The simplest deterministic mathematical relationship
between two variables x and y is a linear relationship:
y = 0 + 1x.
The objective of this section is to develop an
equivalent linear probabilistic model.
If the two (random) variables are probabilistically related,
then for a fixed value of x, there is uncertainty in the
value of the second variable.
So we assume Y = 0 + 1x + ε, where ε is a
random variable.
2 variables are related linearly “on average” if for
fixed x the actual value of Y differs from its expected
value by a random amount (i.e. there is random
5
error).
A Linear Probabilistic Model
Definition The Simple Linear Regression Model
There are parameters 0, 1, and 2, such that for
any fixed value of the independent variable x, the
dependent variable is a random variable related to x
through the model equation
Y = 0 + 1x + ε
The quantity ε in the model equation is the “error”
--a random variable, assumed to be
symmetrically distributed with
E(ε) = 0 and V(ε) =ε 2
=2
(no assumption made about the distribution
of ε, yet)
6
A Linear Probabilistic Model
X: the independent, predictor, or explanatory variable
(usually known). NOT RANDOM.
Y: The dependent or response variable. For fixed x, Y will
be random variable.
ε: The random deviation or random error term. For fixed x, ε
will be random variable.
What exactly does ε do?
7
A Linear Probabilistic Model
The points (x1, y1), …, (xn, yn) resulting from n
independent observations will then be scattered
about the true regression line:
This image cannot currently be
displayed.
8
A Linear Probabilistic Model
How do we know simple linear
regression is appropriate?
- Theoretical considerations
- Scatterplots
9
A Linear Probabilistic Model
If we think of an entire population of (x, y)
pairs, then
Y2Y| x| xisisthe
a mean of all y of
measure values
how for which
much ,
x = xvalues
these of
and about
out the mean value.
y spread
If, for example, x = age of a child and y =
vocabulary size, then Y | 5 is the average
children in the
vocabulary sizepopulation, and
for all 5-year-
o 2
Y |ld describes
thevariability
of amount in vocabulary size 5 for this
part of the population.
10
A Linear Probabilistic Model
Interpreting parameters:
0 (the intercept of the true
regression line): The average value of
Y when x is zero.
1 (the slope of the true regression
line):
The expected (average) change in Y associated with
a 1-unit increase in the value of x.
11
A Linear Probabilistic Model
What is 2Y | x? How do we interpret 2Y | x?
Homoscedasticity:
We assume the variance (amount of variability) of the
distribution of Y values to be the same at each different
value of fixed x. (i.e.
homogeneity of variance assumption).
12
When errors are normally
distributed…
distribution of
(b) distribution of Y for different
values of x
The variance parameter 2 determines the extent to
which each normal curve spreads out about the
1
regression line 3
A Linear Probabilistic Model
When 2 is small, an observed point (x, y) will
almost always fall quite close to the true
regression line, whereas observations may deviate
considerably from their expected values
(corresponding to points far from the line) when 2
is large.
Thus, this variance can be used to tell us how
good the linear fit is
But how do we define “good”?
14
Estimating Model Parameters
The values of 0, 1, and 2 will almost never be
known to an investigator.
Instead, sample data consists of n observed pairs
(x1, y1), … , (xn, yn),
from which the model parameters and the true
regression line itself can be estimated.
The data (pairs) are assumed to have been
obtained independently of one another.
15
Estimating Model Parameters
Where
Yi = 0 + 1xi + εi for i = 1, 2, … , n
and the n deviations ε1, ε2,…, εn are independent
r.v.’s. (Y1, Y2, …, Yn are independent too, why?)
16
Estimating Model Parameters
The “best fit” line is motivated by the principle
of least squares, which can be traced back to
the German mathematician Gauss (1777–
1855):
A line provides the best
fit to the data if the sum
of the squared vertical
distances (deviations)
from the observed points
to that line is as small
as it can be.
17
Estimating Model Parameters
The sum of squared vertical deviations from
the points (x1, y1),…, (xn, yn) to the line is then
The point estimates of 0 and 1, denoted by ,
and are
called the least squares estimates – they are
those values that minimize f(b0, b1).
18
Estimating Model Parameters
The fitted regression line or least squares line is
then the line whose equation is y = + x.
The minimizing values of b0 and b1 are found by
taking partial derivatives of f(b0, b1) with respect to
both b0 and b1, equating them both to zero
[analogously to f ʹ(b) = 0 in univariate calculus],
and solving the equations
19
Estimating Model Parameters
The least squares estimate of the slope coefficient
1 of the true regression line is
Shortcut formulas for the numerator and
denominator of are
Sxy = xiyi – (xi)(yi)/n and Sxx = xi2 –
(xi)2/n
(Typically columns for xi, yi, xiyi and xi2 and constructed
and then 20
S and S are calculated.)
Estimating Model Parameters
The least squares estimate of the intercept 0 of
the true regression line is
The computational formulas for Sxy and Sxx require
only the summary statistics xi, yi, xi2 and xiyi.
(yi2 will be needed shortly for the variance.)
21
Example (fitted regression line)
The cetane number is a critical property in
specifying the ignition quality of a fuel used in a
diesel engine.
Determination of this number for a
biodiesel fuel is expensive and time-
consuming.
The article “Relating the Cetane Number of
Biodiesel Fuels to Their Fatty Acid Composition:
A Critical Study” (J. of Automobile Engr., 2009:
565–583) included the following data on x =
iodine value (g) and y = cetane number for a
sample of 14 biofuels (see next slide).
22
Example (fitted
cont’
regression line) d
The iodine value (x) is the amount of iodine necessary to
saturate a sample of 100 g of oil. The article’s authors fit
the simple linear regression model to this data, so let’s do the
same.
Calculating the relevant statistics
gives
xi = 1307.5, yi = 779.2,
xi =
128,913.93,
2 xi yi =
71,347.30,
from whichSxx = 128,913.93 – (1307.5)2/14 = 6802.7693
and Sxy = 71,347.30 – (1307.5)(779.2)/14 = –
1424.41429
23
Example (fitted regression line)
cont’
d
Scatter plot with the least squares line
superimposed.
24
Fitted Values
Fitted Values:
The fitted (or predicted) are
by substituting x1,…, xn into the equation
values of the
obtained
estimated regression line:
Residuals:
The differences between
observed and fitted y the
values.
Residuals are estimates of the true error –
WHY?
25
Sum of the residuals
When the estimated regression line is obtained
via the principle of least squares, the sum of the
residuals should in theory be zero, if the error
distribution is symmetric, since
0
26
Example (fitted values)
Suppose we have the following data on filtration
rate (x) versus moisture content (y):
Relevant summary quantities (summary
statistics) are
xi = 2817.9, yi = 1574.8, x2i =
415,949.85,
xi yi = y2i =
222,657.88, and 124,039.58,
From Sxx = 18,921.8295, Sxy =
776.434.
27
Calculation of residuals?
Example (fitted values)
cont’
d
All predicted values (fits) and residuals
appear in the accompanying table.
28
Fitted Values
We interpret the fitted value as the value of y that we
would predict or expect when using the estimated
regression line with x = xi;; thus is the estimated true
mean for that population when x = xi (based on the
data).
The residual is a positive number if the point lies
above the line and a negative number if it lies below
the line.(x i , yˆ i )
The residual can be thought of as a measure of
ϵi ≈ βˆ0 + βˆ1 x iand
deviation = Yˆcan
+ ϵˆi we i + ϵˆ
summarize
i the notation in the
following way: ⇒
Y i — Yˆ i = ϵˆ i 29
Residual Plots
Revenue = 2.7 * Temperature – 35
Residual = Observed – Predicted
Residual
Temperature Revenue Revenue
(Observed –
(Celsius) (Observed) (Predicted)
Predicted)
28.2 $44 $41 $3
21.4 $23 $23 $0
32.9 $43 $54 -$11
24.0 $30 $29 $1
etc. etc. etc. etc.
Residual plots (contd.)
Same regression run on two different lemonade stands, one
where the model is very accurate, one where the model is not.
Residual Plots (contd.)
Residual Plots (contd.)
Ideally residual plots
looks like these, i.e.
1. They’re pretty
symmetrically
distributed, tending
to cluster towards
the middle of the
plot.
2. They’re clustered
around the lower
single digits of the y-
axis (e.g., 0.5 or 1.5,
not 30 or 150).
3. In general, there
aren’t any clear
Residual Plots (contd.)
Some not so ideal residual plots
Example Residual Plots and
Their Diagnoses: Y Axis
Imbalanced
Some unexceptionally high value of Y for normal values of X
Example Residual Plots and
Their Diagnoses:
Heteroscedasticity
meaning that the residuals get larger as the prediction
moves from small to large (or from large to small)
Example Residual Plots and
Their Diagnoses: Nonlinear
meaning your model doesn’t accurately represent the
relationship between “Temperature” and “Revenue.”
Example Residual Plots and
Their Diagnoses: Outliers
• data entry error, where the outlier is just wrong, delete it
• If a legitimate outlier, assess the impact of the outlier
Outliers
Data points that diverge in a big way from the overall
pattern are called outliers. There are four ways that a data
point might be considered an outlier.
• It could have an extreme X value compared to other
data points.
• It could have an extreme Y value compared to other
data points.
• It could have extreme X and Y values.
• It might be distant from the rest of the data, even
without extreme X or Y values.
Outliers (contd.)
Each type of
outlier is
depicted
graphically
in the
scatterplots
below.
Influential Points
An influential point is an outlier that greatly affects the
slope of the regression line. One way to test the influence
of an outlier is to compute the regression equation with
and without the outlier.
Influential Points (contd.)
This type of analysis is illustrated below. The scatterplots
are identical, except that one plot includes an outlier.
When the outlier is present, the slope is flatter (-4.10 vs. -
3.32); so this outlier would be considered an influential
point.
Influential Points (contd.)
Here, one chart has a single outlier, located at the high
end of the X axis (where x = 24). As a result of that single
outlier, the slope of the regression line changes greatly,
from -2.5 to -1.6; so the outlier would be considered an
influential point.