Simple Linear Regression
Samir Orujov
Introduction
In this lecture we will learn about the following:
1. The nature of the simple linear regression (SLR);
2. How to estimate parameters of the SLR using least squares
(LS);
3. Three key assumptions for LS to generate expedient
estimators.
4. What happens when the assumptions are violated?
5. Simulations about the sampling distribution of LS estimators.
R packages
We need the following packages:
▶ AER - accompanies the book (Kleiber and Zeileis 2008) and
provides useful functions and data sets.
▶ MASS - a collection of functions for applied statistics
library(AER)
library(MASS)
Motivating example: Student Performance and
Student-to-Teacher Ratio
For now, let us suppose that the function which relates test score
and student-teacher ratio to each other is
𝑇 𝑒𝑠𝑡𝑆𝑐𝑜𝑟𝑒 = 713 − 3 × 𝑆𝑇 𝑅.
# Create sample data
STR <- c(15, 17, 19, 20, 22, 23.5, 25)
TestScore <- c(680, 640, 670, 660, 630, 660, 635)
Motivating example: Scatterplot
R code to generate a scatterplot:
# create a scatterplot of the data
plot(TestScore ~ STR,ylab="Test Score",pch=20)
# add the systematic relationship to the plot
abline(a = 713, b = -3)
680
670
660
Test Score
650
640
630
16 18 20 22 24
STR
Motivating Example: The Relation Is not Systematic
We find that the line does not touch any of the points although we
claimed that it represents the systematic relationship.
▶ Randomness and/or additional influences.
▶ To account for these differences between observed data and
the systematic relationship, we add an error term 𝑢.
▶ Put differently, 𝑢 accounts for all the differences between the
regression line and the actual observed data.
▶ These deviations could also arise from measurement errors or,
leaving out other factors that are relevant in explaining the
dependent variable.
▶ We summarize everything that we can not account for under
𝑢 and thus, the model becomes:
𝑇 𝑒𝑠𝑡𝑆𝑐𝑜𝑟𝑒 = 𝛽0 + 𝛽1 × 𝑆𝑇 𝑅 + 𝑜𝑡ℎ𝑒𝑟𝑓𝑎𝑐𝑡𝑜𝑟𝑠
Terminology for the Regression Model With A Single
Regressor
The linear regression model is
𝑌𝑖 = 𝛽0 + 𝛽1 𝑋𝑖 + 𝑢𝑖
where the index 𝑖 runs over the observations, 𝑖 = 1, ..., 𝑛
▶ 𝑌𝑖 is the dependent variable, the regressand, or simply the
left-hand variable
▶ 𝑋𝑖 is the independent variable, the regressor, or simply the
right-hand variable
▶ 𝑌 = 𝛽0 + 𝛽1 𝑋𝑖 is the population regression line also called
the population regression function
▶ 𝛽0 is the intercept of the population regression line
▶ 𝛽1 is the slope of the population regression line
▶ 𝑢𝑖 is the error term.
Estimating the Coefficients of the Linear Regression Model
▶ First, we will introduce the Canadian Schools Dataset from
AER package;
▶ Canadian schools with their characteristics e.g. number of
teachers, students, location etc.;
▶ Create two new variables called STR and Score and compute
descriptive statistics for them and make a scatterplot:
▶ STR is the students to teacher ratio in each school;
▶ Score is the mean of average math and reading scores in each
school.
▶ Estimate Model Parameters Using the Data:
▶ Brute Force Method;
▶ Using the built-in function in AER.
The Canadian Schools Dataset
#| output-location: slide
# Attach the AER library and download CAschools dataset
library(AER)
data("CASchools")
# compute STR and append it to CASchools
CASchools$STR <- CASchools$students/CASchools$teachers
# compute TestScore and append it to CASchools
CASchools$score <- (CASchools$read + CASchools$math)/2
# compute sample averages of STR and score
avg_STR <- mean(CASchools$STR)
avg_score <- mean(CASchools$score)
# compute sample standard deviations of STR and score
sd_STR <- sd(CASchools$STR)
sd_score <- sd(CASchools$score)
Distribution Summary for Canadian Schools
# set up a vector of percentiles and compute the quantiles
quantiles <- c(0.10, 0.25, 0.4, 0.5, 0.6, 0.75, 0.9)
quant_STR <- quantile(CASchools$STR, quantiles)
quant_score <- quantile(CASchools$score, quantiles)
# gather everything in a data.frame
DistributionSummary <-
data.frame(Average = c(avg_STR, avg_score),
StandardDeviation = c(sd_STR, sd_score),
quantile = rbind(quant_STR, quant_score))
Distribution Summary - Canadian Dataset
Average StandardDeviation quantile.10. quantile.25. quantile.40. quantile.50. quantile.60. quantile.75. quantile.90.
quant_STR 19.64043 1.891812 17.3486 18.58236 19.26618 19.72321 20.0783 20.87181 21.86741
quant_score 654.15655 19.053347 630.3950 640.05000 649.06999 654.45000 659.4000 666.66249 678.85999
Scatterplot of STR and score
plot(score ~ STR,
data = CASchools,
main = "Scatterplot of Test Score and STR",
xlab = "STR (X)",
ylab = "Test Score (Y)")
Scatterplot of Test Score and STR
620 660 700
Test Score (Y)
14 16 18 20 22 24 26
STR (X)
Correlation of STR and score
We can compute the correlation in the following way:
cor(CASchools$STR, CASchools$score)
[1] -0.2263627
▶ We can see that the correlation is negative but rather week.
▶ The Next and Most Important Question is: How to fit the
best fitting line?
The Ordinary Least Squares (OLS) Estimator
Suppose the true regression relationship is:
𝑌𝑖 = 𝛽0 + 𝛽1 𝑋𝑖 + 𝑢𝑖 .
Suppose, (𝑌𝑖 , 𝑋𝑖 ) pair is an i.i.d. draws ∀𝑖 = 1, … , 𝑛. The, the
OLS, seeks to minimize the following:
𝑛
min ∑(𝑌𝑖 − 𝛽0 − 𝛽1 𝑋𝑖 )2
{𝛽0 ,𝛽1 }
𝑖=1
Taking derivatives w.r.t. 𝛽0 and 𝛽1 and setting the result to zero
(Remember from Calculus? ) will give two FOC’s in two unknowns.
Thus, the solution for 𝛽0 and 𝛽1 is unique:
𝑛
∑𝑖=1 (𝑌𝑖 − 𝑌 ̄ )(𝑋𝑖 − 𝑋)̄
𝛽1̂ = 𝑛 and 𝛽0̂ = 𝑌 ̄ − 𝛽1̂ 𝑋̄
∑ (𝑋𝑖 − 𝑋)̄ 2
𝑖=1
OLS Estimator
▶ Note that ̂ stands for the fact that the parameters were
estimated using the sample data, thus, 𝛽0̂ and 𝛽1̂ are not true
parameters but the estimators.
▶ Click Here for A Useful Widget!
How to get OLS estimates using R?
The are two ways:
▶ We can explicitly compute the values of estimates using R;
▶ We can use the built-in lm() function in R.
# compute beta_1_hat
STR <- CASchools$STR
score <- CASchools$score
beta_1 <- sum((STR - mean(STR)) * (score - mean(score))) /
sum((STR - mean(STR))^2)
# compute beta_0_hat
beta_0 <- mean(score) - beta_1 * mean(STR)
beta_0
[1] 698.9329
beta_1
[1] -2.279808
How to Get OLS Estimators Using The Built-in Function?
Now, perform the same calculations using the built-in function:
linear_model <- lm(score~STR, data = CASchools)
summary(linear_model)
How to Get OLS Estimators Using The Built-in Function?
Call:
lm(formula = score ~ STR, data = CASchools)
Residuals:
Min 1Q Median 3Q Max
-47.727 -14.251 0.483 12.822 48.540
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 698.9329 9.4675 73.825 < 2e-16 ***
STR -2.2798 0.4798 -4.751 2.78e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 18.58 on 418 degrees of freedom
Multiple R-squared: 0.05124, Adjusted R-squared: 0.04897
F-statistic: 22.58 on 1 and 418 DF, p-value: 2.783e-06
Scatterplot with the best fit line
# plot the data
plot(score ~ STR,
data = CASchools,
main = "Scatterplot of Test Score and STR",
xlab = "STR (X)",
ylab = "Test Score (Y)",
xlim = c(10, 30),
ylim = c(600, 720))
# add the regression line
abline(linear_model)
Scatterplot with the best fit line
Scatterplot of Test Score and STR
720
700
680
Test Score (Y)
660
640
620
600
10 15 20 25 30
STR (X)
Measures of Fit
After fitting a linear regression model, a natural question is how
well the model describes the data.
▶ Visually, this amounts to assessing whether the observations
are tightly clustered around the regression line.
▶ Both the coefficient of determination and the standard error
of the regression measure how well the OLS Regression line
fits the data.
The Coefficient of Determination
𝑅2 , the coefficient of determination, is the fraction of the sample variance of 𝑌𝑖 that is explained by 𝑋𝑖 .
- Mathematically, the 𝑅2 can be written as the ratio of the explained sum of squares to the total sum of
squares. - The explained sum of squares (𝐸𝑆𝑆) is the sum of squared deviations of the predicted values 𝑌𝑖̂ ,
from the average of the 𝑌𝑖 .
- The total sum of squares (𝑇 𝑆𝑆) is the sum of squared deviations of the 𝑌𝑖 from their average. Thus we have
𝐸𝑆𝑆 = ∑(𝑌̂𝑖 − 𝑌̄ )2
𝑛
𝑇 𝑆𝑆 = ∑ (𝑌𝑖 − 𝑌̄ )2
𝑖=1
𝐸𝑆𝑆
𝑅2 =
𝑇 𝑆𝑆
▶ Since 𝑇 𝑆𝑆 = 𝐸𝑆𝑆 + 𝑆𝑆𝑅 we can also write
𝑆𝑆𝑅
𝑅2 = 1 −
𝑇 𝑆𝑆
where 𝑆𝑆𝑅 is the sum of squared residuals, a measure for the errors made when predicting the 𝑌 by 𝑋.
▶ The 𝑆𝑆𝑅 is defined as
𝑛
𝑆𝑆𝑅 = ∑ 𝑢̂ 2
𝑖.
𝑖=1
𝑅2 lies between 0 and 1. It is easy to see that a perfect fit, i.e., no errors made when fitting the regression line,
implies 𝑅2 = 1 then we have 𝑆𝑆𝑅 = 0.
The Standard Error of the Regression
The Standard Error of the Regression (𝑆𝐸𝑅) is an estimator of
the standard deviation of the residuals 𝑢̂𝑖 . As such it measures the
magnitude of a typical deviation from the regression line, i.e., the
magnitude of a typical residual.
𝑛
1 𝑆𝑆𝑅
𝑆𝐸𝑅 = 𝑠𝑢̂ = √𝑠2𝑢̂ where 𝑠2𝑢̂ = ∑ 𝑢̂2𝑖 =
𝑛 − 2 𝑖=1 𝑛−2
Remember that the 𝑢𝑖 are unobserved. This is why we use their
estimated counterparts, the residuals 𝑢̂𝑖 , instead. See Chapter 4.3
of the book for a more detailed comment on the 𝑆𝐸𝑅.
Application to the Test Score Data
The 𝑅2 in the output is called Multiple R-squared and has a value of 0.051. Hence, 5.1% of the variance of
the dependent variable 𝑠𝑐𝑜𝑟𝑒 is explained by the explanatory variable 𝑆𝑇 𝑅. That is, the regression explains
little of the variance in 𝑠𝑐𝑜𝑟𝑒, and much of the variation in test scores remains unexplained.
mod_summary <- summary(linear_model)
mod_summary
Call:
lm(formula = score ~ STR, data = CASchools)
Residuals:
Min 1Q Median 3Q Max
-47.727 -14.251 0.483 12.822 48.540
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 698.9329 9.4675 73.825 < 2e-16 ***
STR -2.2798 0.4798 -4.751 2.78e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 18.58 on 418 degrees of freedom
Multiple R-squared: 0.05124, Adjusted R-squared: 0.04897
F-statistic: 22.58 on 1 and 418 DF, p-value: 2.783e-06
SER
The 𝑆𝐸𝑅 is called Residual standard error and equals 18.58. The unit of the 𝑆𝐸𝑅 is the same as the unit of
the dependent variable. That is, on average the deviation of the actual achieved test score and the regression line is
18.58 points.
# compute R^2 manually
SSR <- sum(mod_summary$residuals^2)
TSS <- sum((score - mean(score))^2)
R2 <- 1 - SSR/TSS
# print the value to the console
R2
[1] 0.05124009
# compute SER manually
n <- nrow(CASchools)
SER <- sqrt(SSR / (n-2))
# print the value to the console
SER
[1] 18.58097
The Least Squares Assumptions
OLS performs well in many circumstances but we need the
following assumptions for the estimators to be normally distributed
in large samples (so that CLT applies)
▶ The error term 𝑢𝑖 has conditional mean zero given 𝑋𝑖 :
𝐸(𝑢𝑖 |𝑋𝑖 ) = 0.
▶ (𝑋𝑖 , 𝑌𝑖 ), 𝑖 = 1, … , 𝑛 are independent and identically
distributed (i.i.d.) draws from their joint distribution.
▶ Large outliers are unlikely: 𝑋𝑖 and 𝑌𝑖 have nonzero finite
fourth moments.
Assumption 1. 𝐸(𝑢𝑖 |𝑋𝑖 ) = 0.
▶ This assumption makes sure that error terms do not have any
systematic pattern, because otherwise, the model specification
is wrong.
▶ To put it differently, if we fit a linear model to the data which
was generated by a non-linear model the errors will show a
systematic pattern and the Assumption 1 will be violated.
How to show it by simulations?
▶ We generate data based on a quadratic model:
𝑌𝑖 = 𝑋𝐼2 + 2𝑋𝑖 + 𝑢𝑖 .
▶ We fit both linear and quadratic model to the data and
overlay on the same figure.
▶ We will observe that Assumption 1 is violated for linear model
as there is a systematic pattern in errors. But it holds for
quadratic model.
R functions
We need the following R functions for simulation:
▶ runif() - generates uniformly distributed random numbers
▶ rnorm() - generates normally distributed random numbers
▶ predict() - does predictions based on the results of model
fitting functions like lm()
▶ lines() - adds line segments to an existing plot
R Simulation about Assumption 1.
# set a seed to make the results reproducible
set.seed(321)
# simulate the data
X <- runif(50, min = -5, max = 5)
u <- rnorm(50, sd = 1)
# the true relation
Y <- X^2 + 2 * X + u
# estimate a simple regression model
mod_simple <- lm(Y ~ X)
# estimate a quadratic regression model
mod_quadratic <- lm( Y ~ X + I(X^2))
# predict using a quadratic model
prediction <- predict(mod_quadratic, data.frame(X = sort(X)))
# plot the results
plot( Y ~ X, col = "black", pch = 20, xlab = "X", ylab = "Y")
abline( mod_simple, col = "blue",lwd=2)
#red line = incorrect linear regression (this violates the first OLS assumption)
lines( sort(X), prediction,col="red",lwd=2)
legend("topleft",
legend = c("Simple Regression Model",
"Quadratic Model"),
cex = 1,
lty = 1,
col = c("blue","red"))
R Simulation about Assumption 1.
Simple Regression Model
Quadratic Model
30
20
Y
10
0
−4 −2 0 2 4
X
Assumption 2.Independently and Identically Distributed
Data
The sampling scheme based on Simple Random Sampling ensures that the observations (𝑌𝑖 , 𝑋𝑖 ) are i.i.d. for all
𝑖 = 1, … , 𝑛.
- One of the prominent examples where i.i.d. does not hold is time series data. Company cuts the number of its
employees by time and this fact shows a systematic pattern. - GDP of a country shows a pattern over time.
The R simulation about the employee example:
# set seed
set.seed(123)
# generate a date vector
Date <- seq(as.Date("1951/1/1"), as.Date("2000/1/1"), "years")
# initialize the employment vector
X <- c(5000, rep(NA, length(Date)-1))
# generate time series observations with random influences
for (t in 2:length(Date)) {
X[t] <- -50 + 0.98 * X[t-1] + rnorm(n = 1, sd = 200)
#plot the results
plot(x = Date,
y = X,
type = "l",
col = "steelblue",
ylab = "Workers",
xlab = "Time",
lwd=2)
R Simulation about Assumption 2.
5000
4000
3000
Workers
2000
1000
1950 1960 1970 1980 1990 2000
Time
Assumption 3. Large Outliers are Unlikely
This assumption makes sure that both 𝑋 and 𝑌 have a finite
kurtosis. Why do we need it?
▶ OLS suffers from sensitivity to outliers.
▶ One can show that extreme observations receive heavy
weighting in the estimation of the unknown regression
coefficients when using OLS.
▶ Therefore, outliers can lead to strongly distorted estimates of
regression coefficients.
Simulation about Assumption 3.
#| output-location: slide
# set seed
set.seed(123)
# generate the data
X <- sort(runif(10, min = 30, max = 70))
Y <- rnorm(10 , mean = 200, sd = 50)
Y[9] <- 2000
# fit model with outlier
fit <- lm(Y ~ X)
# fit model without outlier
fitWithoutOutlier <- lm(Y[-9] ~ X[-9])
# plot the results
plot(Y ~ X,pch=20)
abline(fit,lwd=2,col="blue")
abline(fitWithoutOutlier, col = "red",lwd=2)
legend("topleft",
legend = c("Model with Outlier",
"Model without Outlier"),
cex = 1,
lty = 1,
col = c("blue","red"))
Simulation about Assumption 3.
2000
Model with Outlier
Model without Outlier
1500
Y
1000
500
35 40 45 50 55 60 65
X
Sampling Distribution of OLS estimators.
As 𝑛 → ∞ the sampling distribution of OLS estimators approach
to normality given the Assumption (1)-(3). Mathematically,
𝑑
𝛽1̂ ∼ 𝑁 (𝛽1 , 𝑉 𝑎𝑟(𝛽1̂ ))
𝑁
∑𝑖=1 (𝑥𝑖 − 𝑥)̄ 2 𝜎𝑖2
𝑉 𝑎𝑟(𝛽1̂ ) = 𝑁
(∑𝑖=1 (𝑥𝑖 − 𝑥)̄ 2 )2
1 𝑥2̄
𝑉 𝑎𝑟(𝛼)̂ = 𝜎2 ( + 𝑁 )
𝑁 ∑𝑖=1 (𝑥𝑖 − 𝑥)̄ 2
Conclusion
To estimate the linear relationship between two variables we use
OLS. OLS estimators are approximately normally distributed if the
following assumptions hold:
▶
𝐸(𝑢𝑖 |𝑋𝑖 ) = 0
▶ (𝑌𝑖 , 𝑋𝑖 ) are iid samples from a population
▶ Large outliers are unlikely meaning both X and Y have finite
kurtosis.
References
Kleiber, Christian, and Achim Zeileis. 2008. Applied Econometrics
with r. Springer New York.
https://doi.org/10.1007/978-0-387-77318-6.