0% found this document useful (0 votes)

33 views38 pages

Understanding Simple Linear Regression

Uploaded by

Seljan Mursalli

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

33 views38 pages

Understanding Simple Linear Regression

Uploaded by

Seljan Mursalli

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Simple Linear Regression

Samir Orujov
Introduction

In this lecture we will learn about the following:

1. The nature of the simple linear regression (SLR);
2. How to estimate parameters of the SLR using least squares
(LS);
3. Three key assumptions for LS to generate expedient
estimators.
4. What happens when the assumptions are violated?
5. Simulations about the sampling distribution of LS estimators.
R packages

We need the following packages:

▶ AER - accompanies the book (Kleiber and Zeileis 2008) and
provides useful functions and data sets.
▶ MASS - a collection of functions for applied statistics
library(AER)
library(MASS)
Motivating example: Student Performance and
Student-to-Teacher Ratio

For now, let us suppose that the function which relates test score
and student-teacher ratio to each other is

𝑇 𝑒𝑠𝑡𝑆𝑐𝑜𝑟𝑒 = 713 − 3 × 𝑆𝑇 𝑅.

# Create sample data

STR <- c(15, 17, 19, 20, 22, 23.5, 25)
TestScore <- c(680, 640, 670, 660, 630, 660, 635)
Motivating example: Scatterplot
R code to generate a scatterplot:
# create a scatterplot of the data
plot(TestScore ~ STR,ylab="Test Score",pch=20)
# add the systematic relationship to the plot
abline(a = 713, b = -3)
680
670
660
Test Score

650
640
630

16 18 20 22 24

STR
Motivating Example: The Relation Is not Systematic
We find that the line does not touch any of the points although we
claimed that it represents the systematic relationship.
▶ Randomness and/or additional influences.
▶ To account for these differences between observed data and
the systematic relationship, we add an error term 𝑢.
▶ Put differently, 𝑢 accounts for all the differences between the
regression line and the actual observed data.
▶ These deviations could also arise from measurement errors or,
leaving out other factors that are relevant in explaining the
dependent variable.
▶ We summarize everything that we can not account for under
𝑢 and thus, the model becomes:

𝑇 𝑒𝑠𝑡𝑆𝑐𝑜𝑟𝑒 = 𝛽0 + 𝛽1 × 𝑆𝑇 𝑅 + 𝑜𝑡ℎ𝑒𝑟𝑓𝑎𝑐𝑡𝑜𝑟𝑠
Terminology for the Regression Model With A Single
Regressor

The linear regression model is

𝑌𝑖 = 𝛽0 + 𝛽1 𝑋𝑖 + 𝑢𝑖
where the index 𝑖 runs over the observations, 𝑖 = 1, ..., 𝑛
▶ 𝑌𝑖 is the dependent variable, the regressand, or simply the
left-hand variable
▶ 𝑋𝑖 is the independent variable, the regressor, or simply the
right-hand variable
▶ 𝑌 = 𝛽0 + 𝛽1 𝑋𝑖 is the population regression line also called
the population regression function
▶ 𝛽0 is the intercept of the population regression line
▶ 𝛽1 is the slope of the population regression line
▶ 𝑢𝑖 is the error term.
Estimating the Coeﬀicients of the Linear Regression Model

▶ First, we will introduce the Canadian Schools Dataset from

AER package;
▶ Canadian schools with their characteristics e.g. number of
teachers, students, location etc.;
▶ Create two new variables called STR and Score and compute
descriptive statistics for them and make a scatterplot:
▶ STR is the students to teacher ratio in each school;
▶ Score is the mean of average math and reading scores in each
school.
▶ Estimate Model Parameters Using the Data:
▶ Brute Force Method;
▶ Using the built-in function in AER.
The Canadian Schools Dataset

#| output-location: slide

# Attach the AER library and download CAschools dataset

library(AER)
data("CASchools")
# compute STR and append it to CASchools
CASchools$STR <- CASchools$students/CASchools$teachers
# compute TestScore and append it to CASchools
CASchools$score <- (CASchools$read + CASchools$math)/2
# compute sample averages of STR and score
avg_STR <- mean(CASchools$STR)
avg_score <- mean(CASchools$score)
# compute sample standard deviations of STR and score
sd_STR <- sd(CASchools$STR)
sd_score <- sd(CASchools$score)
Distribution Summary for Canadian Schools

# set up a vector of percentiles and compute the quantiles

quantiles <- c(0.10, 0.25, 0.4, 0.5, 0.6, 0.75, 0.9)
quant_STR <- quantile(CASchools$STR, quantiles)
quant_score <- quantile(CASchools$score, quantiles)

# gather everything in a data.frame

DistributionSummary <-
data.frame(Average = c(avg_STR, avg_score),
StandardDeviation = c(sd_STR, sd_score),
quantile = rbind(quant_STR, quant_score))
Distribution Summary - Canadian Dataset

Average StandardDeviation quantile.10. quantile.25. quantile.40. quantile.50. quantile.60. quantile.75. quantile.90.

quant_STR 19.64043 1.891812 17.3486 18.58236 19.26618 19.72321 20.0783 20.87181 21.86741
quant_score 654.15655 19.053347 630.3950 640.05000 649.06999 654.45000 659.4000 666.66249 678.85999
Scatterplot of STR and score
plot(score ~ STR,
data = CASchools,
main = "Scatterplot of Test Score and STR",
xlab = "STR (X)",
ylab = "Test Score (Y)")

Scatterplot of Test Score and STR

620 660 700
Test Score (Y)

14 16 18 20 22 24 26

STR (X)
Correlation of STR and score

We can compute the correlation in the following way:

cor(CASchools$STR, CASchools$score)

[1] -0.2263627
▶ We can see that the correlation is negative but rather week.
▶ The Next and Most Important Question is: How to fit the
best fitting line?
The Ordinary Least Squares (OLS) Estimator
Suppose the true regression relationship is:

𝑌𝑖 = 𝛽0 + 𝛽1 𝑋𝑖 + 𝑢𝑖 .

Suppose, (𝑌𝑖 , 𝑋𝑖 ) pair is an i.i.d. draws ∀𝑖 = 1, … , 𝑛. The, the

OLS, seeks to minimize the following:

𝑛
min ∑(𝑌𝑖 − 𝛽0 − 𝛽1 𝑋𝑖 )2
{𝛽0 ,𝛽1 }
𝑖=1

Taking derivatives w.r.t. 𝛽0 and 𝛽1 and setting the result to zero

(Remember from Calculus? ) will give two FOC’s in two unknowns.
Thus, the solution for 𝛽0 and 𝛽1 is unique:

𝑛
∑𝑖=1 (𝑌𝑖 − 𝑌 ̄ )(𝑋𝑖 − 𝑋)̄
𝛽1̂ = 𝑛 and 𝛽0̂ = 𝑌 ̄ − 𝛽1̂ 𝑋̄
∑ (𝑋𝑖 − 𝑋)̄ 2
𝑖=1
OLS Estimator

▶ Note that ̂ stands for the fact that the parameters were
estimated using the sample data, thus, 𝛽0̂ and 𝛽1̂ are not true
parameters but the estimators.
▶ Click Here for A Useful Widget!
How to get OLS estimates using R?
The are two ways:
▶ We can explicitly compute the values of estimates using R;
▶ We can use the built-in lm() function in R.
# compute beta_1_hat
STR <- CASchools$STR
score <- CASchools$score
beta_1 <- sum((STR - mean(STR)) * (score - mean(score))) /
sum((STR - mean(STR))^2)
# compute beta_0_hat
beta_0 <- mean(score) - beta_1 * mean(STR)
beta_0

[1] 698.9329
beta_1

[1] -2.279808
How to Get OLS Estimators Using The Built-in Function?

Now, perform the same calculations using the built-in function:

linear_model <- lm(score~STR, data = CASchools)
summary(linear_model)
How to Get OLS Estimators Using The Built-in Function?

Call:
lm(formula = score ~ STR, data = CASchools)

Residuals:
Min 1Q Median 3Q Max
-47.727 -14.251 0.483 12.822 48.540

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 698.9329 9.4675 73.825 < 2e-16 ***
STR -2.2798 0.4798 -4.751 2.78e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 18.58 on 418 degrees of freedom

Multiple R-squared: 0.05124, Adjusted R-squared: 0.04897
F-statistic: 22.58 on 1 and 418 DF, p-value: 2.783e-06
Scatterplot with the best fit line

# plot the data

plot(score ~ STR,
data = CASchools,
main = "Scatterplot of Test Score and STR",
xlab = "STR (X)",
ylab = "Test Score (Y)",
xlim = c(10, 30),
ylim = c(600, 720))

# add the regression line

abline(linear_model)
Scatterplot with the best fit line

Scatterplot of Test Score and STR

720
700
680
Test Score (Y)

660
640
620
600

10 15 20 25 30

STR (X)
Measures of Fit

After fitting a linear regression model, a natural question is how

well the model describes the data.
▶ Visually, this amounts to assessing whether the observations
are tightly clustered around the regression line.
▶ Both the coefficient of determination and the standard error
of the regression measure how well the OLS Regression line
fits the data.
The Coefficient of Determination
𝑅2 , the coefficient of determination, is the fraction of the sample variance of 𝑌𝑖 that is explained by 𝑋𝑖 .
- Mathematically, the 𝑅2 can be written as the ratio of the explained sum of squares to the total sum of
squares. - The explained sum of squares (𝐸𝑆𝑆) is the sum of squared deviations of the predicted values 𝑌𝑖̂ ,
from the average of the 𝑌𝑖 .
- The total sum of squares (𝑇 𝑆𝑆) is the sum of squared deviations of the 𝑌𝑖 from their average. Thus we have

𝐸𝑆𝑆 = ∑(𝑌̂𝑖 − 𝑌̄ )2

𝑛
𝑇 𝑆𝑆 = ∑ (𝑌𝑖 − 𝑌̄ )2
𝑖=1

𝐸𝑆𝑆
𝑅2 =
𝑇 𝑆𝑆

▶ Since 𝑇 𝑆𝑆 = 𝐸𝑆𝑆 + 𝑆𝑆𝑅 we can also write

𝑆𝑆𝑅
𝑅2 = 1 −
𝑇 𝑆𝑆

where 𝑆𝑆𝑅 is the sum of squared residuals, a measure for the errors made when predicting the 𝑌 by 𝑋.

▶ The 𝑆𝑆𝑅 is defined as

𝑛
𝑆𝑆𝑅 = ∑ 𝑢̂ 2
𝑖.
𝑖=1

𝑅2 lies between 0 and 1. It is easy to see that a perfect fit, i.e., no errors made when fitting the regression line,
implies 𝑅2 = 1 then we have 𝑆𝑆𝑅 = 0.
The Standard Error of the Regression

The Standard Error of the Regression (𝑆𝐸𝑅) is an estimator of

the standard deviation of the residuals 𝑢̂𝑖 . As such it measures the
magnitude of a typical deviation from the regression line, i.e., the
magnitude of a typical residual.

𝑛
1 𝑆𝑆𝑅
𝑆𝐸𝑅 = 𝑠𝑢̂ = √𝑠2𝑢̂ where 𝑠2𝑢̂ = ∑ 𝑢̂2𝑖 =
𝑛 − 2 𝑖=1 𝑛−2

Remember that the 𝑢𝑖 are unobserved. This is why we use their

estimated counterparts, the residuals 𝑢̂𝑖 , instead. See Chapter 4.3
of the book for a more detailed comment on the 𝑆𝐸𝑅.
Application to the Test Score Data

The 𝑅2 in the output is called Multiple R-squared and has a value of 0.051. Hence, 5.1% of the variance of
the dependent variable 𝑠𝑐𝑜𝑟𝑒 is explained by the explanatory variable 𝑆𝑇 𝑅. That is, the regression explains
little of the variance in 𝑠𝑐𝑜𝑟𝑒, and much of the variation in test scores remains unexplained.
mod_summary <- summary(linear_model)
mod_summary

Call:
lm(formula = score ~ STR, data = CASchools)

Residuals:
Min 1Q Median 3Q Max
-47.727 -14.251 0.483 12.822 48.540

Residual standard error: 18.58 on 418 degrees of freedom

Multiple R-squared: 0.05124, Adjusted R-squared: 0.04897
F-statistic: 22.58 on 1 and 418 DF, p-value: 2.783e-06
SER

The 𝑆𝐸𝑅 is called Residual standard error and equals 18.58. The unit of the 𝑆𝐸𝑅 is the same as the unit of
the dependent variable. That is, on average the deviation of the actual achieved test score and the regression line is
18.58 points.
# compute R^2 manually

SSR <- sum(mod_summary$residuals^2)

TSS <- sum((score - mean(score))^2)
R2 <- 1 - SSR/TSS

# print the value to the console

[1] 0.05124009
# compute SER manually
n <- nrow(CASchools)
SER <- sqrt(SSR / (n-2))

# print the value to the console

SER

[1] 18.58097
The Least Squares Assumptions

OLS performs well in many circumstances but we need the

following assumptions for the estimators to be normally distributed
in large samples (so that CLT applies)
▶ The error term 𝑢𝑖 has conditional mean zero given 𝑋𝑖 :
𝐸(𝑢𝑖 |𝑋𝑖 ) = 0.
▶ (𝑋𝑖 , 𝑌𝑖 ), 𝑖 = 1, … , 𝑛 are independent and identically
distributed (i.i.d.) draws from their joint distribution.
▶ Large outliers are unlikely: 𝑋𝑖 and 𝑌𝑖 have nonzero finite
fourth moments.
Assumption 1. 𝐸(𝑢𝑖 |𝑋𝑖 ) = 0.
▶ This assumption makes sure that error terms do not have any
systematic pattern, because otherwise, the model specification
is wrong.
▶ To put it differently, if we fit a linear model to the data which
was generated by a non-linear model the errors will show a
systematic pattern and the Assumption 1 will be violated.
How to show it by simulations?
▶ We generate data based on a quadratic model:
𝑌𝑖 = 𝑋𝐼2 + 2𝑋𝑖 + 𝑢𝑖 .
▶ We fit both linear and quadratic model to the data and
overlay on the same figure.
▶ We will observe that Assumption 1 is violated for linear model
as there is a systematic pattern in errors. But it holds for
quadratic model.
R functions

We need the following R functions for simulation:

▶ runif() - generates uniformly distributed random numbers
▶ rnorm() - generates normally distributed random numbers
▶ predict() - does predictions based on the results of model
fitting functions like lm()
▶ lines() - adds line segments to an existing plot
R Simulation about Assumption 1.
# set a seed to make the results reproducible
set.seed(321)

# simulate the data

X <- runif(50, min = -5, max = 5)
u <- rnorm(50, sd = 1)

# the true relation

Y <- X^2 + 2 * X + u

# estimate a simple regression model

mod_simple <- lm(Y ~ X)

# estimate a quadratic regression model

mod_quadratic <- lm( Y ~ X + I(X^2))

# predict using a quadratic model

prediction <- predict(mod_quadratic, data.frame(X = sort(X)))

# plot the results

plot( Y ~ X, col = "black", pch = 20, xlab = "X", ylab = "Y")
abline( mod_simple, col = "blue",lwd=2)

#red line = incorrect linear regression (this violates the first OLS assumption)
lines( sort(X), prediction,col="red",lwd=2)
legend("topleft",
legend = c("Simple Regression Model",
"Quadratic Model"),
cex = 1,
lty = 1,
col = c("blue","red"))
R Simulation about Assumption 1.

Simple Regression Model

Quadratic Model
30
20
Y

10
0

−4 −2 0 2 4

X
Assumption 2.Independently and Identically Distributed
Data
The sampling scheme based on Simple Random Sampling ensures that the observations (𝑌𝑖 , 𝑋𝑖 ) are i.i.d. for all
𝑖 = 1, … , 𝑛.
- One of the prominent examples where i.i.d. does not hold is time series data. Company cuts the number of its
employees by time and this fact shows a systematic pattern. - GDP of a country shows a pattern over time.

The R simulation about the employee example:

# set seed
set.seed(123)

# generate a date vector

Date <- seq(as.Date("1951/1/1"), as.Date("2000/1/1"), "years")

# initialize the employment vector

X <- c(5000, rep(NA, length(Date)-1))

# generate time series observations with random influences

for (t in 2:length(Date)) {

X[t] <- -50 + 0.98 * X[t-1] + rnorm(n = 1, sd = 200)

#plot the results

plot(x = Date,
y = X,
type = "l",
col = "steelblue",
ylab = "Workers",
xlab = "Time",
lwd=2)
R Simulation about Assumption 2.
5000
4000
3000
Workers

2000
1000

1950 1960 1970 1980 1990 2000

Time
Assumption 3. Large Outliers are Unlikely

This assumption makes sure that both 𝑋 and 𝑌 have a finite

kurtosis. Why do we need it?
▶ OLS suffers from sensitivity to outliers.
▶ One can show that extreme observations receive heavy
weighting in the estimation of the unknown regression
coeﬀicients when using OLS.
▶ Therefore, outliers can lead to strongly distorted estimates of
regression coeﬀicients.
Simulation about Assumption 3.

#| output-location: slide

# set seed
set.seed(123)

# generate the data

X <- sort(runif(10, min = 30, max = 70))
Y <- rnorm(10 , mean = 200, sd = 50)
Y[9] <- 2000

# fit model with outlier

fit <- lm(Y ~ X)

# fit model without outlier

fitWithoutOutlier <- lm(Y[-9] ~ X[-9])

# plot the results

plot(Y ~ X,pch=20)
abline(fit,lwd=2,col="blue")
abline(fitWithoutOutlier, col = "red",lwd=2)
legend("topleft",
legend = c("Model with Outlier",
"Model without Outlier"),
cex = 1,
lty = 1,
col = c("blue","red"))
Simulation about Assumption 3.
2000

Model with Outlier

Model without Outlier
1500
Y

1000
500

35 40 45 50 55 60 65

X
Sampling Distribution of OLS estimators.

As 𝑛 → ∞ the sampling distribution of OLS estimators approach

to normality given the Assumption (1)-(3). Mathematically,

𝑑
𝛽1̂ ∼ 𝑁 (𝛽1 , 𝑉 𝑎𝑟(𝛽1̂ ))

𝑁
∑𝑖=1 (𝑥𝑖 − 𝑥)̄ 2 𝜎𝑖2
𝑉 𝑎𝑟(𝛽1̂ ) = 𝑁
(∑𝑖=1 (𝑥𝑖 − 𝑥)̄ 2 )2

1 𝑥2̄
𝑉 𝑎𝑟(𝛼)̂ = 𝜎2 ( + 𝑁 )
𝑁 ∑𝑖=1 (𝑥𝑖 − 𝑥)̄ 2
Conclusion

To estimate the linear relationship between two variables we use

OLS. OLS estimators are approximately normally distributed if the
following assumptions hold:
▶
𝐸(𝑢𝑖 |𝑋𝑖 ) = 0

▶ (𝑌𝑖 , 𝑋𝑖 ) are iid samples from a population

▶ Large outliers are unlikely meaning both X and Y have finite
kurtosis.
References

Kleiber, Christian, and Achim Zeileis. 2008. Applied Econometrics

with r. Springer New York.
https://doi.org/10.1007/978-0-387-77318-6.

4.1 The Linear Regression Model: E (Tests
No ratings yet
4.1 The Linear Regression Model: E (Tests
16 pages
Simple Linear Regression - Lecture Notes
No ratings yet
Simple Linear Regression - Lecture Notes
19 pages
Ordinary Least Squares Linear Regression Review: Week 4
No ratings yet
Ordinary Least Squares Linear Regression Review: Week 4
10 pages
Understanding Simple Regression Analysis
No ratings yet
Understanding Simple Regression Analysis
42 pages
Stock and Watson - Slides For Chapter 4
No ratings yet
Stock and Watson - Slides For Chapter 4
43 pages
Linear Regression: OLS Estimation Guide
No ratings yet
Linear Regression: OLS Estimation Guide
47 pages
Linear Regression Basics and OLS Estimation
No ratings yet
Linear Regression Basics and OLS Estimation
41 pages
StockWatson Econ CH 3
No ratings yet
StockWatson Econ CH 3
20 pages
Quantitative Methods for Finance Overview
No ratings yet
Quantitative Methods for Finance Overview
21 pages
Linear Regression with R Analysis
No ratings yet
Linear Regression with R Analysis
35 pages
ECON6001: Applied Econometrics S&W: Chapter 4: Linear Regression With One Regressor, An Introduction Dr. Gedeon Lim
No ratings yet
ECON6001: Applied Econometrics S&W: Chapter 4: Linear Regression With One Regressor, An Introduction Dr. Gedeon Lim
59 pages
Applied Regression Analysis Tutorial
No ratings yet
Applied Regression Analysis Tutorial
2 pages
Lecture 2 SLR - 1
No ratings yet
Lecture 2 SLR - 1
28 pages
Econometrics Chapter 4
No ratings yet
Econometrics Chapter 4
5 pages
Class Size Impact on Test Scores
100% (2)
Class Size Impact on Test Scores
84 pages
Lecture 2
No ratings yet
Lecture 2
39 pages
Homoskedasticity in Linear Regression Analysis
No ratings yet
Homoskedasticity in Linear Regression Analysis
191 pages
Regression Analysis of Class Size and Test Scores
No ratings yet
Regression Analysis of Class Size and Test Scores
5 pages
Econometrics Notes Heidelberg
No ratings yet
Econometrics Notes Heidelberg
62 pages
Understanding Simple Linear Regression
No ratings yet
Understanding Simple Linear Regression
75 pages
Classical Regression Analysis Overview
No ratings yet
Classical Regression Analysis Overview
17 pages
Chapter 2
No ratings yet
Chapter 2
12 pages
Lecture Linear Regression One Regressor
No ratings yet
Lecture Linear Regression One Regressor
32 pages
Simple Linear Regression in R Guide
No ratings yet
Simple Linear Regression in R Guide
39 pages
Advanced Marketing Research
No ratings yet
Advanced Marketing Research
32 pages
Applied Econometrics: Linear Regression with Stata
No ratings yet
Applied Econometrics: Linear Regression with Stata
46 pages
EC311 Slides Spring25 Week9 Part1
No ratings yet
EC311 Slides Spring25 Week9 Part1
16 pages
Lecture 3 Ase
No ratings yet
Lecture 3 Ase
13 pages
Class Size Impact on Test Scores Analysis
No ratings yet
Class Size Impact on Test Scores Analysis
73 pages
SSR vs. SST in Regression Analysis
No ratings yet
SSR vs. SST in Regression Analysis
38 pages
Understanding Simple Regression and OLS
No ratings yet
Understanding Simple Regression and OLS
29 pages
Simple Linear Regression1
No ratings yet
Simple Linear Regression1
36 pages
Regression Analysis Basics
No ratings yet
Regression Analysis Basics
56 pages
Simple Linear Regression Explained
No ratings yet
Simple Linear Regression Explained
25 pages
Econ 399 Chapter2a
No ratings yet
Econ 399 Chapter2a
40 pages
Unit-3 Notes
No ratings yet
Unit-3 Notes
16 pages
Simple Linear Regression Explained
No ratings yet
Simple Linear Regression Explained
4 pages
Lecture 2-3
No ratings yet
Lecture 2-3
8 pages
Lecture 2
No ratings yet
Lecture 2
64 pages
Simple Linear Regression Overview
No ratings yet
Simple Linear Regression Overview
51 pages
Linear Regression Analysis of Class Size
No ratings yet
Linear Regression Analysis of Class Size
38 pages
Business Statistics II
100% (2)
Business Statistics II
100 pages
Simple Linear Regression Guide
No ratings yet
Simple Linear Regression Guide
12 pages
Ols 23-24
No ratings yet
Ols 23-24
87 pages
OLS Regression Analysis Explained
No ratings yet
OLS Regression Analysis Explained
34 pages
OLS Regression Estimation Overview
No ratings yet
OLS Regression Estimation Overview
59 pages
Understanding Regression Analysis Basics
No ratings yet
Understanding Regression Analysis Basics
31 pages
03 Regression
No ratings yet
03 Regression
33 pages
Simple Linear Regression Overview
No ratings yet
Simple Linear Regression Overview
9 pages
Machine Learning: Regression Analysis Guide
No ratings yet
Machine Learning: Regression Analysis Guide
8 pages
Lecture - 8 Regression and Correlation
No ratings yet
Lecture - 8 Regression and Correlation
34 pages
The Simple Regression Model
No ratings yet
The Simple Regression Model
24 pages
Chapter 7 - New 1
No ratings yet
Chapter 7 - New 1
29 pages
Simple Linear Regression Explained
No ratings yet
Simple Linear Regression Explained
58 pages
PE Civil: Transportation Ebook Practice Exam
No ratings yet
PE Civil: Transportation Ebook Practice Exam
41 pages
Simple & Multiple Linear Regression Guide
No ratings yet
Simple & Multiple Linear Regression Guide
19 pages
Board of Forensic Document Examiners - Study Guide - 2008 - (34p)
No ratings yet
Board of Forensic Document Examiners - Study Guide - 2008 - (34p)
34 pages
Chapter 8 Planning and Goal-Setting: Management, 14e, Global Edition (Robbins/Coulter)
No ratings yet
Chapter 8 Planning and Goal-Setting: Management, 14e, Global Edition (Robbins/Coulter)
10 pages
Inspire
No ratings yet
Inspire
15 pages
Psychology Students' Memory Insights
No ratings yet
Psychology Students' Memory Insights
5 pages
Finding Supply and Demand Zones That Work PDF
92% (24)
Finding Supply and Demand Zones That Work PDF
26 pages
Enshrouded Lab Report
No ratings yet
Enshrouded Lab Report
2 pages
Earth Science Reviewer
No ratings yet
Earth Science Reviewer
4 pages
Family Tree Project
No ratings yet
Family Tree Project
2 pages
VAL 075 Validation Deviation Management Sample
No ratings yet
VAL 075 Validation Deviation Management Sample
2 pages
Astro Yoga PDF
No ratings yet
Astro Yoga PDF
11 pages
Sewer Reticulation Project Guidelines
No ratings yet
Sewer Reticulation Project Guidelines
6 pages
EMS Result
No ratings yet
EMS Result
2 pages
Doing Indigenous Methodologies
No ratings yet
Doing Indigenous Methodologies
10 pages
Stress and Coping in Nursing Students
No ratings yet
Stress and Coping in Nursing Students
7 pages
AS Level Pure Mathematics Practice Paper
No ratings yet
AS Level Pure Mathematics Practice Paper
8 pages
Company Profile PPI
No ratings yet
Company Profile PPI
36 pages
Appsc Forest
No ratings yet
Appsc Forest
3 pages
FDS1 Differential Manual
No ratings yet
FDS1 Differential Manual
12 pages
Note - Making and Summary
No ratings yet
Note - Making and Summary
6 pages
Multithreading with Pthreads and Semaphores
No ratings yet
Multithreading with Pthreads and Semaphores
10 pages
Class Xi English Project 2022-23
No ratings yet
Class Xi English Project 2022-23
2 pages
Chou Fasman
No ratings yet
Chou Fasman
6 pages
Practice Final Examination Fall 2025 ESD
No ratings yet
Practice Final Examination Fall 2025 ESD
9 pages
Impact of Leadership Style On Performance of Organization
100% (2)
Impact of Leadership Style On Performance of Organization
10 pages
Experiment 7 Cooling Tower
0% (1)
Experiment 7 Cooling Tower
9 pages
Kalidas CV
100% (1)
Kalidas CV
4 pages
Gaussian X1 Delivery Robot Overview
No ratings yet
Gaussian X1 Delivery Robot Overview
2 pages
Identification and Control of Mechanical Systems 1st Edition Jer-Nan Juang Ebook All Chapter Text Included
100% (6)
Identification and Control of Mechanical Systems 1st Edition Jer-Nan Juang Ebook All Chapter Text Included
340 pages
7 Module 7 - Q1 - GENERALPHYSICS 1
No ratings yet
7 Module 7 - Q1 - GENERALPHYSICS 1
35 pages
Bioinformatics and Functional Genomics 3ed. Edition Jonathan Pevsner
No ratings yet
Bioinformatics and Functional Genomics 3ed. Edition Jonathan Pevsner
473 pages

Understanding Simple Linear Regression

Uploaded by

Understanding Simple Linear Regression

Uploaded by

Simple Linear Regression

In this lecture we will learn about the following:

We need the following packages:

# Create sample data

The linear regression model is

▶ First, we will introduce the Canadian Schools Dataset from

# Attach the AER library and download CAschools dataset

# set up a vector of percentiles and compute the quantiles

# gather everything in a data.frame

Average StandardDeviation quantile.10. quantile.25. quantile.40. quantile.50. quantile.60. quantile.75. quantile.90.

Scatterplot of Test Score and STR

We can compute the correlation in the following way:

Suppose, (𝑌𝑖 , 𝑋𝑖 ) pair is an i.i.d. draws ∀𝑖 = 1, … , 𝑛. The, the

Taking derivatives w.r.t. 𝛽0 and 𝛽1 and setting the result to zero

Now, perform the same calculations using the built-in function:

Residual standard error: 18.58 on 418 degrees of freedom

# plot the data

# add the regression line

Scatterplot of Test Score and STR

After fitting a linear regression model, a natural question is how

▶ Since 𝑇 𝑆𝑆 = 𝐸𝑆𝑆 + 𝑆𝑆𝑅 we can also write

▶ The 𝑆𝑆𝑅 is defined as

The Standard Error of the Regression (𝑆𝐸𝑅) is an estimator of

Remember that the 𝑢𝑖 are unobserved. This is why we use their

Residual standard error: 18.58 on 418 degrees of freedom

SSR <- sum(mod_summary$residuals^2)

# print the value to the console

# print the value to the console

OLS performs well in many circumstances but we need the

We need the following R functions for simulation:

# simulate the data

# the true relation

# estimate a simple regression model

# estimate a quadratic regression model

# predict using a quadratic model

# plot the results

Simple Regression Model

The R simulation about the employee example:

# generate a date vector

# initialize the employment vector

# generate time series observations with random influences

X[t] <- -50 + 0.98 * X[t-1] + rnorm(n = 1, sd = 200)

#plot the results

1950 1960 1970 1980 1990 2000

This assumption makes sure that both 𝑋 and 𝑌 have a finite

# generate the data

# fit model with outlier

# fit model without outlier

# plot the results

Model with Outlier

As 𝑛 → ∞ the sampling distribution of OLS estimators approach

To estimate the linear relationship between two variables we use

▶ (𝑌𝑖 , 𝑋𝑖 ) are iid samples from a population

Kleiber, Christian, and Achim Zeileis. 2008. Applied Econometrics

You might also like