0% found this document useful (0 votes)

74 views51 pages

DSA1101 2019 Week2 Part1

Linear regression is a statistical technique used to model the relationship between input variables and a continuous output variable. It assumes the relationships are linear. Examples given include modeling home prices based on home characteristics, demand forecasting based on factors like weather, medical applications like modeling tumor reduction from radiation treatment, and financial applications like modeling stock prices based on economic indicators. The document then examines Singapore HDB resale data from an online government database as a local example, and introduces the concept of multiple linear regression when there are multiple input variables to predict the output variable. It previews using the least squares method to estimate the parameters in a multiple linear regression model.

Uploaded by

ttt

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

74 views51 pages

DSA1101 2019 Week2 Part1

Uploaded by

ttt

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Introduction to Data Science

DSA1101

Semester 1, 2019/2020
Week 2
Introduction to linear regression

1 / 50
Linear regression

Linear regression is an analytical technique used to model the

relationship between several input variables and a continuous
outcome variable.
A key assumption is that the relationships between the input
variables and the outcome variable are linear.

2 / 50
Linear regression: examples
Real estate: Linear
regression analysis can be
used to model residential
home prices as a function
of the home’s living area.
Such a model helps set or
evaluate the list price of a
home on the market.
The model could be further
improved by including
other input variables such
Source: The Business Times as number of bathrooms,
number of bedrooms, lot
size, school district
rankings, crime statistics,
and property taxes.
3 / 50
Linear regression: examples
Demand forecasting:
Businesses and
governments can use linear
regression models to
predict demand for goods
and services.
For example, coffee shops
can appropriately prepare
for the predicted type and
quantity of food that
customers will consume
Source: The Straits Times based upon the weather,
the day of the week,
whether an item is offered
as a special, the time of
day, and the reservation
volume.
4 / 50
Linear regression: examples

Similar forecasting models

can be built to predict taxi
demand, emergency room
visits, and ambulance
dispatches.

Source: The Straits Times

5 / 50
Linear regression: examples

Medical: Linear regression

model can be used to
analyze the effect of a
proposed radiation
treatment on reducing
tumor sizes.
Multiple input variables
might include duration of a
single radiation treatment,
frequency of radiation
Source: The Straits Times
treatment, and patient
attributes such as age or
weight.

6 / 50
Linear regression: examples

Finance: Linear regression

is used to model the
relationships between stock
market prices and other
variables such as economic
performance, interest rates
Source: Bloomberg, Sunday Times and geopolitical risks.
Graphics

7 / 50
Linear regression: examples

Pharmaceutical Industry:
Linear regression model
can be used to analyze the
clinical efficacies of drugs.
Input variables may include
age, gender and other
patient characteristics such
as blood pressure and
Source: The Straits Times
blood sugar level.

8 / 50
Example closer to home...

Data on resale HDB prices

based on registration date
is publicly available from
https://data.gov.sg/
dataset/
resale-flat-prices.
We have extracted a subset
of all the resale records
from March 2012 to
December 2014 based on
Source: The Straits Times registration date.
Available as the data set
HDBresale_reg.csv on
the course website.

9 / 50
Example closer to home...

Source: data.gov.sg

10 / 50
GovTech datasets

https://data.gov.sg
hosts many publicly
available datasets for data
analytics.

Source: data.gov.sg

11 / 50
HDB resale data

Let us take a closer look at the HDB resale dataset

1 > resale = read . csv ( " hdbresale _ reg . csv " )
2
3 > head ( resale [ ,1:5])
4 X month town flat _ type block
5 1 580 2012 -03 CENTRAL AREA 3 ROOM 640
6 2 581 2012 -03 CENTRAL AREA 3 ROOM 640
7 3 582 2012 -03 CENTRAL AREA 3 ROOM 668
8 4 583 2012 -03 CENTRAL AREA 3 ROOM 5
9 5 584 2012 -03 CENTRAL AREA 3 ROOM 271
10 6 585 2012 -03 CENTRAL AREA 4 ROOM 671 A

12 / 50
HDB resale data

Let us take a closer look at the HDB resale dataset

1 > head ( resale [ ,6:8])
2 street _ name storey _ range floor _ area _ sqm
3 1 ROWELL RD 01 TO 05 74
4 2 ROWELL RD 06 TO 10 74
5 3 CHANDER RD 01 TO 05 73
6 4 TG PAGAR PLAZA 11 TO 15 59
7 5 QUEEN ST 11 TO 15 68
8 6 KLANG LANE 01 TO 05 75

13 / 50
HDB resale data

Let us take a closer look at the HDB resale dataset

1 > head ( resale [ ,9:11])
2 flat _ model lease _ commence _ date resale _ price
3 1 Model A 1984 380000
4 2 Model A 1984 388000
5 3 Model A 1984 400000
6 4 Improved 1977 460000
7 5 Improved 1979 488000
8 6 Model A 2003 495000

14 / 50
Multiple linear regression

Suppose we are interested to build a linear regression model

that estimates a HDB unit’s resale price as a function of
town, flattype and floorareainsquaremeters.
With more than one input variable, we will use multiple linear
regression.

15 / 50
Multiple linear regression

In the multiple linear regression model with p input variables,

y = β0 + β1 x(1) + β2 x(2) + ... + βp x(p) + ,

∗ y is the outcome variable

∗ x(j) are the input variables, j = 1, 2, ..., p
∗ β0 is the value of y when each x(j) equals zero
∗ βj is the change in y based on a unit change in x(j) for
j = 1, 2, ..., p
∗ is a random error term

16 / 50
Multiple linear regression

For example, when there are three input variables, the linear
model is

y = β0 + β1 x(1) + β2 x(2) + β3 x(3) + .

The parameters (β0 , β1 , β2 , β3 ) can be estimated by the

method of least squares.
We will review the least squares method in the simple linear
regression case to consolidate understanding.

17 / 50
Review: least squares for simple linear regression

Suppose we have three observations. Each observation has an

outcome y and an input variable x.
We are interested in the linear relationship

yi ≈ β0 + β1 xi

Since there is only one input variable, this is an example of

simple linear model.

18 / 50
Review: least squares for simple linear regression

i xi yi
1 -1 -1
2 3 3.5
3 5 3

Plot of the three data points.

19 / 50
Review: least squares for simple linear regression

We are interested in the

linear relationship

yi ≈ f (xi ) = β0 + β1 xi

Recall that the above is

actually the formula for a
straight line.
There are many different
lines that can be used to
model x and y , as shown
in the plot.

20 / 50
Review: least squares for simple linear regression

Intuitively, we want the line

to be as close to the data
points as possible.
This ”closeness” can be
measured in terms of the
vertical distance between
each point to the line
(represented by the length
of the purple lines).

21 / 50
Review: least squares for simple linear regression

i xi yi β0 + β1 xi residual: ei = yi − (β0 + β1 xi )
1 -1 -1 β0 + (−1)β1 −1 − (β0 + (−1)β1 ) = −1 − β0 + β1
2 3 3.5 β0 + (3)β1 3.5 − (β0 + (3)β1 ) = 3.5 − β0 − 3β1
3 5 3 β0 + (5)β1 3 − (β0 + (5)β1 ) = 3 − β0 − 5β1

22 / 50
Review: least squares for simple linear regression

Now the residual for each point may be positive or negative:

We do not want the residuals to ”cancel off” each other, so
we square each of them, leading to the squared residuals.

i residual: ei = yi − (β0 + β1 xi ) squared residual: ei2

1 −1 − (β0 + (−1)β1 ) = −1 − β0 + β1 [−1 − β0 + β1 ]2
2 3.5 − (β0 + (3)β1 ) = 3.5 − β0 − 3β1 [3.5 − β0 − 3β1 ]2
3 3 − (β0 + (5)β1 ) = 3 − β0 − 5β1 [3 − β0 − 5β1 ]2

23 / 50
Review: least squares for simple linear regression

i residual: ei = yi − (β0 + β1 xi ) squared residual: ei2

1 −1 − (β0 + (−1)β1 ) = −1 − β0 + β1 [−1 − β0 + β1 ]2
2 3.5 − (β0 + (3)β1 ) = 3.5 − β0 − 3β1 [3.5 − β0 − 3β1 ]2
3 3 − (β0 + (5)β1 ) = 3 − β0 − 5β1 [3 − β0 − 5β1 ]2
To express the total magnitude of the deviations, we sum up
the squared residuals for all the data points.
The resulting sum is the Residual Sum of Squares, abbreviated
as RSS
In the above example,

RSS = e12 + e12 + e32

= [−1 − β0 + β1 ]2 + [3.5 − β0 − 3β1 ]2 + [3 − β0 − 5β1 ]2 .

24 / 50
Review: least squares for simple linear regression

We now seek the values of β0 and β1 such that the RSS,

given by

RSS = [−1 − β0 + β1 ]2 + [3.5 − β0 − 3β1 ]2 + [3 − β0 − 5β1 ]2 ,

is minimized.
This process is known as the method of least squares.

25 / 50
Review: least squares for simple linear regression

We now have a function in terms of β0 and β1 . Let’s call it

h(β0 , β1 ) so that

h(β0 , β1 ) = [−1 − β0 + β1 ]2 +[3.5 − β0 − 3β1 ]2 +[3 − β0 − 5β1 ]2 .

To find the minimum value of h(β0 , β1 ), first differentiate

with respect to β0 , while holding β1 constant:

∂h(β0 , β1 )
= 2 [−1 − β0 + β1 ] (−1)
∂β0
+ 2 [3.5 − β0 − 3β1 ] (−1) + 2 [3 − β0 − 5β1 ] (−1)
= 2 + 2β0 − 2β1 − 7 + 2β0 + 6β1 − 6 + 2β0 + 10β1
= −11 + 6β0 + 14β1 .

26 / 50
Review: least squares for simple linear regression

Then differentiate with respect to β1 , while holding β0

constant:
∂h(β0 , β1 )
= 2 [−1 − β0 + β1 ] (1)
∂β1
+2 [3.5 − β0 − 3β1 ] (−3) + 2 [3 − β0 − 5β1 ] (−5)
= − 2 − 2β0 + 2β1 − 21 + 6β0 + 18β1 − 30 + 10β0 + 50β1
= − 53 + 14β0 + 70β1 .

27 / 50
Review: least squares for simple linear regression
Finally, by setting both the derivative to zero, we have the
system of equations

−11 + 6β0 + 14β1 = 0

−53 + 14β0 + 70β1 = 0

Solving the equations,

11 14
β0 = − β1
6 6
−53 + 14β0 + 70β1 = 0,

leads to the least squares estimates

β0 ≈ 0.1250
β1 ≈ 0.7321

28 / 50
Review: least squares for simple linear regression

We usually add the ”hat” sign on top of parameter to denote

estimated values, so the least squares estimates are denoted as

β̂0 ≈ 0.1250
β̂1 ≈ 0.7321.

29 / 50
Review: least squares for simple linear regression

We can check that the least squares estimates we computed

are equivalent to those returned by lm() function in R:
1 > x <-c ( -1 , 3 , 5)
2 > y <-c ( -1 , 3.5 , 3)
3 > lm ( y ~ x ) ;
4
5 Call :
6 lm ( formula = y ~ x )
7
8 Coefficients :
9 ( Intercept ) x
10 0.1250 0.7321

30 / 50
Review: least squares for simple linear regression

Notice that on slide 22, we begin with this table:

i xi yi β0 + β1 xi residual: ei = yi − (β0 + β1 xi )
1 -1 -1 β0 + (−1)β1 −1 − β0 + β1
2 3 3.5 β0 + (3)β1 3.5 − β0 − 3β1
3 5 3 β0 + (5)β1 3 − β0 − 5β1
However, the values for β0 and β1 were unknown.
Now that we have obtained the least squares estimates
β̂0 ≈ 0.1250 and β̂1 ≈ 0.7321, we can plug those values into
the table above!

31 / 50
Review: least squares for simple linear regression

i xi yi ŷi = β̂0 + β̂1 xi residual: ei = yi − (β̂0 + β̂1 xi )

1 -1 -1 -0.6071 -0.3929
2 3 3.5 2.3213 1.1787
3 5 3 3.7855 -0.7855
The column for ŷi = β̂0 + β̂1 xi contains the fitted values for
outcome y .
The column for ei = yi − (β̂0 + β̂1 xi ) contains the residuals
after fitting the simple linear model.

32 / 50
Review: least squares for simple linear regression

i xi yi ŷi = β̂0 + β̂1 xi residual: ei = yi − (β̂0 + β̂1 xi )

1 -1 -1 -0.6071 -0.3929
2 3 3.5 2.3213 1.1787
3 5 3 3.7855 -0.7855
We see that R can also output the fitted values and residuals
directly after fitting the linear model:
1 > x <-c ( -1 , 3 , 5)
2 > y <-c ( -1 , 3.5 , 3)
3 > lmout <- lm ( y ~ x )
4 > lmout $ fitted . values
5 1 2 3
6 -0.6071429 2.3214286 3.7857143
7 > lmout $ residuals
8 1 2 3
9 -0.3928571 1.1785714 -0.7857143

33 / 50
Review: least squares for simple linear regression

We can follow up by plotting the residuals against the fitted

outcome values for model diagnostics, as discussed in the
previous lecture.
1 x <-c ( -1 , 3 , 5)
2 y <-c ( -1 , 3.5 , 3)
3
4 lmout <- lm ( y ~ x )
5
6 plot ( x = lmout $ fitted . values , y = lmout $ residuals ,
7 xlab = " Fitted values " , ylab = " Residuals " ,
8 cex =2 , cex . lab =2 , cex . axis =2 , pch =16)
9
10 abline (0 ,0)

34 / 50
Review: least squares for simple linear regression

35 / 50
Back to multiple linear regression...

In the multiple linear regression model with p input variables,

y = β0 + β1 x(1) + β2 x(2) + ... + βp x(p) + ,

∗ y is the outcome variable

∗ x(j) are the input variables, j = 1, 2, ..., p
∗ β0 is the value of y when each x(j) equals zero
∗ βj is the change in y based on a unit change in x(j) for
j = 1, 2, ..., p
∗ is a random error term
We can also estimate the unknown parameters in the multiple
linear regression model via the method of least squares

36 / 50
Back to our HDB resale data example...

Source: data.gov.sg

37 / 50
Multiple linear regression

Suppose we are interested to build a linear regression model

that estimates a HDB unit’s resale price as a function of
town, flattype and floorareainsquaremeters.
With more than one input variable, we will use multiple linear
regression.

38 / 50
Multiple linear regression

1 > resale = read . csv ( " hdbresale _ reg . csv " )

2 > head ( resale [ , c (3 ,4 ,8 ,11) ])
3 town flat _ type floor _ area _ sqm resale _ price
4 1 CENTRAL AREA 3 ROOM 74 380000
5 2 CENTRAL AREA 3 ROOM 74 388000
6 3 CENTRAL AREA 3 ROOM 73 400000
7 4 CENTRAL AREA 3 ROOM 59 460000
8 5 CENTRAL AREA 3 ROOM 68 488000
9 6 CENTRAL AREA 4 ROOM 75 495000

39 / 50
HDB resale data

Note that in our resale dataset, variables such as town and

flat_type are called categorical variables, since they consist
of different categories instead of numerical values.
The function levels() in R will display all the different
categories in a variable.
1 > levels ( resale $ town )
2 [1] " CENTRAL AREA " " JURONG EAST " " WOODLANDS "
3 > levels ( resale $ flat _ type )
4 [1] " 2 ROOM " " 3 ROOM " " 4 ROOM " " 5 ROOM " "
EXECUTIVE "

40 / 50
HDB resale data
In R, we can perform multiple linear regression using the lm()
function:
1 > lm ( resale _ price ~ town + floor _ area _ sqm + flat _ type , data =
resale )
2
3 Call :
4 lm ( formula = resale _ price ~ town + floor _ area _ sqm +
flat _ type ,
5 data = resale )
6
7 Coefficients :
8 ( Intercept ) townJURONG EAST
9 193438 -122748
10 townWOODLANDS floor _ area _ sqm
11 -169896 2526
12 flat _ type3 ROOM flat _ type4 ROOM
13 98827 129929
14 flat _ type5 ROOM flat _ typeEXECUTIVE
15 142570 214622

41 / 50
Categorical variables

It is not meaningful to assign numerical values to a categorical

variable such as town. For example, it is not meaningful to
consider Jurong East to be one unit greater than Woodlands
and two units greater than Central Area
HDB resales that took place in the Central Areas will be
regarded as the reference case.
So for example, the coefficient estimate of −122748 for
townJURONGEAST means that HDB resale price in Jurong East
is on average $122748 less than the resale price in Central
areas, other variables being held constant.
Similarly, the coefficient estimate of −169896 for
townWOODLANDS means that HDB resale price in Woodlands
is on average $169896 less than the resale price in Central
areas, other variables being held constant.

42 / 50
Multiple linear regression

Suppose we are interested in computing confidence intervals

for the parameter estimates from our multiple linear regression
model
Similar to the simple linear regression setting, we assume that
the error terms are independent and normally distributed with
mean zero and constant variances.

43 / 50
Multiple linear regression

R simplifies the computation of confidence intervals on the

parameters with the use of the confint() function.
For example, the following R command provides 95%
confidence intervals on our parameter estimates:
1 > out = lm ( resale _ price ~ town + floor _ area _ sqm + flat _ type ,
data = resale )
2 > confint ( out , level = .95)
3 2.5 % 97.5 %
4 ( Intercept ) 180791.622 206084.668
5 townJURONG EAST -127894.901 -117601.145
6 townWOODLANDS -174833.773 -164959.166
7 floor _ area _ sqm 2403.141 2649.163
8 flat _ type3 ROOM 87053.089 110601.512
9 flat _ type4 ROOM 117163.666 142693.511
10 flat _ type5 ROOM 128354.088 156786.844
11 flat _ typeEXECUTIVE 197641.718 231602.642

44 / 50
Multiple linear regression

R simplifies the computation of confidence intervals on the

45 / 50
Multiple linear regression
R also computes the p-value for testing H0 : βj = 0 versus
H1 : βj 6= 0 for j = 0, 1, 2, ..., p.
1 > out = lm ( resale _ price ~ town + floor _ area _ sqm + flat _ type , data = resale )
2 > summary ( out )
3
4 Call :
5 lm ( formula = resale _ price ~ town + floor _ area _ sqm + flat _ type ,
6 data = resale )
7
8 Residuals :
9 Min 1Q Median 3Q Max
10 -139026 -23350 -1453 19284 336649
11
12 Coefficients :
13 Estimate Std . Error t value Pr ( >| t |)
14 ( Intercept ) 193438.15 6451.13 29.98 <2e -16 * * *
15 townJURONG EAST -122748.02 2625.48 -46.75 <2e -16 * * *
16 townWOODLANDS -169896.47 2518.57 -67.46 <2e -16 * * *
17 floor _ area _ sqm 2526.15 62.75 40.26 <2e -16 * * *
18 flat _ type3 ROOM 98827.30 6006.16 16.45 <2e -16 * * *
19 flat _ type4 ROOM 129928.59 6511.53 19.95 <2e -16 * * *
20 flat _ type5 ROOM 142570.47 7251.94 19.66 <2e -16 * * *
21 flat _ typeEXECUTIVE 214622.18 8661.93 24.78 <2e -16 * * *
22 ---
23 Signif . codes : 0 ’* * * ’ 0.001 ’* * ’ 0.01 ’* ’ 0.05 ’. ’ 0.1 ’ ’ 1

46 / 50
Confidence interval on the expected outcome
Suppose we are interested in the expected resale price for
3-room HDB units in the Central Area with floor area of 70
square meters.
The predict() function in R can provide a 95% confidence
interval on this expected resale price
1 > out = lm ( resale _ price ~ town + floor _ area _ sqm + flat _ type ,
data = resale )
2 >
3 > town = " CENTRAL AREA "
4 > flat _ type = " 3 ROOM "
5 > floor _ area _ sqm =70
6 >
7 > new _ pt <- data . frame ( town , flat _ type , floor _ area _ sqm )
8 > conf _ int _ pt <- predict ( out , new _ pt , level =.95 , interval =
" confidence " )
9 >
10 > conf _ int _ pt
11 fit lwr upr
12 1 469096.1 464369 473823.2

47 / 50
Confidence interval on a particular outcome
Suppose we are interested in the resale price for a particular
3-room HDB unit in the Central Area with floor area of 70
square meters.
The predict() function in R can also provide a 95%
prediction interval on this resale price
1 > out = lm ( resale _ price ~ town + floor _ area _ sqm + flat _ type ,
data = resale )
2 >
3 > town = " CENTRAL AREA "
4 > flat _ type = " 3 ROOM "
5 > floor _ area _ sqm =70
6 >
7 > new _ pt <- data . frame ( town , flat _ type , floor _ area _ sqm )
8 > conf _ int _ pt <- predict ( out , new _ pt , level =.95 , interval =
" prediction " )
9 >
10 > conf _ int _ pt
11 fit lwr upr
12 1 469096.1 391692.2 546499.9

48 / 50
Model Diagnostics: Evaluating the Residuals

We have made assumptions on the error terms in the multiple

linear regression model
We can plot the residuals against fitted values for visual
inspection
1 > out = lm ( resale _ price ~ town + floor _ area _ sqm + flat _ type ,
data = resale )
2 >
3 > plot ( x = out $ fitted . values , y = out $ residuals ,
4 + xlab = " Fitted values " , ylab = " Residuals " , col = " red
")
5 > abline (0 ,0)

49 / 50
Model Diagnostics: Evaluating the Residuals

50 / 50

Linear Regression
No ratings yet
Linear Regression
97 pages
Daunit 3
No ratings yet
Daunit 3
32 pages
Linear Regression
No ratings yet
Linear Regression
24 pages
Lecture3 221109 035214
No ratings yet
Lecture3 221109 035214
87 pages
Logistic Regression Example Explained
No ratings yet
Logistic Regression Example Explained
45 pages
Chap 5
No ratings yet
Chap 5
13 pages
Unit 3 Da
No ratings yet
Unit 3 Da
20 pages
Linear Regression
No ratings yet
Linear Regression
54 pages
Assignment 2
No ratings yet
Assignment 2
5 pages
Understanding Linear Regression Basics
No ratings yet
Understanding Linear Regression Basics
26 pages
Topic3 Linear Regression
No ratings yet
Topic3 Linear Regression
52 pages
Simple Linear Regression Overview
No ratings yet
Simple Linear Regression Overview
18 pages
Lecture 3
No ratings yet
Lecture 3
42 pages
MachineLearning Unit II
No ratings yet
MachineLearning Unit II
45 pages
Chapter4 Regression
No ratings yet
Chapter4 Regression
15 pages
Applications of Regression Analysis
No ratings yet
Applications of Regression Analysis
98 pages
Regression Models for Math Majors
100% (1)
Regression Models for Math Majors
30 pages
Lecture6 Regression
No ratings yet
Lecture6 Regression
42 pages
Machine Learning Unit2
No ratings yet
Machine Learning Unit2
31 pages
FML Unit2
No ratings yet
FML Unit2
13 pages
Practical 5
No ratings yet
Practical 5
8 pages
Linear Regression for Analysts
No ratings yet
Linear Regression for Analysts
6 pages
Combinepdf
No ratings yet
Combinepdf
8 pages
Linear Regression Guide & Examples
No ratings yet
Linear Regression Guide & Examples
36 pages
Linear Regression
No ratings yet
Linear Regression
24 pages
Lect 10 Regression
No ratings yet
Lect 10 Regression
7 pages
MATH6183 Introduction+Regression
No ratings yet
MATH6183 Introduction+Regression
70 pages
AI Lec23
No ratings yet
AI Lec23
36 pages
Linear Regression
No ratings yet
Linear Regression
15 pages
Chapter - 2 - Linear and Logistic Regression
No ratings yet
Chapter - 2 - Linear and Logistic Regression
34 pages
Simple Linear Regression Overview
100% (1)
Simple Linear Regression Overview
77 pages
ML Linear Regression Trupesh Patel
No ratings yet
ML Linear Regression Trupesh Patel
23 pages
Linear Regression in Machine Learning
No ratings yet
Linear Regression in Machine Learning
8 pages
DA unit-III
No ratings yet
DA unit-III
30 pages
AI & ML: Linear Regression Guide
No ratings yet
AI & ML: Linear Regression Guide
55 pages
A Review On Linear Regression Comprehensive in Machine Learning
No ratings yet
A Review On Linear Regression Comprehensive in Machine Learning
8 pages
Simple Linear Regression1
No ratings yet
Simple Linear Regression1
51 pages
3 - Linear Regression-Least Square Error Fit
No ratings yet
3 - Linear Regression-Least Square Error Fit
35 pages
Lecture 3 - Linear Regression Imran 20022025 092939am
No ratings yet
Lecture 3 - Linear Regression Imran 20022025 092939am
46 pages
TNP Lecture 2 G1G2
No ratings yet
TNP Lecture 2 G1G2
58 pages
Lecture 9-10 - Regression and Classification Cognitive
No ratings yet
Lecture 9-10 - Regression and Classification Cognitive
61 pages
Teit ML2
No ratings yet
Teit ML2
11 pages
Simple Linear Regression
No ratings yet
Simple Linear Regression
25 pages
Lecture 5 Regression
No ratings yet
Lecture 5 Regression
77 pages
Simple Linear Regression Guide
No ratings yet
Simple Linear Regression Guide
46 pages
Linear Regression
No ratings yet
Linear Regression
4 pages
Linear Regression
No ratings yet
Linear Regression
49 pages
Regression Linear Simple
No ratings yet
Regression Linear Simple
37 pages
Simple Linear Regression and Correlation
No ratings yet
Simple Linear Regression and Correlation
39 pages
Regression Analysis
No ratings yet
Regression Analysis
7 pages
Linear Regression
No ratings yet
Linear Regression
7 pages
Mungadze Linear
No ratings yet
Mungadze Linear
21 pages
ML Unit 3 Notes 1
No ratings yet
ML Unit 3 Notes 1
58 pages
Regression
No ratings yet
Regression
16 pages
Understanding Linear Regression Methods
No ratings yet
Understanding Linear Regression Methods
17 pages
Linear Regression
No ratings yet
Linear Regression
20 pages
Student Sol 064 e
No ratings yet
Student Sol 064 e
98 pages
Linear Regression For Machine Learning
100% (1)
Linear Regression For Machine Learning
17 pages
Unit 5. Transportation Systems Planning: Civil Engineering Department 8 Semester
No ratings yet
Unit 5. Transportation Systems Planning: Civil Engineering Department 8 Semester
19 pages
Data Analytics Lab R Experiments Guide
No ratings yet
Data Analytics Lab R Experiments Guide
20 pages
Linear Algebra and AI1
No ratings yet
Linear Algebra and AI1
46 pages
MATH 231-Statistics-Hira Nadeem PDF
No ratings yet
MATH 231-Statistics-Hira Nadeem PDF
3 pages
Unit III - Data Visualization and Representation
No ratings yet
Unit III - Data Visualization and Representation
17 pages
Big Data Assignments Answer
No ratings yet
Big Data Assignments Answer
15 pages
Jntuk R20 ML Unit-Ii
No ratings yet
Jntuk R20 ML Unit-Ii
37 pages
MTH686-Non Linear Regression Lecture 4
No ratings yet
MTH686-Non Linear Regression Lecture 4
5 pages
Mas 42b Cost Behavior With Regression Analysis
No ratings yet
Mas 42b Cost Behavior With Regression Analysis
7 pages
Chapter 10 - Forecasting
No ratings yet
Chapter 10 - Forecasting
7 pages
Statistics for Students
No ratings yet
Statistics for Students
30 pages
Activity Analysis, Cost Behaviour, Cost Estimation
No ratings yet
Activity Analysis, Cost Behaviour, Cost Estimation
48 pages
Chapter 6 (Part I)
0% (1)
Chapter 6 (Part I)
40 pages
Computational Statistics With R
100% (1)
Computational Statistics With R
125 pages
Statistics Project
No ratings yet
Statistics Project
13 pages
Excel Regression for Finance Students
No ratings yet
Excel Regression for Finance Students
19 pages
Advanced Statistics in Criminology and Criminal Justice 5th Edition David Weisburd David B Wilson Alese Wooditch Chester Britt
100% (4)
Advanced Statistics in Criminology and Criminal Justice 5th Edition David Weisburd David B Wilson Alese Wooditch Chester Britt
50 pages
Statistics Basics for Students
No ratings yet
Statistics Basics for Students
16 pages
Forecasting: 1. Qualitative 2. Time Series
No ratings yet
Forecasting: 1. Qualitative 2. Time Series
19 pages
Understanding Regression and Classification
No ratings yet
Understanding Regression and Classification
26 pages
Unit 2 Notes - Final
No ratings yet
Unit 2 Notes - Final
32 pages
Assignment Week 2
No ratings yet
Assignment Week 2
8 pages
SLIDE3
No ratings yet
SLIDE3
27 pages
Simple Regression Model Fitting
No ratings yet
Simple Regression Model Fitting
5 pages
Pengaruh Disiplin Kerja Terhadap Kinerja Karyawan: (Studi Kasus Pada PT. Denso Ten Manufacturing)
No ratings yet
Pengaruh Disiplin Kerja Terhadap Kinerja Karyawan: (Studi Kasus Pada PT. Denso Ten Manufacturing)
12 pages
Simple Linear Regression Guide
No ratings yet
Simple Linear Regression Guide
10 pages
Regression Analysis PPT
No ratings yet
Regression Analysis PPT
28 pages
Statistics & Probability for Engineers Course
No ratings yet
Statistics & Probability for Engineers Course
2 pages

DSA1101 2019 Week2 Part1

Uploaded by

DSA1101 2019 Week2 Part1

Uploaded by

Introduction to Data Science

Linear regression is an analytical technique used to model the

Similar forecasting models

Source: The Straits Times

Medical: Linear regression

Finance: Linear regression

Data on resale HDB prices

Let us take a closer look at the HDB resale dataset

Let us take a closer look at the HDB resale dataset

Let us take a closer look at the HDB resale dataset

Suppose we are interested to build a linear regression model

In the multiple linear regression model with p input variables,

y = β0 + β1 x(1) + β2 x(2) + ... + βp x(p) + ,

∗ y is the outcome variable

y = β0 + β1 x(1) + β2 x(2) + β3 x(3) + .

The parameters (β0 , β1 , β2 , β3 ) can be estimated by the

Suppose we have three observations. Each observation has an

Since there is only one input variable, this is an example of

Plot of the three data points.

We are interested in the

Recall that the above is

Intuitively, we want the line

Now the residual for each point may be positive or negative:

i residual: ei = yi − (β0 + β1 xi ) squared residual: ei2

i residual: ei = yi − (β0 + β1 xi ) squared residual: ei2

RSS = e12 + e12 + e32

= [−1 − β0 + β1 ]2 + [3.5 − β0 − 3β1 ]2 + [3 − β0 − 5β1 ]2 .

We now seek the values of β0 and β1 such that the RSS,

RSS = [−1 − β0 + β1 ]2 + [3.5 − β0 − 3β1 ]2 + [3 − β0 − 5β1 ]2 ,

We now have a function in terms of β0 and β1 . Let’s call it

h(β0 , β1 ) = [−1 − β0 + β1 ]2 +[3.5 − β0 − 3β1 ]2 +[3 − β0 − 5β1 ]2 .

To find the minimum value of h(β0 , β1 ), first differentiate

Then differentiate with respect to β1 , while holding β0

−11 + 6β0 + 14β1 = 0

Solving the equations,

leads to the least squares estimates

We usually add the ”hat” sign on top of parameter to denote

We can check that the least squares estimates we computed

Notice that on slide 22, we begin with this table:

i xi yi ŷi = β̂0 + β̂1 xi residual: ei = yi − (β̂0 + β̂1 xi )

i xi yi ŷi = β̂0 + β̂1 xi residual: ei = yi − (β̂0 + β̂1 xi )

We can follow up by plotting the residuals against the fitted

In the multiple linear regression model with p input variables,

y = β0 + β1 x(1) + β2 x(2) + ... + βp x(p) + ,

∗ y is the outcome variable

Suppose we are interested to build a linear regression model

1 > resale = read . csv ( " hdbresale _ reg . csv " )

Note that in our resale dataset, variables such as town and

It is not meaningful to assign numerical values to a categorical

Suppose we are interested in computing confidence intervals

R simplifies the computation of confidence intervals on the

R simplifies the computation of confidence intervals on the

We have made assumptions on the error terms in the multiple

You might also like

y = β0 + β1 x(1) + β2 x(2) + ... + βp x(p) + ,

y = β0 + β1 x(1) + β2 x(2) + β3 x(3) + .

y = β0 + β1 x(1) + β2 x(2) + ... + βp x(p) + ,