Introduction to Data Science
DSA1101
Semester 1, 2019/2020
Week 2
Introduction to linear regression
1 / 50
Linear regression
Linear regression is an analytical technique used to model the
relationship between several input variables and a continuous
outcome variable.
A key assumption is that the relationships between the input
variables and the outcome variable are linear.
2 / 50
Linear regression: examples
Real estate: Linear
regression analysis can be
used to model residential
home prices as a function
of the home’s living area.
Such a model helps set or
evaluate the list price of a
home on the market.
The model could be further
improved by including
other input variables such
Source: The Business Times as number of bathrooms,
number of bedrooms, lot
size, school district
rankings, crime statistics,
and property taxes.
3 / 50
Linear regression: examples
Demand forecasting:
Businesses and
governments can use linear
regression models to
predict demand for goods
and services.
For example, coffee shops
can appropriately prepare
for the predicted type and
quantity of food that
customers will consume
Source: The Straits Times based upon the weather,
the day of the week,
whether an item is offered
as a special, the time of
day, and the reservation
volume.
4 / 50
Linear regression: examples
Similar forecasting models
can be built to predict taxi
demand, emergency room
visits, and ambulance
dispatches.
Source: The Straits Times
5 / 50
Linear regression: examples
Medical: Linear regression
model can be used to
analyze the effect of a
proposed radiation
treatment on reducing
tumor sizes.
Multiple input variables
might include duration of a
single radiation treatment,
frequency of radiation
Source: The Straits Times
treatment, and patient
attributes such as age or
weight.
6 / 50
Linear regression: examples
Finance: Linear regression
is used to model the
relationships between stock
market prices and other
variables such as economic
performance, interest rates
Source: Bloomberg, Sunday Times and geopolitical risks.
Graphics
7 / 50
Linear regression: examples
Pharmaceutical Industry:
Linear regression model
can be used to analyze the
clinical efficacies of drugs.
Input variables may include
age, gender and other
patient characteristics such
as blood pressure and
Source: The Straits Times
blood sugar level.
8 / 50
Example closer to home...
Data on resale HDB prices
based on registration date
is publicly available from
https://data.gov.sg/
dataset/
resale-flat-prices.
We have extracted a subset
of all the resale records
from March 2012 to
December 2014 based on
Source: The Straits Times registration date.
Available as the data set
HDBresale_reg.csv on
the course website.
9 / 50
Example closer to home...
Source: data.gov.sg
10 / 50
GovTech datasets
https://data.gov.sg
hosts many publicly
available datasets for data
analytics.
Source: data.gov.sg
11 / 50
HDB resale data
Let us take a closer look at the HDB resale dataset
1 > resale = read . csv ( " hdbresale _ reg . csv " )
2
3 > head ( resale [ ,1:5])
4 X month town flat _ type block
5 1 580 2012 -03 CENTRAL AREA 3 ROOM 640
6 2 581 2012 -03 CENTRAL AREA 3 ROOM 640
7 3 582 2012 -03 CENTRAL AREA 3 ROOM 668
8 4 583 2012 -03 CENTRAL AREA 3 ROOM 5
9 5 584 2012 -03 CENTRAL AREA 3 ROOM 271
10 6 585 2012 -03 CENTRAL AREA 4 ROOM 671 A
12 / 50
HDB resale data
Let us take a closer look at the HDB resale dataset
1 > head ( resale [ ,6:8])
2 street _ name storey _ range floor _ area _ sqm
3 1 ROWELL RD 01 TO 05 74
4 2 ROWELL RD 06 TO 10 74
5 3 CHANDER RD 01 TO 05 73
6 4 TG PAGAR PLAZA 11 TO 15 59
7 5 QUEEN ST 11 TO 15 68
8 6 KLANG LANE 01 TO 05 75
13 / 50
HDB resale data
Let us take a closer look at the HDB resale dataset
1 > head ( resale [ ,9:11])
2 flat _ model lease _ commence _ date resale _ price
3 1 Model A 1984 380000
4 2 Model A 1984 388000
5 3 Model A 1984 400000
6 4 Improved 1977 460000
7 5 Improved 1979 488000
8 6 Model A 2003 495000
14 / 50
Multiple linear regression
Suppose we are interested to build a linear regression model
that estimates a HDB unit’s resale price as a function of
town, flattype and floorareainsquaremeters.
With more than one input variable, we will use multiple linear
regression.
15 / 50
Multiple linear regression
In the multiple linear regression model with p input variables,
y = β0 + β1 x(1) + β2 x(2) + ... + βp x(p) + ,
∗ y is the outcome variable
∗ x(j) are the input variables, j = 1, 2, ..., p
∗ β0 is the value of y when each x(j) equals zero
∗ βj is the change in y based on a unit change in x(j) for
j = 1, 2, ..., p
∗ is a random error term
16 / 50
Multiple linear regression
For example, when there are three input variables, the linear
model is
y = β0 + β1 x(1) + β2 x(2) + β3 x(3) + .
The parameters (β0 , β1 , β2 , β3 ) can be estimated by the
method of least squares.
We will review the least squares method in the simple linear
regression case to consolidate understanding.
17 / 50
Review: least squares for simple linear regression
Suppose we have three observations. Each observation has an
outcome y and an input variable x.
We are interested in the linear relationship
yi ≈ β0 + β1 xi
Since there is only one input variable, this is an example of
simple linear model.
18 / 50
Review: least squares for simple linear regression
i xi yi
1 -1 -1
2 3 3.5
3 5 3
Plot of the three data points.
19 / 50
Review: least squares for simple linear regression
We are interested in the
linear relationship
yi ≈ f (xi ) = β0 + β1 xi
Recall that the above is
actually the formula for a
straight line.
There are many different
lines that can be used to
model x and y , as shown
in the plot.
20 / 50
Review: least squares for simple linear regression
Intuitively, we want the line
to be as close to the data
points as possible.
This ”closeness” can be
measured in terms of the
vertical distance between
each point to the line
(represented by the length
of the purple lines).
21 / 50
Review: least squares for simple linear regression
i xi yi β0 + β1 xi residual: ei = yi − (β0 + β1 xi )
1 -1 -1 β0 + (−1)β1 −1 − (β0 + (−1)β1 ) = −1 − β0 + β1
2 3 3.5 β0 + (3)β1 3.5 − (β0 + (3)β1 ) = 3.5 − β0 − 3β1
3 5 3 β0 + (5)β1 3 − (β0 + (5)β1 ) = 3 − β0 − 5β1
22 / 50
Review: least squares for simple linear regression
Now the residual for each point may be positive or negative:
We do not want the residuals to ”cancel off” each other, so
we square each of them, leading to the squared residuals.
i residual: ei = yi − (β0 + β1 xi ) squared residual: ei2
1 −1 − (β0 + (−1)β1 ) = −1 − β0 + β1 [−1 − β0 + β1 ]2
2 3.5 − (β0 + (3)β1 ) = 3.5 − β0 − 3β1 [3.5 − β0 − 3β1 ]2
3 3 − (β0 + (5)β1 ) = 3 − β0 − 5β1 [3 − β0 − 5β1 ]2
23 / 50
Review: least squares for simple linear regression
i residual: ei = yi − (β0 + β1 xi ) squared residual: ei2
1 −1 − (β0 + (−1)β1 ) = −1 − β0 + β1 [−1 − β0 + β1 ]2
2 3.5 − (β0 + (3)β1 ) = 3.5 − β0 − 3β1 [3.5 − β0 − 3β1 ]2
3 3 − (β0 + (5)β1 ) = 3 − β0 − 5β1 [3 − β0 − 5β1 ]2
To express the total magnitude of the deviations, we sum up
the squared residuals for all the data points.
The resulting sum is the Residual Sum of Squares, abbreviated
as RSS
In the above example,
RSS = e12 + e12 + e32
= [−1 − β0 + β1 ]2 + [3.5 − β0 − 3β1 ]2 + [3 − β0 − 5β1 ]2 .
24 / 50
Review: least squares for simple linear regression
We now seek the values of β0 and β1 such that the RSS,
given by
RSS = [−1 − β0 + β1 ]2 + [3.5 − β0 − 3β1 ]2 + [3 − β0 − 5β1 ]2 ,
is minimized.
This process is known as the method of least squares.
25 / 50
Review: least squares for simple linear regression
We now have a function in terms of β0 and β1 . Let’s call it
h(β0 , β1 ) so that
h(β0 , β1 ) = [−1 − β0 + β1 ]2 +[3.5 − β0 − 3β1 ]2 +[3 − β0 − 5β1 ]2 .
To find the minimum value of h(β0 , β1 ), first differentiate
with respect to β0 , while holding β1 constant:
∂h(β0 , β1 )
= 2 [−1 − β0 + β1 ] (−1)
∂β0
+ 2 [3.5 − β0 − 3β1 ] (−1) + 2 [3 − β0 − 5β1 ] (−1)
= 2 + 2β0 − 2β1 − 7 + 2β0 + 6β1 − 6 + 2β0 + 10β1
= −11 + 6β0 + 14β1 .
26 / 50
Review: least squares for simple linear regression
Then differentiate with respect to β1 , while holding β0
constant:
∂h(β0 , β1 )
= 2 [−1 − β0 + β1 ] (1)
∂β1
+2 [3.5 − β0 − 3β1 ] (−3) + 2 [3 − β0 − 5β1 ] (−5)
= − 2 − 2β0 + 2β1 − 21 + 6β0 + 18β1 − 30 + 10β0 + 50β1
= − 53 + 14β0 + 70β1 .
27 / 50
Review: least squares for simple linear regression
Finally, by setting both the derivative to zero, we have the
system of equations
−11 + 6β0 + 14β1 = 0
−53 + 14β0 + 70β1 = 0
Solving the equations,
11 14
β0 = − β1
6 6
−53 + 14β0 + 70β1 = 0,
leads to the least squares estimates
β0 ≈ 0.1250
β1 ≈ 0.7321
28 / 50
Review: least squares for simple linear regression
We usually add the ”hat” sign on top of parameter to denote
estimated values, so the least squares estimates are denoted as
β̂0 ≈ 0.1250
β̂1 ≈ 0.7321.
29 / 50
Review: least squares for simple linear regression
We can check that the least squares estimates we computed
are equivalent to those returned by lm() function in R:
1 > x <-c ( -1 , 3 , 5)
2 > y <-c ( -1 , 3.5 , 3)
3 > lm ( y ~ x ) ;
4
5 Call :
6 lm ( formula = y ~ x )
7
8 Coefficients :
9 ( Intercept ) x
10 0.1250 0.7321
30 / 50
Review: least squares for simple linear regression
Notice that on slide 22, we begin with this table:
i xi yi β0 + β1 xi residual: ei = yi − (β0 + β1 xi )
1 -1 -1 β0 + (−1)β1 −1 − β0 + β1
2 3 3.5 β0 + (3)β1 3.5 − β0 − 3β1
3 5 3 β0 + (5)β1 3 − β0 − 5β1
However, the values for β0 and β1 were unknown.
Now that we have obtained the least squares estimates
β̂0 ≈ 0.1250 and β̂1 ≈ 0.7321, we can plug those values into
the table above!
31 / 50
Review: least squares for simple linear regression
i xi yi ŷi = β̂0 + β̂1 xi residual: ei = yi − (β̂0 + β̂1 xi )
1 -1 -1 -0.6071 -0.3929
2 3 3.5 2.3213 1.1787
3 5 3 3.7855 -0.7855
The column for ŷi = β̂0 + β̂1 xi contains the fitted values for
outcome y .
The column for ei = yi − (β̂0 + β̂1 xi ) contains the residuals
after fitting the simple linear model.
32 / 50
Review: least squares for simple linear regression
i xi yi ŷi = β̂0 + β̂1 xi residual: ei = yi − (β̂0 + β̂1 xi )
1 -1 -1 -0.6071 -0.3929
2 3 3.5 2.3213 1.1787
3 5 3 3.7855 -0.7855
We see that R can also output the fitted values and residuals
directly after fitting the linear model:
1 > x <-c ( -1 , 3 , 5)
2 > y <-c ( -1 , 3.5 , 3)
3 > lmout <- lm ( y ~ x )
4 > lmout $ fitted . values
5 1 2 3
6 -0.6071429 2.3214286 3.7857143
7 > lmout $ residuals
8 1 2 3
9 -0.3928571 1.1785714 -0.7857143
33 / 50
Review: least squares for simple linear regression
We can follow up by plotting the residuals against the fitted
outcome values for model diagnostics, as discussed in the
previous lecture.
1 x <-c ( -1 , 3 , 5)
2 y <-c ( -1 , 3.5 , 3)
3
4 lmout <- lm ( y ~ x )
5
6 plot ( x = lmout $ fitted . values , y = lmout $ residuals ,
7 xlab = " Fitted values " , ylab = " Residuals " ,
8 cex =2 , cex . lab =2 , cex . axis =2 , pch =16)
9
10 abline (0 ,0)
34 / 50
Review: least squares for simple linear regression
35 / 50
Back to multiple linear regression...
In the multiple linear regression model with p input variables,
y = β0 + β1 x(1) + β2 x(2) + ... + βp x(p) + ,
∗ y is the outcome variable
∗ x(j) are the input variables, j = 1, 2, ..., p
∗ β0 is the value of y when each x(j) equals zero
∗ βj is the change in y based on a unit change in x(j) for
j = 1, 2, ..., p
∗ is a random error term
We can also estimate the unknown parameters in the multiple
linear regression model via the method of least squares
36 / 50
Back to our HDB resale data example...
Source: data.gov.sg
37 / 50
Multiple linear regression
Suppose we are interested to build a linear regression model
that estimates a HDB unit’s resale price as a function of
town, flattype and floorareainsquaremeters.
With more than one input variable, we will use multiple linear
regression.
38 / 50
Multiple linear regression
1 > resale = read . csv ( " hdbresale _ reg . csv " )
2 > head ( resale [ , c (3 ,4 ,8 ,11) ])
3 town flat _ type floor _ area _ sqm resale _ price
4 1 CENTRAL AREA 3 ROOM 74 380000
5 2 CENTRAL AREA 3 ROOM 74 388000
6 3 CENTRAL AREA 3 ROOM 73 400000
7 4 CENTRAL AREA 3 ROOM 59 460000
8 5 CENTRAL AREA 3 ROOM 68 488000
9 6 CENTRAL AREA 4 ROOM 75 495000
39 / 50
HDB resale data
Note that in our resale dataset, variables such as town and
flat_type are called categorical variables, since they consist
of different categories instead of numerical values.
The function levels() in R will display all the different
categories in a variable.
1 > levels ( resale $ town )
2 [1] " CENTRAL AREA " " JURONG EAST " " WOODLANDS "
3 > levels ( resale $ flat _ type )
4 [1] " 2 ROOM " " 3 ROOM " " 4 ROOM " " 5 ROOM " "
EXECUTIVE "
40 / 50
HDB resale data
In R, we can perform multiple linear regression using the lm()
function:
1 > lm ( resale _ price ~ town + floor _ area _ sqm + flat _ type , data =
resale )
2
3 Call :
4 lm ( formula = resale _ price ~ town + floor _ area _ sqm +
flat _ type ,
5 data = resale )
6
7 Coefficients :
8 ( Intercept ) townJURONG EAST
9 193438 -122748
10 townWOODLANDS floor _ area _ sqm
11 -169896 2526
12 flat _ type3 ROOM flat _ type4 ROOM
13 98827 129929
14 flat _ type5 ROOM flat _ typeEXECUTIVE
15 142570 214622
41 / 50
Categorical variables
It is not meaningful to assign numerical values to a categorical
variable such as town. For example, it is not meaningful to
consider Jurong East to be one unit greater than Woodlands
and two units greater than Central Area
HDB resales that took place in the Central Areas will be
regarded as the reference case.
So for example, the coefficient estimate of −122748 for
townJURONGEAST means that HDB resale price in Jurong East
is on average $122748 less than the resale price in Central
areas, other variables being held constant.
Similarly, the coefficient estimate of −169896 for
townWOODLANDS means that HDB resale price in Woodlands
is on average $169896 less than the resale price in Central
areas, other variables being held constant.
42 / 50
Multiple linear regression
Suppose we are interested in computing confidence intervals
for the parameter estimates from our multiple linear regression
model
Similar to the simple linear regression setting, we assume that
the error terms are independent and normally distributed with
mean zero and constant variances.
43 / 50
Multiple linear regression
R simplifies the computation of confidence intervals on the
parameters with the use of the confint() function.
For example, the following R command provides 95%
confidence intervals on our parameter estimates:
1 > out = lm ( resale _ price ~ town + floor _ area _ sqm + flat _ type ,
data = resale )
2 > confint ( out , level = .95)
3 2.5 % 97.5 %
4 ( Intercept ) 180791.622 206084.668
5 townJURONG EAST -127894.901 -117601.145
6 townWOODLANDS -174833.773 -164959.166
7 floor _ area _ sqm 2403.141 2649.163
8 flat _ type3 ROOM 87053.089 110601.512
9 flat _ type4 ROOM 117163.666 142693.511
10 flat _ type5 ROOM 128354.088 156786.844
11 flat _ typeEXECUTIVE 197641.718 231602.642
44 / 50
Multiple linear regression
R simplifies the computation of confidence intervals on the
parameters with the use of the confint() function.
For example, the following R command provides 95%
confidence intervals on our parameter estimates:
1 > out = lm ( resale _ price ~ town + floor _ area _ sqm + flat _ type ,
data = resale )
2 > confint ( out , level = .95)
3 2.5 % 97.5 %
4 ( Intercept ) 180791.622 206084.668
5 townJURONG EAST -127894.901 -117601.145
6 townWOODLANDS -174833.773 -164959.166
7 floor _ area _ sqm 2403.141 2649.163
8 flat _ type3 ROOM 87053.089 110601.512
9 flat _ type4 ROOM 117163.666 142693.511
10 flat _ type5 ROOM 128354.088 156786.844
11 flat _ typeEXECUTIVE 197641.718 231602.642
45 / 50
Multiple linear regression
R also computes the p-value for testing H0 : βj = 0 versus
H1 : βj 6= 0 for j = 0, 1, 2, ..., p.
1 > out = lm ( resale _ price ~ town + floor _ area _ sqm + flat _ type , data = resale )
2 > summary ( out )
3
4 Call :
5 lm ( formula = resale _ price ~ town + floor _ area _ sqm + flat _ type ,
6 data = resale )
7
8 Residuals :
9 Min 1Q Median 3Q Max
10 -139026 -23350 -1453 19284 336649
11
12 Coefficients :
13 Estimate Std . Error t value Pr ( >| t |)
14 ( Intercept ) 193438.15 6451.13 29.98 <2e -16 * * *
15 townJURONG EAST -122748.02 2625.48 -46.75 <2e -16 * * *
16 townWOODLANDS -169896.47 2518.57 -67.46 <2e -16 * * *
17 floor _ area _ sqm 2526.15 62.75 40.26 <2e -16 * * *
18 flat _ type3 ROOM 98827.30 6006.16 16.45 <2e -16 * * *
19 flat _ type4 ROOM 129928.59 6511.53 19.95 <2e -16 * * *
20 flat _ type5 ROOM 142570.47 7251.94 19.66 <2e -16 * * *
21 flat _ typeEXECUTIVE 214622.18 8661.93 24.78 <2e -16 * * *
22 ---
23 Signif . codes : 0 ’* * * ’ 0.001 ’* * ’ 0.01 ’* ’ 0.05 ’. ’ 0.1 ’ ’ 1
46 / 50
Confidence interval on the expected outcome
Suppose we are interested in the expected resale price for
3-room HDB units in the Central Area with floor area of 70
square meters.
The predict() function in R can provide a 95% confidence
interval on this expected resale price
1 > out = lm ( resale _ price ~ town + floor _ area _ sqm + flat _ type ,
data = resale )
2 >
3 > town = " CENTRAL AREA "
4 > flat _ type = " 3 ROOM "
5 > floor _ area _ sqm =70
6 >
7 > new _ pt <- data . frame ( town , flat _ type , floor _ area _ sqm )
8 > conf _ int _ pt <- predict ( out , new _ pt , level =.95 , interval =
" confidence " )
9 >
10 > conf _ int _ pt
11 fit lwr upr
12 1 469096.1 464369 473823.2
47 / 50
Confidence interval on a particular outcome
Suppose we are interested in the resale price for a particular
3-room HDB unit in the Central Area with floor area of 70
square meters.
The predict() function in R can also provide a 95%
prediction interval on this resale price
1 > out = lm ( resale _ price ~ town + floor _ area _ sqm + flat _ type ,
data = resale )
2 >
3 > town = " CENTRAL AREA "
4 > flat _ type = " 3 ROOM "
5 > floor _ area _ sqm =70
6 >
7 > new _ pt <- data . frame ( town , flat _ type , floor _ area _ sqm )
8 > conf _ int _ pt <- predict ( out , new _ pt , level =.95 , interval =
" prediction " )
9 >
10 > conf _ int _ pt
11 fit lwr upr
12 1 469096.1 391692.2 546499.9
48 / 50
Model Diagnostics: Evaluating the Residuals
We have made assumptions on the error terms in the multiple
linear regression model
We can plot the residuals against fitted values for visual
inspection
1 > out = lm ( resale _ price ~ town + floor _ area _ sqm + flat _ type ,
data = resale )
2 >
3 > plot ( x = out $ fitted . values , y = out $ residuals ,
4 + xlab = " Fitted values " , ylab = " Residuals " , col = " red
")
5 > abline (0 ,0)
49 / 50
Model Diagnostics: Evaluating the Residuals
50 / 50