0% found this document useful (0 votes)

188 views72 pages

FIT2086 Lecture 6 Linear Regression: Daniel F. Schmidt

This document discusses linear regression models and model selection for linear regression. It begins by introducing linear regression as a type of supervised learning method. Linear regression finds a linear relationship between target variables (y) and predictor variables (X) in order to predict the target variables. The document then discusses simple linear regression using an example dataset to predict blood pressure from patient ID. It uses the mean to make predictions but notes that a simple mean may not be the best predictive model.

Uploaded by

Doaibu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

188 views72 pages

FIT2086 Lecture 6 Linear Regression: Daniel F. Schmidt

Uploaded by

Doaibu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Linear Regression Models

Model Selection for Linear Regression

FIT2086 Lecture 6
Linear Regression

Daniel F. Schmidt

Faculty of Information Technology, Monash University

September 4, 2017
Linear Regression Models
Model Selection for Linear Regression

Outline

1 Linear Regression Models

Supervised Learning
Linear Regression Models

2 Model Selection for Linear Regression

Under and Overfitting
Model Selection Methods
Linear Regression Models
Model Selection for Linear Regression

Revision from last week

Hypothesis testing; test null hypothesis vs alternative

H0 : null hypothesis
vs
HA : alternative hypothesis

A test-statistic measures how different our observed sample is

from the null hypothesis
A p-value quantifies the evidence against the null hypothesis
A p-value is the probability of seeing a sample that results in a
test statistic as extreme, or more extreme, than the one we
observed, just by chance if the null was true.
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Outline

1 Linear Regression Models

Supervised Learning
Linear Regression Models

2 Model Selection for Linear Regression

Under and Overfitting
Model Selection Methods
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Supervised Learning (1)

Over the last three weeks we have looked at parameter

inference

In week 3 we examined point estimation using maximum

likelihood
Selecting our “best guess” at a single value of the parameter

In week 4 we examined interval estimation using confidence

intervals
Give a range of plausible values for the unknown population
parameter

In week 5 we examined hypothesis testing

Quantify statistical evidence against a given hypothesis
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Supervised Learning (2)

Now we will start to see how these tools can be used to build
more complex models
Over the next three weeks we will look at supervised learning
In particular, we we will look at linear regression
But first, what is supervised learning?
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Supervised Learning (3)

Imagine we have measured p + 1 variables on n individuals

(people, objects, things)
We would like to predict one of the variables using the
remaining p variables

If the variable we are predicting is categorical, we are

performing classification
Example: predicting if someone has diabetes from medical
measurements.

If the variable we are predicting is numerical, we are

performing regression
Example: Predicting the quality of a wine from chemical and
seasonal information.
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Supervised Learning (3)

Imagine we have measured p + 1 variables on n individuals

(people, objects, things)
We would like to predict one of the variables using the
remaining p variables

If the variable we are predicting is categorical, we are

performing classification
Example: predicting if someone has diabetes from medical
measurements.

If the variable we are predicting is numerical, we are

Supervised Learning (4)

The variable we are predicting is designated the “y” variable
We have (y1 , . . . , yn )
This variable is often called the:
target;
response;
outcome.

The other variables are usually designated “X” variables

We have (xi,1 , . . . , xi,n ) for i = 1, . . . , p
These variables are often called the
explanatory variables;
predictors;
covariates;
exposures.

Usually we assume the targets are random variables and the

predictors are known without error
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Supervised Learning (4)

The variable we are predicting is designated the “y” variable
We have (y1 , . . . , yn )
This variable is often called the:
target;
response;
outcome.

The other variables are usually designated “X” variables

We have (xi,1 , . . . , xi,n ) for i = 1, . . . , p
These variables are often called the
explanatory variables;
predictors;
covariates;
exposures.

Usually we assume the targets are random variables and the

predictors are known without error
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Supervised Learning (4)

Supervised learning: find a relationship between the targets yi

and associated predictors xi,1 , . . . , xi,p .
That is, learn a function f (·) such that

yi = f (xi,1 , . . . , xi,p )

Usually error in measuring yi so that no f (·) fits perfectly

⇒ we model yi as realisation of RV Yi
So instead, find an f (·) that is “close” to y1 , . . . , yn

It is “supervised” because we have examples to learn from

Supervised learning model depends on form of f (·)
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Linear Regression

Linear regression is a special type of supervised learning

In this case, we take the function f (·) that relates the
predictors to the target as being linear

One of the most important models in statistics

The resulting model is highly interpretable
It is very flexible and can even handle nonlinear relationships
It is computationally efficient to fit, even for very large p
Enormous area of research and work
⇒ we will get acquainted with the basics
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Simple Linear Regression (1)

Consider the following dataset (we examined in Studio 5):

Imagine we want to model blood pressure

Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Simple Linear Regression (2)

Blood pressure plotted against patient ID
125
Blood Pressure (mmHg)

120

115

110

105
5 10 15 20
Patient ID
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Simple Linear Regression (3)

Our blood pressure variable BP1 , . . . , BP20 is continuous

⇒ we choose to model it using a normal distribution
The maximum likelihood estimate of the mean µ is
n
1 X
µ̂ = yi = 114
20 i=1

which is equivalent to the sample mean

We have a new person from the population this sample was
drawn from and we want to predict their blood pressure
Using our simple model our best guess of this persons blood
pressure is 114, i.e., the estimated mean µ̂
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Simple Linear Regression (4)

Prediction of BP using the mean
125
Blood Pressure (mmHg)

120

115

110

105
5 10 15 20
Patient ID
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Simple Linear Regression (5)

How good is our model at predicting?

One way we could measure this is through prediction error
We don’t know future data, but we can look to see how well it
predicts the data we have
Let ŷi denote the prediction of sample y using a model; then

ei = ŷi − yi

are the errors between our model predictions ŷi and the
observed data yi
⇒ often called residual error, or just residuals
A good fit would lead to overall small errors
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Simple Linear Regression (6)

Prediction of BP using the mean, showing errors/residuals
125
Blood Pressure (mmHg)

120

115

110

105
5 10 15 20
Patient ID
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Simple Linear Regression (7)

We can summarise the total error of fit of our model by

n
X
RSS = e2i
i=1

which is called the residual sum-of-squared errors.

For our simple mean model RSS = 560
Can we do better (smaller error) if we use one of the other
measured variables to help predict blood pressure?
For example, if we took a persons weight into account, could
we build a better predictor of their blood pressure?
To get an idea if there is scope for improvement we can plot
blood pressure vs weight
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Simple Linear Regression (7)

We can summarise the total error of fit of our model by

n
X
RSS = e2i
i=1

which is called the residual sum-of-squared errors.

Simple Linear Regression (8)

Blood pressure vs weight – BP appears to increase with weight

125
Blood Pressure (mmHg)

120

115

110

105

86 88 90 92 94 96 98 100 102
Weight (kg)
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Simple Linear Regression (9)

Our simple mean model is clearly not a good fit

125
Blood Pressure (mmHg)

120

115

110

105

86 88 90 92 94 96 98 100 102
Weight (kg)
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Simple Linear Regression (10)

Our simple mean model predicts blood pressure by

E [BPi ] = µ

irrespective of any other data on individual i

Let (Weight1 , . . . , Weight20 ) be the weights of our 20
individuals
We can let the mean vary as a linear function of weight, i.e.,

E [BPi | Weighti ] = β0 + β1 Weighti

This says that the conditional mean of blood pressure BPi for
individual i, given the individual’s weight Weighti , is equal to
β0 plus β1 times the weight Weighti
Note our simple mean model is a linear model with β1 = 0
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Simple Linear Regression (11)

The linear model E [BPi | Weighti ] = 1.2009 + 2.2053 Weighti

125
Blood Pressure (mmHg)

120

115

110

105

86 88 90 92 94 96 98 100 102
Weight (kg)
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Simple Linear Regression (12)

Residuals; ei = BPi − 1.2009 − 2.2053 Weighti (RSS= 120)

125
Blood Pressure (mmHg)

120

115

110

105

86 88 90 92 94 96 98 100 102
Weight (kg)
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Simple Linear Regression (13) – Key Slide

A linear model of the form

E [Yi | xi ] = ŷi = β0 + β1 xi

is called a simple linear regression.

It has two free regression parameters

β0 is the intercept; it is the value of the predicted value ŷi
when the predictor xi = 0
β1 is a regression coefficient; it is the amount the predicted
value ŷi changes by in one unit change of the predictor xi
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Simple Linear Regression (14)

In our example yi is blood pressure and xi weight;

ŷi = 1.2009 + 2.2053xi

so
For every additional kilogram a person weighs, their blood
pressure increases by 2.2053mmHg
For a person who weighs zero kilograms, the predicted blood
pressure is 1.2009mmHg
The predictions might not make sense outside of sensible
ranges of the predictors!
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Fitting Simple Linear Regressions (1)

How did we arrive at β̂0 = 1.2009 and β̂1 = 2.2053 in our

blood pressure vs weight example?
Measure fit of a model by its RSS
n
X
RSS = (yi − β0 − β1 xi )2
i=1
n
X
= (yi − ŷi )2
i=1
n
X
= e2i
i=1

Smaller error = better fit

Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Fitting Simple Linear Regressions (2)

So least-squares principle says we choose (estimate) β0 , β1 to

minimise the RSS
Formally
( n )
X
(β̂0 , β̂1 ) = arg min (yi − β0 − β1 xi )2
β0 ,β1 i=1

These are often called least-squares (LS) estimates.

There are alternative measures of error; for example least sum
of absolute errors.
Least squares is popular due to simplicity, computational
efficiency and connections to normal models
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Fitting Simple Linear Regressions (3)

The RSS is a function of β0 , β1 , i.e.,

n
X
RSS(β0 , β1 ) = (yi − β0 − β1 xi )2
i=1

The least-squares estimates are the solutions to the equations

n
∂RSS(β0 , β1 ) X
= −2 (yi − β0 − β1 xi ) = 0
∂β0 i=1
n
∂RSS(β0 , β1 ) X
= −2 xi (yi − β0 − β1 xi ) = 0
∂β1 i=1

where we use the chain rule.

Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Fitting Simple Linear Regressions (4)

The solution for β0 is

n n n n
! ! ! !
X X X X
yi x2i − yi x i xi
i=1 i=1 i=1 i=1
β̂0 = !2
n
X Xn
n x2i − xi
i=1 i=1

and the solution for β1 is

n n
!
X X
yi xi − β̂0 xi
i=1 i=1
β̂1 = n
X
x2i
i=1
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Fitting Simple Linear Regressions (5)

Given LS estimates β̂0 , β̂1 we can find the predictions for our
data
ŷi = β̂0 + β̂1 xi
and residuals
ei = yi − ŷi
The vector of residuals e = (e1 , . . . , en ) has the properties
n
X
ei = 0 and corr (x, e) = 0
i=1

where x = (x1 , . . . , xn ) is our predictor variable.

This means least-squares fits a line such that the mean of the
resulting residuals is zero, and the residuals are uncorrelated
with the predictor.
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Fitting Simple Linear Regressions (5)

where x = (x1 , . . . , xn ) is our predictor variable.

Multiple Linear Regression (1) – Key Slide

We have used one explanatory variable in our linear model

A great strength of linear models is that they easily handle
multiple variables
Let xi,j denote the variable j for individual i, where
j = 1, . . . , p; i.e., we have p explanatory variables. Then
p
X
E [yi | xi,1 , . . . , xi,p ] = β0 + βj xi,j
j=1

The intercept is now the expected value of the target when

xi,1 = xi,2 = · · · = xi,p = 0
The coefficient βj is the increase in the expected value of the
target per unit change in explanatory variable j
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Multiple Linear Regression (2) – Key Slide

Fit a multiple linear regression using least-squares

⇒ assume p < n, otherwise solution is non-unique
Given coefficients β0 , β1 ,. . . ,βp the RSS is
 2
n
X p
X
RSS(β0 , β1 , . . . , βp ) = yi − βj xi,j 
i=1 j=1

Now we have to solve

β̂0 , β̂1 , . . . , β̂p = arg min {RSS(β0 , β1 , . . . , βp )}

β0 ,β1 ,...,βp

Efficient algorithms exist to find these estimates

Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Multiple Linear Regression (3)

Matrix algebra can simplify linear regression equations

We have a vector of targets y = (y1 , . . . , yn )
We have a vector of coefficients β = (β1 , . . . , βp )
We can treat each variable as a vector xj = (x1,j , . . . , xn,j )
Arrange these vectors into a matrix X of predictors:
 
x1,1 x1,2 ··· x1,p

x2,1 x2,2 ··· x2,p 
(x10 , x20 , . . . , x20 )
 
X= = .. .. .. ,
. . .
 
 
xn,1 xn,2 · · · xn,p

We call this the design matrix

⇒ has p columns (predictors) and n rows (individuals)
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Multiple Linear Regression (4) – Key Slide

We can form our predictions and residuals using

ŷ = Xβ + β0 1n and e = y − ŷ.

where 1n is a vector of n ones.

We can then write our RSS very compactly as

RSS(β0 , β) = e0 e

If β̂0 , β̂ are least-squares estimates, then

corr(xj , e) = 0 for all j

That is, least-squares finds the plane such that the residuals
(errors) are uncorrelated with all predictors in the model
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

R-squared (R2 ) (1)

Residual sum-of-squares tells us how well we fit the data

But the scale is arbitrary – what does an RSS of 2, 352 mean?
Instead, we define the RSS relative to some reference point
We use the total sum-of-squares as the reference:
n
X
TSS = (yi − ȳ)2
i=1

which is the residual sum-of-squares obtained by fitting the

intercept only (the “mean model”)
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

R-squared (R2 ) (2) – Key Slide

The R2 value is then defined as

RSS
R2 = 1 −
TSS
which is also called the coefficient-of-determination
R2 is strictly between 0 (model has no explanatory power)
and 1 (model completely explains the data)
The higher the R2 the better the fit to the data
Adding an extra predictor always increases R2
⇒ predictors that greatly increase R2 are potentially
important
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Example: Multiple regression and R2 (1)

Let us revisit our blood pressure data

The residual sum-of-squares of our mean model was 560
⇒ this is our reference model (total sum-of-squares)
Regression of blood pressure (BP) onto weight gave us

E [BP | Weight] = 2.20 + 1.2 Weight

which had an RSS of 54.52 ⇒ R2 ≈ 0.9

Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Example: Multiple regression and R2 (2)

In our data we also have an individual’s age

We fit a multiple linear regression of BP onto weight and age

E [BP | Weight, Age] = −16.57 + 1.03 Weight + 0.71 Age

This says that:

for every kilogram, a person’s bloodpressure rises by
1.03mmHg;
for every year, a person’s bloodpressure rises by 0.71mmHg;
This model has an RSS of 4.82 ⇒ R2 = 0.99
So include age seems to increase our fit substanstially
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Handling Categorical Predictors (1)

Sometimes our predictors are categorical variables

This means the numerical values they take are on just codes
for different categories
Makes no sense to “added” or “multiply” them
Instead we turn them into K − 1 new predictors (if K is the
number of categories)
These predictors take on a one when an individual is in a
particular category, and zero otherwise
They are called indicator variables.
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Handling Categorical Predictors (2) – Key Slide

Example variable with four categories coded as 1, 2, 3 and 4

   
1 0 0 0

 2 


 1 0 0 

1 0 0 0
   
   
3 0 1 0
   
   
   

 4 
 =⇒ 
 0 0 1 


 2 


 1 0 0 

3 0 1 0
   
   
   
 2   1 0 0 
4 0 0 1

We do not build indicators for first category

Regression coefficients for other categories are increases in
target relative to being in the first category
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Nonlinear effects (1)

Sometimes predictors are related to the target in a nonlinear

fashion
We can still use linear models by transforming the predictors
If the transformed predictors are linearly related to the target,
regression will work well
We can often detect this by plotting the residuals against a
variable – if they exhibit a nonlinear trend or curve it is sign
that a transformation might be needed
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Nonlinear effects (2)

Example dataset
10

6
y

0 0.2 0.4 0.6 0.8 1

x
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Nonlinear effects (3)

Fitted model: ŷ = −1.07 + 9.55x; RSS = 0.95
10

6
y

0 0.2 0.4 0.6 0.8 1

x
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Nonlinear effects (4)

Example data: residuals exhibit clear nonlinear trend
2

1.5

1
Residuals

0.5

-0.5

-1
0 0.2 0.4 0.6 0.8 1
x
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Nonlinear effects (5)

There are several common transformations

A logarithmic transformation can be used if the predictor
seems to be more variable for larger values of the predictor

xi,j ⇒ log xi,j

Can only be used if all xi > 0

Polynomial transformations offer general purpose nonlinear fits
We turn our variable into q new variables of the form:

xi,j ⇒ xi,j , x2i,j , x3i,j , . . . , xqi,j

The higher the q the more nonlinear the fit can become, but
at risk of overfitting
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Nonlinear effects (5)

There are several common transformations

A logarithmic transformation can be used if the predictor
seems to be more variable for larger values of the predictor

xi,j ⇒ log xi,j

Can only be used if all xi > 0

Polynomial transformations offer general purpose nonlinear fits
We turn our variable into q new variables of the form:

xi,j ⇒ xi,j , x2i,j , x3i,j , . . . , xqi,j

The higher the q the more nonlinear the fit can become, but
at risk of overfitting
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Nonlinear effects (6)

New model: ŷ = −0.02 + 2.16x + 7.77x2 , R2 = 0.999
10

6
y

0 0.2 0.4 0.6 0.8 1

x
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Connecting LS to ML (1)
To show this, let our targets Y1 , . . . , Yn be RVs
Write the linear regression model as
p
X
Ŷi = β0 + βj xi,j + εi
j=1

where εi is a random, unobserved “error”

Now assume that εi ∼ N (0, σ 2 )
This is equivalent to saying that
 
p
X
Yi | xi,1 , . . . , xi,p ∼ N β0 + βj xi,j , σ 2 
j=1

so each Yi is normally distributed with variance σ 2 and a

mean that depends on the values of the associated predictors
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Connecting LS to ML (2)

Each Yi is independent
Given target data y the likelihood function can be written

Pp 2 
n 1 yi − β0 −
Y 1 2 j=1 βj xi,j
p(y | β0 , β, σ 2 ) = exp −
 
2πσ 2 2σ 2 
i=1

Noting e−a e−b = e−a−b this simplifes to


Pn Pp 2 
n yi − β0 −
1 j=1 βj xi,j

2 i=1
exp −
 
2πσ 2 2σ 2


where we can see term in the numerator in the exp(·) is the

residual sum-of-squares.
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Connecting LS to ML (2) – Key Slide

Taking the negative-logarithm of this yields

n RSS(β0 , β)
L(y | β0 , β, σ 2 ) = log(2πσ 2 ) +
2 2σ 2
As the value of σ 2 scales the RSS term, it is easy to see that
the values of β0 and β that minimise the negative
log-likelihood are the least-squares estimates β̂0 and β̂
LS estimates are same as the maximum likelihood estimates
assuming the random “errors” εi are normally distributed
Our residuals
ei = yi − ŷi
can be viewed as our estimates of the errors εi .
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Connecting LS to ML (3)

How to estimate the error variance σ 2 ?

The maximum likelihood estimate is:

2 RSS(β̂0 , β̂)
σ̂ML =
n
but this tends to underestimate the actual variance.
A better estimate is the unbiased estimate

RSS(β̂0 , β̂)
σ̂u2 =
n−p−1
where p is the number of predictors used to fit the model.
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models

Making predictions with a linear model

Given estimates β̂0 , β̂ can make predictions about new data

To estimate value of target for some new predictor values
x01 , x02 , . . . , x0p
p
β̂j x0j
X
ŷ = β̂0 +
j=1

Using normal model of residuals, we can also get probability

distribution over future data:
 
p
β̂j x0j , σ 2 
X
Ŷ ∼ N β̂0 +
j=1

By changing predictors we can see how target changes

Example: seeing how weight and age effect blood pressure
Careful using predictions outside of sensible predictors values!
Linear Regression Models Under and Overfitting
Model Selection for Linear Regression Model Selection Methods

Outline

1 Linear Regression Models

Supervised Learning
Linear Regression Models

2 Model Selection for Linear Regression

Under and Overfitting
Model Selection Methods
Linear Regression Models Under and Overfitting
Model Selection for Linear Regression Model Selection Methods

Underfitting/Overfitting (1)

We often have many measured predictors

In our blood pressure example, we have weight, body surface
area, age, pulse rate and a measure of stress
Should we use them all, and if not, why not?
The R2 always improves as we include more predictors
⇒ so model always fits the data we have better
But prediction on new, unseen data might be worse
We call this generalisation
Linear Regression Models Under and Overfitting
Model Selection for Linear Regression Model Selection Methods

Underfitting/Overfitting (2) – Key Slide

Risks of including/excluding predictors

Omitting important predictors
Called underfitting
Leads to systematic error, bias in predicting the target

Including spurious predictors

Called overfitting
Leads our model to “learn” noise and random variation
Poorer ability to predict to new, unseen data from our
population
Linear Regression Models Under and Overfitting
Model Selection for Linear Regression Model Selection Methods

Underfitting/Overfitting Example (1)

Example: we observe x and y data and want to build a

prediction model for y using x
Data looks nonlinear so we use polynomial regression
We take x, x2 , x3 , . . . , x20 ⇒ very flexible model
How many terms to include?
For example, do we use

y = β0 + β1 x + β2 x2 + ε

or
y = β0 + β1 x + β2 x2 + β3 x3 + β4 x4 + β5 x5 + ε
or another model with some other number of polynomial
terms.
Linear Regression Models Under and Overfitting
Model Selection for Linear Regression Model Selection Methods

Underfitting/Overfitting Example (2)

Example dataset of 50 samples
14

6
y

-2

-4
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
x
Linear Regression Models Under and Overfitting
Model Selection for Linear Regression Model Selection Methods

Underfitting/Overfitting Example (3)

Use (x, x2 ), too simple – underfitting
14

6
y

-2

-4
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
x
Linear Regression Models Under and Overfitting
Model Selection for Linear Regression Model Selection Methods

Underfitting/Overfitting Example (4)

Use (x, x2 , . . . , x20 ), too complex – overfitting
14

6
y

-2

-4
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
x
Linear Regression Models Under and Overfitting
Model Selection for Linear Regression Model Selection Methods

Underfitting/Overfitting Example (5)

(x, x2 , . . . , x6 ) seems “just right”. But how to find this model?
14

6
y

-2

-4
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
x
Linear Regression Models Under and Overfitting
Model Selection for Linear Regression Model Selection Methods

Using Hypothesis Testing – Key Slide

One approach is to use hypothesis testing

We know that a predictor j is unimportant if βj = 0
So we can test the hypothesis:

H0 : β=0
vs
HA : β 6= 0

which, in this setting is a variant of the t-test (see Ross,

Chapter 9 and Studio 6)
Strengths: easy to apply, easy to understand
Weaknesses: difficult to directly compare two different models
Linear Regression Models Under and Overfitting
Model Selection for Linear Regression Model Selection Methods

Model Selection (1)

A different approach is through model selection

In the context of linear regression, we define a model by
specifying which predictors are included in the linear regression
For example, in our blood pressure example:
{Weight}
{Weight, Age}
{Age, Stress}
{Age, Stress, Pulse}
are some of the possible models we could build
Given a model, we can estimate the associated linear
regression coefficients using least-squares/maximum likelihood
The question then becomes how to choose a good model
Linear Regression Models Under and Overfitting
Model Selection for Linear Regression Model Selection Methods

Model Selection (2)

We use maximum likelihood to choose the parameters

Remember, this means we adjust the parameters of our
distribution until we find the ones that maximise the
probability of seeing the data y we have observed
Can we use this to select a model as well as parameters?

Assume normal distribution for our regression errors

The minimised negative log-likelihood (i.e., L(y | β0 , β, σ 2 )
2 ) is
evaluated at the ML estimates β̂0 , β̂, σ̂ML

2 n n
L(y | β̂0 , β̂, σ̂ML )= log 2πRSS(β̂0 , β̂)/n +
2 2
This always decreases as we add more predictors to our model
⇒ cannot be used to select models, only parameters
Linear Regression Models Under and Overfitting
Model Selection for Linear Regression Model Selection Methods

Model Selection (2)

We use maximum likelihood to choose the parameters

Assume normal distribution for our regression errors

The minimised negative log-likelihood (i.e., L(y | β0 , β, σ 2 )
2 ) is
evaluated at the ML estimates β̂0 , β̂, σ̂ML

Model Selection (3) – Key Slide

Let M denote a model (set of predictors to use)

2 , M) denote minimised negative
Let L(y | β̂0 , β̂, σ̂ML
log-likelihood for the model M
We can select a model by minimising an information criterion
2
L(y | β̂0 , β̂, σ̂ML , M) + α(n, kM )

where
α(·) is a model complexity penalty;
kM is the number of predictors in model M;
n is the size of our data sample.
This is a form of penalized likelihood estimation
⇒ a model is penalized by its complexity (ability to fit data)
Linear Regression Models Under and Overfitting
Model Selection for Linear Regression Model Selection Methods

Model Selection (4)

How to measure complexity, i.e., choose α(·)?

Akaike Information Criterion (AIC)

α(n, kM ) = kM

Bayesian Information Criterion (BIC)

kM
α(n, kM ) = log n
2
AIC penalty smaller than BIC; increased chance of overfitting
BIC penalty bigger than AIC; increased chance of underfitting
Differences in scores of ≥ 3 or more are considered significant
Linear Regression Models Under and Overfitting
Model Selection for Linear Regression Model Selection Methods

Finding a Good Model (1)

Most obvious approach is to try all possible combinations of

predictors, and choose one that has smallest information
criterion score
Called the all subsets approach
If we have p predictors then we have 2p models to try
For p = 50, 2p ≈ 1.2 × 1015 !
So this method is computationally intractable for moderate p
Linear Regression Models Under and Overfitting
Model Selection for Linear Regression Model Selection Methods

Finding a Good Model (2)

An alternative is to search through the model space

Forward selection algorithm:
1 Start with the empty model;
2 Find the predictor that reduces info criterion by most
3 If no predictor improves model, end.
4 Add this predictor to the model
5 Return to Step 2

Backwards selection is related algorithm

Start with the full model and remove predictors
Is computationally tractable for large p, but may miss
important predictors
Linear Regression Models Under and Overfitting
Model Selection for Linear Regression Model Selection Methods

Reading/Terms to Revise

Reading for this week: Chapter 9 of Ross.

Terms you should know:

Target, predictor, explanatory variable;
Intercept, coefficient;
R2 value;
Categorical predictors;
Polynomial regression;
Model, model selection;
Overfitting, underfitting
Information Criteria;

This week we looked at supervised learning for continuous

targets; next week we will examine supervised learning for
categorical targets (classification).

Linear Regression PDF
100% (1)
Linear Regression PDF
32 pages
Linear Regression - Module 3
No ratings yet
Linear Regression - Module 3
16 pages
Statics Thinking-Regression
No ratings yet
Statics Thinking-Regression
51 pages
ch12 0
No ratings yet
ch12 0
43 pages
Linear Regression
No ratings yet
Linear Regression
28 pages
ML Module 2
No ratings yet
ML Module 2
185 pages
Unit-2 Supervised Machine Learning
No ratings yet
Unit-2 Supervised Machine Learning
132 pages
Linear Regression Techniques Explained
100% (1)
Linear Regression Techniques Explained
44 pages
AAI Lecture 10 SP 25
No ratings yet
AAI Lecture 10 SP 25
37 pages
3CP10 Final MJJ Linear Regression
No ratings yet
3CP10 Final MJJ Linear Regression
68 pages
Module 2-Supervised Learning
No ratings yet
Module 2-Supervised Learning
74 pages
ML 02 Regression 2
No ratings yet
ML 02 Regression 2
30 pages
Linear Regression
No ratings yet
Linear Regression
11 pages
Week 9 - PROG 8510 Week 9
No ratings yet
Week 9 - PROG 8510 Week 9
27 pages
Business Analytics Unit - V Notes - 60637708 - 2025 - 05 - 15 - 02 - 16
No ratings yet
Business Analytics Unit - V Notes - 60637708 - 2025 - 05 - 15 - 02 - 16
37 pages
Linearregression
No ratings yet
Linearregression
18 pages
Linear Regression
No ratings yet
Linear Regression
13 pages
Smai Lecture 06 Regression
No ratings yet
Smai Lecture 06 Regression
46 pages
Progression Linaire
No ratings yet
Progression Linaire
187 pages
Supervised Learning: Regression Techniques
No ratings yet
Supervised Learning: Regression Techniques
34 pages
ML - Module 2
No ratings yet
ML - Module 2
16 pages
Lecture 09 - 02.09.2024 - Regression-01
No ratings yet
Lecture 09 - 02.09.2024 - Regression-01
62 pages
Linear Regression For Machine Learning
No ratings yet
Linear Regression For Machine Learning
9 pages
Understanding The Geometry of Predictive Models: Workshop at S P Jain School Institute of Management and Research
No ratings yet
Understanding The Geometry of Predictive Models: Workshop at S P Jain School Institute of Management and Research
78 pages
Forecasting and Learning Theory
No ratings yet
Forecasting and Learning Theory
46 pages
CH6 Regression
No ratings yet
CH6 Regression
18 pages
Module III (Part II) (Regression and Time Series)
No ratings yet
Module III (Part II) (Regression and Time Series)
118 pages
Regression
No ratings yet
Regression
44 pages
Linear RegressionSV
No ratings yet
Linear RegressionSV
66 pages
Mod3 Eda
No ratings yet
Mod3 Eda
16 pages
Linear Regression for Marketers
No ratings yet
Linear Regression for Marketers
49 pages
Sparse Regression
No ratings yet
Sparse Regression
37 pages
FML Unit2
No ratings yet
FML Unit2
13 pages
Machine Learning Unit2
No ratings yet
Machine Learning Unit2
31 pages
Fsgs
No ratings yet
Fsgs
28 pages
Lec 3 Regression.
No ratings yet
Lec 3 Regression.
20 pages
Linear Review 1
No ratings yet
Linear Review 1
235 pages
Linear Regression Methods Overview
No ratings yet
Linear Regression Methods Overview
29 pages
LinearRegression PDF
No ratings yet
LinearRegression PDF
4 pages
Unit-4 DS Student
No ratings yet
Unit-4 DS Student
43 pages
Linear Regression & Logistic Regression
No ratings yet
Linear Regression & Logistic Regression
30 pages
Linear Regression
No ratings yet
Linear Regression
32 pages
02 Regression and Classification Problems
No ratings yet
02 Regression and Classification Problems
7 pages
L4.5 Linear Regression 2023
No ratings yet
L4.5 Linear Regression 2023
47 pages
Predictive Analytics - Regression
No ratings yet
Predictive Analytics - Regression
27 pages
Regression
No ratings yet
Regression
45 pages
Simple Linear Regression Analysis Guide
No ratings yet
Simple Linear Regression Analysis Guide
46 pages
ML 2
No ratings yet
ML 2
155 pages
Simple Linear Regression
No ratings yet
Simple Linear Regression
5 pages
RRB - Unit 2 Regresion
No ratings yet
RRB - Unit 2 Regresion
53 pages
Lecture 13 BA
No ratings yet
Lecture 13 BA
36 pages
Linear & Polynomial Regression Guide
No ratings yet
Linear & Polynomial Regression Guide
56 pages
ML Unit2
No ratings yet
ML Unit2
69 pages
Regression
No ratings yet
Regression
19 pages
Unit-2 ML
No ratings yet
Unit-2 ML
199 pages
The Theory of Interest (Stephen G. Kellison)
89% (63)
The Theory of Interest (Stephen G. Kellison)
167 pages
Math662TB 09S
100% (2)
Math662TB 09S
712 pages
Math662TB 09S
100% (2)
Math662TB 09S
712 pages
Frequency Programme Squat Bench Everyday
No ratings yet
Frequency Programme Squat Bench Everyday
4 pages
Magnet Scalping - Felix
0% (1)
Magnet Scalping - Felix
15 pages
Regression Analysis Basics
100% (1)
Regression Analysis Basics
17 pages
ICS2205 ISE1204 Tutorial 3
No ratings yet
ICS2205 ISE1204 Tutorial 3
5 pages
Module 10. Simple Linear Regression Regression Analysis
No ratings yet
Module 10. Simple Linear Regression Regression Analysis
6 pages
QTB Project Report
No ratings yet
QTB Project Report
15 pages
Unit1 Deeplearning Updated
No ratings yet
Unit1 Deeplearning Updated
164 pages
JNTUK R20 B.tech CSE 3-2 Machine Learning Unit 2 Notes
No ratings yet
JNTUK R20 B.tech CSE 3-2 Machine Learning Unit 2 Notes
33 pages
Econometrics Problem Set #4 Solutions
No ratings yet
Econometrics Problem Set #4 Solutions
8 pages
Understanding Multiple Regression
No ratings yet
Understanding Multiple Regression
10 pages
MBAS901 2 Lecture
No ratings yet
MBAS901 2 Lecture
87 pages
Lec 14 & 15 CoE
No ratings yet
Lec 14 & 15 CoE
60 pages
Introduction To Econometrics, Tutorial (3
100% (2)
Introduction To Econometrics, Tutorial (3
12 pages
Simple Linear Regression and Correlation Analysis: Chapter Five
No ratings yet
Simple Linear Regression and Correlation Analysis: Chapter Five
5 pages
T.Y.B.a-b.Sc - NEP 1.0 Statistics
No ratings yet
T.Y.B.a-b.Sc - NEP 1.0 Statistics
60 pages
TikTok's Impact on Purchase Decisions
No ratings yet
TikTok's Impact on Purchase Decisions
11 pages
Demand Estimation for Managers
0% (1)
Demand Estimation for Managers
48 pages
Econometricians' Salary Analysis
No ratings yet
Econometricians' Salary Analysis
2 pages
Simple Linear Regression Guide
No ratings yet
Simple Linear Regression Guide
71 pages
Theory of Fitting Straight Lines - Normal Equations
No ratings yet
Theory of Fitting Straight Lines - Normal Equations
11 pages
Lecture 9 Simple Regression
No ratings yet
Lecture 9 Simple Regression
52 pages
Machine-Learning-Using-Python-Pdf-Free (1) - 23-30
No ratings yet
Machine-Learning-Using-Python-Pdf-Free (1) - 23-30
8 pages
Simple Linear Regression Guide
No ratings yet
Simple Linear Regression Guide
12 pages
Understanding Regression Models
No ratings yet
Understanding Regression Models
18 pages
Essentials of Modern Business Statistics With Microsoft Excel 7th Edition David Anderson Instant Download
No ratings yet
Essentials of Modern Business Statistics With Microsoft Excel 7th Edition David Anderson Instant Download
131 pages
Chapter 5 - Cost Estimation
0% (2)
Chapter 5 - Cost Estimation
36 pages
Forecasting
100% (2)
Forecasting
70 pages
Statistics Using R An Integrative Approach 2nd Edition Sharon L. Weinberg Sample
No ratings yet
Statistics Using R An Integrative Approach 2nd Edition Sharon L. Weinberg Sample
167 pages
Multiple Linear Regression Guide
No ratings yet
Multiple Linear Regression Guide
59 pages
Assignment: Regression & Its Evaluation
No ratings yet
Assignment: Regression & Its Evaluation
5 pages
BS Economics Course Outline
No ratings yet
BS Economics Course Outline
114 pages
House Price Predicting Model Using
No ratings yet
House Price Predicting Model Using
7 pages