Linear Regression Models
Model Selection for Linear Regression
FIT2086 Lecture 6
Linear Regression
Daniel F. Schmidt
Faculty of Information Technology, Monash University
September 4, 2017
Linear Regression Models
Model Selection for Linear Regression
Outline
1 Linear Regression Models
Supervised Learning
Linear Regression Models
2 Model Selection for Linear Regression
Under and Overfitting
Model Selection Methods
Linear Regression Models
Model Selection for Linear Regression
Revision from last week
Hypothesis testing; test null hypothesis vs alternative
H0 : null hypothesis
vs
HA : alternative hypothesis
A test-statistic measures how different our observed sample is
from the null hypothesis
A p-value quantifies the evidence against the null hypothesis
A p-value is the probability of seeing a sample that results in a
test statistic as extreme, or more extreme, than the one we
observed, just by chance if the null was true.
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models
Outline
1 Linear Regression Models
Supervised Learning
Linear Regression Models
2 Model Selection for Linear Regression
Under and Overfitting
Model Selection Methods
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models
Supervised Learning (1)
Over the last three weeks we have looked at parameter
inference
In week 3 we examined point estimation using maximum
likelihood
Selecting our “best guess” at a single value of the parameter
In week 4 we examined interval estimation using confidence
intervals
Give a range of plausible values for the unknown population
parameter
In week 5 we examined hypothesis testing
Quantify statistical evidence against a given hypothesis
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models
Supervised Learning (2)
Now we will start to see how these tools can be used to build
more complex models
Over the next three weeks we will look at supervised learning
In particular, we we will look at linear regression
But first, what is supervised learning?
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models
Supervised Learning (3)
Imagine we have measured p + 1 variables on n individuals
(people, objects, things)
We would like to predict one of the variables using the
remaining p variables
If the variable we are predicting is categorical, we are
performing classification
Example: predicting if someone has diabetes from medical
measurements.
If the variable we are predicting is numerical, we are
performing regression
Example: Predicting the quality of a wine from chemical and
seasonal information.
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models
Supervised Learning (3)
Imagine we have measured p + 1 variables on n individuals
(people, objects, things)
We would like to predict one of the variables using the
remaining p variables
If the variable we are predicting is categorical, we are
performing classification
Example: predicting if someone has diabetes from medical
measurements.
If the variable we are predicting is numerical, we are
performing regression
Example: Predicting the quality of a wine from chemical and
seasonal information.
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models
Supervised Learning (4)
The variable we are predicting is designated the “y” variable
We have (y1 , . . . , yn )
This variable is often called the:
target;
response;
outcome.
The other variables are usually designated “X” variables
We have (xi,1 , . . . , xi,n ) for i = 1, . . . , p
These variables are often called the
explanatory variables;
predictors;
covariates;
exposures.
Usually we assume the targets are random variables and the
predictors are known without error
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models
Supervised Learning (4)
The variable we are predicting is designated the “y” variable
We have (y1 , . . . , yn )
This variable is often called the:
target;
response;
outcome.
The other variables are usually designated “X” variables
We have (xi,1 , . . . , xi,n ) for i = 1, . . . , p
These variables are often called the
explanatory variables;
predictors;
covariates;
exposures.
Usually we assume the targets are random variables and the
predictors are known without error
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models
Supervised Learning (4)
Supervised learning: find a relationship between the targets yi
and associated predictors xi,1 , . . . , xi,p .
That is, learn a function f (·) such that
yi = f (xi,1 , . . . , xi,p )
Usually error in measuring yi so that no f (·) fits perfectly
⇒ we model yi as realisation of RV Yi
So instead, find an f (·) that is “close” to y1 , . . . , yn
It is “supervised” because we have examples to learn from
Supervised learning model depends on form of f (·)
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models
Linear Regression
Linear regression is a special type of supervised learning
In this case, we take the function f (·) that relates the
predictors to the target as being linear
One of the most important models in statistics
The resulting model is highly interpretable
It is very flexible and can even handle nonlinear relationships
It is computationally efficient to fit, even for very large p
Enormous area of research and work
⇒ we will get acquainted with the basics
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models
Simple Linear Regression (1)
Consider the following dataset (we examined in Studio 5):
Imagine we want to model blood pressure
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models
Simple Linear Regression (2)
Blood pressure plotted against patient ID
125
Blood Pressure (mmHg)
120
115
110
105
5 10 15 20
Patient ID
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models
Simple Linear Regression (3)
Our blood pressure variable BP1 , . . . , BP20 is continuous
⇒ we choose to model it using a normal distribution
The maximum likelihood estimate of the mean µ is
n
1 X
µ̂ = yi = 114
20 i=1
which is equivalent to the sample mean
We have a new person from the population this sample was
drawn from and we want to predict their blood pressure
Using our simple model our best guess of this persons blood
pressure is 114, i.e., the estimated mean µ̂
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models
Simple Linear Regression (4)
Prediction of BP using the mean
125
Blood Pressure (mmHg)
120
115
110
105
5 10 15 20
Patient ID
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models
Simple Linear Regression (5)
How good is our model at predicting?
One way we could measure this is through prediction error
We don’t know future data, but we can look to see how well it
predicts the data we have
Let ŷi denote the prediction of sample y using a model; then
ei = ŷi − yi
are the errors between our model predictions ŷi and the
observed data yi
⇒ often called residual error, or just residuals
A good fit would lead to overall small errors
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models
Simple Linear Regression (6)
Prediction of BP using the mean, showing errors/residuals
125
Blood Pressure (mmHg)
120
115
110
105
5 10 15 20
Patient ID
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models
Simple Linear Regression (7)
We can summarise the total error of fit of our model by
n
X
RSS = e2i
i=1
which is called the residual sum-of-squared errors.
For our simple mean model RSS = 560
Can we do better (smaller error) if we use one of the other
measured variables to help predict blood pressure?
For example, if we took a persons weight into account, could
we build a better predictor of their blood pressure?
To get an idea if there is scope for improvement we can plot
blood pressure vs weight
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models
Simple Linear Regression (7)
We can summarise the total error of fit of our model by
n
X
RSS = e2i
i=1
which is called the residual sum-of-squared errors.
For our simple mean model RSS = 560
Can we do better (smaller error) if we use one of the other
measured variables to help predict blood pressure?
For example, if we took a persons weight into account, could
we build a better predictor of their blood pressure?
To get an idea if there is scope for improvement we can plot
blood pressure vs weight
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models
Simple Linear Regression (8)
Blood pressure vs weight – BP appears to increase with weight
125
Blood Pressure (mmHg)
120
115
110
105
86 88 90 92 94 96 98 100 102
Weight (kg)
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models
Simple Linear Regression (9)
Our simple mean model is clearly not a good fit
125
Blood Pressure (mmHg)
120
115
110
105
86 88 90 92 94 96 98 100 102
Weight (kg)
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models
Simple Linear Regression (10)
Our simple mean model predicts blood pressure by
E [BPi ] = µ
irrespective of any other data on individual i
Let (Weight1 , . . . , Weight20 ) be the weights of our 20
individuals
We can let the mean vary as a linear function of weight, i.e.,
E [BPi | Weighti ] = β0 + β1 Weighti
This says that the conditional mean of blood pressure BPi for
individual i, given the individual’s weight Weighti , is equal to
β0 plus β1 times the weight Weighti
Note our simple mean model is a linear model with β1 = 0
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models
Simple Linear Regression (11)
The linear model E [BPi | Weighti ] = 1.2009 + 2.2053 Weighti
125
Blood Pressure (mmHg)
120
115
110
105
86 88 90 92 94 96 98 100 102
Weight (kg)
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models
Simple Linear Regression (12)
Residuals; ei = BPi − 1.2009 − 2.2053 Weighti (RSS= 120)
125
Blood Pressure (mmHg)
120
115
110
105
86 88 90 92 94 96 98 100 102
Weight (kg)
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models
Simple Linear Regression (13) – Key Slide
A linear model of the form
E [Yi | xi ] = ŷi = β0 + β1 xi
is called a simple linear regression.
It has two free regression parameters
β0 is the intercept; it is the value of the predicted value ŷi
when the predictor xi = 0
β1 is a regression coefficient; it is the amount the predicted
value ŷi changes by in one unit change of the predictor xi
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models
Simple Linear Regression (14)
In our example yi is blood pressure and xi weight;
ŷi = 1.2009 + 2.2053xi
so
For every additional kilogram a person weighs, their blood
pressure increases by 2.2053mmHg
For a person who weighs zero kilograms, the predicted blood
pressure is 1.2009mmHg
The predictions might not make sense outside of sensible
ranges of the predictors!
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models
Fitting Simple Linear Regressions (1)
How did we arrive at β̂0 = 1.2009 and β̂1 = 2.2053 in our
blood pressure vs weight example?
Measure fit of a model by its RSS
n
X
RSS = (yi − β0 − β1 xi )2
i=1
n
X
= (yi − ŷi )2
i=1
n
X
= e2i
i=1
Smaller error = better fit
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models
Fitting Simple Linear Regressions (2)
So least-squares principle says we choose (estimate) β0 , β1 to
minimise the RSS
Formally
( n )
X
(β̂0 , β̂1 ) = arg min (yi − β0 − β1 xi )2
β0 ,β1 i=1
These are often called least-squares (LS) estimates.
There are alternative measures of error; for example least sum
of absolute errors.
Least squares is popular due to simplicity, computational
efficiency and connections to normal models
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models
Fitting Simple Linear Regressions (3)
The RSS is a function of β0 , β1 , i.e.,
n
X
RSS(β0 , β1 ) = (yi − β0 − β1 xi )2
i=1
The least-squares estimates are the solutions to the equations
n
∂RSS(β0 , β1 ) X
= −2 (yi − β0 − β1 xi ) = 0
∂β0 i=1
n
∂RSS(β0 , β1 ) X
= −2 xi (yi − β0 − β1 xi ) = 0
∂β1 i=1
where we use the chain rule.
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models
Fitting Simple Linear Regressions (4)
The solution for β0 is
n n n n
! ! ! !
X X X X
yi x2i − yi x i xi
i=1 i=1 i=1 i=1
β̂0 = !2
n
X Xn
n x2i − xi
i=1 i=1
and the solution for β1 is
n n
!
X X
yi xi − β̂0 xi
i=1 i=1
β̂1 = n
X
x2i
i=1
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models
Fitting Simple Linear Regressions (5)
Given LS estimates β̂0 , β̂1 we can find the predictions for our
data
ŷi = β̂0 + β̂1 xi
and residuals
ei = yi − ŷi
The vector of residuals e = (e1 , . . . , en ) has the properties
n
X
ei = 0 and corr (x, e) = 0
i=1
where x = (x1 , . . . , xn ) is our predictor variable.
This means least-squares fits a line such that the mean of the
resulting residuals is zero, and the residuals are uncorrelated
with the predictor.
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models
Fitting Simple Linear Regressions (5)
Given LS estimates β̂0 , β̂1 we can find the predictions for our
data
ŷi = β̂0 + β̂1 xi
and residuals
ei = yi − ŷi
The vector of residuals e = (e1 , . . . , en ) has the properties
n
X
ei = 0 and corr (x, e) = 0
i=1
where x = (x1 , . . . , xn ) is our predictor variable.
This means least-squares fits a line such that the mean of the
resulting residuals is zero, and the residuals are uncorrelated
with the predictor.
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models
Multiple Linear Regression (1) – Key Slide
We have used one explanatory variable in our linear model
A great strength of linear models is that they easily handle
multiple variables
Let xi,j denote the variable j for individual i, where
j = 1, . . . , p; i.e., we have p explanatory variables. Then
p
X
E [yi | xi,1 , . . . , xi,p ] = β0 + βj xi,j
j=1
The intercept is now the expected value of the target when
xi,1 = xi,2 = · · · = xi,p = 0
The coefficient βj is the increase in the expected value of the
target per unit change in explanatory variable j
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models
Multiple Linear Regression (2) – Key Slide
Fit a multiple linear regression using least-squares
⇒ assume p < n, otherwise solution is non-unique
Given coefficients β0 , β1 ,. . . ,βp the RSS is
2
n
X p
X
RSS(β0 , β1 , . . . , βp ) = yi − βj xi,j
i=1 j=1
Now we have to solve
β̂0 , β̂1 , . . . , β̂p = arg min {RSS(β0 , β1 , . . . , βp )}
β0 ,β1 ,...,βp
Efficient algorithms exist to find these estimates
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models
Multiple Linear Regression (3)
Matrix algebra can simplify linear regression equations
We have a vector of targets y = (y1 , . . . , yn )
We have a vector of coefficients β = (β1 , . . . , βp )
We can treat each variable as a vector xj = (x1,j , . . . , xn,j )
Arrange these vectors into a matrix X of predictors:
x1,1 x1,2 ··· x1,p
x2,1 x2,2 ··· x2,p
(x10 , x20 , . . . , x20 )
X= = .. .. .. ,
. . .
xn,1 xn,2 · · · xn,p
We call this the design matrix
⇒ has p columns (predictors) and n rows (individuals)
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models
Multiple Linear Regression (4) – Key Slide
We can form our predictions and residuals using
ŷ = Xβ + β0 1n and e = y − ŷ.
where 1n is a vector of n ones.
We can then write our RSS very compactly as
RSS(β0 , β) = e0 e
If β̂0 , β̂ are least-squares estimates, then
corr(xj , e) = 0 for all j
That is, least-squares finds the plane such that the residuals
(errors) are uncorrelated with all predictors in the model
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models
R-squared (R2 ) (1)
Residual sum-of-squares tells us how well we fit the data
But the scale is arbitrary – what does an RSS of 2, 352 mean?
Instead, we define the RSS relative to some reference point
We use the total sum-of-squares as the reference:
n
X
TSS = (yi − ȳ)2
i=1
which is the residual sum-of-squares obtained by fitting the
intercept only (the “mean model”)
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models
R-squared (R2 ) (2) – Key Slide
The R2 value is then defined as
RSS
R2 = 1 −
TSS
which is also called the coefficient-of-determination
R2 is strictly between 0 (model has no explanatory power)
and 1 (model completely explains the data)
The higher the R2 the better the fit to the data
Adding an extra predictor always increases R2
⇒ predictors that greatly increase R2 are potentially
important
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models
Example: Multiple regression and R2 (1)
Let us revisit our blood pressure data
The residual sum-of-squares of our mean model was 560
⇒ this is our reference model (total sum-of-squares)
Regression of blood pressure (BP) onto weight gave us
E [BP | Weight] = 2.20 + 1.2 Weight
which had an RSS of 54.52 ⇒ R2 ≈ 0.9
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models
Example: Multiple regression and R2 (2)
In our data we also have an individual’s age
We fit a multiple linear regression of BP onto weight and age
E [BP | Weight, Age] = −16.57 + 1.03 Weight + 0.71 Age
This says that:
for every kilogram, a person’s bloodpressure rises by
1.03mmHg;
for every year, a person’s bloodpressure rises by 0.71mmHg;
This model has an RSS of 4.82 ⇒ R2 = 0.99
So include age seems to increase our fit substanstially
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models
Handling Categorical Predictors (1)
Sometimes our predictors are categorical variables
This means the numerical values they take are on just codes
for different categories
Makes no sense to “added” or “multiply” them
Instead we turn them into K − 1 new predictors (if K is the
number of categories)
These predictors take on a one when an individual is in a
particular category, and zero otherwise
They are called indicator variables.
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models
Handling Categorical Predictors (2) – Key Slide
Example variable with four categories coded as 1, 2, 3 and 4
1 0 0 0
2
1 0 0
1 0 0 0
3 0 1 0
4
=⇒
0 0 1
2
1 0 0
3 0 1 0
2 1 0 0
4 0 0 1
We do not build indicators for first category
Regression coefficients for other categories are increases in
target relative to being in the first category
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models
Nonlinear effects (1)
Sometimes predictors are related to the target in a nonlinear
fashion
We can still use linear models by transforming the predictors
If the transformed predictors are linearly related to the target,
regression will work well
We can often detect this by plotting the residuals against a
variable – if they exhibit a nonlinear trend or curve it is sign
that a transformation might be needed
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models
Nonlinear effects (2)
Example dataset
10
6
y
0 0.2 0.4 0.6 0.8 1
x
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models
Nonlinear effects (3)
Fitted model: ŷ = −1.07 + 9.55x; RSS = 0.95
10
6
y
0 0.2 0.4 0.6 0.8 1
x
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models
Nonlinear effects (4)
Example data: residuals exhibit clear nonlinear trend
2
1.5
1
Residuals
0.5
-0.5
-1
0 0.2 0.4 0.6 0.8 1
x
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models
Nonlinear effects (5)
There are several common transformations
A logarithmic transformation can be used if the predictor
seems to be more variable for larger values of the predictor
xi,j ⇒ log xi,j
Can only be used if all xi > 0
Polynomial transformations offer general purpose nonlinear fits
We turn our variable into q new variables of the form:
xi,j ⇒ xi,j , x2i,j , x3i,j , . . . , xqi,j
The higher the q the more nonlinear the fit can become, but
at risk of overfitting
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models
Nonlinear effects (5)
There are several common transformations
A logarithmic transformation can be used if the predictor
seems to be more variable for larger values of the predictor
xi,j ⇒ log xi,j
Can only be used if all xi > 0
Polynomial transformations offer general purpose nonlinear fits
We turn our variable into q new variables of the form:
xi,j ⇒ xi,j , x2i,j , x3i,j , . . . , xqi,j
The higher the q the more nonlinear the fit can become, but
at risk of overfitting
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models
Nonlinear effects (6)
New model: ŷ = −0.02 + 2.16x + 7.77x2 , R2 = 0.999
10
6
y
0 0.2 0.4 0.6 0.8 1
x
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models
Connecting LS to ML (1)
To show this, let our targets Y1 , . . . , Yn be RVs
Write the linear regression model as
p
X
Ŷi = β0 + βj xi,j + εi
j=1
where εi is a random, unobserved “error”
Now assume that εi ∼ N (0, σ 2 )
This is equivalent to saying that
p
X
Yi | xi,1 , . . . , xi,p ∼ N β0 + βj xi,j , σ 2
j=1
so each Yi is normally distributed with variance σ 2 and a
mean that depends on the values of the associated predictors
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models
Connecting LS to ML (2)
Each Yi is independent
Given target data y the likelihood function can be written
Pp 2
n 1 yi − β0 −
Y 1 2 j=1 βj xi,j
p(y | β0 , β, σ 2 ) = exp −
2πσ 2 2σ 2
i=1
Noting e−a e−b = e−a−b this simplifes to
Pn Pp 2
n yi − β0 −
1 j=1 βj xi,j
2 i=1
exp −
2πσ 2 2σ 2
where we can see term in the numerator in the exp(·) is the
residual sum-of-squares.
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models
Connecting LS to ML (2) – Key Slide
Taking the negative-logarithm of this yields
n RSS(β0 , β)
L(y | β0 , β, σ 2 ) = log(2πσ 2 ) +
2 2σ 2
As the value of σ 2 scales the RSS term, it is easy to see that
the values of β0 and β that minimise the negative
log-likelihood are the least-squares estimates β̂0 and β̂
LS estimates are same as the maximum likelihood estimates
assuming the random “errors” εi are normally distributed
Our residuals
ei = yi − ŷi
can be viewed as our estimates of the errors εi .
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models
Connecting LS to ML (3)
How to estimate the error variance σ 2 ?
The maximum likelihood estimate is:
2 RSS(β̂0 , β̂)
σ̂ML =
n
but this tends to underestimate the actual variance.
A better estimate is the unbiased estimate
RSS(β̂0 , β̂)
σ̂u2 =
n−p−1
where p is the number of predictors used to fit the model.
Linear Regression Models Supervised Learning
Model Selection for Linear Regression Linear Regression Models
Making predictions with a linear model
Given estimates β̂0 , β̂ can make predictions about new data
To estimate value of target for some new predictor values
x01 , x02 , . . . , x0p
p
β̂j x0j
X
ŷ = β̂0 +
j=1
Using normal model of residuals, we can also get probability
distribution over future data:
p
β̂j x0j , σ 2
X
Ŷ ∼ N β̂0 +
j=1
By changing predictors we can see how target changes
Example: seeing how weight and age effect blood pressure
Careful using predictions outside of sensible predictors values!
Linear Regression Models Under and Overfitting
Model Selection for Linear Regression Model Selection Methods
Outline
1 Linear Regression Models
Supervised Learning
Linear Regression Models
2 Model Selection for Linear Regression
Under and Overfitting
Model Selection Methods
Linear Regression Models Under and Overfitting
Model Selection for Linear Regression Model Selection Methods
Underfitting/Overfitting (1)
We often have many measured predictors
In our blood pressure example, we have weight, body surface
area, age, pulse rate and a measure of stress
Should we use them all, and if not, why not?
The R2 always improves as we include more predictors
⇒ so model always fits the data we have better
But prediction on new, unseen data might be worse
We call this generalisation
Linear Regression Models Under and Overfitting
Model Selection for Linear Regression Model Selection Methods
Underfitting/Overfitting (2) – Key Slide
Risks of including/excluding predictors
Omitting important predictors
Called underfitting
Leads to systematic error, bias in predicting the target
Including spurious predictors
Called overfitting
Leads our model to “learn” noise and random variation
Poorer ability to predict to new, unseen data from our
population
Linear Regression Models Under and Overfitting
Model Selection for Linear Regression Model Selection Methods
Underfitting/Overfitting Example (1)
Example: we observe x and y data and want to build a
prediction model for y using x
Data looks nonlinear so we use polynomial regression
We take x, x2 , x3 , . . . , x20 ⇒ very flexible model
How many terms to include?
For example, do we use
y = β0 + β1 x + β2 x2 + ε
or
y = β0 + β1 x + β2 x2 + β3 x3 + β4 x4 + β5 x5 + ε
or another model with some other number of polynomial
terms.
Linear Regression Models Under and Overfitting
Model Selection for Linear Regression Model Selection Methods
Underfitting/Overfitting Example (2)
Example dataset of 50 samples
14
12
10
6
y
-2
-4
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
x
Linear Regression Models Under and Overfitting
Model Selection for Linear Regression Model Selection Methods
Underfitting/Overfitting Example (3)
Use (x, x2 ), too simple – underfitting
14
12
10
6
y
-2
-4
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
x
Linear Regression Models Under and Overfitting
Model Selection for Linear Regression Model Selection Methods
Underfitting/Overfitting Example (4)
Use (x, x2 , . . . , x20 ), too complex – overfitting
14
12
10
6
y
-2
-4
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
x
Linear Regression Models Under and Overfitting
Model Selection for Linear Regression Model Selection Methods
Underfitting/Overfitting Example (5)
(x, x2 , . . . , x6 ) seems “just right”. But how to find this model?
14
12
10
6
y
-2
-4
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
x
Linear Regression Models Under and Overfitting
Model Selection for Linear Regression Model Selection Methods
Using Hypothesis Testing – Key Slide
One approach is to use hypothesis testing
We know that a predictor j is unimportant if βj = 0
So we can test the hypothesis:
H0 : β=0
vs
HA : β 6= 0
which, in this setting is a variant of the t-test (see Ross,
Chapter 9 and Studio 6)
Strengths: easy to apply, easy to understand
Weaknesses: difficult to directly compare two different models
Linear Regression Models Under and Overfitting
Model Selection for Linear Regression Model Selection Methods
Model Selection (1)
A different approach is through model selection
In the context of linear regression, we define a model by
specifying which predictors are included in the linear regression
For example, in our blood pressure example:
{Weight}
{Weight, Age}
{Age, Stress}
{Age, Stress, Pulse}
are some of the possible models we could build
Given a model, we can estimate the associated linear
regression coefficients using least-squares/maximum likelihood
The question then becomes how to choose a good model
Linear Regression Models Under and Overfitting
Model Selection for Linear Regression Model Selection Methods
Model Selection (2)
We use maximum likelihood to choose the parameters
Remember, this means we adjust the parameters of our
distribution until we find the ones that maximise the
probability of seeing the data y we have observed
Can we use this to select a model as well as parameters?
Assume normal distribution for our regression errors
The minimised negative log-likelihood (i.e., L(y | β0 , β, σ 2 )
2 ) is
evaluated at the ML estimates β̂0 , β̂, σ̂ML
2 n n
L(y | β̂0 , β̂, σ̂ML )= log 2πRSS(β̂0 , β̂)/n +
2 2
This always decreases as we add more predictors to our model
⇒ cannot be used to select models, only parameters
Linear Regression Models Under and Overfitting
Model Selection for Linear Regression Model Selection Methods
Model Selection (2)
We use maximum likelihood to choose the parameters
Remember, this means we adjust the parameters of our
distribution until we find the ones that maximise the
probability of seeing the data y we have observed
Can we use this to select a model as well as parameters?
Assume normal distribution for our regression errors
The minimised negative log-likelihood (i.e., L(y | β0 , β, σ 2 )
2 ) is
evaluated at the ML estimates β̂0 , β̂, σ̂ML
2 n n
L(y | β̂0 , β̂, σ̂ML )= log 2πRSS(β̂0 , β̂)/n +
2 2
This always decreases as we add more predictors to our model
⇒ cannot be used to select models, only parameters
Linear Regression Models Under and Overfitting
Model Selection for Linear Regression Model Selection Methods
Model Selection (3) – Key Slide
Let M denote a model (set of predictors to use)
2 , M) denote minimised negative
Let L(y | β̂0 , β̂, σ̂ML
log-likelihood for the model M
We can select a model by minimising an information criterion
2
L(y | β̂0 , β̂, σ̂ML , M) + α(n, kM )
where
α(·) is a model complexity penalty;
kM is the number of predictors in model M;
n is the size of our data sample.
This is a form of penalized likelihood estimation
⇒ a model is penalized by its complexity (ability to fit data)
Linear Regression Models Under and Overfitting
Model Selection for Linear Regression Model Selection Methods
Model Selection (4)
How to measure complexity, i.e., choose α(·)?
Akaike Information Criterion (AIC)
α(n, kM ) = kM
Bayesian Information Criterion (BIC)
kM
α(n, kM ) = log n
2
AIC penalty smaller than BIC; increased chance of overfitting
BIC penalty bigger than AIC; increased chance of underfitting
Differences in scores of ≥ 3 or more are considered significant
Linear Regression Models Under and Overfitting
Model Selection for Linear Regression Model Selection Methods
Finding a Good Model (1)
Most obvious approach is to try all possible combinations of
predictors, and choose one that has smallest information
criterion score
Called the all subsets approach
If we have p predictors then we have 2p models to try
For p = 50, 2p ≈ 1.2 × 1015 !
So this method is computationally intractable for moderate p
Linear Regression Models Under and Overfitting
Model Selection for Linear Regression Model Selection Methods
Finding a Good Model (2)
An alternative is to search through the model space
Forward selection algorithm:
1 Start with the empty model;
2 Find the predictor that reduces info criterion by most
3 If no predictor improves model, end.
4 Add this predictor to the model
5 Return to Step 2
Backwards selection is related algorithm
Start with the full model and remove predictors
Is computationally tractable for large p, but may miss
important predictors
Linear Regression Models Under and Overfitting
Model Selection for Linear Regression Model Selection Methods
Reading/Terms to Revise
Reading for this week: Chapter 9 of Ross.
Terms you should know:
Target, predictor, explanatory variable;
Intercept, coefficient;
R2 value;
Categorical predictors;
Polynomial regression;
Model, model selection;
Overfitting, underfitting
Information Criteria;
This week we looked at supervised learning for continuous
targets; next week we will examine supervised learning for
categorical targets (classification).