0% found this document useful (0 votes)
26 views

Week 2 Introduction To Linear Models - Revised - v1

The document discusses key concepts in linear regression and logistic regression models: 1) It introduces linear regression, gradient descent, and regularization techniques for modeling relationships between variables. 2) Logistic regression is covered for classification problems, including choosing a hypothesis class of linear classifiers and optimizing classifiers using a loss function and gradient descent. 3) The concepts are applied to examples of modeling the orbit of Ceres using linear regression and classifying examples using a linear classifier with a decision boundary.

Uploaded by

Ahmad Hammoudeh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views

Week 2 Introduction To Linear Models - Revised - v1

The document discusses key concepts in linear regression and logistic regression models: 1) It introduces linear regression, gradient descent, and regularization techniques for modeling relationships between variables. 2) Logistic regression is covered for classification problems, including choosing a hypothesis class of linear classifiers and optimizing classifiers using a loss function and gradient descent. 3) The concepts are applied to examples of modeling the orbit of Ceres using linear regression and classifying examples using a linear classifier with a decision boundary.

Uploaded by

Ahmad Hammoudeh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

AI701: Foundations of Artificial Intelligence

Week 2
Agenda

• Introduction to linear regression

• Logistic regression fundamentals

• Gradient descent (batch/stochastic)

• Underfitting and overfitting

• Introduction to regularization
Linear Regression
The discovery of Ceres
1801: Astronomer Piazzi discovered Ceres
Made 19 observations of location before it was obscured by the sun

Time Right ascension Declination


Jan 01,20:43:17.8 50.91 15.24
Jan 02,20:39:04.6 50.84 15.30
... ... ...
Feb 11, 18:11:58.2 53.51 18.43

Where and when it will be observed again?


Gauss's triumph
September 1801: Gauss took Piazzi’s data and created a model of
Ceres’s orbit
Makes prediction

December 7, 1801: Ceres located within 1/2 degree of Gauss’s prediction,


much more accurate than other astronomers
Method: Least squares linear regression
Linear regression framework

Design decisions:

Which predictors are possible? hypothesis class


How good is a predictor? Loss function
How do we compute the best predictor? Optimization algorithm
Hypothesis class: which predictors?
4

3
f (x) = 1 + 0.57x
2

y
f (x) = 2 + 0.2x
1
f (x) = w1 + w2x
0
0 1 2 3 4 5

Vector notation:
weight vector w = [w1, w2] feature extractor φ(x) = [1, x] feature vector

f w (x) = w ·φ(x) score

f w (3) = [1, 0.57] ·[1, 3] = 2.71


Hypothesis class:
F = { f w : w ∈R2}
Loss function: how good is apredictor?
4

training data Dtrain 3

f w (x) = w ·φ(x) x y 2

y
w = [1, 0.57] 1 1 residual
φ(x) = [1, x] 1
2 3
4 3 0
0 1 2 3 4 5

Loss(x, y, w) = (f w (x) − y)2 squared loss


Loss(1, 1, [1, 0.57]) = ([1, 0.57] ·[1, 1] − 1)2 = 0.32
Loss(2, 3, [1, 0.57]) = ([1, 0.57] ·[1, 2] − 3)2 = 0.74
Loss(4, 3, [1, 0.57]) = ([1, 0.57] ·[1, 4] − 3)2 = 0.08

1
TrainLoss(w) = |D train| Σ(x,y)∈Dtrain Loss(x, y,w)
TrainLoss([1, 0.57]) = 0.38
Visualizing Loss Function
w
1 2
TrainLoss(w) = |D | Σ ( f w (x) − y) min TrainLoss(w)
train (x,y)∈D train
Gradient Descent
Optimization algorithm: how to compute best?
Goal: min w TrainLoss(w)

Definition: gradient

The gradient ∇w TrainLoss(w) is the direction that increases the


training loss the most.
Computing the gradient
Objective function:

Gradient (use chain rule):


Gradient descent example
Linear Classification
Linear classification framework
3

2
training data decision boundary
[2, 0] input 1
x1 x2 y

x2
example 0
0 2 1 learning algorithm
example
f classifier
-1
-2 0 1
example
1 -1 -1 -2
-1 label -3
-3 -2 -1 0 1 2 3

x1

Design decisions:
Which classifiers are possible? hypothesisclass
How good is a classifier? loss function
How do we compute the best classifier? optimization algorithm
An example linear classifier
3

x2
-1

-2

-3
-3 -2 -1 0 1 2 3

x1
f ([0, 2]) = sign([−0.6, 0.6] ·[0, 2]) = sign(1.2) = 1
f ([−2, 0]) = sign([−0.6, 0.6] ·[−2, 0]) = sign(1.2) = 1
f ([1, −1]) = sign([−0.6, 0.6] ·[1, −1]) = sign(−1.2) = −1

Decision boundary: x such that w ·φ(x) = 0


Hypothesis class: which classifiers?
3

1
φ(x) = [x1, x2]

x2
0
f (x) = sign([−0.6, 0.6] ·φ(x)) -1

f (x) = sign([0.5, 1] ·φ(x)) -2

-3
-3 -2 -1 0 1 2 3

x1

General binary classifier:


f w (x) = sign(w ·φ(x))
Hypothesis class:
F = { f w : w ∈R2}
Loss function: how good is aclassifier?
3

2
training data Dtrain
1

f w (x) = w ·φ(x) x1 x2 y

x2
0
w = [0.5, 1] 0 2 1 -1
φ(x) = [x1, x2]
-2 0 1
-2
1 -1 -1
-3
-3 -2 -1 0 1 2 3

x1

Loss0-1(x, y, w) = 1[f w (x) y] zero-one loss


Loss([0, 2], 1, [0.5, 1]) = 1[sign([0.5, 1] ·[0, 2]) ƒ= 1] = 0
Loss([−2, 0], 1, [0.5, 1]) = 1[sign([0.5, 1] ·[−2, 0]) ƒ= 1] = 1
Loss([1, −1], −1, [0.5, 1]) = 1[sign([0.5, 1] ·[1, −1]) ƒ= −1] = 0

TrainLoss([0.5, 1]) = 0.33


Score and margin
3

Predicted label: f w (x) = sign(w ·φ(x))

x2
0

-1
Target label: y
-2

-3
-3 -2 -1 0 1 2 3

x1
Definition: score

The score on an example (x, y) is w·φ(x), how confident we are in


predicting +1.

Definition: margin

The margin on an example (x, y) is (w ·φ(x))y, how correct we are.


Zero-one loss rewritten
4

Loss(x,y, w)
3

0
-3 -2 -1 0 1 2 3

margin (w ·φ(x))y
Optimization algorithm: how to compute best?
Goal: min w TrainLoss(w)
To run gradient descent, compute the gradient:
∇w TrainLoss(w) = 1
|D train| Σ (x,y)∈D train ∇Loss0-1(x, y,w)

∇w Loss0-1(x, y, w) = ∇1[(w ·φ(x))y ≤ 0]


4
Loss(x,y, w)

2
Gradient is zero almost everywhere!
1

0
-3 -2 -1 0 1 2 3

margin (w ·φ(x))y
Hinge loss
4

Loss(x,y, w)
3
Loss0-1
2
Losshinge
1

0
-3 -2 -1 0 1 2 3

margin (w ·φ(x))y

Losshinge(x, y, w) = max{1 − (w ·φ(x))y, 0}


Another: logistic regression
Losslogistic(x, y, w) = log(1 + e−(w·φ(x))y )
4

Loss(x, y,w)
3
Loss0-1
2 Losshinge

1
Losslogistic

0
-3 -2 -1 0 1 2 3

margin (w ·φ(x))y
Intuition: Try to increase margin even when it already exceeds
1
Gradient of the hinge loss
4

Loss(x,y, w)
3

2 Losshinge
1

0
-3 -2 -1 0 1 2 3

margin (w ·φ(x))y
Hinge loss on training data
3

2
training data Dtrain
1

f w (x) = w ·φ(x) x1 x2 y

x2
0
w = [0.5, 1] 0 2 1 -1
φ(x) = [x1, x2]
-2 0 1
-2
1 -1 -1
-3
-3 -2 -1 0 1 2 3

x1

Losshinge(x, y, w) = max{1 − (w ·φ(x))y, 0}


Loss([0, 2], 1, [0.5, 1]) = max{1 − [0.5, 1] ·[0, 2](1), 0} = 0 ∇Loss([0, 2], 1, [0.5, 1]) = [0, 0]
Loss([−2, 0], 1, [0.5, 1]) = max{1 − [0.5, 1] ·[−2, 0](1), 0} = 2 ∇Loss([−2, 0], 1, [0.5, 1]) = [2, 0]
Loss([1, −1], −1, [0.5, 1]) = max{1 − [0.5, 1] ·[1, −1](−1), 0} = 0.5 ∇Loss([1, −1], −1, [0.5, 1]) = [1, −1]
TrainLoss([0.5, 1]) = 0.83 ∇TrainLoss([0.5, 1]) = [1, −0.33]
Regression vs Classification

Regression Classification

Prediction f w (x) score sign(score)


Relate to target y residual (score − y) margin (score y)
zero-one
squared
Loss functions hinge
absolute deviation
logistic
Algorithm gradient descent gradient descent
Stochastic Gradient Descent
Gradient descent is slow

Algorithm: gradient descent

Initialize w = [0, . . . , 0]
For t = 1, . . . , T :
w ← w − η∇w TrainLoss(w)

Problem: each iteration requires going over all training examples

Expensive when have lots of data!


Stochastic gradient descent

Algorithm: stochastic gradient descent

Initialize w = [0, . . . , 0]
For t = 1, . . . , T :
For (x, y) ∈ Dtrain:
w ← w − η∇wLoss(x, y, w)
Step size

Question: what should η be?

0 1
η
conservative, more stable aggressive, faster

Strategies:
• Constant: η = 0.1
• Decreasing:
GD vs SGD

gradient descent stochastic gradient descent

Key idea: stochastic updates

It’s not about quality, it’s about quantity.


Overfitting and Regularization
Minimizing the training loss
Hypothesis class:
f w (x) = w ·φ(x)
Training objective (loss function):
1
TrainLoss(w) = Σ Loss(x, y,w)
|Dtrain| (x,y)∈D
train

Optimization algorithm:
stochastic gradient descent

Is the training loss a good objective to optimize?


Rote Learning
Algorithm: rote learning

Training: just store Dtrain .


Predictor f (x):
If (x, y) ∈ Dtrain : return y.
Else: segfault.

Minimizes the objective perfectly (zero), but clearly bad...


Overfitting scenarios

Classification Regression
Overfitting
Overfitting – Possible reasons

• Too few training data

• Noise in the data

• The hypothesis space is too large

• The input space is high-dimensional


Overfitting vs. model complexity
• We talk of overfitting when High bias Low bias

Error
Low variance High variance
decreasing 𝐸𝑖𝑛 leads to
increasing 𝐸𝑜 𝑢 𝑡

• Major source of failure for


machine learning systems Out of sample error

𝐸𝑜𝑢𝑡
• Overfitting leads to bad
generalization
In sample error Overfitting
• A model can exhibit bad 𝐸𝑖𝑛
generalization even if it does not
overfit
Low Model complexity High
Overfitting vs. model complexity
Evaluation
Dtrain learning algorithm f

How good is the predictor f ?

Key idea: the real learning objective

Our goal is to minimize error on unseen future examples.

Don’t have unseen examples; next best thing:

Definition: test set

Test set Dtest contains examples not used for training.


Generalization
When will a learning algorithm generalizewell?

Dtrain Dtest
Approximation & estimation error
All predictors
F
Learning
f ∗ approx. error est. error
g fˆ

• Approximation error: how good is the hypothesis class?


• Estimation error: how good is the learned predictor relative to the potential of the
hypothesis class?
Effect of hypotheses class
All predictors
F
Learning
f∗ approx. error est. error
g fˆ

As the hypothesis class size increases...


Approximation error decreases because:
taking min over larger set
Estimation error increases because:

harder to estimate something more complex

How do we control the hypothesis class size?


Cure 1: Dimensionality
w ∈ Rd

Reduce the dimensionality d (number of features):


Controlling the dimensionality
Manual feature (template) selection:
• Add feature templates if they help
Remove feature templates if they don’t help

Automatic feature selection (beyond the scope of this class):


• Forward selection
• Boosting
• L 1 regularization

It’s the number of features that matters


Cure 2: Norm
Controlling the Norm
Regularized objective:
λ
min TrainLoss(w)+ ǁwǁ 2
w 2

Algorithm: gradient descent

Initialize w = [0, . . . , 0]
For t = 1, . . . , T :
w ← w − η(∇w TrainLoss(w)+λw)

Same as gradient descent, except shrink the weights towards zero by λ.


Controlling the Norm: Early Stopping
Algorithm: gradient descent

Initialize w = [0, . . . , 0]
For t = 1, . . . , T :
w ← w − η∇w TrainLoss(w)

Idea: simply make T smaller

Intuition: if have fewer updates, then ǁwǁ can’t get too big.

Lesson: try to minimize the training error, but don’t try too hard.
Regularization and bias-variance
The effects of the regularization procedure can be observed in the bias and variance
terms

• Regularization trades bias in order to considerably decrease the variance of the model

• Regularization strives for smoother hypothesis, thus reducing the opportunities to overfit

• The amount of regularization 𝜆 has to be chosen specifically for each type of regularizer

• Usually 𝜆 is chosen by cross-validation


How overfitting affects predictions

Predictive Underfitting Overfitting


Error

Error on Test Data

Error on Training Data

Model Complexity

Ideal Range
for Model Complexity
Regularization
• A method for automatically controlling the complexity of the learned
hypothesis

• Idea: penalize for largevaluesof thertaj


oIncorporate penalty into the cost function
oWorks well when we have a lot of features, each that contributes a bit to
predicting the label

• Can also address overfitting by eliminating features (either manually or via model
selection)
L2 Regularization
• Regularized linear regression objectivefunction:

model fit to data regularization


o λis the regularization parameter (λ
oNo regularization on ! 0!
Slide Credits

Percy Liang
Dorsa Sadigh
Mirko Mazzoleni
Ryan P. Adams
Thank You

Mohamed bin Zayed


University of Artificial Intelligence
Masdar City
Abu Dhabi
United Arab Emirates

mbzuai.ac.ae

You might also like