Week 2 Introduction To Linear Models - Revised - v1
Week 2 Introduction To Linear Models - Revised - v1
Week 2
Agenda
• Introduction to regularization
Linear Regression
The discovery of Ceres
1801: Astronomer Piazzi discovered Ceres
Made 19 observations of location before it was obscured by the sun
Design decisions:
3
f (x) = 1 + 0.57x
2
y
f (x) = 2 + 0.2x
1
f (x) = w1 + w2x
0
0 1 2 3 4 5
Vector notation:
weight vector w = [w1, w2] feature extractor φ(x) = [1, x] feature vector
f w (x) = w ·φ(x) x y 2
y
w = [1, 0.57] 1 1 residual
φ(x) = [1, x] 1
2 3
4 3 0
0 1 2 3 4 5
1
TrainLoss(w) = |D train| Σ(x,y)∈Dtrain Loss(x, y,w)
TrainLoss([1, 0.57]) = 0.38
Visualizing Loss Function
w
1 2
TrainLoss(w) = |D | Σ ( f w (x) − y) min TrainLoss(w)
train (x,y)∈D train
Gradient Descent
Optimization algorithm: how to compute best?
Goal: min w TrainLoss(w)
Definition: gradient
2
training data decision boundary
[2, 0] input 1
x1 x2 y
x2
example 0
0 2 1 learning algorithm
example
f classifier
-1
-2 0 1
example
1 -1 -1 -2
-1 label -3
-3 -2 -1 0 1 2 3
x1
Design decisions:
Which classifiers are possible? hypothesisclass
How good is a classifier? loss function
How do we compute the best classifier? optimization algorithm
An example linear classifier
3
x2
-1
-2
-3
-3 -2 -1 0 1 2 3
x1
f ([0, 2]) = sign([−0.6, 0.6] ·[0, 2]) = sign(1.2) = 1
f ([−2, 0]) = sign([−0.6, 0.6] ·[−2, 0]) = sign(1.2) = 1
f ([1, −1]) = sign([−0.6, 0.6] ·[1, −1]) = sign(−1.2) = −1
1
φ(x) = [x1, x2]
x2
0
f (x) = sign([−0.6, 0.6] ·φ(x)) -1
-3
-3 -2 -1 0 1 2 3
x1
2
training data Dtrain
1
f w (x) = w ·φ(x) x1 x2 y
x2
0
w = [0.5, 1] 0 2 1 -1
φ(x) = [x1, x2]
-2 0 1
-2
1 -1 -1
-3
-3 -2 -1 0 1 2 3
x1
x2
0
-1
Target label: y
-2
-3
-3 -2 -1 0 1 2 3
x1
Definition: score
Definition: margin
Loss(x,y, w)
3
0
-3 -2 -1 0 1 2 3
margin (w ·φ(x))y
Optimization algorithm: how to compute best?
Goal: min w TrainLoss(w)
To run gradient descent, compute the gradient:
∇w TrainLoss(w) = 1
|D train| Σ (x,y)∈D train ∇Loss0-1(x, y,w)
2
Gradient is zero almost everywhere!
1
0
-3 -2 -1 0 1 2 3
margin (w ·φ(x))y
Hinge loss
4
Loss(x,y, w)
3
Loss0-1
2
Losshinge
1
0
-3 -2 -1 0 1 2 3
margin (w ·φ(x))y
Loss(x, y,w)
3
Loss0-1
2 Losshinge
1
Losslogistic
0
-3 -2 -1 0 1 2 3
margin (w ·φ(x))y
Intuition: Try to increase margin even when it already exceeds
1
Gradient of the hinge loss
4
Loss(x,y, w)
3
2 Losshinge
1
0
-3 -2 -1 0 1 2 3
margin (w ·φ(x))y
Hinge loss on training data
3
2
training data Dtrain
1
f w (x) = w ·φ(x) x1 x2 y
x2
0
w = [0.5, 1] 0 2 1 -1
φ(x) = [x1, x2]
-2 0 1
-2
1 -1 -1
-3
-3 -2 -1 0 1 2 3
x1
Regression Classification
Initialize w = [0, . . . , 0]
For t = 1, . . . , T :
w ← w − η∇w TrainLoss(w)
Initialize w = [0, . . . , 0]
For t = 1, . . . , T :
For (x, y) ∈ Dtrain:
w ← w − η∇wLoss(x, y, w)
Step size
0 1
η
conservative, more stable aggressive, faster
Strategies:
• Constant: η = 0.1
• Decreasing:
GD vs SGD
Optimization algorithm:
stochastic gradient descent
Classification Regression
Overfitting
Overfitting – Possible reasons
Error
Low variance High variance
decreasing 𝐸𝑖𝑛 leads to
increasing 𝐸𝑜 𝑢 𝑡
𝐸𝑜𝑢𝑡
• Overfitting leads to bad
generalization
In sample error Overfitting
• A model can exhibit bad 𝐸𝑖𝑛
generalization even if it does not
overfit
Low Model complexity High
Overfitting vs. model complexity
Evaluation
Dtrain learning algorithm f
Dtrain Dtest
Approximation & estimation error
All predictors
F
Learning
f ∗ approx. error est. error
g fˆ
Initialize w = [0, . . . , 0]
For t = 1, . . . , T :
w ← w − η(∇w TrainLoss(w)+λw)
Initialize w = [0, . . . , 0]
For t = 1, . . . , T :
w ← w − η∇w TrainLoss(w)
Intuition: if have fewer updates, then ǁwǁ can’t get too big.
Lesson: try to minimize the training error, but don’t try too hard.
Regularization and bias-variance
The effects of the regularization procedure can be observed in the bias and variance
terms
• Regularization trades bias in order to considerably decrease the variance of the model
• Regularization strives for smoother hypothesis, thus reducing the opportunities to overfit
• The amount of regularization 𝜆 has to be chosen specifically for each type of regularizer
Model Complexity
Ideal Range
for Model Complexity
Regularization
• A method for automatically controlling the complexity of the learned
hypothesis
• Can also address overfitting by eliminating features (either manually or via model
selection)
L2 Regularization
• Regularized linear regression objectivefunction:
Percy Liang
Dorsa Sadigh
Mirko Mazzoleni
Ryan P. Adams
Thank You
mbzuai.ac.ae