0% found this document useful (0 votes)
27 views46 pages

Regularization Techniques in Machine Learning

The document discusses supervised learning and regularization techniques, focusing on optimization, gradient descent, and the concepts of underfitting and overfitting. It highlights various regularization methods such as L1 and L2 penalties, dataset augmentation, and early stopping to improve model generalization and combat overfitting. Additionally, it provides insights into the bias-variance trade-off and the importance of managing model complexity to enhance performance on unseen data.

Uploaded by

2022mcb1318
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views46 pages

Regularization Techniques in Machine Learning

The document discusses supervised learning and regularization techniques, focusing on optimization, gradient descent, and the concepts of underfitting and overfitting. It highlights various regularization methods such as L1 and L2 penalties, dataset augmentation, and early stopping to improve model generalization and combat overfitting. Additionally, it provides insights into the bias-variance trade-off and the importance of managing model complexity to enhance performance on unseen data.

Uploaded by

2022mcb1318
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Supervised Learning

Regularization Techniques
LEARNING OBJECTIVES

At the end of this session, you


will be able to understand:
• What is Optimization?
• Gradient Descent
Some Basic Concepts
• Generalization Optimization
• Underfitting-Overfitting • Stochastic Gradient Descent
• Bias-Variance • Parameter Initialization
• Adagrad
Regularization • RMSProp
• Parameter Norm-Penalties. (L1- • Batch Normalization
norm and L2-norm)
• Dataset Augmentation
• Early Stopping
• Bagging and other ensemble
methods
• Dropout
3
Data Management for Training and Evaluation
Complete Dataset

Validation Set
Training Set Testing Set
(Optional)

Batch 1 Batch 1
Epoch 3
Batch 2 Batch 2
Epoch 4
Epoch1 Epoch2

Batch M Batch M
Epoch N
Validate Test Validate Test
4
Generalization
• The ability of a trained model to perform well over the test data is
known as its Generalization ability. There are two kinds of
problems that can afflict machine learning models in general:
- Even after the model has been fully trained such that its training
error is small, it exhibits a high test error rate. This is known as the
problem of Overfitting.
- The training error fails to come down in-spite of several epochs of
training. This is known as the problem of Underfitting

5
Recipe for Learning

http://www.gizmodo.com.au/2015/04/the-basic-recipe-for-machine-
learning-explained-in-a-single-powerpoint-slide/ 6
Recipe for Learning

Don’t forget! overfittin


g
Modify the Network Preventing
Better optimization Overfitting
Strategy

http://www.gizmodo.com.au/2015/04/the-basic-recipe-for-machine-
learning-explained-in-a-single-powerpoint-slide/ 7
Underfitting-Overfitting
• This becomes especially problematic as you make your model increasingly
complex.
• Underfitting is a related issue where your model is not complex enough to
capture the underlying trend in the data.
• The problem of overfitting is not limited to computers, humans are often no
better.

8
Underfitting-Overfitting

Source: Quora: Luis Argerich

9
What is Regularization?
How to Combat Overfitting?
• Two ways to combat overfitting:
1. Use more training data. The more you have, the harder it is
to overfit the data by learning too much from any single training
example.

2. Use regularization. Add in a penalty in the loss function for


building a model

11
How to Combat Overfitting?

• The first piece of the sum is our normal cost function.


• The second piece is a regularization term that adds a penalty for large beta
coefficients
• With these two elements in place, the cost function now balances between
two priorities: explaining the training data and preventing that explanation
from becoming overly specific.

12
Regularization in Machine Learning

Illustrates the relationship between model capacity and the concepts of underfitting and overfitting by plotting the
training and test errors as a function of model capacity. When the capacity is low, then both the training and test
errors are high. As the capacity increases, the training error steadily decreases, but the test error initially
decreases, but then starts to increases due to overfitting. Hence the optimal model capacity is the one at
13
which the test error is at a minimum
Regularization in Machine Learning
• How to make an algorithm/model perform well not just on the
training data, but also on new inputs?

• Many strategies are designed explicitly to reduce the test error,


possibly at the expense of increased training error. These
strategies are known collectively as regularization.

14
Regularization in Machine Learning
• The following factors determine how well a model is able to
generalize from the training dataset to the test dataset:
- The model capacity and its relation to data complexity:
- In general if the model capacity is less than the data complexity
then it leads to underfitting,
- while if the converse is true, then it can lead to overfitting. Hence,
the objective is to choose a model whose capacity matches the
complexity of the training and test datasets.
- Even if the model capacity and the data complexity are well
matched, we can still encounter overfitting ??
- due to an insufficient amount of training samples.
15
How to Regularize?
• Put extra constraints on the model. Example: Add restrictions on
the parameter values.
• Add extra terms (as penalties) in the objective function. Indirectly
putting constraint on the parameter values.

16
Regularization Techniques
• Parameter Norm-Penalties. (L1-norm and L2-norm)
• Dataset Augmentation
• Early Stopping
• Bagging and other ensemble methods
• Dropout

17
Parameter Norm-Penalties
Limit the capacity of models such as neural networks, linear
regression, or logistic regression, by adding a parameter norm
penalty Ω(θ) to the objective function J.
J(θ; X, y) = J(θ; X, y) + αΩ(w)
where α ∈[0, ∞) is a hyperparameter that weights the relative contribution of the norm
penalty term, Ω, relative to the standard objective function J.

Setting α to zero results in no regularization. Larger values of α correspond to more


regularization

We use the vector w to indicate all of the weights that should be affected by a norm
penalty, while the vector θ denotes all of the parameters, including both w and the
unregularized parameters. 18
L2-Regularization
• The L2 parameter norm penalty commonly known as weight
decay.
• L2-regularization is also known as ridge regression or Tikhonov
regularization
• L2 regularization drives the weights closer to origin by adding a
regularization term Ω(θ) = 1/2||w||²₂ to the objective function.

2-norm (also known as L2 norm or Euclidean norm)

19
L2-Regularization
• Such a model has following total objective function:
J(w; X, y) =α/2(wTw) + J(w; X, y)
where T denotes transpose. To simplify the presentation, we
assume no bias parameter, so θ is just w.
• The corresponding parameter gradient
∇w J^(w; X, y) =αw + ∇w J(w; X, y)
Gradient Step without Weight Decay:
w←w−η∇wJ(w;X,y)

20
L2-Regularization
• To take a single gradient step to update the weights, we perform this
update:
w ← w − 𝟄(αw + ∇w J(w; X, y))
• Written another way, the update is:
w ← (1 − 𝟄α)w − 𝟄∇w J(w; X, y)
• We can see that the addition of the weight decay term has modified the
learning rule to multiplicatively shrink the weight vector (multiplicative
shrinkage) by a constant factor on each step, just before performing the
usual gradient update.
The purpose of weight decay is to prevent overfitting by penalizing large
weights in the model, which can lead to a model that is too complex and
fits the noise in the training data rather than the underlying pattern. The
multiplicative shrinkage encourages the model to maintain 21smaller,
more generalizable weights.
L2-Regularization
- What are the effects over the entire course of training?
- The L2-regularization causes the learning algorithm to “perceive”
the input X as having higher variance. Thus, it shrink the weights
on features
- The L2 regularization has the intuitive interpretation of heavily
penalizing peaky weight vectors and preferring diffuse weight
vectors.
- Due to multiplicative interactions between weights and inputs,
this has the appealing property of encouraging the network to use
all of its inputs a little rather that some of its inputs a lot.
22
L1-Regularization
• The regularized objective function J(w; X, y) is given by:
J(w; X, y) = α ||w||1 + J(w; X, y)
• With the corresponding gradient (actually, sub-gradient):
∇w J^(w; X, y) =α.sign(w) + ∇w J(w; X, y)
where sign(w) is simply the sign of w applied element-wise.

1-norm (also known as L1 norm)23


24
25
L1 and L2-Regularization

The graphs above show how the functions used in L1 and L2 regularization look like.
The penalty in both cases is zero in the center of the plot, but this also implies that the
weights are zero and the model will not work. The values of the weights try to be as
low as possible to minimize this function, but inevitably they will leave the center
and will head outside.
L1 and L2-Regularization
• In case of L2 regularization, going towards any direction (from
the center) is okay because, as we can see in the plot, the
function increases equally in all directions. Thus, L2
regularization mainly focuses on keeping the weights as low as
possible.
• In contrast, L1 regularization’s shape is diamond-like and the
weights are lower in the corners of the diamond. These corners
show where one of the axis/feature is zero thus leading to
sparse matrices.
• Note how the shapes of the functions shows their
differentiability: L2 is smooth and differentiable and L1 is sharp
and non-differentiable.
L1 and L2-Regularization
• In few words, L2 will aim to find small weight values whereas L1
could put all the values in a single feature.
• L1 and L2 regularization methods are also combined in what is
called elastic net regularization.
L1 Vs L2-Regularization
For a Linear Regression problem with only 2 parameters w1 and w2

L2/Ridge solution as a function of α, L2 and J(w,X,y) L1/Lasso solution as a function of α, L1 and J(w,X,y)

w2 w2

w1 w1
Bias / Variance Trade-off
• Training error

• Cross-validation error

Loss

Degree of Polynomial
Source: Andrew Ng
Bias / Variance Trade-off
• Training error

• Cross-validation error

High bias High


Loss Variance

Degree of Polynomial
Bias / Variance Trade-off with Regularization
• Training error

• Cross-validation error

Loss

λ
Source: Andrew Ng
Bias / Variance Trade-off with Regularization
• Training error

• Cross-validation error

High High bias


Loss Variance

λ
Source: Andrew Ng
What does L1 regularization tend to produce in a model's weight
What can be a potential disadvantage
distribution?
of using L1 regularization with a
•A. Larger weight values
very high λ value?
•B. Smaller weight values
• A. It may lead to underfitting.
•C. Zeroed weights for some features
• B. It increases the risk of overfitting.
•D. Equal weights for all features
• C. It improves the model's
interpretability.
L2 regularization is also known as:
• D. It makes the model training faster.
•A. Lasso
•B. Ridge
•C. Elastic Net
•D. Max Norm

What is the effect of L2 regularization on the weight values of a model?


•A. It eliminates some weight values.
•B. It reduces the magnitude of the weight values.
•C. It increases the magnitude of the weight values.
•D. It has no effect on the weight values.

The main purpose of using regularization in a machine learning model is to:


•A. Increase model accuracy on the training set.
•B. Decrease the training time of the model.
•C. Prevent overfitting to the training data. 34
•D. Increase the number of features in the model.
What does L1 regularization tend to produce in a model's weight distribution?
•A. Larger weight values
•B. Smaller weight values What can be a potential disadvantage
•C. Zeroed weights for some features of using L1 regularization with a
•D. Equal weights for all features very high λ value?
• A. It may lead to underfitting.
L2 regularization is also known as: • B. It increases the risk of overfitting.
•A. Lasso • C. It improves the model's
•B. Ridge interpretability.
•C. Elastic Net • D. It makes the model training faster.
•D. Max Norm

What is the effect of L2 regularization on the weight values of a model?


•A. It eliminates some weight values.
•B. It reduces the magnitude of the weight values.
•C. It increases the magnitude of the weight values.
•D. It has no effect on the weight values.

The main purpose of using regularization in a machine learning model is to:


•A. Increase model accuracy on the training set.
•B. Decrease the training time of the model.
•C. Prevent overfitting to the training data.
•D. Increase the number of features in the model. 35
Dataset Augmentation

• The best way to make a machine learning model generalize better is to


train it on more data.
• How to generate more data?
- Label more data.
- Create fake data.
- Injecting noise in the data
- Inject noise to the model parameters
- Inject noise to the output
- For image data: rotation, translation and other transformation, inject
noise

36
Dataset Augmentation

37
Dataset Augmentation

38
Dataset Augmentation

39
Dataset Augmentation

Example of classes
Example of examples for one class 40
What is the primary purpose of data augmentation in machine learning?
• A. To reduce the size of the dataset
• B. To make the training process faster
• C. To increase the diversity of the training set by generating altered versions of the data
• D. To increase the accuracy of the test set

Which of the following is a common data augmentation technique used in image processing?
• A. Fourier transformation
• B. Random rotation
• C. Tokenization
• D. Lemmatization

Data augmentation can help reduce overfitting because it:


• A. Decreases the number of parameters in the model
• B. Increases the quantity and variety of the training data
• C. Simplifies the features in the dataset
• D. Reduces the dimensionality of the input data

Which of the following data augmentation techniques is typically not used for tabular data?
• A. Noise injection
• B. Feature cross
• C. Horizontal flipping
• D. Oversampling 41
What is the primary purpose of data augmentation in machine learning?
• A. To reduce the size of the dataset
• B. To make the training process faster
• C. To increase the diversity of the training set by generating altered versions of the data
• D. To increase the accuracy of the test set

Which of the following is a common data augmentation technique used in image processing?
• A. Fourier transformation
• B. Random rotation
• C. Tokenization
• D. Lemmatization

Data augmentation can help reduce overfitting because it:


• A. Decreases the number of parameters in the model
• B. Increases the quantity and variety of the training data
• C. Simplifies the features in the dataset
• D. Reduces the dimensionality of the input data

Which of the following data augmentation techniques is typically not used for tabular data?
• A. Noise injection
• B. Feature cross
• C. Horizontal flipping
• D. Oversampling 42
Early Stopping
• When training large models with sufficient representational
capacity to overfit the task, it is often observed that training error
decreases steadily over time, but validation set error begins to rise
again.

• This means we can obtain a model with better validation set error
(and thus, hopefully better test set error) by returning to the
parameter setting at the point in time with the lowest validation
set error.

43
Early Stopping

Learning curves showing how the negative log-likelihood loss changes over time (indicated as number of
training iterations over the dataset, or epochs). In this example, a network is trained on MNIST. Observe that the
training objective decreases consistently over time, but the validation set average loss eventually begins to
increase again, forming an asymmetric U-shaped curve
44
Bagging and Other Ensemble Methods
• Bagging is a technique for reducing generalization error by
combining several models.
• The idea is to train several different models separately, then have
all of the models vote on the output for test examples.
• This is an example of a general strategy in machine learning called
model averaging. Techniques employing this strategy are known
as ensemble methods

45
Bagging and Other Ensemble Methods

A cartoon depiction of how bagging works. Suppose we train an ‘8’ detector on the dataset depicted above, containing
an ‘8’, a ‘6’ and a ‘9’. Suppose we make two different resampled datasets. The bagging training procedure is to construct
each of these datasets by sampling with replacement. The first dataset omits the ‘9’ and repeats the ‘8’. On this
dataset, the detector learns that a loop on top of the digit corresponds to an ‘8’. On the second dataset, we repeat the
‘9’ and omit the ‘6’. In this case, the detector learns that a loop on the bottom of the digit corresponds to an ‘8’. Each of
these individual classification rules is brittle, but if we average their output then the detector is robust, achieving
maximal confidence only when both loops of the ‘8’ are present. 46

You might also like