0% found this document useful (0 votes)

27 views46 pages

Regularization Techniques in Machine Learning

The document discusses supervised learning and regularization techniques, focusing on optimization, gradient descent, and the concepts of underfitting and overfitting. It highlights various regularization methods such as L1 and L2 penalties, dataset augmentation, and early stopping to improve model generalization and combat overfitting. Additionally, it provides insights into the bias-variance trade-off and the importance of managing model complexity to enhance performance on unseen data.

Uploaded by

2022mcb1318

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views46 pages

Regularization Techniques in Machine Learning

Uploaded by

2022mcb1318

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Supervised Learning

Regularization Techniques
LEARNING OBJECTIVES

At the end of this session, you

will be able to understand:
• What is Optimization?
• Gradient Descent
Some Basic Concepts
• Generalization Optimization
• Underfitting-Overfitting • Stochastic Gradient Descent
• Bias-Variance • Parameter Initialization
• Adagrad
Regularization • RMSProp
• Parameter Norm-Penalties. (L1- • Batch Normalization
norm and L2-norm)
• Dataset Augmentation
• Early Stopping
• Bagging and other ensemble
methods
• Dropout
3
Data Management for Training and Evaluation
Complete Dataset

Validation Set
Training Set Testing Set
(Optional)

Batch 1 Batch 1
Epoch 3
Batch 2 Batch 2
Epoch 4
Epoch1 Epoch2

Batch M Batch M
Epoch N
Validate Test Validate Test
4
Generalization
• The ability of a trained model to perform well over the test data is
known as its Generalization ability. There are two kinds of
problems that can afflict machine learning models in general:
- Even after the model has been fully trained such that its training
error is small, it exhibits a high test error rate. This is known as the
problem of Overfitting.
- The training error fails to come down in-spite of several epochs of
training. This is known as the problem of Underfitting

5
Recipe for Learning

http://www.gizmodo.com.au/2015/04/the-basic-recipe-for-machine-
learning-explained-in-a-single-powerpoint-slide/ 6
Recipe for Learning

Don’t forget! overfittin

g
Modify the Network Preventing
Better optimization Overfitting
Strategy

http://www.gizmodo.com.au/2015/04/the-basic-recipe-for-machine-
learning-explained-in-a-single-powerpoint-slide/ 7
Underfitting-Overfitting
• This becomes especially problematic as you make your model increasingly
complex.
• Underfitting is a related issue where your model is not complex enough to
capture the underlying trend in the data.
• The problem of overfitting is not limited to computers, humans are often no
better.

8
Underfitting-Overfitting

Source: Quora: Luis Argerich

9
What is Regularization?
How to Combat Overfitting?
• Two ways to combat overfitting:
1. Use more training data. The more you have, the harder it is
to overfit the data by learning too much from any single training
example.

2. Use regularization. Add in a penalty in the loss function for

building a model

11
How to Combat Overfitting?

• The first piece of the sum is our normal cost function.

• The second piece is a regularization term that adds a penalty for large beta
coefficients
• With these two elements in place, the cost function now balances between
two priorities: explaining the training data and preventing that explanation
from becoming overly specific.

12
Regularization in Machine Learning

Illustrates the relationship between model capacity and the concepts of underfitting and overfitting by plotting the
training and test errors as a function of model capacity. When the capacity is low, then both the training and test
errors are high. As the capacity increases, the training error steadily decreases, but the test error initially
decreases, but then starts to increases due to overfitting. Hence the optimal model capacity is the one at
13
which the test error is at a minimum
Regularization in Machine Learning
• How to make an algorithm/model perform well not just on the
training data, but also on new inputs?

• Many strategies are designed explicitly to reduce the test error,

possibly at the expense of increased training error. These
strategies are known collectively as regularization.

14
Regularization in Machine Learning
• The following factors determine how well a model is able to
generalize from the training dataset to the test dataset:
- The model capacity and its relation to data complexity:
- In general if the model capacity is less than the data complexity
then it leads to underfitting,
- while if the converse is true, then it can lead to overfitting. Hence,
the objective is to choose a model whose capacity matches the
complexity of the training and test datasets.
- Even if the model capacity and the data complexity are well
matched, we can still encounter overfitting ??
- due to an insufficient amount of training samples.
15
How to Regularize?
• Put extra constraints on the model. Example: Add restrictions on
the parameter values.
• Add extra terms (as penalties) in the objective function. Indirectly
putting constraint on the parameter values.

16
Regularization Techniques
• Parameter Norm-Penalties. (L1-norm and L2-norm)
• Dataset Augmentation
• Early Stopping
• Bagging and other ensemble methods
• Dropout

17
Parameter Norm-Penalties
Limit the capacity of models such as neural networks, linear
regression, or logistic regression, by adding a parameter norm
penalty Ω(θ) to the objective function J.
J(θ; X, y) = J(θ; X, y) + αΩ(w)
where α ∈[0, ∞) is a hyperparameter that weights the relative contribution of the norm
penalty term, Ω, relative to the standard objective function J.

Setting α to zero results in no regularization. Larger values of α correspond to more

regularization

We use the vector w to indicate all of the weights that should be affected by a norm
penalty, while the vector θ denotes all of the parameters, including both w and the
unregularized parameters. 18
L2-Regularization
• The L2 parameter norm penalty commonly known as weight
decay.
• L2-regularization is also known as ridge regression or Tikhonov
regularization
• L2 regularization drives the weights closer to origin by adding a
regularization term Ω(θ) = 1/2||w||²₂ to the objective function.

2-norm (also known as L2 norm or Euclidean norm)

19
L2-Regularization
• Such a model has following total objective function:
J(w; X, y) =α/2(wTw) + J(w; X, y)
where T denotes transpose. To simplify the presentation, we
assume no bias parameter, so θ is just w.
• The corresponding parameter gradient
∇w J^(w; X, y) =αw + ∇w J(w; X, y)
Gradient Step without Weight Decay:
w←w−η∇wJ(w;X,y)

20
L2-Regularization
• To take a single gradient step to update the weights, we perform this
update:
w ← w − 𝟄(αw + ∇w J(w; X, y))
• Written another way, the update is:
w ← (1 − 𝟄α)w − 𝟄∇w J(w; X, y)
• We can see that the addition of the weight decay term has modified the
learning rule to multiplicatively shrink the weight vector (multiplicative
shrinkage) by a constant factor on each step, just before performing the
usual gradient update.
The purpose of weight decay is to prevent overfitting by penalizing large
weights in the model, which can lead to a model that is too complex and
fits the noise in the training data rather than the underlying pattern. The
multiplicative shrinkage encourages the model to maintain 21smaller,
more generalizable weights.
L2-Regularization
- What are the effects over the entire course of training?
- The L2-regularization causes the learning algorithm to “perceive”
the input X as having higher variance. Thus, it shrink the weights
on features
- The L2 regularization has the intuitive interpretation of heavily
penalizing peaky weight vectors and preferring diffuse weight
vectors.
- Due to multiplicative interactions between weights and inputs,
this has the appealing property of encouraging the network to use
all of its inputs a little rather that some of its inputs a lot.
22
L1-Regularization
• The regularized objective function J(w; X, y) is given by:
J(w; X, y) = α ||w||1 + J(w; X, y)
• With the corresponding gradient (actually, sub-gradient):
∇w J^(w; X, y) =α.sign(w) + ∇w J(w; X, y)
where sign(w) is simply the sign of w applied element-wise.

1-norm (also known as L1 norm)23

24
25
L1 and L2-Regularization

The graphs above show how the functions used in L1 and L2 regularization look like.
The penalty in both cases is zero in the center of the plot, but this also implies that the
weights are zero and the model will not work. The values of the weights try to be as
low as possible to minimize this function, but inevitably they will leave the center
and will head outside.
L1 and L2-Regularization
• In case of L2 regularization, going towards any direction (from
the center) is okay because, as we can see in the plot, the
function increases equally in all directions. Thus, L2
regularization mainly focuses on keeping the weights as low as
possible.
• In contrast, L1 regularization’s shape is diamond-like and the
weights are lower in the corners of the diamond. These corners
show where one of the axis/feature is zero thus leading to
sparse matrices.
• Note how the shapes of the functions shows their
differentiability: L2 is smooth and differentiable and L1 is sharp
and non-differentiable.
L1 and L2-Regularization
• In few words, L2 will aim to find small weight values whereas L1
could put all the values in a single feature.
• L1 and L2 regularization methods are also combined in what is
called elastic net regularization.
L1 Vs L2-Regularization
For a Linear Regression problem with only 2 parameters w1 and w2

L2/Ridge solution as a function of α, L2 and J(w,X,y) L1/Lasso solution as a function of α, L1 and J(w,X,y)

w2 w2

w1 w1
Bias / Variance Trade-off
• Training error

• Cross-validation error

Loss

Degree of Polynomial
Source: Andrew Ng
Bias / Variance Trade-off
• Training error

• Cross-validation error

High bias High

Loss Variance

Degree of Polynomial
Bias / Variance Trade-off with Regularization
• Training error

• Cross-validation error

Loss

λ
Source: Andrew Ng
Bias / Variance Trade-off with Regularization
• Training error

• Cross-validation error

High High bias

Loss Variance

λ
Source: Andrew Ng
What does L1 regularization tend to produce in a model's weight
What can be a potential disadvantage
distribution?
of using L1 regularization with a
•A. Larger weight values
very high λ value?
•B. Smaller weight values
• A. It may lead to underfitting.
•C. Zeroed weights for some features
• B. It increases the risk of overfitting.
•D. Equal weights for all features
• C. It improves the model's
interpretability.
L2 regularization is also known as:
• D. It makes the model training faster.
•A. Lasso
•B. Ridge
•C. Elastic Net
•D. Max Norm

What is the effect of L2 regularization on the weight values of a model?

•A. It eliminates some weight values.
•B. It reduces the magnitude of the weight values.
•C. It increases the magnitude of the weight values.
•D. It has no effect on the weight values.

The main purpose of using regularization in a machine learning model is to:

•A. Increase model accuracy on the training set.
•B. Decrease the training time of the model.
•C. Prevent overfitting to the training data. 34
•D. Increase the number of features in the model.
What does L1 regularization tend to produce in a model's weight distribution?
•A. Larger weight values
•B. Smaller weight values What can be a potential disadvantage
•C. Zeroed weights for some features of using L1 regularization with a
•D. Equal weights for all features very high λ value?
• A. It may lead to underfitting.
L2 regularization is also known as: • B. It increases the risk of overfitting.
•A. Lasso • C. It improves the model's
•B. Ridge interpretability.
•C. Elastic Net • D. It makes the model training faster.
•D. Max Norm

What is the effect of L2 regularization on the weight values of a model?

•A. It eliminates some weight values.
•B. It reduces the magnitude of the weight values.
•C. It increases the magnitude of the weight values.
•D. It has no effect on the weight values.

The main purpose of using regularization in a machine learning model is to:

•A. Increase model accuracy on the training set.
•B. Decrease the training time of the model.
•C. Prevent overfitting to the training data.
•D. Increase the number of features in the model. 35
Dataset Augmentation

• The best way to make a machine learning model generalize better is to

train it on more data.
• How to generate more data?
- Label more data.
- Create fake data.
- Injecting noise in the data
- Inject noise to the model parameters
- Inject noise to the output
- For image data: rotation, translation and other transformation, inject
noise

36
Dataset Augmentation

37
Dataset Augmentation

38
Dataset Augmentation

39
Dataset Augmentation

Example of classes
Example of examples for one class 40
What is the primary purpose of data augmentation in machine learning?
• A. To reduce the size of the dataset
• B. To make the training process faster
• C. To increase the diversity of the training set by generating altered versions of the data
• D. To increase the accuracy of the test set

Which of the following is a common data augmentation technique used in image processing?
• A. Fourier transformation
• B. Random rotation
• C. Tokenization
• D. Lemmatization

Data augmentation can help reduce overfitting because it:

• A. Decreases the number of parameters in the model
• B. Increases the quantity and variety of the training data
• C. Simplifies the features in the dataset
• D. Reduces the dimensionality of the input data

Which of the following data augmentation techniques is typically not used for tabular data?
• A. Noise injection
• B. Feature cross
• C. Horizontal flipping
• D. Oversampling 41
What is the primary purpose of data augmentation in machine learning?
• A. To reduce the size of the dataset
• B. To make the training process faster
• C. To increase the diversity of the training set by generating altered versions of the data
• D. To increase the accuracy of the test set

Which of the following is a common data augmentation technique used in image processing?
• A. Fourier transformation
• B. Random rotation
• C. Tokenization
• D. Lemmatization

Data augmentation can help reduce overfitting because it:

Which of the following data augmentation techniques is typically not used for tabular data?
• A. Noise injection
• B. Feature cross
• C. Horizontal flipping
• D. Oversampling 42
Early Stopping
• When training large models with sufficient representational
capacity to overfit the task, it is often observed that training error
decreases steadily over time, but validation set error begins to rise
again.

• This means we can obtain a model with better validation set error
(and thus, hopefully better test set error) by returning to the
parameter setting at the point in time with the lowest validation
set error.

43
Early Stopping

Learning curves showing how the negative log-likelihood loss changes over time (indicated as number of
training iterations over the dataset, or epochs). In this example, a network is trained on MNIST. Observe that the
training objective decreases consistently over time, but the validation set average loss eventually begins to
increase again, forming an asymmetric U-shaped curve
44
Bagging and Other Ensemble Methods
• Bagging is a technique for reducing generalization error by
combining several models.
• The idea is to train several different models separately, then have
all of the models vote on the output for test examples.
• This is an example of a general strategy in machine learning called
model averaging. Techniques employing this strategy are known
as ensemble methods

45
Bagging and Other Ensemble Methods

A cartoon depiction of how bagging works. Suppose we train an ‘8’ detector on the dataset depicted above, containing
an ‘8’, a ‘6’ and a ‘9’. Suppose we make two different resampled datasets. The bagging training procedure is to construct
each of these datasets by sampling with replacement. The first dataset omits the ‘9’ and repeats the ‘8’. On this
dataset, the detector learns that a loop on top of the digit corresponds to an ‘8’. On the second dataset, we repeat the
‘9’ and omit the ‘6’. In this case, the detector learns that a loop on the bottom of the digit corresponds to an ‘8’. Each of
these individual classification rules is brittle, but if we average their output then the detector is robust, achieving
maximal confidence only when both loops of the ‘8’ are present. 46

12-Regularization For Deep Learning-17!08!2024
No ratings yet
12-Regularization For Deep Learning-17!08!2024
51 pages
Regularization Techniques in Deep Learning
No ratings yet
Regularization Techniques in Deep Learning
43 pages
Understanding Regularization Techniques
No ratings yet
Understanding Regularization Techniques
32 pages
Deep Learning Regularization Strategies
No ratings yet
Deep Learning Regularization Strategies
35 pages
Deep Learning Regularization Guide
No ratings yet
Deep Learning Regularization Guide
77 pages
Norm Penalties in Machine Learning Optimization
No ratings yet
Norm Penalties in Machine Learning Optimization
8 pages
Deep Learning Regularization Techniques
No ratings yet
Deep Learning Regularization Techniques
10 pages
Regularization
No ratings yet
Regularization
74 pages
Deep Learning Regularization Techniques
No ratings yet
Deep Learning Regularization Techniques
56 pages
L11+ Regularization
No ratings yet
L11+ Regularization
24 pages
Regularization: L1, L2 & Dropout
No ratings yet
Regularization: L1, L2 & Dropout
49 pages
Regularization For Deep Learning: Tsz-Chiu Au Chiu@unist - Ac.kr
No ratings yet
Regularization For Deep Learning: Tsz-Chiu Au Chiu@unist - Ac.kr
100 pages
NNDL Notes
No ratings yet
NNDL Notes
73 pages
BACK PROPAGATION and REGULATION, BATCH NORMALIZATION
No ratings yet
BACK PROPAGATION and REGULATION, BATCH NORMALIZATION
20 pages
Parameter Norm Penalties
No ratings yet
Parameter Norm Penalties
6 pages
Unit 4
No ratings yet
Unit 4
35 pages
Regularization Techniques in Deep Learning
No ratings yet
Regularization Techniques in Deep Learning
93 pages
Overfitting Problem Regularization (Ridge, Lasso, Elastic) Dropout and Early Stopping
No ratings yet
Overfitting Problem Regularization (Ridge, Lasso, Elastic) Dropout and Early Stopping
17 pages
Deep Learning Regularization Techniques
No ratings yet
Deep Learning Regularization Techniques
51 pages
Deep Learning Regularization Techniques
No ratings yet
Deep Learning Regularization Techniques
64 pages
UNIT LV
No ratings yet
UNIT LV
8 pages
Understanding Regularization in Machine Learning
No ratings yet
Understanding Regularization in Machine Learning
32 pages
4th Unit DL Final Class Notes
No ratings yet
4th Unit DL Final Class Notes
68 pages
Regularization in Machine Learning
No ratings yet
Regularization in Machine Learning
17 pages
Unit - 4 REGULARIZATION FOR DEEP LEARNING
No ratings yet
Unit - 4 REGULARIZATION FOR DEEP LEARNING
56 pages
Regularization
No ratings yet
Regularization
3 pages
Regularization Techniques in ML
No ratings yet
Regularization Techniques in ML
62 pages
Lec8 Regularization
No ratings yet
Lec8 Regularization
41 pages
DL 3 Regularization
No ratings yet
DL 3 Regularization
50 pages
Understanding Regularization Techniques
No ratings yet
Understanding Regularization Techniques
3 pages
Unit Iv NNHDL
No ratings yet
Unit Iv NNHDL
15 pages
Scribe Notes Fall 2022
No ratings yet
Scribe Notes Fall 2022
41 pages
Regularization Techniques in Deep Learning
No ratings yet
Regularization Techniques in Deep Learning
27 pages
Regularization in Machine Learning
No ratings yet
Regularization in Machine Learning
5 pages
4.bias and Variance
No ratings yet
4.bias and Variance
19 pages
L1, L2andBatchnormalization (1) T1754749408264
No ratings yet
L1, L2andBatchnormalization (1) T1754749408264
9 pages
Understanding Machine Learning Concepts
No ratings yet
Understanding Machine Learning Concepts
25 pages
NN&DL Unit-IV Regularization For Deep Learning
No ratings yet
NN&DL Unit-IV Regularization For Deep Learning
16 pages
Overfitting Underfitting: UNIT 2: Optimization and Regularization in Neural Networks
No ratings yet
Overfitting Underfitting: UNIT 2: Optimization and Regularization in Neural Networks
18 pages
UNIT-II Regularization in Deep Learning
No ratings yet
UNIT-II Regularization in Deep Learning
24 pages
Chap 7-1 Regularization For Deep Learning-Keonwoo Noh
No ratings yet
Chap 7-1 Regularization For Deep Learning-Keonwoo Noh
41 pages
Lecture 4 - Regularization
No ratings yet
Lecture 4 - Regularization
22 pages
Regularisation in Machine Learning Models
No ratings yet
Regularisation in Machine Learning Models
79 pages
DL Unit 4
No ratings yet
DL Unit 4
15 pages
Module 3 - 3
No ratings yet
Module 3 - 3
93 pages
Mod 4
No ratings yet
Mod 4
65 pages
1.2 Overfitting Under Fitting and Cross Validation and Confusion Matrix
No ratings yet
1.2 Overfitting Under Fitting and Cross Validation and Confusion Matrix
17 pages
Understanding Loss & Regularization in Deep Learning
No ratings yet
Understanding Loss & Regularization in Deep Learning
19 pages
What Is Regularization.
No ratings yet
What Is Regularization.
10 pages
Deep Neural Network Regularization Techniques
No ratings yet
Deep Neural Network Regularization Techniques
53 pages
Regularization in ML
No ratings yet
Regularization in ML
2 pages
DL-Lec 2 - Bias-Variance-Tradeoff
No ratings yet
DL-Lec 2 - Bias-Variance-Tradeoff
33 pages
Deep Learning Regularization Techniques
No ratings yet
Deep Learning Regularization Techniques
14 pages
Aa New
No ratings yet
Aa New
15 pages
Regularization and Normalization
No ratings yet
Regularization and Normalization
29 pages
2 2 Angle Relationships
No ratings yet
2 2 Angle Relationships
12 pages
Alloy Constitution and Phase Diagrams Assignment
No ratings yet
Alloy Constitution and Phase Diagrams Assignment
2 pages
Organic Farming Practices Explained
No ratings yet
Organic Farming Practices Explained
4 pages
Motion in a Plane Solutions Guide
No ratings yet
Motion in a Plane Solutions Guide
16 pages
Indian Railways Desktop PC Procurement Guidelines
No ratings yet
Indian Railways Desktop PC Procurement Guidelines
1 page
Successful Failure Analysis Strategies
No ratings yet
Successful Failure Analysis Strategies
2 pages
Basic Electrical Concepts Explained
No ratings yet
Basic Electrical Concepts Explained
87 pages
Truly Automotive Business Introduction Updated 2024Q2
No ratings yet
Truly Automotive Business Introduction Updated 2024Q2
33 pages
Hydraulic Subplates Type G-NS6 Guide
No ratings yet
Hydraulic Subplates Type G-NS6 Guide
4 pages
Diffusion PDF
No ratings yet
Diffusion PDF
3 pages
PTD EFCV List e As at 31.3.2018
No ratings yet
PTD EFCV List e As at 31.3.2018
36 pages
Accompaniments & Garnishes (Study Material) Food Production
No ratings yet
Accompaniments & Garnishes (Study Material) Food Production
10 pages
Technical Brochures List On ECIGRE - May 2025
No ratings yet
Technical Brochures List On ECIGRE - May 2025
72 pages
Soil Permeability Testing Methods
No ratings yet
Soil Permeability Testing Methods
3 pages
Design of An Undergraduate Atomic Force Microscopy Laboratory For A Materials Science Lecture Course
No ratings yet
Design of An Undergraduate Atomic Force Microscopy Laboratory For A Materials Science Lecture Course
6 pages
Flow Dynamics of Immersed Bodies
No ratings yet
Flow Dynamics of Immersed Bodies
29 pages
Quasi-Newton Methods Overview
No ratings yet
Quasi-Newton Methods Overview
15 pages
Psychology Exam Question Paper 2020
No ratings yet
Psychology Exam Question Paper 2020
3 pages
IB HL Complex Numbers Review Questions
No ratings yet
IB HL Complex Numbers Review Questions
8 pages
SCADA Impementation For Small System Electricity
No ratings yet
SCADA Impementation For Small System Electricity
5 pages
Patho I & Pharma I Syllabus
No ratings yet
Patho I & Pharma I Syllabus
7 pages
Noise Control Techniques and Effects
No ratings yet
Noise Control Techniques and Effects
30 pages
Foreseen Lex 1st Edition Sloane Kennedy Ebook Complete PDF
100% (1)
Foreseen Lex 1st Edition Sloane Kennedy Ebook Complete PDF
142 pages
TENDONS: 122.11 Specifications
No ratings yet
TENDONS: 122.11 Specifications
2 pages
Top 100 Classic Love Songs Playlist
100% (1)
Top 100 Classic Love Songs Playlist
2 pages
Eee 3132 - 2025 Class List
No ratings yet
Eee 3132 - 2025 Class List
8 pages
Mathematics Subject Classification: From Wikipedia, The Free Encyclopedia
No ratings yet
Mathematics Subject Classification: From Wikipedia, The Free Encyclopedia
6 pages
Abst-Microseriesmouldingmachine: All Plast
No ratings yet
Abst-Microseriesmouldingmachine: All Plast
2 pages
GappedText ValentinesDay
No ratings yet
GappedText ValentinesDay
13 pages
Recommender Systems An Introduction 1st Edition Dietmar Jannach Ebook Reconstructed Edition
100% (6)
Recommender Systems An Introduction 1st Edition Dietmar Jannach Ebook Reconstructed Edition
148 pages