Deep Learning Course File Aiml-1
Deep Learning Course File Aiml-1
COURSE FILE
Department of CSE
(Artificial Intelligence & Machine Learning)
(2023-2024)
0
Neural Networks and Deep Learning
COURSE FILE
SUBJECT Neural Networks and Deep
Learning
ACADEMIC YEAR 2023-2024
REGULATION R18
SUBJECT CODE
1
Neural Networks and Deep Learning
INDEX
COURSE FILE
S.NO TOPIC PAGE
NO
1. PEO’S, PO’S ,PSO’S 3
2 Syllabus Copy 5
5 Lesson Plan 9
a) Notes of Units
b) Assignment Questions
e) Objective Questions
2
Neural Networks and Deep Learning
PEO1: The graduates of the program will understand the concepts and principles of
Computer Science and Engineering inclusive of basic sciences.
PEO2: The program enables the learners to provide the technical skills
necessary to design and implement computer systems and applications, to conduct
open-ended problem solving, and apply critical thinking.
PEO3: The graduates of the program will practice the profession with work
effectively on teams to communicate in written and oral form, ethics, integrity,
leadership and social responsibility through safe engineering leading them to
contribute their might for the good of the human race.
PEO4: The program encourages the students to become lifelong activity and as a
means to the creative discovery, development, and implementation of technology
as well as to keep up with the dynamic nature of the Computer Science and
Engineering discipline.
PROGRAM OUTCOMES
3
Neural Networks and Deep Learning
Modern tool usage: Create, select, and apply appropriate techniques, resources, and
modern engineering and IT tools including prediction and modeling to complex
engineering activities with an understanding of the limitations.
The engineer and society: Apply reasoning informed by the contextual knowledge
to assess societal, health, safety, legal and cultural issues and the consequent
responsibilities relevant to the professional engineering practice.
Life-long learning: Recognize the need for, and have the preparation and ability to
engage in independent and life-long learning in the broadest context of technological
change.
4
Neural Networks and Deep Learning
2. Syllabus Copy
Course Outcomes:
UNIT-I
Artificial Neural Networks Introduction, Basic models of ANN, important terminologies,
Supervised
Learning Networks, Perceptron Networks, Adaptive Linear Neuron, Back-propagation Network.
Associative Memory Networks. Training Algorithms for pattern association, BAM and Hopfield
Networks.
UNIT-II
Unsupervised Learning Network- Introduction, Fixed Weight Competitive Nets, Maxnet,
Hamming
Network, Kohonen Self-Organizing Feature Maps, Learning Vector Quantization, Counter
Propagation
Networks, Adaptive Resonance Theory Networks. Special Networks-Introduction to various
networks.
UNIT - III
5
Neural Networks and Deep Learning
Introduction to Deep Learning, Historical Trends in Deep learning, Deep Feed - forward
networks,
Gradient-Based learning, Hidden Units, Architecture Design, Back-Propagation and Other
Differentiation Algorithms
UNIT - IV
Regularization for Deep Learning: Parameter norm Penalties, Norm Penalties as Constrained
Optimization, Regularization and Under-Constrained Problems, Dataset Augmentation, Noise
Robustness, Semi-Supervised learning, Multi-task learning, Early Stopping, Parameter Typing
and
Parameter Sharing, Sparse Representations, Bagging and other Ensemble Methods, Dropout,
Adversarial Training, Tangent Distance, tangent Prop and Manifold, Tangent Classifier
UNIT - V
Optimization for Train Deep Models: Challenges in Neural Network Optimization, Basic
Algorithms,
Parameter Initialization Strategies, Algorithms with Adaptive Learning Rates, Approximate
SecondOrder Methods, Optimization Strategies and Meta-Algorithms
Applications: Large-Scale Deep Learning, Computer Vision, Speech Recognition, Natural
Language
Processing
TEXT BOOKS:
1. Deep Learning: An MIT Press Book By Ian Goodfellow and Yoshua Bengio and Aaron
Courville
2. Neural Networks and Learning Machines, Simon Haykin, 3rd Edition, Pearson Prentice Hall.
6
Neural Networks and Deep Learning
Time 09.15 a.m. 10.15 a.m. 11.15 a.m. 12.15 p.m. 01.15 p.m. 02.00 p.m. 03.00 p.m.
to to to to to to to
Day 10.15 a.m. 11.15 a.m. 12.15 p.m. 01.15 p.m. 02.00 p.m. 03.00 p.m. 04.00 p.m.
L
LIBRARY/
TUE IPR RF NN CC ASN
SPORTS
U
WED NN ASN RF CC SEMINAR
N
THU RF IPR NN LAB CC ASN
C
7
Neural Networks and Deep Learning
1.15p.m.
DAY/TIME 09:15a.m 10:15a.m 11:15a.m 12:15p.m TO 02.00p.m. 03.00p.m.
TO TO TO TO 02.00 TO TO
10.15a.m. 1.15a.m. 12.15a.m. 01.15p.m. p.m. 03.00p.m. 04.00p.m.
IOMP/SUMMER
MON DL(CS) NN&DL L INTERNSHIP
TUE DL(CS) NN&DL U
WED NN&DL DL(CS) N
THU DL(CS) DL LAB C
IOMP/SUMMER
FRI DL(CS) NN&DL H INTERNSHIP
IOMP/SUMMER
SAT NN&DL
INTERNSHIP
8
Neural Networks and Deep Learning
5. Student List
MALLAREDDY INSTITUTE OF TECHNOLOGY & SCIENCE
CSE - AIML
Class: IV Year-I Sem B. Tech. Branch: B.Tech – CSE (AIML)
Batch: 2020-2024 A.Y:2023-2024
ROLL LIST
S. No H.T.NO NAME OF THE STUDENT
1 20S11A6601 AJAY KYADAVENI
2 20S11A6602 AKHIL DESAI
3 20S11A6603 AMRUTHA S.V.S
4 20S11A6604 ANUDEEP DHURGAM
5 20S11A6605 ASHWINI KASHI
6 20S11A6606 BHARATH D
7 20S11A6607 BHAVANA GOLLAPALLY
8 20S11A6608 BHAVANA KAMMARI
9 20S11A6609 BHAVANI SHANKER C V
10 20S11A6610 CHANDANA TIGULLA
11 20S11A6611 CHANDRASHEKHARA PRAMOD
12 20S11A6612 DINESH SADANANAD
13 20S11A6613 HARI KRISHNA SAMBARI
14 20S11A6614 JASHWANTH BOMMAKANTI
15 20S11A6615 KAMAL SANJAY
16 20S11A6616 LAXMI PRASANNA A A S
17 20S11A6617 M N AJAY VARMA PENMATSA
18 20S11A6618 MAHESH BOLLABATHULA
19 20S11A6619 MANISH SAI KUMAR KOSURU
20 20S11A6620 NANDINI REPALLE
21 20S11A6621 NIHARIKA CH
22 20S11A6622 NISHANK YARLAGADDA
23 20S11A6623 NITHIN GOUD BEESU
24 20S11A6624 PAVAN SAI GORUPUTI
9
Neural Networks and Deep Learning
10
Neural Networks and Deep Learning
11
Neural Networks and Deep Learning
5. Lessson plan
Progra
Mode of Referen
Lesson No. of Topic/Sub Course m
Unit Date Teachin ce Text
No. Periods Topic Outcome Outcom
No. g Books
e (PO)
Adaptive CO1
1.5 4.08.23 1 Linear PPT T1
Neuron
Back-
propagatio
n
1.6 5.08.23 1 Network, PPT T1
Associativ
e Memory
Networks
Training
Algorithm
s for
1.7 7.08.23 1 PPT T1
pattern
associatio
n
12
Neural Networks and Deep Learning
II BAM and
1.8 8.08.23 1 Hopfield PPT T1
Networks.
Unsupervi
sed
Learning
2.1 9.08.23 1 PPT T1
Network-
Introducti
on
13
Neural Networks and Deep Learning
14
Neural Networks and Deep Learning
Dropout,
V 4.10 23.09.23 1 Adversaria PPT CO1 T1
l Training
15
Neural Networks and Deep Learning
Tangent
Distance,
tangent
4.11 25.09.23 1 Prop and PPT T1
Manifold,
Tangent
Classifier
Introducti
on,
Challenge
5.1 25.09.23 1 s in Neural PPT T1
Network
Optimizati
on
Basic
Algorithm
5.2 23.09.23 1 s PPT T1
Parameter
Initializati
5.3 23.09.23 1 PPT T1
on
Strategies
Algorithm
s with
5.4 23.09.23 1 Adaptive PPT T1
Learning
Rates
Approxim
5.5 16.11.23 1 ate Second PPT T1
Order
Methods
Optimizati
on
Strategies
5.6 18.11.23 1 PPT T1
and Meta-
Algorithm
s
Optimizati
on
Strategies
5.7 20.11.23 1 PPT T1
and Meta-
Algorithm
s
16
Neural Networks and Deep Learning
Application
s: Large-
Scale Deep
Learning,
Computer
5.8 22.11.23 1 Vision, PPT T1
Speech
Recognitio
n, Natural
Language
Processing
Virtualizat
ion
Services
Provided
5.11 27.11.23 1 by SAP, PPT T1
Sales
force,
Sales
Cloud
Service
Cloud:
Knowledg
e as a
Service,
Rack
5.12 29.11.23 1 PPT T1
space,
VMware,
Manjra
soft,
Aneka
Platform
Assignment Test Unit 5
17
Neural Networks and Deep Learning
6. Lecture notes
Unit 1
Artificial Neural Networks
Machine Learning is a subset of artificial intelligence that helps you build AI-driven
applications.
Deep Learning is a subset of machine learning that uses vast volumes of data and complex
algorithms to train a model.
Artificial intelligence, commonly referred to as AI, is the process of imparting data, information,
and human intelligence to machines. The main goal of Artificial Intelligence is to develop self-
18
Neural Networks and Deep Learning
reliant machines that can think and act like humans. These machines can mimic human behavior
and perform tasks by learning and problem-solving. Most of the AI systems simulate natural
intelligence to solve complex problems.
Amazon Echo is a smart speaker that uses Alexa, the virtual assistant AI technology developed
by Amazon. Amazon Alexa is capable of voice interaction, playing music, setting alarms,
playing audiobooks, and giving real-time information such as news, weather, sports, and traffic
reports.
As you can see in the illustration below, the person wants to know the current temperature in
Chicago. The person’s voice is first converted into a machine-readable format. The formatted
data is then fed into the Amazon Alexa system for processing and analyzing. Finally, Alexa
returns the desired voice output via Amazon Echo.
Now that you’ve been given a simple introduction to the basics of artificial intelligence, let’s
have a look at its different types.
Reactive Machines - These are systems that only react. These systems don’t form memories, and
they don’t use any past experiences for making new decisions.
19
Neural Networks and Deep Learning
Limited Memory - These systems reference the past, and information is added over a period of
time. The referenced information is short-lived.
Theory of Mind - This covers systems that are able to understand human emotions and how they
affect decision making. They are trained to adjust their behavior accordingly.
Self-awareness - These systems are designed and created to be aware of themselves. They
understand their own internal states, predict other people’s feelings, and act appropriately.
Now that we have gone over the basics of artificial intelligence, let’s move on to machine
learning and see how it works.
Machine learning is a discipline of computer science that uses computer algorithms and analytics
to build predictive models that can solve business problems.
As per McKinsey & Co., machine learning is based on algorithms that can learn from data
without relying on rules-based programming.
20
Neural Networks and Deep Learning
Tom Mitchell’s book on machine learning says “A computer program is said to learn from
experience E with respect to some class of tasks T and performance measure P, if its
performance at tasks in T, as measured by P, improves with experience E.”
So you see, machine learning has numerous definitions. But how does it really work?
Machine learning accesses vast amounts of data (both structured and unstructured) and learns
from it to predict the future. It learns from the data by using multiple algorithms and techniques.
Below is a diagram that shows how a machine learns from data.
Now that you have been introduced to the basics of machine learning and how it works, let’s see
the different types of machine learning methods.
1. Supervised Learning
In supervised learning, the data is already labeled, which means you know the target variable.
Using this method of learning, systems can predict future outcomes based on past data. It
requires that at least an input and output variable be given to the model for it to be trained.
Below is an example of a supervised learning method. The algorithm is trained using labeled
data of dogs and cats. The trained model predicts whether the new image is that of a cat or a dog.
21
Neural Networks and Deep Learning
Some examples of supervised learning include linear regression, logistic regression, support
vector machines, Naive Bayes, and decision tree.
2. Unsupervised Learning
Unsupervised learning algorithms employ unlabeled data to discover patterns from the data on
their own. The systems are able to identify hidden features from the input data provided. Once
the data is more readable, the patterns and similarities become more evident.
Below is an example of an unsupervised learning method that trains a model using unlabeled
data. In this case, the data consists of different vehicles. The purpose of the model is to classify
each kind of vehicle.
Some examples of unsupervised learning include k-means clustering, hierarchical clustering, and
anomaly detection.
22
Neural Networks and Deep Learning
3. Reinforcement Learning
The goal of reinforcement learning is to train an agent to complete a task within an uncertain
environment. The agent receives observations and a reward from the environment and sends
actions to the environment. The reward measures how successful action is with respect to
completing the task goal.
Examples of reinforcement learning algorithms include Q-learning and Deep Q-learning Neural
Networks.
23
Neural Networks and Deep Learning
Product recommendations
Now that we’ve explored machine learning and its applications, let’s turn our attention to deep
learning, what it is, and how it is different from AI and machine learning.
Deep learning is a subset of machine learning that deals with algorithms inspired by the structure
and function of the human brain. Deep learning algorithms can work with an enormous amount
of both structured and unstructured data. Deep learning’s core concept lies in artificial neural
networks, which enable machines to make decisions.
The major difference between deep learning vs machine learning is the way data is presented to
the machine. Machine learning algorithms usually require structured data, whereas deep learning
networks work on multiple layers of artificial neural networks.
24
Neural Networks and Deep Learning
The network has an input layer that accepts inputs from the data. The hidden layer is used to find
any hidden features from the data. The output layer then provides the expected output.
Here is an example of a neural network that uses large sets of unlabeled data of eye retinas. The
network model is trained on this data to find out whether or not a person has diabetic retinopathy.
Now that we have an idea of what deep learning is, let’s see how it works.
3. The activation function takes the “weighted sum of input” as the input to the function,
adds a bias, and decides whether the neuron should be fired or not.
25
Neural Networks and Deep Learning
5. The model output is compared with the actual output. After training the neural network,
the model uses the backpropagation method to improve the performance of the network.
The cost function helps to reduce the error rate.
In the following example, deep learning and neural networks are used to identify the number on
a license plate. This technique is used by many countries to identify rules violators and speeding
vehicles.
26
Neural Networks and Deep Learning
Convolutional Neural Network (CNN) - CNN is a class of deep neural networks most commonly
used for image analysis.
Recurrent Neural Network (RNN) - RNN uses sequential information to build a model. It often
works better for models that have to memorize past data.
Generative Adversarial Network (GAN) - GAN are algorithmic architectures that use two neural
networks to create new, synthetic instances of data that pass for real data. A GAN trained on
photographs can generate new photographs that look at least superficially authentic to human
observers.
Deep Belief Network (DBN) - DBN is a generative graphical model that is composed of multiple
layers of latent variables called hidden units. Each layer is interconnected, but the units are not.
Learn from the best in the AI/ML industry with our Caltech Artificial Intelligence Course! Enroll
now to get started!
27
Neural Networks and Deep Learning
Music generation
Image coloring
Object detection
1. There are three layers in the network architecture: the input layer, the hidden layer (more
than one), and the output layer. Because of the numerous layers are sometimes referred to
28
Neural Networks and Deep Learning
2. It is possible to think of the hidden layer as a “distillation layer,” which extracts some of
the most relevant patterns from the inputs and sends them on to the next layer for further
analysis. It accelerates and improves the efficiency of the network by recognizing just the
most important information from the inputs and discarding the redundant information.
3. 3. The activation function is important for two reasons: first, it allows you to turn on your
computer.
4. his model captures the presence of non-linear relationships between the inputs.
29
Neural Networks and Deep Learning
4. Finding the “optimal values of W — weights” that minimize prediction error is critical to
building a successful model. The “backpropagation algorithm” does this by converting ANN into
a learning algorithm by learning from mistakes.
5. The optimization approach uses a “gradient descent” technique to quantify prediction errors. To
find the optimum value for W, small adjustments in W are tried, and the impact on prediction
errors is examined. Finally, those W values are chosen as ideal since further W changes do not
reduce mistakes.
ANNs offers many key benefits that make them particularly well-suited to specific issues and
situations:
1. ANNs can learn and model non-linear and complicated interactions, which is critical since many
of the relationships between inputs and outputs in real life are non-linear and complex.
2. ANNs can generalize – After learning from the original inputs and their associations, the model
may infer unknown relationships from anonymous data, allowing it to generalize and predict
unknown data.
3. ANN does not impose any constraints on the input variables, unlike many other prediction
approaches (like how they should be distributed). Furthermore, numerous studies have
demonstrated that ANNs can better simulate heteroskedasticity, or data with high volatility and
non-constant variance, because of their capacity to discover latent correlations in the data without
imposing any preset associations. This is particularly helpful in financial time series forecasting
(for example, stock prices) when significant data volatility.
30
Neural Networks and Deep Learning
ANNs have a wide range of applications because of their unique properties. A few of the important
applications of ANNs include:
Image 3
Image recognition is a rapidly evolving discipline with several applications ranging from social
media facial identification to cancer detection in medicine to satellite image processing for
agricultural and defense purposes.
Deep neural networks, which form the core of “deep learning,” have now opened up all of the new
and transformative advances in computer vision, speech recognition, and natural language
processing – notable examples being self-driving vehicles, thanks to ANN research.
2. Forecasting:
Forecasting is widely used in everyday company decisions (sales, the financial allocation between
goods, and capacity utilization), economic and monetary policy, finance, and the stock market.
Forecasting issues are frequently complex; for example, predicting stock prices is complicated
with many underlying variables (some known, some unseen).
raditional forecasting models have flaws when it comes to accounting for these complicated, non-
linear interactions. Given its capacity to model and extract previously unknown characteristics and
31
Neural Networks and Deep Learning
correlations, ANNs can provide a reliable alternative when used correctly. ANN also has no
restrictions on the input and residual distributions, unlike conventional models.
1. Hardware Dependence:
The construction of Artificial Neural Networks necessitates the use of parallel processors.
As a result, the equipment’s realization is contingent.
Any precise rule does not determine the structure of artificial neural networks.
Experience and trial and error are used to develop a suitable network structure.
When the network’s error on the sample is decreased to a specific amount, the training is complete.
The value does not produce the best outcomes.
The ANN learns through various learning algorithms that are described as supervised or
unsupervised learning.
In supervised learning algorithms, the target values are labeled. Its goal is to try to reduce
the error between the desired output (target) and the actual output for optimization. Here, a
supervisor is present.
In unsupervised learning algorithms, the target values are not labeled and the network learns
by itself by identifying the patterns through repeated trials and experiments.
ANN Terminology:
Weights: each neuron is linked to the other neurons through connection links that carry
weight. The weight has information and data about the input signal. The output depends
33
Neural Networks and Deep Learning
solely on the weights and input signal. The weights can be presented in a matrix form that is
known as the Connection matrix.
if there are “n” nodes with each node having “m” weights, then it is represented as:
Bias: Bias is a constant that is added to the product of inputs and weights to calculate the
product. It is used to shift the result to the positive or negative side. The net input weight is
increased by a positive bias while The net input weight is decreased by a negative bias.
34
Neural Networks and Deep Learning
Here,{1,x1…xn} are the inputs, and the output (Y) neurons will be computed by the function
g(x) which sums up all the input and adds bias to it.
g(x)=∑xi+b where i=0 to n
= x1+........+xn+b
and the role of the activation is to provide the output depending on the results of the summation
function:
Y=1 if g(x)>=0
Y=0 else
Threshold: A threshold value is a constant value that is compared to the net input to get the
output. The activation function is defined based on the threshold value to calculate the output.
For Example:
Y=1 if net input>=threshold
Y=0 else
Learning Rate: The learning rate is denoted α. It ranges from 0 to 1. It is used for balancing
weights during the learning of ANN.
Target value: Target values are Correct values of the output variable and are also known as
just targets.
Error: It is the inaccuracy of predicted output values compared to Target Values.
Supervised Learning
As the name suggests, supervised learning takes place under the supervision of a
teacher. This learning process is dependent. During the training of ANN under
supervised learning, the input vector is presented to the network, which will produce
an output vector. This output vector is compared with the desired/target output
vector. An error signal is generated if there is a difference between the actual output
35
Neural Networks and Deep Learning
and the desired/target output vector. On the basis of this error signal, the weights
would be adjusted until the actual output is matched with the desired output.
Supervised learning is the types of machine learning in which machines are trained
using well "labelled" training data, and on basis of that data, machines predict the
output. The labelled data means some input data is already tagged with the correct
output.
In supervised learning, the training data provided to the machines work as the
supervisor that teaches the machines to predict the output correctly. It applies the
same concept as a student learns in the supervision of the teacher.
In the real-world, supervised learning can be used for Risk Assessment, Image
classification, Fraud Detection, spam filtering, etc.
In supervised learning, models are trained using labelled dataset, where the model
learns about each type of data. Once the training process is completed, the model is
tested on the basis of test data (a subset of the training set), and then it predicts the
output.
The working of Supervised learning can be easily understood by the below example
and diagram:
36
Neural Networks and Deep Learning
o If the given shape has four sides, and all the sides are equal, then it will be
labelled as a Square.
o If the given shape has three sides, then it will be labelled as a triangle.
o If the given shape has six equal sides then it will be labelled as hexagon.
Now, after training, we test our model using the test set, and the task of the model is
to identify the shape.
The machine is already trained on all types of shapes, and when it finds a new shape,
it classifies the shape on the bases of a number of sides, and predicts the output.
37
Neural Networks and Deep Learning
o Determine the suitable algorithm for the model, such as support vector
machine, decision tree, etc.
o Execute the algorithm on the training dataset. Sometimes we need validation
sets as the control parameters, which are the subset of training datasets.
o Evaluate the accuracy of the model by providing the test set. If the model
predicts the correct output, which means our model is accurate.
1. Regression
Regression algorithms are used if there is a relationship between the input variable
and the output variable. It is used for the prediction of continuous variables, such as
Weather forecasting, Market Trends, etc. Below are some popular Regression
algorithms which come under supervised learning:
o Linear Regression
o Regression Trees
o Non-Linear Regression
o Bayesian Linear Regression
o Polynomial Regression
2. Classification
Classification algorithms are used when the output variable is categorical, which
means there are two classes such as Yes-No, Male-Female, True-false, etc.
38
Neural Networks and Deep Learning
Spam Filtering,
o Random Forest
o Decision Trees
o Logistic Regression
o Support vector Machines
o With the help of supervised learning, the model can predict the output on the
basis of prior experiences.
o In supervised learning, we can have an exact idea about the classes of objects.
o Supervised learning model helps us to solve various real-world problems such
as fraud detection, spam filtering, etc.
o Supervised learning models are not suitable for handling the complex tasks.
o Supervised learning cannot predict the correct output if the test data is
different from the training dataset.
o Training required lots of computation times.
o In supervised learning, we need enough knowledge about the classes of object.
Perceptron
39
Neural Networks and Deep Learning
This is the primary component of Perceptron which accepts the initial data into the system for
further processing. Each input node contains a real numerical value.
Weight parameter represents the strength of the connection between units. This is another most
important parameter of Perceptron components. Weight is directly proportional to the strength of
the associated input neuron in deciding the output. Further, Bias can be considered as the line of
intercept in a linear equation.
o Activation Function:
These are the final and important components that help to determine whether the neuron will fire
or not. Activation Function can be considered primarily as a step function.
o Sign function
o Step function, and
o Sigmoid function
40
Neural Networks and Deep Learning
The data scientist uses the activation function to take a subjective decision based on various
problem statements and forms the desired outputs. Activation function may differ (e.g., Sign, Step,
and Sigmoid) in perceptron models by checking whether the learning process is slow or has
vanishing or exploding gradients.
This step function or Activation function plays a vital role in ensuring that output is mapped
between required values (0,1) or (-1,1). It is important to note that the weight of input is indicative
of the strength of a node. Similarly, an input's bias value gives the ability to shift the activation
function curve up or down.
41
Neural Networks and Deep Learning
Step-1
In the first step first, multiply all input values with corresponding weight values and then add them
to determine the weighted sum. Mathematically, we can calculate the weighted sum as follows:
Add a special term called bias 'b' to this weighted sum to improve the model's performance.
∑wi*xi + b
Step-2
ADVERTISING
In the second step, an activation function is applied with the above-mentioned weighted sum,
which gives us output either in binary form or a continuous value as follows:
Y = f(∑wi*xi + b)
In a single layer perceptron model, its algorithms do not contain recorded data, so it begins with
inconstantly allocated input for weight parameters. Further, it sums up all inputs (weight). After
adding all inputs, if the total sum of all inputs is more than a pre-determined value, the model gets
activated and shows the output value as +1.
If the outcome is same as pre-determined or threshold value, then the performance of this model
is stated as satisfied, and weight demand does not change. However, this model consists of a few
42
Neural Networks and Deep Learning
discrepancies triggered when multiple weight inputs values are fed into the model. Hence, to find
desired output and minimize errors, some changes should be necessary for the weights input.
The multi-layer perceptron model is also known as the Backpropagation algorithm, which executes
in two stages as follows:
o Forward Stage: Activation functions start from the input layer in the forward stage and terminate
on the output layer.
o Backward Stage: In the backward stage, weight and bias values are modified as per the model's
requirement. In this stage, the error between actual output and demanded originated backward on
the output layer and ended on the input layer.
Hence, a multi-layered perceptron model has considered as multiple artificial neural networks
having various layers in which activation function does not remain linear, similar to a single layer
perceptron model. Instead of linear, activation function can be executed as sigmoid, TanH, ReLU,
etc., for deployment.
A multi-layer perceptron model has greater processing power and can process linear and non-linear
patterns. Further, it can also implement logic gates such as AND, OR, XOR, NAND, NOT, XNOR,
NOR.
43
Neural Networks and Deep Learning
Perceptron Function
Perceptron function ''f(x)'' can be achieved as output by multiplying the input 'x' with the learned
weight coefficient 'w'.
f(x)=1; if w.x+b>0
otherwise, f(x)=0
Characteristics of Perceptron
The perceptron model has the following characteristics.
o The output of a perceptron can only be a binary number (0 or 1) due to the hard limit transfer
function.
44
Neural Networks and Deep Learning
o Perceptron can only be used to classify the linearly separable sets of input vectors. If input vectors
are non-linear, it is not easy to classify them properly.
Future of Perceptron
The future of the Perceptron model is much bright and significant as it helps to interpret data by
building intuitive patterns and applying them in the future. Machine learning is a rapidly growing
technology of Artificial Intelligence that is continuously evolving and in the developing phase;
hence the future of perceptron technology will continue to support and facilitate analytical
behavior in machines that will, in turn, add to the efficiency of computers.
The perceptron model is continuously becoming more advanced and working efficiently on
complex problems with the help of artificial neurons.
An artificial neural network inspired by the human neural system is a network used to
process the data which consist of three types of layer i.e input layer, the hidden layer,
and the output layer. The basic neural network contains only two layers which are the
input and output layers. The layers are connected with the weighted path which is used
to find net input data. In this section, we will discuss two basic types of neural networks
Adaline which doesn’t have any hidden layer, and Madaline which has one hidden
layer.
unit and output unit are adjustable. It uses the delta rule i.e ,
where and are the weight, predicted output, and true value
respectively.
The learning rule is found to minimize the mean square error between activation and
target values. Adaline consists of trainable weights, it compares actual output with
calculated output, and based on error training algorithm is applied.
45
Neural Networks and Deep Learning
Workflow:
Adaline
First, calculate the net input to your Adaline network then apply the activation function
to its output then compare it with the original output if both the equal, then give the
output else send an error back to the network and update the weight according to the
error which is calculated by the delta learning rule. i.e , where and are the weight,
predicted output, and true value respectively.
Architecture:
Adaline
46
Neural Networks and Deep Learning
In Adaline, all the input neuron is directly connected to the output neuron with the
weighted connected path. There is a bias b of activation function 1 is present.
Algorithm:
Step 1: Initialize weight not zero but small random values are used. Set learning
rate α.
Step 2: While the stopping condition is False do steps 3 to 7.
Step 3: for each training set perform steps 4 to 6.
Step 4: Set activation of input unit x i = si for (i=1 to n).
Step 5: compute net input to output unit
Here, b is the bias and n is the total number of neurons.
Step 6: Update the weights and bias for i=1 to n
and calcu when the predicted output and the true value are the same then the
weight will not change.
Step 7: Test the stopping condition. The stopping condition may be when the weight
changes at a low rate or no change.
Implementations
1 1 1
1 -1 1
47
Neural Networks and Deep Learning
-1 1 1
-1 -1 -1
(when x1=x2=1)
Now compute, (t-yin)=(1-0.3)=0.7
Now, update the weights and bias
calculate the error
Similarly, repeat the same steps for other input vectors and you will get.
(t-
x x2 w1 (0. w2 (0. b
t yin (t-yin) ∆w1 ∆w2 ∆b yin)^
1 1) 1) (0.1)
2
-
1 -1 1 0.17 0.83 0.083 0.083 0.253 0.087 0.253 0.69
0.083
-
- 0.091 0.091 0.161 0.178 0.344
1 1 0.087 0.913 0.091 0.83
1 3 3 7 3 3
3
- -
- - 0.004 0.100 0.100 0.262 0.278 0.243
-1 1.004 0.100 1.01
1 1 3 4 4 1 7 9
3 4
This is epoch 1 where the total error is 0.49 + 0.69 + 0.83 + 1.01 = 3.02 so more epochs
will run until the total error becomes less than equal to the least squared error i.e 2.
48
Neural Networks and Deep Learning
Back-propagation Network
Backpropagation is the essence of neural network training. It is the method of fine-tuning the
weights of a neural network based on the error rate obtained in the previous epoch (i.e.,
iteration). Proper tuning of the weights allows you to reduce error rates and make the model
reliable by increasing its generalization.
Backpropagation in neural network is a short form for “backward propagation of errors.” It is a
standard method of training artificial neural networks. This method helps calculate the gradient
of a loss function with respect to all the weights in the network.
The Back propagation algorithm in neural network computes the gradient of the loss function for
a single weight by the chain rule. It efficiently computes one layer at a time, unlike a native
direct computation. It computes the gradient, but it does not define how the gradient is used. It
generalizes the computation in the delta rule.
Consider the following Back propagation neural network example diagram to understand:
5. Travel back from the output layer to the hidden layer to adjust the weights such that the
error is decreased.
49
Neural Networks and Deep Learning
Static Back-propagation
Recurrent Backpropagation
Static back-propagation:
It is one kind of backpropagation network which produces a mapping of a static input for static
output. It is useful to solve static classification issues like optical character recognition.
Recurrent Backpropagation:
Recurrent Back propagation in data mining is fed forward until a fixed value is achieved. After
that, the error is computed and propagated backward.
The main difference between both of these methods is: that the mapping is rapid in static back-
propagation while it is nonstatic in recurrent backpropagation.
History of Backpropagation
In 1961, the basics concept of continuous backpropagation were derived in the context of
control theory by J. Kelly, Henry Arthur, and E. Bryson.
In 1969, Bryson and Ho gave a multi-stage dynamic system optimization method.
50
Neural Networks and Deep Learning
In 1974, Werbos stated the possibility of applying this principle in an artificial neural
network.
In 1982, Hopfield brought his idea of a neural network.
In 1986, by the effort of David E. Rumelhart, Geoffrey E. Hinton, Ronald J. Williams,
backpropagation gained recognition.
In 1993, Wan was the first person to win an international pattern recognition contest with
the help of the backpropagation method.
Discomfort (bias)
51
Neural Networks and Deep Learning
Summary
A neural network is a group of connected it I/O units where each connection has a weight
associated with its computer programs.
Backpropagation is a short form for “backward propagation of errors.” It is a standard
method of training artificial neural networks
Back propagation algorithm in machine learning is fast, simple and easy to program
A feedforward BPN network is an artificial neural network.
Two Types of Backpropagation Networks are 1)Static Back-propagation 2) Recurrent
Backpropagation
In 1961, the basics concept of continuous backpropagation were derived in the context of
control theory by J. Kelly, Henry Arthur, and E. Bryson.
Back propagation in data mining simplifies the network structure by removing weighted
links that have a minimal effect on the trained network.
It is especially useful for deep neural networks working on error-prone projects, such as
image or speech recognition.
The biggest drawback of the Backpropagation is that it can be sensitive for noisy data.
52
Neural Networks and Deep Learning
Architecture
As shown in the following figure, the architecture of Auto Associative memory network
has ‘n’ number of input training vectors and similar ‘n’ number of output target vectors.
Training Algorithm
For training, this network is using the Hebb or Delta learning rule.
Step 1 − Initialize all the weights to zero as wij = 0 i=1ton,j=1ton�=1���,�=1���
Step 2 − Perform steps 3-4 for each input vector.
Step 3 − Activate each input unit as follows −
xi=si(i=1ton)��=��(�=1���)
Step 4 − Activate each output unit as follows −
yj=sj(j=1ton)��=��(�=1���)
Step 5 − Adjust the weights as follows −
wij(new)=wij(old)+xiyj���(���)=���(���)+����
Testing Algorithm
Step 1 − Set the weights obtained during training for Hebb’s rule.
Step 2 − Perform steps 3-5 for each input vector.
Step 3 − Set the activation of the input units equal to that of the input vector.
Step 4 − Calculate the net input to each output unit j = 1 to n
yinj=∑i=1nxiwij����=∑�=1������
Step 5 − Apply the following activation function to calculate the output
53
Neural Networks and Deep Learning
yj=f(yinj)={+1−1ifyinj>0ifyinj⩽0��=�(����)={+1������>0−1������⩽0
Architecture
As shown in the following figure, the architecture of Hetero Associative Memory network
has ‘n’ number of input training vectors and ‘m’ number of output target vectors.
Training Algorithm
For training, this network is using the Hebb or Delta learning rule.
Step 1 − Initialize all the weights to zero as wij = 0 i=1ton,j=1tom�=1���,�=1���
Step 2 − Perform steps 3-4 for each input vector.
Step 3 − Activate each input unit as follows −
xi=si(i=1ton)��=��(�=1���)
Step 4 − Activate each output unit as follows −
yj=sj(j=1tom)��=��(�=1���)
Step 5 − Adjust the weights as follows −
wij(new)=wij(old)+xiyj���(���)=���(���)+����
Testing Algorithm
Step 1 − Set the weights obtained during training for Hebb’s rule.
Step 2 − Perform steps 3-5 for each input vector.
54
Neural Networks and Deep Learning
Step 3 − Set the activation of the input units equal to that of the input vector.
Step 4 − Calculate the net input to each output unit j = 1 to m;
yinj=∑i=1nxiwij����=∑�=1������
Step 5 − Apply the following activation function to calculate the output
yj=f(yinj)=⎧⎩⎨⎪⎪+10−1ifyinj>0ifyinj=0ifyinj<0
55
Neural Networks and Deep Learning
UNIT 2
Unsupervised Learning Network
56
Neural Networks and Deep Learning
In general, data collected for an unsupervised machine learning model is unstructured as it's in a
more raw format. Even though unsupervised data sets are much bigger than labeled or supervised
data sets, they are usually cheaper to collect, as they require no specific labeling or processing in
order for the data set to be used.
As we'll see in some of the unsupervised machine learning algorithms, unlike supervised
algorithms, such algorithms take in unlabeled data and try to make sense of it. This can be done
by clustering all data points into given clusters or by discovering hidden patterns and trends.
To make sure that our model is returning peak accurate results, we must deliberately test the
model’s output on different and various input variables. We can then move on to tuning the
model’s parameters in order to improve its final result.
Clustering is the task of classifying unlabeled data into multiple groups (or 'clusters') based on
their similarities and differences. Data points with the most similar features will be clustered
together. Two of the most well-known unsupervised clustering algorithms are K-Means
clustering and hierarchical clustering.
Association
Association is an unsupervised machine learning technique that is used for discovering relations
between variables. Association learning is commonly used in basket market analysis, in which
the given algorithm tries to relate or find a given relationship between two products. For
example, 90 percent of customers that buy product A also buy product B. Such hidden insights
and patterns are incredibly useful for marketing purposes, boosting a company’s sales.
57
Neural Networks and Deep Learning
Sentiment analysis is the process of clustering different sentences depending on the semantical meaning
that they hold. In sentiment analysis, a sentence can be either labeled as positive, neutral, or negative
depending on the writer’s attitude toward a certain topic. For example, take the sentence, “I love the rain”.
Such a sentence shows that the writer holds a positive sentimental attitude towards a certain topic which
is the rain. The sentiment meaning of a given sentence can be identified using a list of keywords such as
love, hate, like, dislike, etc.[TowardsDataScience, Unsupervised sentiment analysis]
Sentimental analysis is incredibly useful in real-world applications as it is heavily implemented in social
media apps in order to detect and eliminate hate speech from all over the internet.
58
Neural Networks and Deep Learning
2. Speech Recognition
Speech recognition is the ability of a machine learning model to extract meaning from human speech.
Such models take as input an audio recording of human speech and decode it extracting all relevant
information from it. While supervised speech recognition models offer great precision, unsupervised
learning speech recognition enables us to generate precise predictions on never-before-seen data sets.
Apple's Siri and Amazon's Alexa are two of the most popular speech recognition applications.
3. Artificial Intelligence Chatbots
AI chatbots are being heavily implemented in nearly every business and government sector nowadays.
Such chatbots are capable of providing users with human-like interactions, answering questions,
providing assistants, and more. Some companies that infuse chatbots into their services include Lyft,
Spotify, and Starbucks.
There exist three generations of chatbots. The first generation was based on written rules. The
programmer provided a specific list of answers for a specific list of questions. Moving on to the second
59
Neural Networks and Deep Learning
generation, chatbots were infused with artificial intelligence, starting with supervised learning. Such
chatbots were trained on a massive amount of labeled user chats. Second-generation or AI chatbots
provided a way more dynamic answering mechanism to the model. As you may have guessed, the third
and final chatbot generation integrates unsupervised learning to train its training models. Such models are
trained on even bigger data sets that are unlabeled. The third-generation chatbots offer all the advantages
of the first-generation while also having additional space to handle trickier and more complex situations.
[Rulia, The 3 Different Generations Of Chatbot Technology]
Computers can recognize odd anomalies in a particular medical scan using unsupervised learning in
computer vision. The model is initially provided with a massive quantity of unlabeled images that include
both healthy and cancer-positive inputs. The model is then able to examine and contrast various photos,
correctly detecting variations between the images. As a result, the model can later determine whether a
particular scan contains a tumor. The model will be able to distinguish and label each form of tumor on its
own in situations even where there are several different tumor types. As we did not properly identify each
tumor type from the beginning, it is important to note that the model will assign each kind a number to
identify it.
60
Neural Networks and Deep Learning
2. X-ray Diagnosis
Similar to the cancer diagnosis model, the x-ray diagnosis model is fed a multitude of X-ray scans. Any
irregularities in the image can then be detected by the model. These models are excellent at seeing minute
anomalies that doctors would overlook. We can use AI to find anomalies in the heart, lungs, pleura,
mediastinum, bones, and diaphragm.
It is important to keep in mind that this model is still in its early stages of development. Therefore X-ray
imagining will still require the occasional doctor checkup.
Conclusion
To answer the age-old question, which is superior: supervised or unsupervised learning? The answer is
that it depends.
While some machine learning practitioners may prefer supervised learning algorithms over unsupervised
learning algorithms as they are easier to use and produce in most cases and will return more accurate
results, it is worth noting that unsupervised learning also has its advantages, such as being more resistant
to overfitting and being better suited to complex and unstructured data.
In some cases, the user would be unsure where to start looking for hidden insights in a given data set,
making unsupervised learning approaches extremely useful in such cases. Furthermore, while supervised
learning data sets are much smaller in size than unsupervised data sets, they are far more difficult to
collect and maintain. This is because each data point must be manually checked and labeled separately. A
process like this could take months or even years to complete. Unsupervised data, on the other hand, has
no definite structure and does not require labeling.
Thus, whether supervised or unsupervised learning is used is highly dependent on the problem at hand.
61
Neural Networks and Deep Learning
the net to force it to make a definitive decision. The mechanism by which this can be accomplished is
called competition. The most extreme form of competition among a group of neurons is called Winner-
Take-All, where only one neuron (the winner) in the group will have a nonzero output signal when the
competition is completed.
During training process also the weights remains fixed in these competitive networks. The idea of
competition is used among neurons for enhancement of contrast in their activation functions. In this,
two networks- Maxnet and Hamming networks
Maxnet
Maxnet network was developed by Lippmann in 1987. The Maxner serves as a sub net for picking the
node whose input is larger. All the nodes present in this subnet are fully interconnected and there exist
symmetrical weights in all these weighted interconnections.
Architecture of Maxnet
The architecrure of Maxnet is a fixed symmetrical weights are present over the weighted
interconnections. The weights between the neurons are inhibitory and Fixed
The Maxnet with this structure can be used as a subnet to select a particular node whose net input is
the largest.
62
Neural Networks and Deep Learning
63
Neural Networks and Deep Learning
64
Neural Networks and Deep Learning
65
Neural Networks and Deep Learning
66
Neural Networks and Deep Learning
Counterpropagation network
Counterpropagation network (CPN) were proposed by Hecht Nielsen in 1987.They are multilayer
network based on the combinations of the input, output, and clustering layers. The application of
counterpropagation net are data compression, function approximation and pattern association. The
ccounterpropagation network is basically constructed from an instar-outstar model. This model is three
layer neural network that performs input-output data mapping, producing an output vector y in
response to input vector x, on the basis of competitive learning. The three layer in an instar-outstar
model are the input layer, the hidden(competitive) layer and the output layer.
There are two stages involved in the training process of a counterpropagation net. The input vector are
clustered in the first stage. In the second stage of training, the weights from the cluster layer units to
the output units are tuned to obtain the desired response.
67
Neural Networks and Deep Learning
If Euclidean distance method is used, find the cluster unit Zj whose squared distance from input
vectors is the smallest
68
Neural Networks and Deep Learning
The F1 layer accepts the inputs and performs some processing and transfers it to the F2
layer that best matches with the classification factor. There exist two sets of weighted
interconnection for controlling the degree of similarity between the units in the F1 and
the F2 layer. The F2 layer is a competitive layer. The cluster unit with the large net
input becomes the candidate to learn the input pattern first and the rest F2 units are
ignored. The reset unit makes the decision whether or not the cluster unit is allowed to
learn the input pattern depending on how similar its top-down weight vector is to the
input vector and to the decision. This is called the vigilance test. Thus we can say that
the vigilance parameter helps to incorporate new memories or new information.
Higher vigilance produces more detailed memories, lower vigilance produces more
general memories.
Generally two types of learning exists,slow learning and fast learning. In fast learning,
weight update during resonance occurs rapidly. It is used in ART1.In slow learning, the
weight change occurs slowly relative to the duration of the learning trial. It is used in
ART2.
Advantage of Adaptive Resonance Theory (ART)
It exhibits stability and is not disturbed by a wide variety of inputs provided to
its network.
It can be integrated and used with various other techniques to give more good
results.
It can be used for various fields such as mobile robot control, face recognition,
land cover classification, target recognition, medical diagnosis, signature
verification, clustering web users, etc.
70
Neural Networks and Deep Learning
It has got advantages over competitive learning (like bpnn etc). The competitive
learning lacks the capability to add new clusters when deemed necessary.
It does not guarantee stability in forming clusters.
Limitations of Adaptive Resonance Theory Some ART networks are inconsistent
(like the Fuzzy ART and ART1) as they depend upon the order of training data, or upon
the learning rate.
Special Networks
In the context of deep learning, "Special Networks" is not a well-defined or commonly used term. It is
possible that you are referring to specialized network architectures or specific types of neural networks
designed for specific tasks. I can provide information on some popular specialized network architectures
used in deep learning:
1. Convolutional Neural Networks (CNNs): CNNs are commonly used for image and video
processing tasks. They are designed to process data with a grid-like structure, such as images, by
using convolutional layers that capture local patterns and hierarchical representations.
2. Recurrent Neural Networks (RNNs): RNNs are used for sequential data processing tasks, such as
natural language processing and speech recognition. RNNs have feedback connections that allow
them to retain information from previous inputs, making them suitable for tasks with temporal
dependencies.
3. Generative Adversarial Networks (GANs): GANs consist of two neural networks, a generator and
a discriminator, that are trained together in a competitive manner. GANs are commonly used for
generative tasks, such as generating realistic images, by learning to capture the underlying
distribution of the training data.
4. Transformer Networks: Transformers have gained popularity in natural language processing
tasks, especially for machine translation and text generation. They rely on self-attention
mechanisms to capture the relationships between different words or tokens in a sequence,
enabling them to model long-range dependencies effectively.
5. Autoencoders: Autoencoders are neural networks used for unsupervised learning and
dimensionality reduction tasks. They are composed of an encoder network that compresses the
input data into a latent representation and a decoder network that reconstructs the original input
from the latent space.
These are just a few examples of specialized network architectures used in deep learning. There are many
other architectures and variations tailored to specific tasks and domains, such as object detection, speech
synthesis, and reinforcement learning.
71
Neural Networks and Deep Learning
When it comes to Machine Learning, Artificial Neural Networks perform really well.
Neural Networks are used in various datasets like images, audio, and text. Different
types of Neural Networks are used for different purposes, for example for predicting the
sequence of words we use Recurrent Neural Networks more precisely an LSTM,
similarly for image classification we use Convolution Neural networks. In this blog, we
are going to build a basic building block for CNN.
In a regular Neural Network there are three types of layers:
1. Input Layers: It’s the layer in which we give input to our model. The number of
neurons in this layer is equal to the total number of features in our data (number of
pixels in the case of an image).
2. Hidden Layer: The input from the Input layer is then feed into the hidden layer.
There can be many hidden layers depending upon our model and data size. Each
hidden layer can have different numbers of neurons which are generally greater than
the number of features. The output from each layer is computed by matrix
multiplication of output of the previous layer with learnable weights of that layer
and then by the addition of learnable biases followed by activation function which
makes the network nonlinear.
3. Output Layer: The output from the hidden layer is then fed into a logistic function
like sigmoid or softmax which converts the output of each class into the probability
score of each class.
The data is fed into the model and output from each layer is obtained from the above
step is called feedforward, we then calculate the error using an error function, some
common error functions are cross-entropy, square loss error, etc. The error function
measures how well the network is performing. After that, we backpropagate into the
model by calculating the derivatives. This step is called Backpropagation which
basically is used to minimize the loss.
Convolutional Neural Network consists of multiple layers like the input layer, Convolutional layer,
Pooling layer, and fully connected layers.
72
Neural Networks and Deep Learning
Now imagine taking a small patch of this image and running a small neural network, called a filter or
kernel on it, with say, K outputs and representing them vertically. Now slide that neural network across
the whole image, as a result, we will get another image with different widths, heights, and depths. Instead
of just R, G, and B channels now we have more channels but lesser width and height. This operation is
called Convolution. If the patch size is the same as that of the image it will be a regular neural network.
Because of this small patch, we have fewer weights.
73
Neural Networks and Deep Learning
As we slide our filters we’ll get a 2-D output for each filter and we’ll stack them together as a
result, we’ll get output volume having a depth equal to the number of filters. The network will
learn all the filters.
Layers used to build ConvNets
A complete Convolution Neural Networks architecture is also known as covnets. A covnets is a sequence
of layers, and every layer transforms one volume to another through a differentiable function.
Types of layers: datasets
Let’s take an example by running a covnets on of image of dimension 32 x 32 x 3.
Input Layers: It’s the layer in which we give input to our model. In CNN, Generally, the input
will be an image or a sequence of images. This layer holds the raw input of the image with width
32, height 32, and depth 3.
Convolutional Layers: This is the layer, which is used to extract the feature from the input
dataset. It applies a set of learnable filters known as the kernels to the input images. The
filters/kernels are smaller matrices usually 2×2, 3×3, or 5×5 shape. it slides over the input image
data and computes the dot product between kernel weight and the corresponding input image
patch. The output of this layer is referred ad feature maps. Suppose we use a total of 12 filters for
this layer we’ll get an output volume of dimension 32 x 32 x 12.
Activation Layer: By adding an activation function to the output of the preceding layer,
activation layers add nonlinearity to the network. it will apply an element-wise activation function
to the output of the convolution layer. Some common activation functions are RELU: max(0,
x), Tanh, Leaky RELU, etc. The volume remains unchanged hence output volume will have
dimensions 32 x 32 x 12.
Pooling layer: This layer is periodically inserted in the covnets and its main function is to reduce
the size of volume which makes the computation fast reduces memory and also prevents
overfitting. Two common types of pooling layers are max pooling and average pooling. If we
use a max pool with 2 x 2 filters and stride 2, the resultant volume will be of dimension
16x16x12.
74
Neural Networks and Deep Learning
75
Neural Networks and Deep Learning
Output Layer: The output from the fully connected layers is then fed into a logistic function for
classification tasks like sigmoid or softmax which converts the output of each class into the
probability score of each class.
76
Neural Networks and Deep Learning
Y = f (X, h , W, U, V, B, C)
Here S is the State matrix which has element si as the state of the network at timestep i
The parameters in the network are W, U, V, c, b which are shared across timestep
The Recurrent Neural Network consists of multiple fixed activation function units, one
for each time step. Each unit has an internal state which is called the hidden state of the
unit. This hidden state signifies the past knowledge that the network currently holds at a
given time step. This hidden state is updated at every time step to signify the change in
the knowledge of the network about the past. The hidden state is updated using the
following recurrence relation:-
The formula for calculating the current state:
where:
ht -> current state
ht-1 -> previous state
xt -> input state
Formula for applying Activation function(tanh):
where:
77
Neural Networks and Deep Learning
78
Neural Networks and Deep Learning
1. An RNN remembers each and every piece of information through time. It is useful in time series
prediction only because of the feature to remember previous inputs as well. This is called Long
Short Term Memory.
2. Recurrent neural networks are even used with convolutional layers to extend the effective pixel
neighborhood.
Disadvantages of Recurrent Neural Network
1. Gradient vanishing and exploding problems.
2. Training an RNN is a very difficult task.
3. It cannot process very long sequences if using tanh or relu as an activation function.
Applications of Recurrent Neural Network
1. Language Modelling and Generating Text
2. Speech Recognition
3. Machine Translation
4. Image Recognition, Face detection
5. Time series Forecasting
Types Of RNN
There are four types of RNNs based on the number of inputs and outputs in the network.
One to One
One to Many
Many to One
Many to Many
One to One
This type of RNN behaves the same as any simple Neural network it is also known as Vanilla Neural
Network. In this Neural network, there is only one input and one output.
79
Neural Networks and Deep Learning
One To Many
In this type of RNN, there is one input and many outputs associated with it. One of the most used
examples of this network is Image captioning where given an image we predict a sentence having
Multiple words.
Many to One
In this type of network, Many inputs are fed to the network at several states of the network generating
only one output. This type of network is used in the problems like sentimental analysis. Where we give
multiple words as input and predict only the sentiment of the sentence as output.
80
Neural Networks and Deep Learning
Many to Many
In this type of neural network, there are multiple inputs and multiple outputs corresponding to a problem.
One Example of this Problem will be language translation. In language translation, we provide multiple
words from one language as input and predict multiple words from the second language as output.
81
Neural Networks and Deep Learning
It has been noticed most of the mainstream neural nets can be easily fooled into misclassifying things by
adding only a small amount of noise into the original data. Surprisingly, the model after adding noise has
higher confidence in the wrong prediction than when it predicted correctly. The reason for such an
adversary is that most machine learning models learn from a limited amount of data, which is a huge
drawback, as it is prone to overfitting. Also, the mapping between the input and the output is almost
linear. Although, it may seem that the boundaries of separation between the various classes are linear, but
in reality, they are composed of linearities, and even a small change in a point in the feature space might
lead to the misclassification of data.
How do GANs work?
Generative Adversarial Networks (GANs) can be broken down into three parts:
Generative: To learn a generative model, which describes how data is generated in terms of a
probabilistic model.
Adversarial: The training of a model is done in an adversarial setting.
Networks: Use deep neural networks as artificial intelligence (AI) algorithms for training
purposes.
In GANs, there is a Generator and a Discriminator. The Generator generates fake samples of data(be it
an image, audio, etc.) and tries to fool the Discriminator. The Discriminator, on the other hand, tries to
distinguish between the real and fake samples. The Generator and the Discriminator are both Neural
Networks and they both run in competition with each other in the training phase. The steps are repeated
several times and in this, the Generator and Discriminator get better and better in their respective jobs
after each repetition. The work can be visualized by the diagram given below:
Here, the generative model captures the distribution of data and is trained in such a manner that it tries
to maximize the probability of the Discriminator making a mistake. The Discriminator, on the other
hand, is based on a model that estimates the probability that the sample that it got is received from the
training data and not from the Generator. The GANs are formulated as a minimax game, where the
Discriminator is trying to minimize its reward V(D, G) and the Generator is trying to minimize the
Discriminator’s reward or in other words, maximize its loss. It can be mathematically described by the
formula below:
82
Neural Networks and Deep Learning
where,
G = Generator
D = Discriminator
Pdata(x) = distribution of real data
P(z) = distribution of generator
x = sample from Pdata(x)
z = sample from P(z)
D(x) = Discriminator network
G(z) = Generator network
Generator Model
The Generator is trained while the Discriminator is idle. After the Discriminator is trained by the
generated fake data of the Generator, we can get its predictions and use the results for training the
Generator and get better from the previous state to try and fool the Discriminator.
Discriminator Model
The Discriminator is trained while the Generator is idle. In this phase, the network is only forward
propagated and no back-propagation is done. The Discriminator is trained on real data for n epochs
and sees if it can correctly predict them as real. Also, in this phase, the Discriminator is also trained
on the fake generated data from the Generator and see if it can correctly predict them as fake.
83
Neural Networks and Deep Learning
perceptrons. The ConvNets are implemented without max pooling, which is in fact replaced by
convolutional stride. Also, the layers are not fully connected.
Laplacian Pyramid GAN (LAPGAN): The Laplacian pyramid is a linear invertible image
representation consisting of a set of band-pass images, spaced an octave apart, plus a low-frequency
residual. This approach uses multiple numbers of Generator and Discriminator networks and
different levels of the Laplacian Pyramid. This approach is mainly used because it produces very
high-quality images. The image is down-sampled at first at each layer of the pyramid and then it is
again up-scaled at each layer in a backward pass where the image acquires some noise from the
Conditional GAN at these layers until it reaches its original size.
Super Resolution GAN (SRGAN): SRGAN as the name suggests is a way of designing a GAN in
which a deep neural network is used along with an adversarial network in order to produce higher-
resolution images. This type of GAN is particularly useful in optimally up-scaling native low-
resolution images to enhance their details minimizing errors while doing so.
84
Neural Networks and Deep Learning
The encoders each convert their input into another sequence of vectors called encodings. The decoders do
the reverse: they convert the encodings back into a sequence of probabilities of different output words.
The output probabilities can be converted into another natural language sentence using the softmax
function.
Each encoder and decoder contains a component called the attention mechanism, which allows the
processing of one input word to include relevant data from certain other words, while masking the words
which do not contain relevant information.
Because this must be calculated many times, we implement multiple attention mechanisms in parallel,
taking advantage of the parallel computing offered by GPUs. This is called the multi-head attention
mechanism. The ability to pass multiple words through a neural network simultaneously is one advantage
of transformers over LSTMs and RNNs.
The architecture of a transformer neural network. In the original paper, there were 6 encoders chained to 6
decoders.
Autoencoders
What is an autoencoder?
An autoencoder is a type of artificial neural network used to learn data encodings in an unsupervised
manner.
1. Encoder: A module that compresses the train-validate-test set input data into an encoded
representation that is typically several orders of magnitude smaller than the input data.
85
Neural Networks and Deep Learning
2. Bottleneck: A module that contains the compressed knowledge representations and is therefore the
most important part of the network.
3. Decoder: A module that helps the network“decompress” the knowledge representations and
reconstructs the data back from its encoded form. The output is then compared with a ground truth.
Encoder
The encoder is a set of convolutional blocks followed by pooling modules that compress the input to
the model into a compact section called the bottleneck.
The bottleneck is followed by the decoder that consists of a series of upsampling modules to bring the
compressed feature back into the form of an image. In case of simple autoencoders, the output is
expected to be the same as the input data with reduced noise.
However, for variational autoencoders it is a completely new image, formed with information the
model has been provided as input.
86
Neural Networks and Deep Learning
Bottleneck
The most important part of the neural network, and ironically the smallest one, is the bottleneck. The
bottleneck exists to restrict the flow of information to the decoder from the encoder, thus,allowing only
the most vital information to pass through.
Since the bottleneck is designed in such a way that the maximum information possessed by an image is
captured in it, we can say that the bottleneck helps us form a knowledge-representation of the input.
Thus, the encoder-decoder structure helps us extract the most from an image in the form of data and
establish useful correlations between various inputs within the network.
A bottleneck as a compressed representation of the input further prevents the neural network from
memorising the input and overfitting on the data.
As a rule of thumb, remember this: The smaller the bottleneck, the lower the risk of overfitting.
However—
Very small bottlenecks would restrict the amount of information storable, which increases the chances
of important information slipping out through the pooling layers of the encoder.
Decoder
Finally, the decoder is a set of upsampling and convolutional blocks that reconstructs the bottleneck's
output.
Since the input to the decoder is a compressed knowledge representation, the decoder serves as a
“decompressor” and builds back the image from its latent attributes.
1. Code size: The code size or the size of the bottleneck is the most important hyperparameter
used to tune the autoencoder. The bottleneck size decides how much the data has to be
compressed. This can also act as a regularisation term.
87
Neural Networks and Deep Learning
3. Number of nodes per layer: The number of nodes per layer defines the weights we use per
layer. Typically, the number of nodes decreases with each subsequent layer in the autoencoder
as the input to each of these layers becomes smaller across the layers.
4. Reconstruction Loss: The loss function we use to train the autoencoder is highly dependent on
the type of input and output we want the autoencoder to adapt to. If we are working with image
data, the most popular loss functions for reconstruction are MSE Loss and L1 Loss. In case the
inputs and outputs are within the range [0,1], as in MNIST, we can also make use of Binary
Cross Entropy as the reconstruction loss.
88
Neural Networks and Deep Learning
Unit 3
Introduction to Deep Learning
Deep learning is a branch of machine learning which is based on artificial neural networks. It is capable
of learning complex patterns and relationships within data. In deep learning, we don’t need to explicitly
program everything. It has become increasingly popular in recent years due to the advances in
processing power and the availability of large datasets. Because it is based on artificial neural networks
(ANNs) also known as deep neural networks (DNNs). These neural networks are inspired by the
structure and function of the human brain’s biological neurons, and they are designed to learn from
large amounts of data.
1. Deep Learning is a subfield of Machine Learning that involves the use of neural networks to model
and solve complex problems. Neural networks are modeled after the structure and function of the
human brain and consist of layers of interconnected nodes that process and transform data.
2. The key characteristic of Deep Learning is the use of deep neural networks, which have multiple
layers of interconnected nodes. These networks can learn complex representations of data by
discovering hierarchical patterns and features in the data. Deep Learning algorithms can
automatically learn and improve from data without the need for manual feature engineering.
3. Deep Learning has achieved significant success in various fields, including image recognition,
natural language processing, speech recognition, and recommendation systems. Some of the
popular Deep Learning architectures include Convolutional Neural Networks (CNNs), Recurrent
Neural Networks (RNNs), and Deep Belief Networks (DBNs).
4. Training deep neural networks typically requires a large amount of data and computational
resources. However, the availability of cloud computing and the development of specialized
hardware, such as Graphics Processing Units (GPUs), has made it easier to train deep neural
networks.
In summary, Deep Learning is a subfield of Machine Learning that involves the use of deep neural
networks to model and solve complex problems. Deep Learning has achieved significant success in
various fields, and its use is expected to continue to grow as more data becomes available, and more
powerful computing resources become available.
What is Deep Learning?
Deep learning is the branch of machine learning which is based on artificial neural network
architecture. An artificial neural network or ANN uses layers of interconnected nodes called neurons
that work together to process and learn from the input data.
In a fully connected Deep neural network, there is an input layer and one or more hidden layers
connected one after the other. Each neuron receives input from the previous layer neurons or the input
layer. The output of one neuron becomes the input to other neurons in the next layer of the network,
and this process continues until the final layer produces the output of the network. The layers of the
neural network transform the input data through a series of nonlinear transformations, allowing the
network to learn complex representations of the input data.
89
Neural Networks and Deep Learning
Today Deep learning has become one of the most popular and visible areas of machine learning, due to
its success in a variety of applications, such as computer vision, natural language processing, and
Reinforcement learning.
Deep learning can be used for supervised, unsupervised as well as reinforcement machine learning. it
uses a variety of ways to process these.
Supervised Machine Learning: Supervised machine learning is the machine learning technique in
which the neural network learns to make predictions or classify data based on the labeled datasets.
Here we input both input features along with the target variables. the neural network learns to make
predictions based on the cost or error that comes from the difference between the predicted and the
actual target, this process is known as backpropagation. Deep learning algorithms like
Convolutional neural networks, Recurrent neural networks are used for many supervised tasks like
image classifications and recognization, sentiment analysis, language translations, etc.
Unsupervised Machine Learning: Unsupervised machine learning is the machine
learning technique in which the neural network learns to discover the patterns or to cluster the
dataset based on unlabeled datasets. Here there are no target variables. while the machine has to
self-determined the hidden patterns or relationships within the datasets. Deep learning algorithms
like autoencoders and generative models are used for unsupervised tasks like clustering,
dimensionality reduction, and anomaly detection.
Reinforcement Machine Learning: Reinforcement Machine Learning is the machine
learning technique in which an agent learns to make decisions in an environment to maximize a
reward signal. The agent interacts with the environment by taking action and observing the
resulting rewards. Deep learning can be used to learn policies, or a set of actions, that maximizes
the cumulative reward over time. Deep reinforcement learning algorithms like Deep Q networks
and Deep Deterministic Policy Gradient (DDPG) are used to reinforce tasks like robotics and game
playing etc.
Difference between Machine Learning and Deep Learning :
machine learning and deep learning both are subsets of artificial intelligence but there
are many similarities and differences between them.
90
Neural Networks and Deep Learning
Can work on the smaller amount of Requires the larger volume of dataset
dataset compared to machine learning
Takes less time to train the model. Takes more time to train the model.
91
Neural Networks and Deep Learning
Object detection and recognition: Deep learning model can be used to identify and
locate objects within images and videos, making it possible for machines to perform
tasks such as self-driving cars, surveillance, and robotics.
Image classification: Deep learning models can be used to classify images into
categories such as animals, plants, and buildings. This is used in applications such as
medical imaging, quality control, and image retrieval.
Image segmentation: Deep learning models can be used for image segmentation
into different regions, making it possible to identify specific features within images.
Natural language processing (NLP):
In NLP, the Deep learning model can enable machines to understand and generate
human language. Some of the main applications of deep learning in NLP include:
Automatic Text Generation – Deep learning model can learn the corpus of text and
new text like summaries, essays can be automatically generated using these trained
models.
Language translation: Deep learning models can translate text from one language
to another, making it possible to communicate with people from different linguistic
backgrounds.
Sentiment analysis: Deep learning models can analyze the sentiment of a piece of
text, making it possible to determine whether the text is positive, negative, or
neutral. This is used in applications such as customer service, social media
monitoring, and political analysis.
Speech recognition: Deep learning models can recognize and transcribe spoken
words, making it possible to perform tasks such as speech-to-text conversion, voice
search, and voice-controlled devices.
Reinforcement learning:
In reinforcement learning, deep learning works as training agents to take action in an
environment to maximize a reward. Some of the main applications of deep learning in
reinforcement learning include:
Game playing: Deep reinforcement learning models have been able to beat human
experts at games such as Go, Chess, and Atari.
Robotics: Deep reinforcement learning models can be used to train robots to
perform complex tasks such as grasping objects, navigation, and manipulation.
Control systems: Deep reinforcement learning models can be used to control
complex systems such as power grids, traffic management, and supply chain
optimization.
Challenges in Deep Learning
Deep learning has made significant advancements in various fields, but there are still
some challenges that need to be addressed. Here are some of the main challenges in
deep learning:
1. Data availability: It requires large amounts of data to learn from. For using deep
learning it’s a big concern to gather as much data for training.
92
Neural Networks and Deep Learning
93
Neural Networks and Deep Learning
94
Neural Networks and Deep Learning
95
Neural Networks and Deep Learning
We are still making use of a gradient descent optimization algorithm which acts to minimize the
error of our model by iteratively moving in the direction with the steepest descent, the direction
which updates the parameters of our model while ensuring the minimal error. It updates the weight
of every model in every single layer. We will talk more about optimization algorithms and
backpropagation later.
It is important to recognize the subsequent training of our neural network. Recognition is done by
dividing our data samples through some decision boundary.
"The process of receiving an input to produce some kind of output to make some kind of prediction
is known as Feed Forward." Feed Forward neural network is the core of many other important
neural networks such as convolution neural network.
In the feed-forward neural network, there are not any feedback loops or connections in the network.
Here is simply an input layer, a hidden layer, and an output layer.
96
Neural Networks and Deep Learning
97
Neural Networks and Deep Learning
So, what we will do we use our non-linear model to produce an output that describes the probability
of the point being in the positive region. The point was represented by 2 and 2. Along with bias,
we will represent the input as
The first linear model in the hidden layer recall and the equation defined it
Which means in the first layer to obtain the linear combination the inputs are multiplied by -4, -1
and the bias value is multiplied by twelve.
98
Neural Networks and Deep Learning
The weight of the inputs are multiplied by -1/5, 1, and the bias is multiplied by three to obtain the
linear combination of that same point in our second model.
Now, to obtain the probability of the point is in the positive region relative to both models we
apply sigmoid to both points as
The second layer contains the weights which dictated the combination of the linear models in the
first layer to obtain the non-linear model in the second layer. The weights are 1.5, 1, and a bias
value of 0.5.
Now, we have to multiply our probabilities from the first layer with the second set of weights as
99
Neural Networks and Deep Learning
It is complete math behind the feed forward process where the inputs from the input traverse the
entire depth of the neural network. In this example, there is only one hidden layer. Whether there
is one hidden layer or twenty, the computational processes are the same for all hidden layers.
Back-Propagation
Backpropagation is one of the important concepts of a neural network. Our task is to classify
our data best. For this, we have to update the weights of parameter and bias, but how can we do
that in a deep neural network? In the linear regression model, we use gradient descent to
optimize the parameter. Similarly here we also use gradient descent algorithm using
Backpropagation.
For a single training example, Backpropagation algorithm calculates the gradient of the error
function. Backpropagation can be written as a function of the neural network. Backpropagation
algorithms are a set of methods used to efficiently train artificial neural networks following a
gradient descent approach which exploits the chain rule.
The main features of Backpropagation are the iterative, recursive and efficient method through
which it calculates the updated weight to improve the network until it is not able to perform the
task for which it is being trained. Derivatives of the activation function to be known at network
design time is required to Backpropagation.
Now, how error function is used in Backpropagation and how Backpropagation works? Let start
with an example and do it mathematically to understand how exactly updates the weight using
Backpropagation.
Backw ar
Input values
X1=0.05
X2=0.10
100
Neural Networks and Deep Learning
Initial weight
W1=0.15 w5=0.40
W2=0.20 w6=0.45
W3=0.25 w7=0.50
W4=0.30 w8=0.55
Bias Values
b1=0.35 b2=0.60
Target Values
T1=0.01
T2=0.99
Forward Pass
To find the value of H1 we first multiply the input value from the weights as
H1=x1×w1+x2×w2+b1
H1=0.05×0.15+0.10×0.20+0.35
H1=0.3775
H2=x1×w3+x2×w4+b1
H2=0.05×0.25+0.10×0.30+0.35
H2=0.3925
101
Neural Networks and Deep Learning
Now, we calculate the values of y1 and y2 in the same way as we calculate the H1 and H2.
To find the value of y1, we first multiply the input value i.e., the outcome of H1 and H2 from the
weights as
y1=H1×w5+H2×w6+b2
y1=0.593269992×0.40+0.596884378×0.45+0.60
y1=1.10590597
y2=H1×w7+H2×w8+b2
y2=0.593269992×0.50+0.596884378×0.55+0.60
y2=1.2249214
102
Neural Networks and Deep Learning
Our target values are 0.01 and 0.99. Our y1 and y2 value is not matched with our target values T1
and T2.
Now, we will find the total error, which is simply the difference between the outputs from the
target outputs. The total error is calculated as
Now, we will backpropagate this error to update the weights using a backward pass.
103
Neural Networks and Deep Learning
From equation two, it is clear that we cannot partially differentiate it with respect to w5 because
there is no any w5. We split equation one into multiple terms so that we can easily differentiate it
with respect to w5 as
Now, we calculate each term one by one to differentiate Etotal with respect to w5 as
104
Neural Networks and Deep Learning
So, we put the values of in equation no (3) to find the final result.
Now, we will calculate the updated weight w5new with the help of the following formula
In the same way, we calculate w6new,w7new, and w8new and this will give us the following values
w5new=0.35891648
w6new=408666186
105
Neural Networks and Deep Learning
w7new=0.511301270
w8new=0.561370121
From equation (2), it is clear that we cannot partially differentiate it with respect to w1 because
there is no any w1. We split equation (1) into multiple terms so that we can easily differentiate it
with respect to w1 as
Now, we calculate each term one by one to differentiate Etotal with respect to w1 as
106
Neural Networks and Deep Learning
We again Split both because there is no any y1 and y2 term in E1 and E2. We split it
as
Now, we find the value of by putting values in equation (18) and (19) as
107
Neural Networks and Deep Learning
108
Neural Networks and Deep Learning
109
Neural Networks and Deep Learning
We calculate the partial derivative of the total net input to H1 with respect to w1 the same as we
did for the output neuron:
110
Neural Networks and Deep Learning
So, we put the values of in equation (13) to find the final result.
Now, we will calculate the updated weight w1new with the help of the following formula
In the same way, we calculate w2new,w3new, and w4 and this will give us the following values
w1new=0.149780716
w2new=0.19956143
w3new=0.24975114
w4new=0.29950229
We have updated all the weights. We found the error 0.298371109 on the network when we fed
forward the 0.05 and 0.1 inputs. In the first round of Backpropagation, the total error is down to
0.291027924. After repeating this process 10,000, the total error is down to 0.0000351085. At this
point, the outputs neurons generate 0.159121960 and 0.984065734 i.e., nearby our target value
when we feed forward the 0.05 and 0.1.
In addition to traditional gradient-based learning, there are several differentiation algorithms and
techniques used in deep learning to train neural networks more effectively. Here are some
notable ones:
1. Stochastic Gradient Descent with Momentum (SGD with Momentum): This algorithm
enhances the standard SGD by incorporating momentum. Momentum helps accelerate
gradient descent by accumulating a weighted average of past gradients and using it to
update the parameters. This momentum term reduces oscillations and helps the optimizer
navigate flat or shallow regions more efficiently.
2. Adaptive Learning Rate Methods: These algorithms dynamically adjust the learning rate
during training based on the gradient information or historical update statistics. Some
popular methods include:
a. Adam (Adaptive Moment Estimation): Adam combines the advantages of adaptive learning
rates and momentum. It adapts the learning rate for each parameter based on estimates of first-
order moments (mean) and second-order moments (variance) of the gradients.
b. RMSprop (Root Mean Square Propagation): RMSprop adjusts the learning rate for each
parameter by dividing it by a moving average of the root mean square of past gradients. This
technique helps in controlling the learning rate based on the magnitude of the gradients.
c. Adagrad (Adaptive Gradient): Adagrad adapts the learning rate for each parameter by scaling
it inversely proportional to the cumulative sum of the historical squared gradients. This method
gives larger updates for parameters with infrequent updates and smaller updates for frequently
updated parameters.
3. Nesterov Accelerated Gradient (NAG): NAG is an optimization algorithm that improves
upon SGD with Momentum. It computes an intermediate step in the direction of the
accumulated momentum before calculating the gradient. This lookahead step allows
NAG to make better-informed updates and often leads to faster convergence.
4. Second-Order Methods: While most deep learning optimization algorithms rely on first-
order gradients, second-order methods consider second-order derivatives (Hessian) as
well. These methods can provide more accurate and faster convergence but come with
increased computational complexity. Examples include Newton's method and Quasi-
Newton methods like L-BFGS (Limited-memory Broyden-Fletcher-Goldfarb-Shanno).
5. Regularization Techniques: Regularization techniques are used to prevent overfitting and
improve generalization. These techniques modify the loss function or add additional
terms to the optimization process. Some commonly used regularization techniques
include L1 regularization (Lasso), L2 regularization (Ridge), Dropout, and Batch
Normalization.
6. Learning Rate Scheduling: Instead of using a fixed learning rate throughout training,
learning rate scheduling adjusts the learning rate dynamically over time. Techniques like
step decay, exponential decay, or polynomial decay reduce the learning rate periodically
112
Neural Networks and Deep Learning
or gradually during training. Learning rate scheduling can help fine-tune the optimization
process and improve convergence.
These differentiation algorithms and techniques are employed to optimize deep learning models,
enhance convergence, and improve generalization. The choice of algorithm depends on the
specific task, dataset, and model architecture, and it often involves experimentation to identify
the most effective approach.
Unit 4
Regularization for Deep Learning
Parameter norm Penalties
113
Neural Networks and Deep Learning
In our last post, we learned about feedforward neural networks and how to design them. In this
post, we will learn how to tackle one of the most central problems that arise in the domain of
machine learning, that is how to make our algorithm to find a perfect fit not only to the training
set but also to the testing set. When an algorithm performs well on the training set but performs
poorly on the testing set, the algorithm is said to be overfitted on the Training data. After all, our
main goal is to perform well on never seen before data, ie reducing the overfitting. To tackle this
problem we have to make our model generalize over the training data which is done using various
regularization techniques which we will learn about in this post.
Strategies or techniques which are used to reduce the error on the test set at an expense of
increased training error are collectively known as Regularization. Many such techniques are
available to the deep learning practitioner. In fact, developing more effective regularization
strategies have been one of the major research efforts in the field.
intended to reduce its generalization error but not its training error. This regularization is often
done by putting some extra constraints on a machine learning model, such as adding restrictions
on the parameter values or by adding extra terms in the objective function that can be thought of
as corresponding to a soft constraint on the parameter values. If chosen correctly these can lead to
a reduced testing error. An effective regularizer is said to be the one that makes a profitable trade
This technique is based on limiting the capacity of models, by adding a parameter norm penalty to
114
Neural Networks and Deep Learning
Where alpha is a hyperparameter that weighs the relative contribution of the norm penalty omega.
Setting alpha to 0 means no regularization and larger values of alpha correspond to more
regularization.
For Neural networks, we choose to use parameter norm penalty that penalizes on the weights of
the affine transformations and leave biases unregularized. This is because of the fact that biases
require lesser data to fit accurately than the weights. As weights are used to denote the
relationship between the two variables and require observing both variables in various conditions
whereas bias controls only single variables hence they can be left unregularized.
L2 Parameter Regularization
This regularization is popularly known as weight decay. This strategy drives the weights closer to
the origin by adding the regularization term omega which is defined as:
115
Neural Networks and Deep Learning
We can see that the weight decay term is now multiplicatively shrinking the weight vector by a
constant factor on each step, before performing the usual gradient update.
L1 Regularization
Corresponding gradient:
By observing the gradient we can notice how the gradient is scaled by the constant factor with a
Dataset Augmentation
116
Neural Networks and Deep Learning
The best and easiest way to make a model generalize is to train it on a large amount of data but
mostly we are provided with limited data. One way is to create fake data and add it to our training
This approach is mostly taken for classification problem, A classifier needs to take a complicated,
high dimensional input x and summarize it with a single category identity y. This means that the
generate new ( x, y) pairs easily just by transforming the x inputs in our training set. This
approach isn’t always suitable for a task such as for a density estimation task it is difficult to
generate fake data unless we have already solved the density estimation problem.
Dataset Augmentation is a very popular approach for Computer vision tasks such as Image
classification or object recognition as Images are high dimensional and include an enormous
variety of factors of variation, many of which can be easily simulated. Operations like translating
the training images a few pixels in each direction, rotating the image or scaling the image can
often greatly improve generalization, even if the model has already been designed to be partially
Noise Robustness
Noise is often introduced to the inputs as a dataset augmentation strategy. the addition of noise
with infinitesimal variance at the input of the model is equivalent to imposing a penalty on the
norm of the weights. Noise injection is much more powerful than simply shrinking the
117
Neural Networks and Deep Learning
Another way that noise has been used in the service of regularizing models is by adding it to the
weights. This technique has been used primarily in the context of recurrent neural networks. This
Semi-Supervised Learning
In semi-supervised learning, both unlabeled examples from P (x) and labeled examples from P (x,
y) are used to estimate P (y | x) or predict y from x. the context of deep learning, semi-supervised
learning usually refers to learning a representation h = f (x). The goal is to learn a representation
so that examples from the same class have similar representations. Unsupervised learning
provides cues about how to group training examples in representation Space. Using a principal
component analysis as a pre-processing step before applying our classifier is an example of this
approach.
Instead of using separate models for unsupervised and supervised components, one can construct
models in which a generative model of either P (x) or P(x, y) shares parameters with a
discriminative model of P(y | x). Now the structure of P(x) is connected to the structure of P(y | x)
in a way that is captured by the shared parametrization. By controlling how much of the
generative criterion is included in the total criterion, one can find a better trade-off than with a
purely generative or a purely discriminative training criterion.
Multi-Task Learning
Multi-task learning is a way to improve generalization by pooling the examples arising out of
several tasks. In the same way that additional training examples put more pressure on the
parameters of the model towards values that generalize well, when part of a model is shared
across tasks, that part of the model is more constrained towards good values, often yielding better
generalization.
118
Neural Networks and Deep Learning
The model can generally be divided into two kinds of parts and associated parameters:
Task-specific parameters which only benefit from the examples of their task to achieve good
generalization.
Generic parameters shared across all the tasks which benefit from the pooled data of all the tasks.
When training a large model on a sufficiently large dataset, if the training is done for a long
amount of time rather than increasing the generalization capability of the model, it increases the
overfitting. As in the training process, the training error keeps on reducing but after a certain
point, the validation error starts to increase hence signifying that our model has started to overfit.
One way to think of early stopping is as a very efficient hyperparameter selection algorithm. The
idea of early stopping of training is that as soon as the validation error starts to increase we freeze
the parameters and stop the training process. Or we can also store the copy of model parameters
119
Neural Networks and Deep Learning
every time the error on the validation set improves and return these parameters when the training
Early stopping has an advantage over weight decay that early stopping automatically determines
the correct amount of regularization while weight decay requires many training experiments with
Bagging
several models. The idea is to train several different models separately, then have all of the
models vote on the output for test examples. This is an example of a general strategy in machine
learning called model averaging. Techniques employing this strategy are known
as ensemble methods. This is an efficient method as different models don’t make the same types
of errors.
Bagging involves constructing k different datasets. Each dataset has the same number of examples
as the original dataset, but each dataset is constructed by sampling with replacement from the
original dataset. This means that, with high probability, each dataset is missing some of the
examples from the original dataset and also contains several duplicate examples. Model i is then
trained on dataset i. The differences between which examples are included in each dataset result
Dropout
thought of as a method of making bagging practical for ensembles of very many large neural
120
Neural Networks and Deep Learning
networks. The method of bagging cannot be directly applied to large neural networks as it
involves training multiple models, and evaluating multiple models on each test example. since
training and evaluating such networks is costly in terms of runtime and memory, this method is
impractical for neural networks. Dropout provides an inexpensive approximation to training and
evaluating a bagged ensemble of exponentially many neural networks. Dropout trains the
ensemble consisting of all sub-networks that can be formed by removing non-output units from an
In most modern neural networks, based on a series of affine transformations and nonlinearities,
we can effectively remove a unit from a network by multiplying its output value by zero. This
procedure requires some slight modification for models such as radial basis function networks,
which take the difference between the unit’s state and some reference value. Here, we present
the dropout algorithm in terms of multiplication by zero for simplicity, but it can be trivially
modified to work with other operations that remove a unit from the network.
Dropout training is not quite the same as bagging training. In the case of bagging, the models are
all independent. In the case of dropout, the models share parameters, with each model inheriting a
different subset of parameters from the parent neural network. This parameter sharing makes it
possible to represent an exponential number of models with a tractable amount of memory. One
advantage of dropout is that it is very computationally cheap. Using dropout during training
requires only O(n) computation per example per update, to generate n random binary numbers
and multiply them by the state. Another significant advantage of dropout is that it does not
significantly limit the type of model or training procedure that can be used. It works well with
nearly any model that uses a distributed representation and can be trained with stochastic gradient
descent.
121
Neural Networks and Deep Learning
Adversarial Training
In many cases, Neural Networks seem to have achieved human-level understanding the task but to
check if it really is able to perform at human-level, Networks are tested on adversarial examples.
Adversarial examples can be defined as if for an input a near a data point x such that the model
output is very different at a, then a is called an Adversarial example. Adversarial examples are
intentionally constructed by using an optimization procedure and models have a nearly 100%
Adversarial training helps in regularization of models as when models are trained on the training
sets that are augmented with Adversarial examples, it improves the generalization of the model.
122
Neural Networks and Deep Learning
Norm penalties can also be used in combination, leading to elastic net regularization, which combines
both the L1 and L2 penalties. The elastic net penalty helps balance the benefits of sparsity from the L1
norm and the robustness to correlated features from the L2 norm.
Norm Penalties as Constrained Optimization Apart from adding penalty term Ω(θ) to objective
function J and try to minimize their sum J˜, we can also make sure it is small by optimizing J
with a constraint Ω(θ) < k.
This can be done by constructing a generalized Lagrange function, consisting of original
objective function plus a set of penalties: L(θ, α; X, y) = J(θ; X, y) + α(Ω(θ) − k) (11) The
solution will be: θ ∗ = argmin θ max α,α≥0 L(θ, α) (12) Note that both θ and α are variables in
this objective function.
When α ∗ is fixed (say, we already know the best α), the optimization problem becomes θ ∗ =
argmin θ L(θ, α∗ ) = argmin θ J(θ; X, y) + α ∗Ω(θ) (13) which is the same as regularization with
parameter norm penalty.
For example, if Ω is L 2 norm, we can think of it as limiting weights to be in a L 2 ball. Benefits
of regularization as constraints:
• We can specify a concrete constraint region, while the effect of adjusting α for Ω(θ) is vague.
We can take a step with stochastic gradient descent, and re-project θ back to the feasible region
Ω(θ) < k.
• We can avoid getting stuck in local minima, which is common with regularization term in
objective function. These constraints only take effect when the weights attempt to leave the
constraint region.
• More stable optimization procedure. A large learning rate may result in positive feedback loop.
Explicit constraints with re-projection prevent this.
Regularization and Under-Constrained Problems
In the context of deep learning, regularization and under-constrained problems are important
concepts.
Regularization in Deep Learning: In deep learning, regularization techniques are employed to
prevent overfitting, which occurs when a model performs well on the training data but fails to
generalize to new, unseen data. Overfitting often happens when a model becomes too complex
and starts memorizing the training examples instead of learning meaningful patterns.
Regularization techniques in deep learning typically involve adding a regularization term to the
loss function during training. This extra term encourages the model to have certain desirable
properties, such as smaller weights or sparsity. The most common regularization techniques used
in deep learning include L1 and L2 regularization (also known as weight decay), dropout, and
batch normalization.
123
Neural Networks and Deep Learning
In a linear regression problem, when the number of instances is smaller than the number of variables, the
problem is under-constrained, and its closed form solution w = (XT X) −1XT y can not be calculated.
In a logistic regression problem, when the two classes are linearly separable with vector w, 2w will also
be a feasible solution. An iterative optimization algorithm may keep increasing the magnitude of w and
never stops.
However, when we add a regularization term to loss function, convergence is guaranteed. For example, w
will not be updated to 2w because the likelihood loss is not decreased, while the regularization term is
huge. The idea of using regularization to solve under-determined problems extends beyond machine
learning.
124
Neural Networks and Deep Learning
Don’t worry, we didn’t mean to insult you. It’s not your fault: it’s Deep Learning’s fault.
Algorithms are getting infinitely more complex, and neural nets are getting deeper and deeper.
More layers in neural nets means more parameters that your model is learning from your data.
In some of the recent more state of the art models we’ve seen, there can be more than 100
million parameters learned during training:
When your model is trying to understand a relationship this deeply, it needs a lot of examples to
learn from. That’s why popular datasets for models like these might have something like 10,000
images for training. That size of data is not at all easy to come by.
Even if you’re using simpler or smaller types of models, it’s challenging to organize a dataset
large enough to train effectively. Especially as Machine Learning gets applied to newer and
newer verticals, it’s becoming harder and harder to find reliable training data. If you wanted to
125
Neural Networks and Deep Learning
create a classifier to distinguish iPhones from Google Pixels, how would you get thousands of
different photos?
Finally, even with the right size training set, things can still go awry. Remember that algorithms
don’t think like humans: while you classify images based on a natural understanding of what’s in
the image, algorithms are learning that on the fly. If you’re creating a cat / dog classifier and
most of your training images for dogs have a snowy background, your algorithm might end up
learning the wrong rules. Having images from varied perspectives and with different contexts is
crucial.
For an idea of just how much this process can help, check out this benchmark that NanoNets ran
in their explainer post. Their results showed an almost 20 percentage point increase in test
accuracy with dataset augmentation applied.
It’s safer for us to assume the cause of this accuracy boost was a bit more complicated than just
dataset augmentation, but the message is clear: it can really help.
Before we dive into what you might practically do to augment your data, it’s worth noting that
there are two broad approaches to when to augment it. In offline dataset augmentation,
transforms are applied en masse to your dataset before training. You might, for example, flip
each of your images horizontally and vertically, resulting in a training set with twice as many
126
Neural Networks and Deep Learning
examples. In online dataset augmentation, transforms are applied in real time as batches are
passed into training. This won’t help with size, but is much quicker for larger training sets.
Most of these transformations have fairly simple implementations in packages like Tensorflow.
And though they might seem simple, combining them in creative ways across your dataset can
yield impressive improvements in model accuracy.
One issue that often comes up is input size requirements, which are one of the most frustrating
parts of neural nets for practitioners. If you shift or rotate an image, you’re going to end up with
something that’s a different size, and that needs to be fixed before training. Different approaches
advocate filling in empty space with constant values, zooming in until you’ve reached the right
size, or reflecting pixel values into your empty space. As with any preprocessing, testing and
validating is the best way to find a definitive answer.
127
Neural Networks and Deep Learning
You can utilize pre-trained nets that transfer exterior styles onto your training images as part of a
dataset augmentation pipeline.
Noise Robustness
Noise with infinitesimal variance imposes a penalty on the norm of the weights. Noise added to
hidden units is very important and is discussed later in Dropout. Noise can even be added to the
weights. This has several interpretations. One of them is that adding noise to weights is a
stochastic implementation of Bayesian inference over the weights, where the weights are
considered to be uncertain, with the uncertainty being modelled by a probability distribution. It is
For e.g. in the linear regression case, we want to learn the mapping y(x) for each feature vector x,
128
Neural Networks and Deep Learning
Now, suppose a zero mean unit variance Gaussian random noise, ϵ, is added to the weights. We
till want to learn the appropriate mapping through reducing the mean square. Minimizing the loss
after adding noise to the weights is equivalent to adding another regularization term which makes
sure that small perturbations in the weight values don’t affect the predictions much, thus
stabilising training.
Sometimes we may have the wrong output labels, in which case maximizing p(y | x)may not be a
good idea. In such a case, we can add noise to the labels by assigning a probability of (1-ϵ) that
the label is correct and a probability of ϵ that it is not. In the latter case, all the other labels are
equally likely. Label Smoothing regularizes a model with k softmax outputs by assigning the
classification targets with probability (1-ϵ ) or choosing any of the remaining (k-1) classes with
probability ϵ / (k-1).
Semi-Supervised Learning
Today’s Machine Learning algorithms can be broadly classified into three categories,
Supervised Learning, Unsupervised Learning, and Reinforcement Learning. Casting
Reinforced Learning aside, the primary two categories of Machine Learning problems
are Supervised and Unsupervised Learning. The basic difference between the two is
that Supervised Learning datasets have an output label associated with each tuple while
Unsupervised Learning datasets do not.
129
Neural Networks and Deep Learning
Intuitively, one may imagine the three types of learning algorithms as Supervised
learning where a student is under the supervision of a teacher at both home and school,
Unsupervised learning where a student has to figure out a concept himself and Semi-
Supervised learning where a teacher teaches a few concepts in class and gives questions
as homework which are based on similar concepts.
130
Neural Networks and Deep Learning
P(x,y) denotes the joint distribution of x and y, i.e., corresponding to a training sample x, I have a
label y. P(x) denotes the marginal distribution of x, i.e., just the training examples without any
and P(x)(unlabelled samples) to estimate P(y|x)(since we want to predict the class, given the
131
Neural Networks and Deep Learning
training sample). We want to learn some representation h = f(x)such that samples which are closer
in the input space have similar representations and a linear classifier in the new space achieves
Instead of separating the supervised and unsupervised criteria, we can instead have a generative
model of P(x) (or P(x, y)) which shares parameters with the discriminative model. The idea is to
share the unsupervised/generative criterion with the supervised criterion to express a prior belief
that the structure of P(x) (or P(x, y)) is connected to the structure of P(y|x), which is expressed by
the shared parameters.
Multi-Task Learning (MTL)
Multi-Task Learning (MTL) is a type of machine learning technique where a model is
trained to perform multiple tasks simultaneously. In deep learning, MTL refers to
training a neural network to perform multiple tasks by sharing some of the network’s
layers and parameters across tasks.
In MTL, the goal is to improve the generalization performance of the model by
leveraging the information shared across tasks. By sharing some of the network’s
parameters, the model can learn a more efficient and compact representation of the data,
which can be beneficial when the tasks are related or have some commonalities.
There are different ways to implement MTL in deep learning, but the most common
approach is to use a shared feature extractor and multiple task-specific heads. The
shared feature extractor is a part of the network that is shared across tasks and is used to
extract features from the input data. The task-specific heads are used to make
predictions for each task and are typically connected to the shared feature extractor.
Another approach is to use a shared decision-making layer, where the decision-making
layer is shared across tasks, and the task-specific layers are connected to the shared
decision-making layer.
MTL can be useful in many applications such as natural language processing, computer
vision, and healthcare, where multiple tasks are related or have some commonalities. It
is also useful when the data is limited, MTL can help to improve the generalization
performance of the model by leveraging the information shared across tasks.
However, MTL also has its own limitations, such as when the tasks are very different
Multi-Task Learning is a sub-field of Deep Learning. It is recommended that you
familiarize yourself with the concepts of neural networks to understand what multi-task
learning means.
132
Neural Networks and Deep Learning
133
Neural Networks and Deep Learning
Soft Parameter Sharing – Each model has their own sets of weights and biases and the
distance between these parameters in different models is regularized so that the
parameters become similar and can represent all the tasks.
Assumptions and Considerations – Using MTL to share knowledge among tasks are
very useful only when the tasks are very similar, but when this assumption is violated,
the performance will significantly decline. Applications: MTL techniques have found
various uses, some of the major applications are-
Object detection and Facial recognition
Self Driving Cars: Pedestrians, stop signs and other obstacles can be detected
together
Multi-domain collaborative filtering for web applications
Stock Prediction
Language Modelling and other NLP applications
134
Neural Networks and Deep Learning
Important points:
Here are some important points to consider when implementing Multi-Task Learning
(MTL) for deep learning:
1. Task relatedness: MTL is most effective when the tasks are related or have some
commonalities, such as natural language processing, computer vision, and
healthcare.
2. Data limitation: MTL can be useful when the data is limited, as it allows the model
to leverage the information shared across tasks to improve the generalization
performance.
3. Shared feature extractor: A common approach in MTL is to use a shared feature
extractor, which is a part of the network that is shared across tasks and is used to
extract features from the input data.
4. Task-specific heads: Task-specific heads are used to make predictions for each task
and are typically connected to the shared feature extractor.
5. Shared decision-making layer: another approach is to use a shared decision-making
layer, where the decision-making layer is shared across tasks, and the task-specific
layers are connected to the shared decision-making layer.
6. Careful architecture design: The architecture of MTL should be carefully designed
to accommodate the different tasks and to make sure that the shared features are
useful for all tasks.
7. Overfitting: MTL models can be prone to overfitting if the model is not regularized
properly.
8. Avoiding negative transfer: when the tasks are very different or independent, MTL
can lead to suboptimal performance compared to training a single-task model.
Therefore, it is important to make sure that the shared features are useful for all
tasks to avoid negative transfer.
Early stopping
A significant challenge when training a machine learning model is deciding how
many epochs to run. Too few epochs might not lead to model convergence, while too
many epochs could lead to overfitting.
135
Neural Networks and Deep Learning
Although the validation set strategy is the best in terms of preventing overfitting, it
usually takes a large number of epochs before a model begins to overfit, which could
cost a lot of computing power. A smart way to get the best of both worlds is to devise
136
Neural Networks and Deep Learning
a hybrid approach between the validation set strategy and then stop when the loss
function update becomes small. For example, the training could stop when either of
them is achieved.
Parameter Tying
Some standard regularisers like l1 and l2 penalize model parameters for deviating from the fixed
value of zero. One of the side effects of Lasso or group-Lasso regularization in learning a Deep
Neural Networks is that there is a possibility that many of the parameters may become zero.
Thus, reducing the amount of memory required to store the model and lowering the
computational cost of applying it. A significant drawback of Lasso (or group-Lasso)
regularization is that in the presence of groups of highly correlated features, it tends to select
only one or an arbitrary convex combination of elements from each group. Moreover, the
learning process of Lasso tends to be unstable because the subsets of parameters that end up
selected may change dramatically with minor changes in the data or algorithmic procedure.
In Deep Neural Networks, it is almost unavoidable to encounter correlated features due to the
high dimensionality of the input to each layer and because neurons tend to adapt, producing
strongly correlated features that we pass as an input to the subsequent layer.
To overcome the issues we face while using Lasso or group lasso is countered by a regularizer,
the group version of the ordered weighted one norm, known as group-OWL (GrOWL). GrOWL
supports sparsity and simultaneously learns which parameters should share a similar value.
GrOWL has been effective in linear regression, identifying and coping with strongly correlated
covariates.
Unlike standard sparsity-inducing regularizers (e.g., Lasso), GrOWL eliminates unimportant
neurons by setting all their weights to zero and explicitly identifies strongly correlated neurons
by tying the corresponding weights to an expected value.
This ability of GrOWL motivates the following two-stage procedure:
137
Neural Networks and Deep Learning
(i) use GrOWL regularization during training to simultaneously identify significant neurons and
groups of parameters that should be tied together.
(ii) retrain the network, enforcing the structure unveiled in the previous phase, i.e., keeping only
the significant neurons and implementing the learned tying structure.
Parameter Sharing
The parameters of one model, trained as a classifier in a supervised paradigm, were regularised to be
close to the parameters of another model, trained in an unsupervised paradigm, using this method (to
capture the distribution of the observed input data).
Many of the parameters in the classifier model might be linked with similar parameters in the
unsupervised model thanks to the designs. While a parameter norm penalty is one technique to require
sets of parameters to be equal, constraints are a more prevalent way to regularise parameters to be close
to one another. Because we view the numerous models or model components as sharing a unique set of
parameters, this form of regularisation is commonly referred to as parameter sharing. The fact that only
a subset of the parameters (the unique set) needs to be retained in memory is a significant advantage of
parameter sharing over regularising the parameters to be close (through a norm penalty).
This can result in a large reduction in the memory footprint of certain models, such as the convolutional
neural network.
Convolutional neural networks (CNNs) used in computer vision are by far the most widespread and
extensive usage of parameter sharing. Many statistical features of natural images are translation
insensitive. A shot of a cat, for example, can be translated one pixel to the right and still be a shot of a
cat. By sharing parameters across several picture locations, CNNs take this property into account.
Different locations in the input are computed with the same feature (a hidden unit with the same
weights). This indicates that whether the cat appears in column i or column i + 1 in the image, we can
find it with the same cat detector.
CNN’s have been able to reduce the number of unique model parameters and raise network sizes
greatly without requiring a comparable increase in training data thanks to parameter sharing. It’s still
one of the best illustrations of how domain knowledge can be efficiently integrated into the network
architecture.
Bagging and other Ensemble Methods
Ensemble Methods
The general principle of an ensemble method in Machine Learning to combine the predictions
of several models. These are built with a given learning algorithm in order to improve
robustness over a single model. Ensemble methods can be divided into two groups:
Parallel ensemble methods: In these methods, the base learners are generated in
parallel simultaneously. For example, when deciding the movie you want to watch, you
may ask multiple friends for suggestions and probably watch the movie which got the
highest votes.
Sequential ensemble methods: In this technique, different learners learn sequentially
with early learners fitting simple models to the data. Then the data is analyzed for
errors. The goal is to solve for net error from the prior model. The overall performance
can be boosted by weighing previously mislabeled examples with higher weight.
138
Neural Networks and Deep Learning
Most ensemble methods use a single base learning algorithm to produce homogeneous base
learners, i.e. learners of the same type, leading to homogeneous ensembles. For
example, Random forests (Parallel ensemble method) and Adaboost(Sequential ensemble
methods).
Some methods use heterogeneous learners, i.e. learners of different types. This leads to
heterogeneous ensembles. For ensemble methods to be more accurate than any of its members,
the base learners have to be as accurate and as diverse as possible. In Scikit-learn, there is a
model known as a voting classifier. This is an example of heterogeneous learners.
Bagging
Bagging, a Parallel ensemble method (stands for Bootstrap Aggregating), is a way to decrease
the variance of the prediction model by generating additional data in the training stage. This is
produced by random sampling with replacement from the original set. By sampling with
replacement, some observations may be repeated in each new training data set. In the case of
Bagging, every element has the same probability to appear in a new dataset. By increasing the
size of the training set, the model’s predictive force can’t be improved. It decreases the
variance and narrowly tunes the prediction to an expected outcome.
These multisets of data are used to train multiple models. As a result, we end up with an
ensemble of different models. The average of all the predictions from different models is used.
This is more robust than a model. Prediction can be the average of all the predictions given by
the different models in case of regression. In the case of classification, the majority vote is
taken into consideration.
For example, Decision tree models tend to have a high variance. Hence, we apply bagging to
them. Usually, the Random Forest model is used for this purpose. It is an extension over-
bagging. It takes the random selection of features rather than using all features to grow trees.
When you have many random trees. It’s called Random Forest.
Boosting
Boosting is a sequential ensemble method that in general decreases the bias error and builds
strong predictive models. The term ‘Boosting’ refers to a family of algorithms which converts
a weak learner to a strong learner.
Boosting gets multiple learners. The data samples are weighted and therefore, some of them
may take part in the new sets more often.
In each iteration, data points that are mispredicted are identified and their weights are
increased so that the next learner pays extra attention to get them right. The following figure
illustrates the boosting process.
139
Neural Networks and Deep Learning
During training, the algorithm allocates weights to each resulting model. A learner with good
prediction results on the training data will be assigned a higher weight than a poor one. So
when evaluating a new learner, Boosting also needs to keep track of learner’s errors.
Some of the Boosting techniques include an extra-condition to keep or discard a single learner.
For example, in AdaBoost an error of less than 50% is required to maintain the model;
otherwise, the iteration is repeated until achieving a learner better than a random guess.
Bagging vs Boosting
There’s no outright winner, it depends on the data, the simulation, and the
circumstances. Bagging and Boosting in machine learning decrease the variance of a single
estimate as they combine several estimates from different models. As a result, the performance
of the model increases, and the predictions are much more robust and stable.
But how do we measure the performance of a model? One of the ways is to compare its
training accuracy with its validation accuracy which is done by splitting the data into two sets,
viz- training set and validation set.
The model is trained on the training set and evaluated on the validation set. Thus, the training
accuracy is evaluated on the training set and gives us a measure of how good the model can fit
the training data. On the other hand, validation accuracy is evaluated on the validation set and
reveals the generalization ability of the model. A model’s ability to generalize is crucial to the
success of a model. Thus, we can say that the performance of a model is good if it can fit the
training data well and also predict the unknown data points accurately.
If a single model gets a low performance, Bagging rarely gets a better bias. However, Boosting
can generate a combined model with lower errors. As it optimizes the advantages and reduces
the pitfalls of the single model. On the other hand, Bagging can increase the generalization
ability of the model and help it better predict the unknown samples. Let us see an example of
this in the next section.
Implementation
In this section, we demonstrate the effect of Bagging and Boosting on the decision boundary of
a classifier. Let us start by introducing some of the algorithms used in this code.
Decision Tree Classifier: Decision Tree Classifier is a simple and widely used
classification technique. It applies a straightforward idea to solve the classification
problem. Decision Tree Classifier poses a series of carefully crafted questions about the
attributes of the test record. Each time it receives an answer, a follow-up question is
asked until a conclusion about the class label of the record is reached.
Decision Stump: A decision stump is a machine learning model consisting of a one-
level decision tree. That is, it is a decision tree with one internal node (the root) which
is immediately connected to the terminal nodes (its leaves). A decision stump makes a
140
Neural Networks and Deep Learning
prediction based on the value of just a single input feature. Here we take decision stump
as a weak learner for the AdaBoost algorithm.
RandomForest: Random forest is an ensemble learning algorithm that uses the concept
of Bagging.
AdaBoost: AdaBoost, short for Adaptive Boosting, is a machine learning meta-
algorithm that works on the principle of Boosting. We use a Decision stump as a weak
learner here.
Dropout
INTRODUCTION
So before diving deep into its world, let’s address the first question. What is
the problem that we are trying to solve?
The best way to reduce overfitting or the best way to regularise a fixed-size
model is to get the average predictions from all possible settings of the
parameters and aggregate the final output. But, this becomes too
computationally expensive and isn’t feasible for a real-time
inference/prediction.
141
Neural Networks and Deep Learning
What is a Dropout?
The term “dropout” refers to dropping out the nodes (input and hidden
layer) in a neural network (as seen in Figure 1). All the forward and
backwards connections with a dropped node are temporarily removed,
thus creating a new network architecture out of the parent network. The
nodes are dropped by a dropout probability of p.
142
Neural Networks and Deep Learning
probability = 0.8). During the forward propagation (training) from the input x,
20% of the nodes would be dropped, i.e. the x could become {1, 0, 3, 4, 5} or
{1, 2, 0, 4, 5} and so on. Similarly, it applied to the hidden layers.
For instance, if the hidden layers have 1000 neurons (nodes) and a dropout is
applied with drop probability = 0.5, then 500 neurons would be randomly
dropped in every iteration (batch).
Generally, for the input layers, the keep probability, i.e. 1- drop probability, is
closer to 1, 0.8 being the best as suggested by the authors. For the hidden
layers, the greater the drop probability more sparse the model, where 0.5 is the
most optimised keep probability, that states dropping 50% of the nodes.
In the overfitting problem, the model learns the statistical noise. To be precise,
the main motive of training is to decrease the loss function, given all the units
(neurons). So in overfitting, a unit may change in a way that fixes up the
mistakes of the other units. This leads to complex co-adaptations, which in
turn leads to the overfitting problem because this complex co-adaptation fails
to generalise on the unseen dataset.
Now, if we use dropout, it prevents these units to fix up the mistake of other
units, thus preventing co-adaptation, as in every iteration the presence of a
unit is highly unreliable. So by randomly dropping a few units (nodes), it
143
Neural Networks and Deep Learning
forces the layers to take more or less responsibility for the input by taking a
probabilistic approach.
This ensures that the model is getting generalised and hence reducing the
overfitting problem.
Figure 2:(a) Hidden layer features without dropout; (b) Hidden layer features with
dropout
From figure 2, we can easily make out that the hidden layer with dropout is
learning more of the generalised features than the co-adaptations in the layer
without dropout. It is quite apparent, that dropout breaks such inter-unit
relations and focuses more on generalisation.
Dropout Implementation
144
Neural Networks and Deep Learning
Figure 3 (a) A unit (neuron) during training is present with a probability p and is
connected to the next layer with weights ‘w’ ; (b) A unit during inference/prediction is
always present and is connected to the next layer with weights, ‘pw’
In the standard neural network, during the forward propagation we have the
following equations:
where:
z: denote the vector of output from layer (l + 1) before activation
y: denote the vector of outputs from layer l
w: weight of the layer l
b: bias of the layer l
145
Neural Networks and Deep Learning
Further, with the activation function, z is transformed into the output for layer
(l+1).
Figure 6
146
Neural Networks and Deep Learning
Comparison of the dropout network with the standard network for a given layer
during forward propagation
Now, we know the dropout works mathematically but what happens during the
inference/prediction? Do we use the network with dropout or do we remove the dropout during
inference?
This is one of the most important concepts of dropout which very few data scientists are aware of.
According to the original implementation (Figure 3b) during the inference, we do not use a dropout
layer. This means that all the units are considered during the prediction step. But, because of taking
all the units/neurons from a layer, the final weights will be larger than expected and to deal with
this problem, weights are first scaled by the chosen dropout rate. With this, the network would be
able to make accurate predictions.
Adversarial Training
147
Neural Networks and Deep Learning
In this figure, we can see that the classification of the perturbed image
(rightmost) is clearly absurd. We can also notice that the difference between
the original and modified images is very slight.
With article like “California’s finally ready for truly driverless cars” or “The
Pentagon’s ‘Terminator Conundrum’: Robots That Could Kill on Their Own”,
applications of adversarial examples are not exactly hard to come up with…
In this post I will first explain how such images are created and then go
through the main defenses that have been published. For more details and
more rigorous information, please refer to the research papers referenced.
148
Neural Networks and Deep Learning
In the figure on the left, you can see a simple curve. Suppose that we want to
find a local minimum (a value of x for which f(x) is locally minimal).
The gradient descent consists in the following steps: first pick an initial value
for x, then compute the derivative f’ of f according to x and evaluate it for our
initial guess. f’(x) is the slope of the tangent to the curve at x. According to the
sign of this slope, we know whether we have to increase or decrease x to
make f(x) decrease. In the example on the left, the slope is negative so we
should increase the value of x to make f(x) decrease. As the tangent is a good
149
Neural Networks and Deep Learning
Suppose that we have a set of points and we want to find a line that is a
reasonable approximation for these values.
Our machine learning model will be the line y = ax + b and the model
parameters will be a and b.
Now to use the gradient descent, we are going to define a function of which
we will want to find a local minimum. This is our loss function.
150
Neural Networks and Deep Learning
This loss takes a data point x, its corresponding value y and the model
parameters a and b. The loss is the squared difference of the real
value y and ax + b, the prediction of our model. The bigger the difference
between the real and predicted values is, the bigger the value of the loss
function will be. Intuitively, we chose the squaring operation to penalize big
differences between real and predicted values more than small ones.
And as before, we can evaluate this derivatives with our current values
of a and b for each data point (x, y), which will give us the slopes of the
tangents to the loss function and use these slopes to update a and b in order to
minimize L.
OK, that’s cool and all but that’s not how we’re going to generate our
adversarial examples…
Well, in fact it is exactly how we are going to do it. Suppose now that
the model is fixed (you can’t change a and b) and you want to increase the
value of the loss. The only thing left to modify are the data points (x, y). As
modifying the ys does not really make sense, we will modify the xs.
151
Neural Networks and Deep Learning
We could just replace the x by random values and the loss value would
increase by a tremendous amount but that’s not really subtle, in particular, it
would be really obvious to a human plotting the data points. To make our
changes in a way that is not obviously detected by an observer, we will
compute the derivative of the loss function according to x.
And now, just as before, we can evaluate this derivative on our data points,
get the slope of the tangent and update the x values by a small amount
accordingly. The loss will increase and, as we are modifying all the points by
a small amount, our perturbation will be hard to detect.
Well that was a very simple model that we’ve just messed with, deep learning
is much more complicated than that…
Guess what? It’s not. Everything we just did has a direct equivalent in the
world of deep learning. When we are training a neural network to classify
images, the loss function is usually a categorical cross entropy, the model
parameters are the weights of the network and the inputs are the pixel values
of the image.
152
Neural Networks and Deep Learning
Let x be the original image, y the class of x, θ the weights of the network
and L(θ, x, y) the loss function used to train the network.
First, we compute the gradient of the loss function according to the input
pixels. The ∇ operator is just a concise mathematical way of taking the
derivatives of a function according to many of its parameters. You can think
of as a matrix of shape [width, height, channels] containing the slopes of the
tangents.
As before, we are only interested in the sign of the slopes to know if we want
to increase or decrease the pixel values. We multiply these signs by a very
small value ε to ensure that we do not go too far on the loss function surface
and that the perturbation will be imperceptible. This will be our perturbation.
Our final image is just our original image to which we add the perturbation η.
153
Neural Networks and Deep Learning
FGSM applied to an image. The original is classified as ‘king penguin’ with 100%
confidence and the perturbed one is classified as ‘tripod’ with 71% confidence.
The family of attack where you are able to use compute gradients using the
target model are called white-box attacks.
Now you could tell me that the attack I’ve just presented is not really realistic
as you’re unlikely to get access to the gradients of the loss function on a self-
driving car. Researchers thought exactly the same thing and, in this paper,
they found a way to deal with it.
In a more realistic context, you would want to attack a system having only
access to its outputs. The problem with this is that you would not be able to
apply the FGSM algorithm anymore as you would not have access to the
network itself.
The solution proposed is to train a new neural network M’ to solve the same
classification task as the target model M. Then, when M’ is trained, use it to
154
Neural Networks and Deep Learning
What they found is that M will very often misclassify adversarial samples
generated using M’. Moreover, if we do not have access to a proper training
set for M’, we can build one using M predictions as truth values. The authors
call this synthetic inputs. This is an excerpt of their article in which they
describe their attack on the MetaMind network to which they did not have
access:
“After labeling 6,400 synthetic inputs to train our substitute (an order of
magnitude smaller than the training set used by MetaMind) we find that their
DNN misclassifies adversarial examples crafted with our substitute at a rate
of 84.24%”.
This kind of attack is called black-box attack as you see the target model as a
black-box.
So, even when the attacker does not have access to the internals of the model
he can still produce adversarial sample that will fool it but still, this attack
context is not realistic either. In a real scenario, the attacker would not be
allowed to provide its own image files, the neural network would take camera
pictures as input. That’s the problem the authors of this article are trying to
solve.
What they noticed is that when you print adversarial samples which have been
generated with a high-enough ε and then take a picture of the print and
155
Neural Networks and Deep Learning
classify it, the neural network is still fooled a significant portion of the time.
The authors recorded a video to showcase their results:s
156
Neural Networks and Deep Learning
UNIT 5
Optimization for Train Deep Models
Challenges in Neural Network Optimization
Optimization is one of the broadest areas of research in the deep learning space. In
previous articles, I explained the differences between optimization and
regularization as two of the fundamental techniques used to improve deep learning
models. There are several types of optimization in deep learning algorithms but the
most interesting ones are focused on reducing the value of cost functions.
When we say that optimization is one of the key areas of deep learning we are not
exaggerating. In real world deep learning implementations, data scientists often
spend more time refining and optimizing models than building new ones. What
makes deep learning optimization such a difficult endeavor. To answer that, we
need to understand some of the principles behind this new type of optimization n.
The core of deep learning optimization relies on trying to minimize the cost
function of a model without affecting its training performance. That type of
optimization problem contrasts with the general optimization problem in which the
objective is to simply minimize a specific indicator without being constrained by
the performance of other elements( ex:training).
157
Neural Networks and Deep Learning
algorithms depending on the way they interact with the training dataset. For
instance, algorithms that use the entire training set at once are called deterministic.
Other techniques that use one training example at a time has come to be known as
online algorithms. Similarly, algorithms that use more than one but less than the
entire training dataset during the optimization process are known as minibatch
stochastic or simply stochastic. The most famous method of stochastic optimization
which is also the most common algorithm in deep learning solution is known as
stochastic gradient descent(SGD)(read my previous article about SGD).
There are plenty of challenges in deep learning optimization but most of them are
related to the nature of the gradient of the model. Below, I’ve listed some of the
most common challenges in deep learning optimization that you are likely to run
into:
b)Flat Regions: In deep learning optimization models, flat regions are common
areas that represent both a local minimum for a sub-region and a local maximum for
another. That duality often causes the gradient to get stuck.
158
Neural Networks and Deep Learning
c)Inexact Gradients: There are many deep learning models in which the cost
function is intractable which forces an inexact estimation of the gradient. In these
cases, the inexact gradients introduces a second layer of uncertainty in the model.
d)Local vs. Global Structures: Another very common challenge in the optimization
of deep leavening models is that local regions of the cost function don’t correspond
with its global structure producing a misleading gradient.
Optimization Strategies
Optimizer algorithms are optimization method that helps improve a deep learning
model’s performance. These optimization algorithms or optimizers widely affect the
accuracy and speed training of the deep learning model. But first of all, the question
arises of what an optimizer is.
While training the deep learning optimizers model, modify each epoch’s weights
and minimize the loss function. An optimizer is a function or an algorithm that
adjusts the attributes of the neural network, such as weights and learning rates. Thus,
it helps in reducing the overall loss and improving accuracy. The problem of
choosing the right weights for the model is a daunting task, as a deep learning model
generally consists of millions of parameters. It raises the need to choose a suitable
optimization algorithm for your application. Hence understanding these machine
learning algorithms is necessary for data scientists before having a deep dive into
the field.
You can use different optimizers in the machine learning model to change your weights
and learning rate. However, choosing the best optimizer depends upon the application.
As a beginner, one evil thought that comes to mind is that we try all the possibilities
159
Neural Networks and Deep Learning
and choose the one that shows the best results. This might be fine initially, but when
dealing with hundreds of gigabytes of data, even a single epoch can take considerable
time. So randomly choosing an algorithm is no less than gambling with your precious
time that you will realize sooner or later in your journey.
This guide will cover various deep-learning optimizers, such as Gradient Descent,
Stochastic Gradient Descent, Stochastic Gradient descent with momentum, Mini-Batch
Gradient Descent, Adagrad, RMSProp, AdaDelta, and Adam. By the end of the article,
you can compare various optimizers and the procedure they are based upon.
Before proceeding, there are a few terms that you should be familiar with.
Epoch – The number of times the algorithm runs on the whole training dataset.
Batch – It denotes the number of samples to be taken to for updating the model
parameters.
Learning rate – It is a parameter that provides the model a scale of how much model
weights should be updated.
160
Neural Networks and Deep Learning
Cost Function/Loss Function – A cost function is used to calculate the cost, which is
the difference between the predicted value and the actual value.
Weights/ Bias – The learnable parameters in a model that controls the signal between
two neurons.
Gradient Descent can be considered the popular kid among the class of optimizers. This
optimization algorithm uses calculus to modify the values consistently and to achieve
the local minimum. Before moving ahead, you might have the question of what a
gradient is.
In simple terms, consider you are holding a ball resting at the top of a bowl. When you
lose the ball, it goes along the steepest direction and eventually settles at the bottom of
the bowl. A Gradient provides the ball in the steepest direction to reach the local
minimum which is the bottom of the bowl.
The above equation means how the gradient is calculated. Here alpha is the step size
that represents how far to move against each gradient with each iteration.
161
Neural Networks and Deep Learning
1. It starts with some coefficients, sees their cost, and searches for cost value lesser
than what it is now.
2. It moves towards the lower weight and updates the value of the coefficients.
3. The process repeats until the local minimum is reached. A local minimum is a
point beyond which it can not proceed.
Gradient descent works best for most purposes. However, it has some downsides too.
It is expensive to calculate the gradients if the size of the data is huge. Gradient descent
works well for convex functions, but it doesn’t know how far to travel along the gradient
for nonconvex functions.
At the end of the previous section, you learned why using gradient descent on
massive data might not be the best option. To tackle the problem, we have stochastic
162
Neural Networks and Deep Learning
gradient descent. The term stochastic means randomness on which the algorithm is
based upon. In stochastic gradient descent, instead of taking the whole dataset for
each iteration, we randomly select the batches of data. That means we only take a
few samples from the dataset.
The procedure is first to select the initial parameters w and learning rate n. Then
randomly shuffle the data at each iteration to reach an approximate minimum.
Since we are not using the whole dataset but the batches of it for each iteration, the
path taken by the algorithm is full of noise as compared to the gradient descent
algorithm. Thus, SGD uses a higher number of iterations to reach the local minima.
Due to an increase in the number of iterations, the overall computation time
increases. But even after increasing the number of iterations, the computation cost
is still less than that of the gradient descent optimizer. So the conclusion is if the data
is enormous and computational time is an essential factor, stochastic gradient
descent should be preferred over batch gradient descent algorithm.
As discussed in the earlier section, you have learned that stochastic gradient descent
takes a much more noisy path than the gradient descent algorithm. Due to this reason,
it requires a more significant number of iterations to reach the optimal minimum, and
hence computation time is very slow. To overcome the problem, we use stochastic
gradient descent with a momentum algorithm.
163
Neural Networks and Deep Learning
What the momentum does is helps in faster convergence of the loss function. Stochastic
gradient descent oscillates between either direction of the gradient and updates the
weights accordingly. However, adding a fraction of the previous update to the current
update will make the process a bit faster. One thing that should be remembered while
using this algorithm is that the learning rate should be decreased with a high momentum
term.
In the above image, the left part shows the convergence graph of the stochastic gradient
descent algorithm. At the same time, the right side shows SGD with momentum. From
the image, you can compare the path chosen by both algorithms and realize that using
momentum helps reach convergence in less time. You might be thinking of using a
large momentum and learning rate to make the process even faster. But remember that
while increasing the momentum, the possibility of passing the optimal minimum also
increases. This might result in poor accuracy and even more oscillations.
In this variant of gradient descent, instead of taking all the training data, only a subset
of the dataset is used for calculating the loss function. Since we are using a batch of
data instead of taking the whole dataset, fewer iterations are needed. That is why the
mini-batch gradient descent algorithm is faster than both stochastic gradient descent
and batch gradient descent algorithms. This algorithm is more efficient and robust than
164
Neural Networks and Deep Learning
the earlier variants of gradient descent. As the algorithm uses batching, all the training
data need not be loaded in the memory, thus making the process more efficient to
implement. Moreover, the cost function in mini-batch gradient descent is noisier than
the batch gradient descent algorithm but smoother than that of the stochastic gradient
descent algorithm. Because of this, mini-batch gradient descent is ideal and provides a
good balance between speed and accuracy.
Despite all that, the mini-batch gradient descent algorithm has some downsides too. It
needs a hyperparameter that is “mini-batch-size”, which needs to be tuned to achieve
the required accuracy. Although, the batch size of 32 is considered to be appropriate for
almost every case. Also, in some cases, it results in poor final accuracy. Due to this,
there needs a rise to look for other alternatives too.
The adaptive gradient descent algorithm is slightly different from other gradient descent
algorithms. This is because it uses different learning rates for each iteration. The change
in learning rate depends upon the difference in the parameters during training. The more
the parameters get changed, the more minor the learning rate changes. This
modification is highly beneficial because real-world datasets contain sparse as well as
dense features. So it is unfair to have the same value of learning rate for all the features.
The Adagrad algorithm uses the below formula to update the weights. Here the alpha(t)
denotes the different learning rates at each iteration, n is a constant, and E is a small
positive to avoid division by 0.
165
Neural Networks and Deep Learning
The benefit of using Adagrad is that it abolishes the need to modify the learning rate
manually. It is more reliable than gradient descent algorithms and their variants, and it
reaches convergence at a higher speed.
One downside of the AdaGrad optimizer is that it decreases the learning rate
aggressively and monotonically. There might be a point when the learning rate becomes
extremely small. This is because the squared gradients in the denominator keep
accumulating, and thus the denominator part keeps on increasing. Due to small learning
rates, the model eventually becomes unable to acquire more knowledge, and hence the
accuracy of the model is compromised.
RMS prop is one of the popular optimizers among deep learning enthusiasts. This is
maybe because it hasn’t been published but is still very well-known in the community.
RMS prop is ideally an extension of the work RPPROP. It resolves the problem of
varying gradients. The problem with the gradients is that some of them were small while
others may be huge. So, defining a single learning rate might not be the best idea.
RPPROP uses the gradient sign, adapting the step size individually for each weight. In
this algorithm, the two gradients are first compared for signs. If they have the same
sign, we’re going in the right direction, increasing the step size by a small fraction. If
they have opposite signs, we must decrease the step size. Then we limit the step size
and can now go for the weight update.
The problem with RPPROP is that it doesn’t work well with large datasets and when
we want to perform mini-batch updates. So, achieving the robustness of RPPROP and
the efficiency of mini-batches simultaneously was the main motivation behind the rise
166
Neural Networks and Deep Learning
where gamma is the forgetting factor. Weights are updated by the below formula
In simpler terms, if there exists a parameter due to which the cost function oscillates a
lot, we want to penalize the update of this parameter. Suppose you built a model to
classify a variety of fishes. The model relies on the factor ‘color’ mainly to differentiate
between the fishes. Due to this, it makes a lot of errors. What RMS Prop does is,
penalize the parameter ‘color’ so that it can rely on other features too. This prevents the
algorithm from adapting too quickly to changes in the parameter ‘color’ compared to
other parameters. This algorithm has several benefits as compared to earlier versions of
gradient descent algorithms. The algorithm converges quickly and requires lesser
tuning than gradient descent algorithms and their variants.
The problem with RMS Prop is that the learning rate has to be defined manually, and
the suggested value doesn’t work for every application.
167
Neural Networks and Deep Learning
AdaDelta can be seen as a more robust version of the AdaGrad optimizer. It is based
upon adaptive learning and is designed to deal with significant drawbacks of AdaGrad
and RMS prop optimizer. The main problem with the above two optimizers is that the
initial learning rate must be defined manually. One other problem is the decaying
learning rate which becomes infinitesimally small at some point. Due to this, a certain
number of iterations later, the model can no longer learn new knowledge.
To deal with these problems, AdaDelta uses two state variables to store the leaky
average of the second moment gradient and a leaky average of the second moment of
change of parameters in the model.
Here St and delta Xt denote the state variables, g’t denotes rescaled gradient, delta Xt-
1 denotes squares rescaled gradients, and epsilon represents a small positive integer to
handle division by 0.
168
Neural Networks and Deep Learning
descent (SGD) algorithm and is designed to update the weights of a neural network
during training.
The name “Adam” is derived from “adaptive moment estimation,” highlighting its
ability to adaptively adjust the learning rate for each network weight individually.
Unlike SGD, which maintains a single learning rate throughout training, Adam
optimizer dynamically computes individual learning rates based on the past gradients
and their second moments.
By incorporating both the first moment (mean) and second moment (uncentered
variance) of the gradients, Adam optimizer achieves an adaptive learning rate that can
efficiently navigate the optimization landscape during training. This adaptivity helps in
faster convergence and improved performance of the neural network.
169
Neural Networks and Deep Learning
running time, low memory requirements, and requires less tuning than any other
optimization algorithm.
The above formula represents the working of adam optimizer. Here B1 and B2 represent
the decay rate of the average of the gradients.
If the adam optimizer uses the good properties of all the algorithms and is the best
available optimizer, then why shouldn’t you use Adam in every application? And what
was the need to learn about other algorithms in depth? This is because even Adam has
some downsides. It tends to focus on faster computation time, whereas algorithms like
stochastic gradient descent focus on data points. That’s why algorithms like SGD
generalize the data in a better manner at the cost of low computation speed. So, the
optimization algorithms can be picked accordingly depending on the requirements and
the type of data.
170
Neural Networks and Deep Learning
The above visualizations create a better picture in mind and help in comparing the
results of various optimization algorithms.
Meta-Algorithms
Introduction
171
Neural Networks and Deep Learning
The performance of a learning model depends on its training dataset, algorithm, and
parameters. Many experiments are needed to find the best performing algorithm and
algorithm parameters. Meta-learning approaches help them find and optimize the
number of experiments. The result is better predictions in less time.
Meta-learning can be used for various machine learning models (e.g., few-
shot, Reinforcement Learning, natural language processing, etc.). Meta-learning
algorithms make predictions by inputting the outputs and metadata of machine
learning algorithms. Meta-learning algorithms can learn to use the best predictions
from machine learning algorithms to make better predictions. In computer science,
meta-learning studies and approaches started in the 1980s and became popular after
the works of Jürgen Schmidhuber and Yoshua Bengio on the subject.
What is Meta-learning?
Meta-learning, described as “learning to learn”, is a subset of machine learning in the
field of computer science. It is used to improve the results and performance of the
learning algorithm by changing some aspects of the learning algorithm based on the
results of the experiment. Meta-learning helps researchers understand which
algorithms generate the best/better predictions from datasets.
Meta-learning algorithms use learning algorithm metadata as input. They then make
predictions and provide information about the performance of these learning
172
Neural Networks and Deep Learning
Applications
Large-scale deep learning refers to the application of deep learning
techniques to massive datasets and high-computational resources. It
involves training deep neural networks on large datasets with millions or
even billions of data points using powerful hardware such as multiple
GPUs or specialized hardware like TPUs (Tensor Processing Units).
The need for large-scale deep learning arises from the complexity and
capacity of deep neural networks. Deep learning models, particularly
deep neural networks with multiple hidden layers, have the ability to
learn intricate patterns and representations from data, making them
highly effective in various tasks like image recognition, natural language
173
Neural Networks and Deep Learning
Computer Vision
Computer Vision in deep learning is a specialized field of artificial
intelligence (AI) that focuses on teaching computers to interpret and
understand visual information from the world. It involves the use of
deep learning techniques to process, analyze, and extract meaningful
insights from images and videos.
The primary goal of computer vision is to enable machines to perceive
the visual world in a manner similar to how humans do, and to make
decisions or take actions based on that understanding. Deep learning,
specifically deep neural networks, has revolutionized computer vision by
175
Neural Networks and Deep Learning
176
Neural Networks and Deep Learning
177
Neural Networks and Deep Learning
178
Neural Networks and Deep Learning
179
Neural Networks and Deep Learning
MID 1
PART A
Fill in the Blanks
1. Artificial Neural Networks (ANNs) are a fundamental component
of _________ and _________.
Answer: Machine learning and artificial intelligence.
2. ANNs are designed to mimic the structure and functioning of the
_________.
Answer: Human brain.
3. he simplest form of an artificial neuron is called the _________.
Answer: Perceptron.
4. Multi-Layer Perceptron (MLP) consists of an input layer, one or
more _________ layers, and an output layer.
Answer: Hidden.
5. Convolutional Neural Networks (CNNs) are specialized for tasks
related to _________.
Answer: Image processing and computer vision.
6. Recurrent Neural Networks (RNNs) are suitable for processing
_________ data. Answer: Sequential.
7. The basic computational unit in an ANN is called a
_________.Answer: Neuron or Node.
8. Weights indicate the _________ of connections between
neurons.Answer: Strength.
9. The function applied to the weighted sum of inputs to introduce
non-linearity is known as the _________ function.Answer:
Activation.
180
Neural Networks and Deep Learning
181
Neural Networks and Deep Learning
182