Deep Learning Course File Aiml-1

Neural Networks and Deep Learning
COURSE FILE
Department of CSE
(Artificial Intelligence & Machine Learning)
(2023-2024)
Faculty In-Charge HOD,CSE(AI&ML)

Ms.P.Nalini Dr.N.VINAYA KUMARI
0
COURSE FILE
SUBJECT Neural Networks and Deep
Learning
ACADEMIC YEAR 2023-2024
REGULATION R18
NAME OF THE FACULTY P.NALINI
DEPARTMENT CSE - AIML
YEAR & SECTION IV & CSE – AIML
SUBJECT CODE
1
INDEX
COURSE FILE
S.NO TOPIC PAGE
NO
1. PEO’S, PO’S ,PSO’S 3
2 Syllabus Copy 5
3 Class Time table & Individual Time table 7
4 Student Roll List 8
5 Lesson Plan 9
Unit Wise Lecture Notes 12
a) Notes of Units
b) Assignment Questions
c) Short and long answer question with

6
Blooms Taxonomy
d) Beyond the syllabus topics and notes
e) Objective Questions
f) PPT’S/NPTEL VIDEOS/any other
7 Student Seminar Topics
8 Previous University Question Papers to practice
9 Sample Internal Examination Question papers with key
2
10 Course Attendance Register
PEO’S, PO’S, PSO’S
PROGRAM EDUCATIONAL OBJECTIVES:
PEO1: The graduates of the program will understand the concepts and principles of
Computer Science and Engineering inclusive of basic sciences.
PEO2: The program enables the learners to provide the technical skills
necessary to design and implement computer systems and applications, to conduct
open-ended problem solving, and apply critical thinking.
PEO3: The graduates of the program will practice the profession with work
effectively on teams to communicate in written and oral form, ethics, integrity,
leadership and social responsibility through safe engineering leading them to
contribute their might for the good of the human race.
PEO4: The program encourages the students to become lifelong activity and as a
means to the creative discovery, development, and implementation of technology
as well as to keep up with the dynamic nature of the Computer Science and
Engineering discipline.
PROGRAM OUTCOMES
Engineering knowledge: Apply the knowledge of mathematics, science,

engineering fundamentals, and an engineering specialization to the solution of
complex engineering problems.
Problem analysis: Identify, formulate, review research literature, and analyze

complex engineering problems reaching substantiated conclusions using first
principles of mathematics, natural sciences, and engineering sciences.
Design/development of solutions: Design solutions for complex engineering

problems and design system components or processes that meet the specified needs
with appropriate consideration for the public health and safety, and the cultural,
societal, and environmental considerations.
Conduct investigations of complex problems: Use research-based knowledge

and research methods including design of experiments, analysis and interpretation
of data, and synthesis of the information to provide valid conclusions.
3
Modern tool usage: Create, select, and apply appropriate techniques, resources, and
modern engineering and IT tools including prediction and modeling to complex
engineering activities with an understanding of the limitations.
The engineer and society: Apply reasoning informed by the contextual knowledge
to assess societal, health, safety, legal and cultural issues and the consequent
responsibilities relevant to the professional engineering practice.
Environment and sustainability: Understand the impact of the professional

engineering solutions in societal and environmental contexts, and demonstrate the
knowledge of, and need for sustainable development.
Ethics: Apply ethical principles and commit to professional ethics and

responsibilities and norms of the engineering practice.
Individual and team work: Function effectively as an individual, and as a member

or leader in diverse teams, and in multidisciplinary settings.
Communication: Communicate effectively on complex engineering activities

with the engineering community and with society at large, such as, being able to
comprehend and write effective reports and design documentation, make
effective presentations, and give and receive clear instructions.
Project management and finance: Demonstrate knowledge and understanding of

the engineering and management principles and apply these to one’s own work, as a
member and leader in a team, to manage projects and in multidisciplinary
environments.
Life-long learning: Recognize the need for, and have the preparation and ability to
engage in independent and life-long learning in the broadest context of technological
change.
PROGRAM SPECIFIC OUTCOMES:
PSO1: Design and development of software applications by using data mining

techniques.
PSO2: Enrichment of graduates with global certifications to produce reliable software

solutions.
4
2. Syllabus Copy

B.Tech. IV Year I Sem. LTPC
3003
Course Objectives:
 To introduce the foundations of Artificial Neural Networks

 To acquire the knowledge on Deep Learning Concepts
 To learn various types of Artificial Neural Networks
 To gain knowledge to apply optimization strategies
Course Outcomes:
 Ability to understand the concepts of Neural Networks

 Ability to select the Learning Networks in modeling real world systems
 Ability to use an efficient algorithm for Deep Models
 Ability to apply optimization strategies for large scale applications
UNIT-I
Artificial Neural Networks Introduction, Basic models of ANN, important terminologies,
Supervised
Learning Networks, Perceptron Networks, Adaptive Linear Neuron, Back-propagation Network.
Associative Memory Networks. Training Algorithms for pattern association, BAM and Hopfield
Networks.
UNIT-II
Unsupervised Learning Network- Introduction, Fixed Weight Competitive Nets, Maxnet,
Hamming
Network, Kohonen Self-Organizing Feature Maps, Learning Vector Quantization, Counter
Propagation
Networks, Adaptive Resonance Theory Networks. Special Networks-Introduction to various
networks.
UNIT - III
5
Introduction to Deep Learning, Historical Trends in Deep learning, Deep Feed - forward
networks,
Gradient-Based learning, Hidden Units, Architecture Design, Back-Propagation and Other
Differentiation Algorithms
UNIT - IV
Regularization for Deep Learning: Parameter norm Penalties, Norm Penalties as Constrained
Optimization, Regularization and Under-Constrained Problems, Dataset Augmentation, Noise
Robustness, Semi-Supervised learning, Multi-task learning, Early Stopping, Parameter Typing
and
Parameter Sharing, Sparse Representations, Bagging and other Ensemble Methods, Dropout,
Adversarial Training, Tangent Distance, tangent Prop and Manifold, Tangent Classifier
UNIT - V
Optimization for Train Deep Models: Challenges in Neural Network Optimization, Basic
Algorithms,
Parameter Initialization Strategies, Algorithms with Adaptive Learning Rates, Approximate
SecondOrder Methods, Optimization Strategies and Meta-Algorithms
Applications: Large-Scale Deep Learning, Computer Vision, Speech Recognition, Natural
Language
Processing
TEXT BOOKS:
1. Deep Learning: An MIT Press Book By Ian Goodfellow and Yoshua Bengio and Aaron
Courville
2. Neural Networks and Learning Machines, Simon Haykin, 3rd Edition, Pearson Prentice Hall.
6
2. Class Time Table
Time 09.15 a.m. 10.15 a.m. 11.15 a.m. 12.15 p.m. 01.15 p.m. 02.00 p.m. 03.00 p.m.
to to to to to to to
Day 10.15 a.m. 11.15 a.m. 12.15 p.m. 01.15 p.m. 02.00 p.m. 03.00 p.m. 04.00 p.m.
MON CC RF IPR NN PROJECT STAGE 1
L
LIBRARY/
TUE IPR RF NN CC ASN
SPORTS
U
WED NN ASN RF CC SEMINAR
N
THU RF IPR NN LAB CC ASN
C
FRI ASN RF IPR NN H PROJECT STAGE 1
SAT ASN NN CC IPR PROJECT STAGE 1
7
4. Individual Time Table
Ms. P Nalini (Neural Networks & Deep Learning)-Class Incharge
1.15p.m.
DAY/TIME 09:15a.m 10:15a.m 11:15a.m 12:15p.m TO 02.00p.m. 03.00p.m.
TO TO TO TO 02.00 TO TO
10.15a.m. 1.15a.m. 12.15a.m. 01.15p.m. p.m. 03.00p.m. 04.00p.m.
IOMP/SUMMER
MON DL(CS) NN&DL L INTERNSHIP
TUE DL(CS) NN&DL U
WED NN&DL DL(CS) N
THU DL(CS) DL LAB C
IOMP/SUMMER
FRI DL(CS) NN&DL H INTERNSHIP
IOMP/SUMMER
SAT NN&DL
INTERNSHIP
8
5. Student List
MALLAREDDY INSTITUTE OF TECHNOLOGY & SCIENCE
CSE - AIML
Class: IV Year-I Sem B. Tech. Branch: B.Tech – CSE (AIML)
Batch: 2020-2024 A.Y:2023-2024
ROLL LIST
S. No H.T.NO NAME OF THE STUDENT
1 20S11A6601 AJAY KYADAVENI
2 20S11A6602 AKHIL DESAI
3 20S11A6603 AMRUTHA S.V.S
4 20S11A6604 ANUDEEP DHURGAM
5 20S11A6605 ASHWINI KASHI
6 20S11A6606 BHARATH D
7 20S11A6607 BHAVANA GOLLAPALLY
8 20S11A6608 BHAVANA KAMMARI
9 20S11A6609 BHAVANI SHANKER C V
10 20S11A6610 CHANDANA TIGULLA
11 20S11A6611 CHANDRASHEKHARA PRAMOD
12 20S11A6612 DINESH SADANANAD
13 20S11A6613 HARI KRISHNA SAMBARI
14 20S11A6614 JASHWANTH BOMMAKANTI
15 20S11A6615 KAMAL SANJAY
16 20S11A6616 LAXMI PRASANNA A A S
17 20S11A6617 M N AJAY VARMA PENMATSA
18 20S11A6618 MAHESH BOLLABATHULA
19 20S11A6619 MANISH SAI KUMAR KOSURU
20 20S11A6620 NANDINI REPALLE
21 20S11A6621 NIHARIKA CH
22 20S11A6622 NISHANK YARLAGADDA
23 20S11A6623 NITHIN GOUD BEESU
24 20S11A6624 PAVAN SAI GORUPUTI
9
25 20S11A6625 PRAKESH SAI

26 20S11A6626 PRANEETH CHERLA
27 20S11A6627 PRAVEEN MADELA
28 20S11A6628 PREM KIRAN BODDU
29 20S11A6629 PUNEETH KOUTARAPU
30 20S11A6630 RAGHAVENDRA SRINIVAS REMELLA
31 20S11A6631 RAKESH VARMA CHINDAM
32 20S11A6632 SAHANA REDDY ANNAM
33 20S11A6633 SAI BHAVYA POLEMONI
34 20S11A6634 SAI CHERAN CHENNOJU
35 20S11A6635 SAI PAVAN SIDDAM
36 20S11A6636 SAI VARDHAN REPALLE
37 20S11A6637 SAI VARDHAN YADLAPALLI
38 20S11A6638 SAKETH MACHA
39 20S11A6639 SOUMITH PARUVELLI
40 20S11A6640 SRI VENKATA SAI SUHAAS
41 20S11A6641 SUJITH DAS KURAMA
42 20S11A6642 SURABHI VAS PRAGNAN
43 20S11A6643 TARUN BHOSLE
44 20S11A6644 TEJA KONDAPARTHY
45 20S11A6645 VAMSHI GANDLA
46 20S11A6646 VARSHA AITHA
47 20S11A6647 VARSHITHA GOURISHETTY
48 20S11A6648 VENKATA RAMANA SAI THALKA
49 20S11A6649 VENKATA SAI SUNNAM
50 20S11A6650 VENKATA SURYA PRAKASH
51 20S11A6651 VINAY KUMAR CHINORI
52 20S11A6652 VINEET REDDY SADDI
53 21S15A6653 VISHNU PRIYA MARISHETTY
10
54 21S15A6654 VISHNU REDDY KONDITIVAR
55 21S15A6655 VISHWA SHREYA SHARAB
56 21S15A6656 YASHWANTH REDDY ANAM
57 21S15A6657 GOLAMARU DAMINI
58 21S15A6658 SATHE BHARGAVI
59 21S15A6659 SIDDIRALA SRIKANTH
60 21S15A6660 SADANALA SAIKIRAN
61 21S15A6601 BALA XAVIER GOVINDU
62 21S15A6602 LIKHITH REDDY KORPOL
63 21S15A6603 PRAVEEN KATROTH
64 21S15A6604 SAI KIRAN BODLA
65 21S15A6605 SANDEEP KUMAR JADA
66 21S15A6606 THARUN JAVAJI
11
5. Lessson plan
Progra
Mode of Referen
Lesson No. of Topic/Sub Course m
Unit Date Teachin ce Text
No. Periods Topic Outcome Outcom
No. g Books
e (PO)
I 1.1 1.08.23 1 Introducti PPT T1

on
Basic
1.2 1.08.23 1 models of PPT T1
ANN
Important
terminolog
ies,
1.3 2.08.23 1 PPT T1
Supervise
d Learning
Networks
Perceptron
1.4 3.08.23 1 PPT T1
Networks
Adaptive CO1
1.5 4.08.23 1 Linear PPT T1
Neuron
Back-
propagatio
n
1.6 5.08.23 1 Network, PPT T1
Associativ
e Memory
Networks
Training
Algorithm
s for
1.7 7.08.23 1 PPT T1
pattern
associatio
n
12
II BAM and
1.8 8.08.23 1 Hopfield PPT T1
Networks.
Unsupervi
sed
Learning
2.1 9.08.23 1 PPT T1
Network-
Introducti
on
Assignment Test Unit 1

Fixed
Weight
2.2 14.08.23 1 PPT T1
Competiti
ve Nets
17.08.23 Maxnet,
2.3 1 Hamming PPT T1
Network
21.08.23 Kohonen
Self-
2.4 1 Organizin PPT T1
g Feature
Maps
22.08.23 Learning
Vector
2.5 1 PPT T1
Quantizati CO3
on
23.08.23 Counter
Propagatio
2.6 1 n PPT T1
Networks
24.08.23 Adaptive
Resonance
2.7 1 PPT T1
Theory
Networks
28.08.23 Special
Networks-
2.8 1 Introductio PPT T1
n to various
networks
13

30.08.23 Introducti
3.1 1 on to Deep PPT T1
Learning
01.09.23 Historical
III Trends in
3.2 1 PPT T1
Deep
learning
02.09.23 Deep Feed
3.3 1 - forward PPT T1
networks
6.09.23 Gradient-
3.4 1 Based PPT T1
learning
07.09.23 Hidden
Units,
3.5 1 PPT T1
Architectu
re Design
09.09.23
3.6 1 Architectu PPT T1
re Design
11.09.23 Back-
Propagatio CO1
n and
Other
3.7 1 PPT T1
Differentia
tion
Algorithm
s
12.09.23 Back-
IV
Propagatio
n and
Other
3.8 1 PPT T1
Differentia
tion
Algorithm
s
13.09.23 Regulariza
tion for
Deep
4.1 1 Learning: PPT T1
Parameter
norm
Penalties
14

Norm
Penalties
as
4.2 14.09.23 1 Constraine PPT T1
d
Optimizati
on
15.09.23 Regulariza
tion and
Under-
4.3 1 PPT T1
Constraine
d
Problems
16.09.23 Dataset
Augmenta
4.4 1 tion, Noise PPT T1
Robustnes
s,
IV 18.09.23 Semi- CO2
4.5 1 Supervise PPT T1
d learning
19.09.23 Multi-task
learning,
4.6 1 PPT T1
Early
Stopping
20.09.23 Parameter
Typing
4.7 1 and PPT T1
Parameter
Sharing
21.09.23 Sparse
4.8 1 Represent PPT T1
ations
22.09.23 Bagging
and other
4.9 1 PPT T1
Ensemble
Methods
Dropout,
V 4.10 23.09.23 1 Adversaria PPT CO1 T1
l Training
15
Tangent
Distance,
tangent
4.11 25.09.23 1 Prop and PPT T1
Manifold,
Tangent
Classifier
Introducti
on,
Challenge
5.1 25.09.23 1 s in Neural PPT T1
Network
Optimizati
on
Basic
Algorithm
5.2 23.09.23 1 s PPT T1
Parameter
Initializati
5.3 23.09.23 1 PPT T1
on
Strategies
Algorithm
s with
5.4 23.09.23 1 Adaptive PPT T1
Learning
Rates
Approxim
5.5 16.11.23 1 ate Second PPT T1
Order
Methods
Optimizati
on
Strategies
5.6 18.11.23 1 PPT T1
and Meta-
Algorithm
s
Optimizati
on
Strategies
5.7 20.11.23 1 PPT T1
and Meta-
Algorithm
s
16
Application
s: Large-
Scale Deep
Learning,
Computer
5.8 22.11.23 1 Vision, PPT T1
Speech
Recognitio
n, Natural
Language
Processing
Virtualizat
ion
Services
Provided
5.11 27.11.23 1 by SAP, PPT T1
Sales
force,
Sales
Cloud
Service
Cloud:
Knowledg
e as a
Service,
Rack
5.12 29.11.23 1 PPT T1
space,
VMware,
Manjra
soft,
Aneka
Platform
17
6. Lecture notes
Unit 1
Artificial Neural Networks
AI vs Machine Learning vs Deep Learning: Know the Differences

The three terms are often used interchangeably, but they do not quite refer to the same things.
Here is an illustration designed to help us understand the fundamental differences between

artificial intelligence, machine learning, and deep learning.
Artificial Intelligence is the concept of creating smart intelligent machines.
Machine Learning is a subset of artificial intelligence that helps you build AI-driven
applications.
Deep Learning is a subset of machine learning that uses vast volumes of data and complex
algorithms to train a model.
Now, let’s explore each of these technologies in detail.
What is Artificial Intelligence?
Artificial intelligence, commonly referred to as AI, is the process of imparting data, information,
and human intelligence to machines. The main goal of Artificial Intelligence is to develop self-
18
reliant machines that can think and act like humans. These machines can mimic human behavior
and perform tasks by learning and problem-solving. Most of the AI systems simulate natural
intelligence to solve complex problems.
Let’s have a look at an example of an AI-driven product - Amazon Echo.
Amazon Echo is a smart speaker that uses Alexa, the virtual assistant AI technology developed
by Amazon. Amazon Alexa is capable of voice interaction, playing music, setting alarms,
playing audiobooks, and giving real-time information such as news, weather, sports, and traffic
reports.
As you can see in the illustration below, the person wants to know the current temperature in
Chicago. The person’s voice is first converted into a machine-readable format. The formatted
data is then fed into the Amazon Alexa system for processing and analyzing. Finally, Alexa
returns the desired voice output via Amazon Echo.
Now that you’ve been given a simple introduction to the basics of artificial intelligence, let’s
have a look at its different types.
Types of Artificial Intelligence
Reactive Machines - These are systems that only react. These systems don’t form memories, and
they don’t use any past experiences for making new decisions.
19
Limited Memory - These systems reference the past, and information is added over a period of
time. The referenced information is short-lived.
Theory of Mind - This covers systems that are able to understand human emotions and how they
affect decision making. They are trained to adjust their behavior accordingly.
Self-awareness - These systems are designed and created to be aware of themselves. They
understand their own internal states, predict other people’s feelings, and act appropriately.
Applications of Artificial Intelligence
 Machine Translation such as Google Translate
 Self Driving Vehicles such as Google’s Waymo
 AI Robots such as Sophia and Aibo
 Speech Recognition applications like Apple’s Siri or OK Google
Now that we have gone over the basics of artificial intelligence, let’s move on to machine
learning and see how it works.
What is Machine Learning?
Machine learning is a discipline of computer science that uses computer algorithms and analytics
to build predictive models that can solve business problems.
As per McKinsey & Co., machine learning is based on algorithms that can learn from data
without relying on rules-based programming.
20
Tom Mitchell’s book on machine learning says “A computer program is said to learn from
experience E with respect to some class of tasks T and performance measure P, if its
performance at tasks in T, as measured by P, improves with experience E.”
So you see, machine learning has numerous definitions. But how does it really work?
How Does Machine Learning Work?
Machine learning accesses vast amounts of data (both structured and unstructured) and learns
from it to predict the future. It learns from the data by using multiple algorithms and techniques.
Below is a diagram that shows how a machine learns from data.
Now that you have been introduced to the basics of machine learning and how it works, let’s see
the different types of machine learning methods.
Types of Machine Learning
Machine learning algorithms are classified into three main categories:
1. Supervised Learning
In supervised learning, the data is already labeled, which means you know the target variable.
Using this method of learning, systems can predict future outcomes based on past data. It
requires that at least an input and output variable be given to the model for it to be trained.
Below is an example of a supervised learning method. The algorithm is trained using labeled
data of dogs and cats. The trained model predicts whether the new image is that of a cat or a dog.
21
Some examples of supervised learning include linear regression, logistic regression, support
vector machines, Naive Bayes, and decision tree.
2. Unsupervised Learning
Unsupervised learning algorithms employ unlabeled data to discover patterns from the data on
their own. The systems are able to identify hidden features from the input data provided. Once
the data is more readable, the patterns and similarities become more evident.
Below is an example of an unsupervised learning method that trains a model using unlabeled
data. In this case, the data consists of different vehicles. The purpose of the model is to classify
each kind of vehicle.
Some examples of unsupervised learning include k-means clustering, hierarchical clustering, and
anomaly detection.
Become an Expert in All Things AI and ML!
Caltech Post Graduate Program in AI & MLEXPLORE PROGRAM
22
3. Reinforcement Learning
The goal of reinforcement learning is to train an agent to complete a task within an uncertain
environment. The agent receives observations and a reward from the environment and sends
actions to the environment. The reward measures how successful action is with respect to
completing the task goal.
Below is an example that shows how a machine is trained to identify shapes.
Examples of reinforcement learning algorithms include Q-learning and Deep Q-learning Neural
Networks.
Machine Learning Processes
Machine Learning involves seven steps:
Machine Learning Applications
23
 Sales forecasting for different products
 Fraud analysis in banking
 Product recommendations
 Stock price prediction
Now that we’ve explored machine learning and its applications, let’s turn our attention to deep
learning, what it is, and how it is different from AI and machine learning.
What is Deep Learning?
Deep learning is a subset of machine learning that deals with algorithms inspired by the structure
and function of the human brain. Deep learning algorithms can work with an enormous amount
of both structured and unstructured data. Deep learning’s core concept lies in artificial neural
networks, which enable machines to make decisions.
The major difference between deep learning vs machine learning is the way data is presented to
the machine. Machine learning algorithms usually require structured data, whereas deep learning
networks work on multiple layers of artificial neural networks.
This is what a simple neural network looks like:
24
The network has an input layer that accepts inputs from the data. The hidden layer is used to find
any hidden features from the data. The output layer then provides the expected output.
Here is an example of a neural network that uses large sets of unlabeled data of eye retinas. The
network model is trained on this data to find out whether or not a person has diabetic retinopathy.
Now that we have an idea of what deep learning is, let’s see how it works.
How Does Deep Learning Work?
1. Calculate the weighted sums.
2. The calculated sum of weights is passed as input to the activation function.
3. The activation function takes the “weighted sum of input” as the input to the function,
adds a bias, and decides whether the neuron should be fired or not.
4. The output layer gives the predicted output.
25
5. The model output is compared with the actual output. After training the neural network,
the model uses the backpropagation method to improve the performance of the network.
The cost function helps to reduce the error rate.
In the following example, deep learning and neural networks are used to identify the number on
a license plate. This technique is used by many countries to identify rules violators and speeding
vehicles.
26
Types of Deep Neural Networks
Convolutional Neural Network (CNN) - CNN is a class of deep neural networks most commonly
used for image analysis.
Recurrent Neural Network (RNN) - RNN uses sequential information to build a model. It often
works better for models that have to memorize past data.
Generative Adversarial Network (GAN) - GAN are algorithmic architectures that use two neural
networks to create new, synthetic instances of data that pass for real data. A GAN trained on
photographs can generate new photographs that look at least superficially authentic to human
observers.
Deep Belief Network (DBN) - DBN is a generative graphical model that is composed of multiple
layers of latent variables called hidden units. Each layer is interconnected, but the units are not.
Learn from the best in the AI/ML industry with our Caltech Artificial Intelligence Course! Enroll
now to get started!
Deep Learning Applications
27
 Cancer tumor detection
 Captionbot for captioning an image
 Music generation
 Image coloring
 Object detection
Artificial Neural Networks

Introduction
Artificial Neural Networks (ANN) are algorithms based on brain function and are used to model
complicated patterns and forecast issues. The Artificial Neural Network (ANN) is a deep
learning method that arose from the concept of the human brain Biological Neural Networks.
The development of ANN was the result of an attempt to replicate the workings of the human
brain. The workings of ANN are extremely similar to those of biological neural networks,
although they are not identical. ANN algorithm accepts only numeric and structured data.
Convolutional Neural Networks (CNN) and Recursive Neural Networks (RNN) are used to
accept unstructured and non-numeric data forms such as Image, Text, and Speech. This article
focuses solely on Artificial Neural Networks.
Artificial Neural Networks Architecture
1. There are three layers in the network architecture: the input layer, the hidden layer (more
than one), and the output layer. Because of the numerous layers are sometimes referred to
as the MLP (Multi-Layer Perceptron).
28
2. It is possible to think of the hidden layer as a “distillation layer,” which extracts some of
the most relevant patterns from the inputs and sends them on to the next layer for further
analysis. It accelerates and improves the efficiency of the network by recognizing just the
most important information from the inputs and discarding the redundant information.
3. 3. The activation function is important for two reasons: first, it allows you to turn on your
computer.
4. his model captures the presence of non-linear relationships between the inputs.
5. It contributes to the conversion of the input into a more usable output.
29
4. Finding the “optimal values of W — weights” that minimize prediction error is critical to
building a successful model. The “backpropagation algorithm” does this by converting ANN into
a learning algorithm by learning from mistakes.
5. The optimization approach uses a “gradient descent” technique to quantify prediction errors. To
find the optimum value for W, small adjustments in W are tried, and the impact on prediction
errors is examined. Finally, those W values are chosen as ideal since further W changes do not
reduce mistakes.
Benefits of Artificial Neural Networks
ANNs offers many key benefits that make them particularly well-suited to specific issues and
situations:
1. ANNs can learn and model non-linear and complicated interactions, which is critical since many
of the relationships between inputs and outputs in real life are non-linear and complex.
2. ANNs can generalize – After learning from the original inputs and their associations, the model
may infer unknown relationships from anonymous data, allowing it to generalize and predict
unknown data.
3. ANN does not impose any constraints on the input variables, unlike many other prediction
approaches (like how they should be distributed). Furthermore, numerous studies have
demonstrated that ANNs can better simulate heteroskedasticity, or data with high volatility and
non-constant variance, because of their capacity to discover latent correlations in the data without
imposing any preset associations. This is particularly helpful in financial time series forecasting
(for example, stock prices) when significant data volatility.
30
Application of Artificial Neural Networks
ANNs have a wide range of applications because of their unique properties. A few of the important
applications of ANNs include:
1. Image Processing and Character recognition:

ANNs play a significant part in picture and character recognition because of their capacity to take
in many inputs, process them, and infer hidden and complicated, non-linear correlations. Character
recognition, such as handwriting recognition, has many applications in fraud detection (for
example, bank fraud) and even national security assessments.
Image 3
Image recognition is a rapidly evolving discipline with several applications ranging from social
media facial identification to cancer detection in medicine to satellite image processing for
agricultural and defense purposes.
Deep neural networks, which form the core of “deep learning,” have now opened up all of the new
and transformative advances in computer vision, speech recognition, and natural language
processing – notable examples being self-driving vehicles, thanks to ANN research.
2. Forecasting:
Forecasting is widely used in everyday company decisions (sales, the financial allocation between
goods, and capacity utilization), economic and monetary policy, finance, and the stock market.
Forecasting issues are frequently complex; for example, predicting stock prices is complicated
with many underlying variables (some known, some unseen).
raditional forecasting models have flaws when it comes to accounting for these complicated, non-
linear interactions. Given its capacity to model and extract previously unknown characteristics and
31
correlations, ANNs can provide a reliable alternative when used correctly. ANN also has no
restrictions on the input and residual distributions, unlike conventional models.
Advantages of Artificial Neural Networks
1. Attribute-value pairs are used to represent problems in ANN.

2. The output of ANNs can be discrete-valued, real-valued, or a vector of multiple real or
discrete-valued characteristics, while the target function can be discrete-valued, real-
valued, or a vector of numerous real or discrete-valued attributes.
3. Noise in the training data is not a problem for ANN learning techniques. There may be
mistakes in the training samples, but they will not affect the final result.
4. It’s utilized when a quick assessment of the taught target function is necessary.
5. The number of weights in the network, the number of training instances evaluated, and the
settings of different learning algorithm parameters can all contribute to extended training
periods for ANNs.
Disadvantages of Artificial Neural Networks
1. Hardware Dependence:
 The construction of Artificial Neural Networks necessitates the use of parallel processors.
 As a result, the equipment’s realization is contingent.
2. Understanding the network’s operation:
 This is the most serious issue with ANN.

 When ANN provides a probing answer, it does not explain why or how it was chosen.
 As a result, the network’s confidence is eroded.
3. Assured network structure:
 Any precise rule does not determine the structure of artificial neural networks.
 Experience and trial and error are used to develop a suitable network structure.
4. Difficulty in presenting the issue to the network:
 ANNs are capable of working with numerical data.

 Before being introduced to ANN, problems must be converted into numerical values.
 The display method that is chosen will have a direct impact on the network’s performance.
32
 The user’s skill is a factor here.
5. The network’s lifetime is unknown:
 When the network’s error on the sample is decreased to a specific amount, the training is complete.
 The value does not produce the best outcomes.
The ANN(Artificial Neural Network) is based on BNN(Biological Neural Network) as its

primary goal is to fully imitate the Human Brain and its functions. Similar to the brain having
neurons interlinked to each other, the ANN also has neurons that are linked to each other in
various layers of the networks which are known as nodes.
The ANN learns through various learning algorithms that are described as supervised or
unsupervised learning.
 In supervised learning algorithms, the target values are labeled. Its goal is to try to reduce
the error between the desired output (target) and the actual output for optimization. Here, a
supervisor is present.
 In unsupervised learning algorithms, the target values are not labeled and the network learns
by itself by identifying the patterns through repeated trials and experiments.
ANN Terminology:
 Weights: each neuron is linked to the other neurons through connection links that carry
weight. The weight has information and data about the input signal. The output depends
33
solely on the weights and input signal. The weights can be presented in a matrix form that is
known as the Connection matrix.
 if there are “n” nodes with each node having “m” weights, then it is represented as:
 Bias: Bias is a constant that is added to the product of inputs and weights to calculate the
product. It is used to shift the result to the positive or negative side. The net input weight is
increased by a positive bias while The net input weight is decreased by a negative bias.
34
Here,{1,x1…xn} are the inputs, and the output (Y) neurons will be computed by the function
g(x) which sums up all the input and adds bias to it.
g(x)=∑xi+b where i=0 to n
= x1+........+xn+b
and the role of the activation is to provide the output depending on the results of the summation
function:
Y=1 if g(x)>=0
Y=0 else
 Threshold: A threshold value is a constant value that is compared to the net input to get the
output. The activation function is defined based on the threshold value to calculate the output.
For Example:
Y=1 if net input>=threshold
Y=0 else
 Learning Rate: The learning rate is denoted α. It ranges from 0 to 1. It is used for balancing
weights during the learning of ANN.
 Target value: Target values are Correct values of the output variable and are also known as
just targets.
 Error: It is the inaccuracy of predicted output values compared to Target Values.
Supervised Learning
As the name suggests, supervised learning takes place under the supervision of a
teacher. This learning process is dependent. During the training of ANN under
supervised learning, the input vector is presented to the network, which will produce
an output vector. This output vector is compared with the desired/target output
vector. An error signal is generated if there is a difference between the actual output
35
and the desired/target output vector. On the basis of this error signal, the weights
would be adjusted until the actual output is matched with the desired output.
Supervised learning is the types of machine learning in which machines are trained
using well "labelled" training data, and on basis of that data, machines predict the
output. The labelled data means some input data is already tagged with the correct
output.
In supervised learning, the training data provided to the machines work as the
supervisor that teaches the machines to predict the output correctly. It applies the
same concept as a student learns in the supervision of the teacher.
Supervised learning is a process of providing input data as well as correct output

data to the machine learning model. The aim of a supervised learning algorithm is
to find a mapping function to map the input variable(x) with the output
variable(y).
In the real-world, supervised learning can be used for Risk Assessment, Image
classification, Fraud Detection, spam filtering, etc.
How Supervised Learning Works?
In supervised learning, models are trained using labelled dataset, where the model
learns about each type of data. Once the training process is completed, the model is
tested on the basis of test data (a subset of the training set), and then it predicts the
output.
The working of Supervised learning can be easily understood by the below example
and diagram:
36
Suppose we have a dataset of different types of shapes which includes square,

rectangle, triangle, and Polygon. Now the first step is that we need to train the model
for each shape.
o If the given shape has four sides, and all the sides are equal, then it will be
labelled as a Square.
o If the given shape has three sides, then it will be labelled as a triangle.
o If the given shape has six equal sides then it will be labelled as hexagon.
Now, after training, we test our model using the test set, and the task of the model is
to identify the shape.
The machine is already trained on all types of shapes, and when it finds a new shape,
it classifies the shape on the bases of a number of sides, and predicts the output.
Steps Involved in Supervised Learning:
o First Determine the type of training dataset

o Collect/Gather the labelled training data.
o Split the training dataset into training dataset, test dataset, and validation
dataset.
o Determine the input features of the training dataset, which should have
enough knowledge so that the model can accurately predict the output.
37
o Determine the suitable algorithm for the model, such as support vector
machine, decision tree, etc.
o Execute the algorithm on the training dataset. Sometimes we need validation
sets as the control parameters, which are the subset of training datasets.
o Evaluate the accuracy of the model by providing the test set. If the model
predicts the correct output, which means our model is accurate.
Types of supervised Machine learning Algorithms:
Supervised learning can be further divided into two types of problems:
1. Regression
Regression algorithms are used if there is a relationship between the input variable
and the output variable. It is used for the prediction of continuous variables, such as
Weather forecasting, Market Trends, etc. Below are some popular Regression
algorithms which come under supervised learning:
o Linear Regression
o Regression Trees
o Non-Linear Regression
o Bayesian Linear Regression
o Polynomial Regression
2. Classification
Classification algorithms are used when the output variable is categorical, which
means there are two classes such as Yes-No, Male-Female, True-false, etc.
38
Spam Filtering,
o Random Forest
o Decision Trees
o Logistic Regression
o Support vector Machines
Advantages of Supervised learning:
o With the help of supervised learning, the model can predict the output on the
basis of prior experiences.
o In supervised learning, we can have an exact idea about the classes of objects.
o Supervised learning model helps us to solve various real-world problems such
as fraud detection, spam filtering, etc.
Disadvantages of supervised learning:
o Supervised learning models are not suitable for handling the complex tasks.
o Supervised learning cannot predict the correct output if the test data is
different from the training dataset.
o Training required lots of computation times.
o In supervised learning, we need enough knowledge about the classes of object.
Perceptron
Basic Components of Perceptron

Mr. Frank Rosenblatt invented the perceptron model as a binary classifier which contains three
main components. These are as follows:
39
o Input Nodes or Input Layer:
This is the primary component of Perceptron which accepts the initial data into the system for
further processing. Each input node contains a real numerical value.
o Wight and Bias:
Weight parameter represents the strength of the connection between units. This is another most
important parameter of Perceptron components. Weight is directly proportional to the strength of
the associated input neuron in deciding the output. Further, Bias can be considered as the line of
intercept in a linear equation.
o Activation Function:
These are the final and important components that help to determine whether the neuron will fire
or not. Activation Function can be considered primarily as a step function.
Types of Activation functions:
o Sign function
o Step function, and
o Sigmoid function
40
The data scientist uses the activation function to take a subjective decision based on various
problem statements and forms the desired outputs. Activation function may differ (e.g., Sign, Step,
and Sigmoid) in perceptron models by checking whether the learning process is slow or has
vanishing or exploding gradients.
How does Perceptron work?

In Machine Learning, Perceptron is considered as a single-layer neural network that consists of
four main parameters named input values (Input nodes), weights and Bias, net sum, and an
activation function. The perceptron model begins with the multiplication of all input values and
their weights, then adds these values together to create the weighted sum. Then this weighted sum
is applied to the activation function 'f' to obtain the desired output. This activation function is also
known as the step function and is represented by 'f'.
This step function or Activation function plays a vital role in ensuring that output is mapped
between required values (0,1) or (-1,1). It is important to note that the weight of input is indicative
of the strength of a node. Similarly, an input's bias value gives the ability to shift the activation
function curve up or down.
41
Perceptron model works in two important steps as follows:
Step-1
In the first step first, multiply all input values with corresponding weight values and then add them
to determine the weighted sum. Mathematically, we can calculate the weighted sum as follows:
∑wi*xi = x1*w1 + x2*w2 +…wn*xn
Add a special term called bias 'b' to this weighted sum to improve the model's performance.
∑wi*xi + b
Step-2
ADVERTISING
In the second step, an activation function is applied with the above-mentioned weighted sum,
which gives us output either in binary form or a continuous value as follows:
Y = f(∑wi*xi + b)
Types of Perceptron Models

Based on the layers, Perceptron models are divided into two types. These are as follows:
1. Single-layer Perceptron Model

2. Multi-layer Perceptron model
Single Layer Perceptron Model:

This is one of the easiest Artificial neural networks (ANN) types. A single-layered perceptron
model consists feed-forward network and also includes a threshold transfer function inside the
model. The main objective of the single-layer perceptron model is to analyze the linearly separable
objects with binary outcomes.
In a single layer perceptron model, its algorithms do not contain recorded data, so it begins with
inconstantly allocated input for weight parameters. Further, it sums up all inputs (weight). After
adding all inputs, if the total sum of all inputs is more than a pre-determined value, the model gets
activated and shows the output value as +1.
If the outcome is same as pre-determined or threshold value, then the performance of this model
is stated as satisfied, and weight demand does not change. However, this model consists of a few
42
discrepancies triggered when multiple weight inputs values are fed into the model. Hence, to find
desired output and minimize errors, some changes should be necessary for the weights input.
"Single-layer perceptron can learn only linearly separable patterns."
Multi-Layered Perceptron Model:

Like a single-layer perceptron model, a multi-layer perceptron model also has the same model
structure but has a greater number of hidden layers.
The multi-layer perceptron model is also known as the Backpropagation algorithm, which executes
in two stages as follows:
o Forward Stage: Activation functions start from the input layer in the forward stage and terminate
on the output layer.
o Backward Stage: In the backward stage, weight and bias values are modified as per the model's
requirement. In this stage, the error between actual output and demanded originated backward on
the output layer and ended on the input layer.
Hence, a multi-layered perceptron model has considered as multiple artificial neural networks
having various layers in which activation function does not remain linear, similar to a single layer
perceptron model. Instead of linear, activation function can be executed as sigmoid, TanH, ReLU,
etc., for deployment.
A multi-layer perceptron model has greater processing power and can process linear and non-linear
patterns. Further, it can also implement logic gates such as AND, OR, XOR, NAND, NOT, XNOR,
NOR.
Advantages of Multi-Layer Perceptron:
o A multi-layered perceptron model can be used to solve complex non-linear problems.

o It works well with both small and large input data.
o It helps us to obtain quick predictions after the training.
o It helps to obtain the same accuracy ratio with large as well as small data.
Disadvantages of Multi-Layer Perceptron:
o In Multi-layer perceptron, computations are difficult and time-consuming.

o In multi-layer Perceptron, it is difficult to predict how much the dependent variable affects each
independent variable.
43
o The model functioning depends on the quality of the training.
Perceptron Function
Perceptron function ''f(x)'' can be achieved as output by multiplying the input 'x' with the learned
weight coefficient 'w'.
Mathematically, we can express it as follows:
f(x)=1; if w.x+b>0
otherwise, f(x)=0
o 'w' represents real-valued weights vector

o 'b' represents the bias
o 'x' represents a vector of input x values.
Characteristics of Perceptron
The perceptron model has the following characteristics.
1. Perceptron is a machine learning algorithm for supervised learning of binary classifiers.

2. In Perceptron, the weight coefficient is automatically learned.
3. Initially, weights are multiplied with input features, and the decision is made whether the neuron is
fired or not.
4. The activation function applies a step rule to check whether the weight function is greater than zero.
5. The linear decision boundary is drawn, enabling the distinction between the two linearly separable
classes +1 and -1.
6. If the added sum of all input values is more than the threshold value, it must have an output signal;
otherwise, no output will be shown.
Limitations of Perceptron Model

A perceptron model has limitations as follows:
o The output of a perceptron can only be a binary number (0 or 1) due to the hard limit transfer
function.
44
o Perceptron can only be used to classify the linearly separable sets of input vectors. If input vectors
are non-linear, it is not easy to classify them properly.
Future of Perceptron
The future of the Perceptron model is much bright and significant as it helps to interpret data by
building intuitive patterns and applying them in the future. Machine learning is a rapidly growing
technology of Artificial Intelligence that is continuously evolving and in the developing phase;
hence the future of perceptron technology will continue to support and facilitate analytical
behavior in machines that will, in turn, add to the efficiency of computers.
The perceptron model is continuously becoming more advanced and working efficiently on
complex problems with the help of artificial neurons.
An artificial neural network inspired by the human neural system is a network used to
process the data which consist of three types of layer i.e input layer, the hidden layer,
and the output layer. The basic neural network contains only two layers which are the
input and output layers. The layers are connected with the weighted path which is used
to find net input data. In this section, we will discuss two basic types of neural networks
Adaline which doesn’t have any hidden layer, and Madaline which has one hidden
layer.
1. Adaline (Adaptive Linear Neural) :

 A network with a single linear unit is called Adaline (Adaptive Linear Neural). A
unit with a linear activation function is called a linear unit. In Adaline, there is only
one output unit and output values are bipolar (+1,-1). Weights between the input
unit and output unit are adjustable. It uses the delta rule i.e ,
where and are the weight, predicted output, and true value
respectively.
 The learning rule is found to minimize the mean square error between activation and
target values. Adaline consists of trainable weights, it compares actual output with
calculated output, and based on error training algorithm is applied.
45
Workflow:
Adaline
First, calculate the net input to your Adaline network then apply the activation function
to its output then compare it with the original output if both the equal, then give the
output else send an error back to the network and update the weight according to the
error which is calculated by the delta learning rule. i.e , where and are the weight,
predicted output, and true value respectively.
Architecture:
Adaline
46
In Adaline, all the input neuron is directly connected to the output neuron with the
weighted connected path. There is a bias b of activation function 1 is present.
Algorithm:
Step 1: Initialize weight not zero but small random values are used. Set learning
rate α.
Step 2: While the stopping condition is False do steps 3 to 7.
Step 3: for each training set perform steps 4 to 6.
Step 4: Set activation of input unit x i = si for (i=1 to n).
Step 5: compute net input to output unit
Here, b is the bias and n is the total number of neurons.
Step 6: Update the weights and bias for i=1 to n
and calcu when the predicted output and the true value are the same then the
weight will not change.
Step 7: Test the stopping condition. The stopping condition may be when the weight
changes at a low rate or no change.
Implementations
Problem: Design OR gate using Adaline Network?

Solution :
 Initially, all weights are assumed to be small random values, say 0.1, and set
learning rule to 0.1.
 Also, set the least squared error to 2.
 The weights will be updated until the total error is greater than the least squared
error.
x1 x2 t
1 1 1
1 -1 1
47
-1 1 1
-1 -1 -1
 Calculate the net input
(when x1=x2=1)
 Now compute, (t-yin)=(1-0.3)=0.7
 Now, update the weights and bias
 calculate the error
Similarly, repeat the same steps for other input vectors and you will get.
(t-
x x2 w1 (0. w2 (0. b
t yin (t-yin) ∆w1 ∆w2 ∆b yin)^
1 1) 1) (0.1)
2
1 1 1 0.3 0.7 0.07 0.07 0.07 0.17 0.17 0.17 0.49
-
1 -1 1 0.17 0.83 0.083 0.083 0.253 0.087 0.253 0.69
0.083
-
- 0.091 0.091 0.161 0.178 0.344
1 1 0.087 0.913 0.091 0.83
1 3 3 7 3 3
3
- -
- - 0.004 0.100 0.100 0.262 0.278 0.243
-1 1.004 0.100 1.01
1 1 3 4 4 1 7 9
3 4
This is epoch 1 where the total error is 0.49 + 0.69 + 0.83 + 1.01 = 3.02 so more epochs
will run until the total error becomes less than equal to the least squared error i.e 2.
48
Back-propagation Network
Backpropagation is the essence of neural network training. It is the method of fine-tuning the
weights of a neural network based on the error rate obtained in the previous epoch (i.e.,
iteration). Proper tuning of the weights allows you to reduce error rates and make the model
reliable by increasing its generalization.
Backpropagation in neural network is a short form for “backward propagation of errors.” It is a
standard method of training artificial neural networks. This method helps calculate the gradient
of a loss function with respect to all the weights in the network.
The Back propagation algorithm in neural network computes the gradient of the loss function for
a single weight by the chain rule. It efficiently computes one layer at a time, unlike a native
direct computation. It computes the gradient, but it does not define how the gradient is used. It
generalizes the computation in the delta rule.
Consider the following Back propagation neural network example diagram to understand:
1. Inputs X, arrive through the preconnected path

2. Input is modeled using real weights W. The weights are usually randomly selected.
3. Calculate the output for every neuron from the input layer, to the hidden layers, to the
output layer.
4. Calculate the error in the outputs
ErrorB= Actual Output – Desired Output
5. Travel back from the output layer to the hidden layer to adjust the weights such that the
error is decreased.
49
Keep repeating the process until the desired output is achieved
Why We Need Backpropagation?

Most prominent advantages of Backpropagation are:
 Backpropagation is fast, simple and easy to program

 It has no parameters to tune apart from the numbers of input
 It is a flexible method as it does not require prior knowledge about the network
 It is a standard method that generally works well
 It does not need any special mention of the features of the function to be learned.
What is a Feed Forward Network?

A feedforward neural network is an artificial neural network where the nodes never form a cycle.
This kind of neural network has an input layer, hidden layers, and an output layer. It is the first
and simplest type of artificial neural network.
Types of Backpropagation Networks

Two Types of Backpropagation Networks are:
 Static Back-propagation
 Recurrent Backpropagation
Static back-propagation:
It is one kind of backpropagation network which produces a mapping of a static input for static
output. It is useful to solve static classification issues like optical character recognition.
Recurrent Backpropagation:
Recurrent Back propagation in data mining is fed forward until a fixed value is achieved. After
that, the error is computed and propagated backward.
The main difference between both of these methods is: that the mapping is rapid in static back-
propagation while it is nonstatic in recurrent backpropagation.
History of Backpropagation
 In 1961, the basics concept of continuous backpropagation were derived in the context of
control theory by J. Kelly, Henry Arthur, and E. Bryson.
 In 1969, Bryson and Ho gave a multi-stage dynamic system optimization method.
50
 In 1974, Werbos stated the possibility of applying this principle in an artificial neural
network.
 In 1982, Hopfield brought his idea of a neural network.
 In 1986, by the effort of David E. Rumelhart, Geoffrey E. Hinton, Ronald J. Williams,
backpropagation gained recognition.
 In 1993, Wan was the first person to win an international pattern recognition contest with
the help of the backpropagation method.
Backpropagation Key Points

 Simplifies the network structure by elements weighted links that have the least effect on
the trained network
 You need to study a group of input and activation values to develop the relationship
between the input and hidden unit layers.
 It helps to assess the impact that a given input variable has on a network output. The
knowledge gained from this analysis should be represented in rules.
 Backpropagation is especially useful for deep neural networks working on error-prone
projects, such as image or speech recognition.
 Backpropagation takes advantage of the chain and power rules allows backpropagation to
function with any number of outputs.
Best practice Backpropagation

Backpropagation in neural network can be explained with the help of “Shoe Lace” analogy
Too little tension =
 Not enough constraining and very loose
Too much tension =
 Too much constraint (overtraining)

 Taking too much time (relatively slow process)
 Higher likelihood of breaking
Pulling one lace more than other =
 Discomfort (bias)
51
Disadvantages of using Backpropagation

 The actual performance of backpropagation on a specific problem is dependent on the
input data.
 Back propagation algorithm in data mining can be quite sensitive to noisy data
 You need to use the matrix-based approach for backpropagation instead of mini-batch.
Summary
 A neural network is a group of connected it I/O units where each connection has a weight
associated with its computer programs.
 Backpropagation is a short form for “backward propagation of errors.” It is a standard
method of training artificial neural networks
 Back propagation algorithm in machine learning is fast, simple and easy to program
 A feedforward BPN network is an artificial neural network.
 Two Types of Backpropagation Networks are 1)Static Back-propagation 2) Recurrent
Backpropagation
 In 1961, the basics concept of continuous backpropagation were derived in the context of
control theory by J. Kelly, Henry Arthur, and E. Bryson.
 Back propagation in data mining simplifies the network structure by removing weighted
links that have a minimal effect on the trained network.
 It is especially useful for deep neural networks working on error-prone projects, such as
image or speech recognition.
 The biggest drawback of the Backpropagation is that it can be sensitive for noisy data.
Associative Memory Networks

These kinds of neural networks work on the basis of pattern association, which means they can
store different patterns and at the time of giving an output they can produce one of the stored
patterns by matching them with the given input pattern. These types of memories are also
called Content-Addressable Memory CAM��. Associative memory makes a parallel
search with the stored patterns as data files.
Following are the two types of associative memories we can observe −
 Auto Associative Memory

 Hetero Associative memory
Auto Associative Memory
This is a single layer neural network in which the input training vector and the output target vectors
are the same. The weights are determined so that the network stores a set of patterns.
52
Architecture
As shown in the following figure, the architecture of Auto Associative memory network
has ‘n’ number of input training vectors and similar ‘n’ number of output target vectors.
Training Algorithm
For training, this network is using the Hebb or Delta learning rule.
Step 1 − Initialize all the weights to zero as wij = 0 i=1ton,j=1ton�=1��,�=1��
Step 2 − Perform steps 3-4 for each input vector.
Step 3 − Activate each input unit as follows −
xi=si(i=1ton)��=��(�=1��)
Step 4 − Activate each output unit as follows −
yj=sj(j=1ton)��=��(�=1��)
Step 5 − Adjust the weights as follows −
wij(new)=wij(old)+xiyj��(��)=��(��)+��
Testing Algorithm
Step 1 − Set the weights obtained during training for Hebb’s rule.
Step 3 − Set the activation of the input units equal to that of the input vector.
Step 4 − Calculate the net input to each output unit j = 1 to n
yinj=∑i=1nxiwij��=∑�=1��
Step 5 − Apply the following activation function to calculate the output
53
yj=f(yinj)={+1−1ifyinj>0ifyinj⩽0��=�(��)={+1��>0−1��⩽0
Hetero Associative memory

Similar to Auto Associative Memory network, this is also a single layer neural network. However,
in this network the input training vector and the output target vectors are not the same. The weights
are determined so that the network stores a set of patterns. Hetero associative network is static in
nature, hence, there would be no non-linear and delay operations.
Architecture
As shown in the following figure, the architecture of Hetero Associative Memory network
has ‘n’ number of input training vectors and ‘m’ number of output target vectors.
Training Algorithm
For training, this network is using the Hebb or Delta learning rule.
Step 1 − Initialize all the weights to zero as wij = 0 i=1ton,j=1tom�=1��,�=1��
Step 3 − Activate each input unit as follows −
xi=si(i=1ton)��=��(�=1��)
Step 4 − Activate each output unit as follows −
yj=sj(j=1tom)��=��(�=1��)
Step 5 − Adjust the weights as follows −
wij(new)=wij(old)+xiyj��(��)=��(��)+��
Testing Algorithm
Step 1 − Set the weights obtained during training for Hebb’s rule.
54
Step 3 − Set the activation of the input units equal to that of the input vector.
Step 4 − Calculate the net input to each output unit j = 1 to m;
yinj=∑i=1nxiwij��=∑�=1��
Step 5 − Apply the following activation function to calculate the output
yj=f(yinj)=⎧⎩⎨⎪⎪+10−1ifyinj>0ifyinj=0ifyinj<0
55
UNIT 2
Unsupervised Learning Network
What Is Unsupervised Learning?

Unsupervised learning is a subfield of machine learning in which a model is trained on unlabeled
(or untagged) data. The main idea behind unsupervised learning is for the model to detect hidden
insights and patterns in a given data set without having to first identify or classify what to look
for.
In unsupervised learning, we are feeding the model data without specifying what output values
we want it to produce. This gives the model the freedom to manipulate the data set as it sees fit.
To better understand unsupervised learning, consider a mother teaching her child to distinguish
between two animals: a dog and a cat. The mother shows her child a set number of images of
both animals, and the child then defines some characteristics of both animals until he is able to
fully classify new images as either species. The child can then categorize dogs as category 1 and
cats as category 2. It's worth noting that the mother did not label the images, so the child has no
idea which animal is which. Instead, the child made observations about the features of both
animals, such as nose shape, tail length, size, and so on, in order to categorize them.
How Does an Unsupervised Machine Learning Model

Work?
An unsupervised machine learning model works in three stages — collecting the data that's
needed, training the model to make sense of the unlabeled data, and then evaluating the model to
see how it performs for a given set of inputs.
Let's look at each step in isolation:
56
Step 1: Collection of Necessary Data
In general, data collected for an unsupervised machine learning model is unstructured as it's in a
more raw format. Even though unsupervised data sets are much bigger than labeled or supervised
data sets, they are usually cheaper to collect, as they require no specific labeling or processing in
order for the data set to be used.
Step 2: Training of the Model
As we'll see in some of the unsupervised machine learning algorithms, unlike supervised
algorithms, such algorithms take in unlabeled data and try to make sense of it. This can be done
by clustering all data points into given clusters or by discovering hidden patterns and trends.
Step 3: Model Evaluation
To make sure that our model is returning peak accurate results, we must deliberately test the
model’s output on different and various input variables. We can then move on to tuning the
model’s parameters in order to improve its final result.
Types of Unsupervised Learning

Clustering
Clustering is the task of classifying unlabeled data into multiple groups (or 'clusters') based on
their similarities and differences. Data points with the most similar features will be clustered
together. Two of the most well-known unsupervised clustering algorithms are K-Means
clustering and hierarchical clustering.
Association
Association is an unsupervised machine learning technique that is used for discovering relations
between variables. Association learning is commonly used in basket market analysis, in which
the given algorithm tries to relate or find a given relationship between two products. For
example, 90 percent of customers that buy product A also buy product B. Such hidden insights
and patterns are incredibly useful for marketing purposes, boosting a company’s sales.
What is Unsupervised Learning Used For

Unsupervised learning has myriad uses. These include finding meaningful groupings and patterns in data,
extracting features, and a host of exploratory purposes. Its uses, therefore, range from cancer diagnosis to
selecting who is eligible for a loan and beyond.
Let's consider how it is used in NLP and computer vision:
57
Unsupervised Learning in Natural Language Processing

“Natural language processing (NLP) refers to the branch of computer science—and more specifically, the
branch of artificial intelligence or AI—concerned with giving computers the ability to understand text and
the spoken words in much the same way human beings can.”[IBM, Natural language processing(NLP)]
In the case of natural language processing in supervised learning, a set of predefined keywords is
provided as input to the model, while in unsupervised learning, no keywords are given to the model.
Instead, the model should be able to extract such words on its own without help from the user. So what
are some applications of unsupervised NLP?
1. Text Sentiment Analysis
Sentiment analysis is the process of clustering different sentences depending on the semantical meaning
that they hold. In sentiment analysis, a sentence can be either labeled as positive, neutral, or negative
depending on the writer’s attitude toward a certain topic. For example, take the sentence, “I love the rain”.
Such a sentence shows that the writer holds a positive sentimental attitude towards a certain topic which
is the rain. The sentiment meaning of a given sentence can be identified using a list of keywords such as
love, hate, like, dislike, etc.[TowardsDataScience, Unsupervised sentiment analysis]
Sentimental analysis is incredibly useful in real-world applications as it is heavily implemented in social
media apps in order to detect and eliminate hate speech from all over the internet.
58
2. Speech Recognition
Speech recognition is the ability of a machine learning model to extract meaning from human speech.
Such models take as input an audio recording of human speech and decode it extracting all relevant
information from it. While supervised speech recognition models offer great precision, unsupervised
learning speech recognition enables us to generate precise predictions on never-before-seen data sets.
Apple's Siri and Amazon's Alexa are two of the most popular speech recognition applications.
3. Artificial Intelligence Chatbots
AI chatbots are being heavily implemented in nearly every business and government sector nowadays.
Such chatbots are capable of providing users with human-like interactions, answering questions,
providing assistants, and more. Some companies that infuse chatbots into their services include Lyft,
Spotify, and Starbucks.
There exist three generations of chatbots. The first generation was based on written rules. The
programmer provided a specific list of answers for a specific list of questions. Moving on to the second
59
generation, chatbots were infused with artificial intelligence, starting with supervised learning. Such
chatbots were trained on a massive amount of labeled user chats. Second-generation or AI chatbots
provided a way more dynamic answering mechanism to the model. As you may have guessed, the third
and final chatbot generation integrates unsupervised learning to train its training models. Such models are
trained on even bigger data sets that are unlabeled. The third-generation chatbots offer all the advantages
of the first-generation while also having additional space to handle trickier and more complex situations.
[Rulia, The 3 Different Generations Of Chatbot Technology]
Unsupervised Learning in Computer Vision

Computer vision is a subfield of machine learning in which computers are capable of extracting useful
information from visual data representations such as images and video recordings. The goal of the
computer vision field is to allow computers to view the world in a matter similar to that of a human's
visual eyesight.
In contrast to supervised image analysis in computer vision, in unsupervised learning, the model is trained
on unlabelled images. It is up to the model to detect all anomalies in the image on its own. A supervised
learning model will typically produce superior results, with the exception of situations when the sought-
after anomaly is difficult to identify. What are some uses for unsupervised computer vision, then?
1. Cancer Diagnosis
Computers can recognize odd anomalies in a particular medical scan using unsupervised learning in
computer vision. The model is initially provided with a massive quantity of unlabeled images that include
both healthy and cancer-positive inputs. The model is then able to examine and contrast various photos,
correctly detecting variations between the images. As a result, the model can later determine whether a
particular scan contains a tumor. The model will be able to distinguish and label each form of tumor on its
own in situations even where there are several different tumor types. As we did not properly identify each
tumor type from the beginning, it is important to note that the model will assign each kind a number to
identify it.
60
2. X-ray Diagnosis
Similar to the cancer diagnosis model, the x-ray diagnosis model is fed a multitude of X-ray scans. Any
irregularities in the image can then be detected by the model. These models are excellent at seeing minute
anomalies that doctors would overlook. We can use AI to find anomalies in the heart, lungs, pleura,
mediastinum, bones, and diaphragm.
It is important to keep in mind that this model is still in its early stages of development. Therefore X-ray
imagining will still require the occasional doctor checkup.
Conclusion
To answer the age-old question, which is superior: supervised or unsupervised learning? The answer is
that it depends.
While some machine learning practitioners may prefer supervised learning algorithms over unsupervised
learning algorithms as they are easier to use and produce in most cases and will return more accurate
results, it is worth noting that unsupervised learning also has its advantages, such as being more resistant
to overfitting and being better suited to complex and unstructured data.
In some cases, the user would be unsure where to start looking for hidden insights in a given data set,
making unsupervised learning approaches extremely useful in such cases. Furthermore, while supervised
learning data sets are much smaller in size than unsupervised data sets, they are far more difficult to
collect and maintain. This is because each data point must be manually checked and labeled separately. A
process like this could take months or even years to complete. Unsupervised data, on the other hand, has
no definite structure and does not require labeling.
Thus, whether supervised or unsupervised learning is used is highly dependent on the problem at hand.
Fixed Weight Competitive Nets

When a net is trained to classify the input signal into one of the output categories, A, B, C, D, E, J, or K,
the net sometimes responded that the signal was both a C and a K, or both an E and a K, or both a J and a
K, due to similarities in these character pairs. In this case it will be better to include additional structure in
61
the net to force it to make a definitive decision. The mechanism by which this can be accomplished is
called competition. The most extreme form of competition among a group of neurons is called Winner-
Take-All, where only one neuron (the winner) in the group will have a nonzero output signal when the
competition is completed.
During training process also the weights remains fixed in these competitive networks. The idea of
competition is used among neurons for enhancement of contrast in their activation functions. In this,
two networks- Maxnet and Hamming networks
Maxnet
Maxnet network was developed by Lippmann in 1987. The Maxner serves as a sub net for picking the
node whose input is larger. All the nodes present in this subnet are fully interconnected and there exist
symmetrical weights in all these weighted interconnections.
Architecture of Maxnet
The architecrure of Maxnet is a fixed symmetrical weights are present over the weighted
interconnections. The weights between the neurons are inhibitory and Fixed
The Maxnet with this structure can be used as a subnet to select a particular node whose net input is
the largest.
62
63
Kohonen Self-Organizing Feature Maps

Self-Organizing Feature Maps(SOM) was developed by Dr. Teuvo Kohonen in 1982. Kohonen Self-
Organizing feature map (KSOM) refers to a neural network, which is trained using competitive learning.
Basic competitive learning implies that the competition process takes place before the cycle of learning.
The competition process suggests that some criteria select a winning processing element. After the
winning processing element is selected, its weight vector is adjusted according to the used learning law.
Feature mapping is a process which converts the patterns of arbitrary dimensionality into a response of
one or two dimensions array of neurons. The network performing such a mapping is called feature map.
The reason for reducing the higher dimensionality, the ability to preserve the neighbor topology.
64
65
Learning Vector Quantization

In 1980, Finnish Professor Kohonen discovered that some areas of the brain develop structures with
different areas, each of them with a high sensitive for a specific input pattern. It is based on competition
among neural units based on a principle called winner-takes-all.
Learning Vector Quantization (LVQ) is a prototype-based supervised classification algorithm. A prototype
is an early sample, model, or release of a product built to test a concept or process. One or more prototypes
are used to represent each class in the dataset. New (unknown) data points are then assigned the class of the
prototype that is nearest to them. In order for "nearest" to make sense, a distance measure has to be defined.
There is no limitation on how many prototypes can be used per class, the only requirement being that there
is at least one prototype for each class. LVQ is a special case of an artificial neural network and it applies
a winner-take-all Hebbian learning-based approach. With a small difference, it is similar to Self-Organizing
Maps (SOM) algorithm. SOM and LVQ were invented by Teuvo Kohonen.
LVQ system is represented by prototypes W=(W1....,Wn). In winner-take-all training algorithms, the
winner is moved closer if it correctly classifies the data point or moved away if it classifies the data
point incorrectly. An advantage of LVQ is that it creates prototypes that are easy to interpret for experts
in the respective application domain
66
Counterpropagation network
Counterpropagation network (CPN) were proposed by Hecht Nielsen in 1987.They are multilayer
network based on the combinations of the input, output, and clustering layers. The application of
counterpropagation net are data compression, function approximation and pattern association. The
ccounterpropagation network is basically constructed from an instar-outstar model. This model is three
layer neural network that performs input-output data mapping, producing an output vector y in
response to input vector x, on the basis of competitive learning. The three layer in an instar-outstar
model are the input layer, the hidden(competitive) layer and the output layer.
There are two stages involved in the training process of a counterpropagation net. The input vector are
clustered in the first stage. In the second stage of training, the weights from the cluster layer units to
the output units are tuned to obtain the desired response.
There are two types of counterpropagation network:
1. Full counterpropagation network

2. Forward-only counterpropagation network
67
Full counterpropagation network

Full CPN efficiently represents a large number of vector pair x:y by adaptively constructing a look-up-
table. The full CPN works best if the inverse function exists and is continuous. The vector x and y
propagate through the network in a counterflow manner to yield output vector x* and y*.
Architecture of Full Counterpropagation Network

The four major components of the instar-outstar model are the input layer, the instar, the competitive
layer and the outstar. For each node in the input layer there is an input value xi. All the instar are
grouped into a layer called the competitive layer. Each of the instar responds maximally to a group of
input vectors in a different region of space. An outstar model is found to have all the nodes in the
output layer and a single node in the competitive layer. The outstar looks like the fan-out of a node.
Training Algorithm for Full Counterpropagation Network:

Step 0: Set the initial weights and the initial learning rare.
Step 1: Perform Steps 2-7 if stopping condition is false for phase-I training.
Step 2: For each of the training input vector pair x: y presented, perform Steps 3-5.
Step 3: Make the X-input layer activations to vector X. Make the Y-inpur layer activations to vector
Y.
Step 4: Find the winning cluster unit. If dot product method is used, find rhe cluster unit Zj with target
net input: for j = 1 to p.
If Euclidean distance method is used, find the cluster unit Zj whose squared distance from input
vectors is the smallest
68
Adaptive Resonance Theory (ART)
Adaptive Resonance Theory (ART) Adaptive resonance theory is a type of neural

network technique developed by Stephen Grossberg and Gail Carpenter in 1987. The
basic ART uses unsupervised learning technique. The
term “adaptive” and “resonance” used in this suggests that they are open to new
learning(i.e. adaptive) without discarding the previous or the old information(i.e.
resonance). The ART networks are known to solve the stability-plasticity dilemma i.e.,
stability refers to their nature of memorizing the learning and plasticity refers to the fact
that they are flexible to gain new information. Due to this the nature of ART they are
always able to learn new input patterns without forgetting the past. ART networks
implement a clustering algorithm. Input is presented to the network and the algorithm
checks whether it fits into one of the already stored clusters. If it fits then the input is
added to the cluster that matches the most else a new cluster is formed.
69
Types of Adaptive Resonance Theory(ART) Carpenter and Grossberg developed

different ART architectures as a result of 20 years of research. The ARTs can be
classified as follows:
 ART1 – It is the simplest and the basic ART architecture. It is capable of clustering
binary input values.
 ART2 – It is extension of ART1 that is capable of clustering continuous-valued
input data.
 Fuzzy ART – It is the augmentation of fuzzy logic and ART.
 ARTMAP – It is a supervised form of ART learning where one ART learns based
on the previous ART module. It is also known as predictive ART.
 FARTMAP – This is a supervised ART architecture with Fuzzy logic included.
Basic of Adaptive Resonance Theory (ART) Architecture The adaptive resonant
theory is a type of neural network that is self-organizing and competitive. It can be of
both types, the unsupervised ones(ART1, ART2, ART3, etc) or the supervised
ones(ARTMAP). Generally, the supervised algorithms are named with the suffix
“MAP”. But the basic ART model is unsupervised in nature and consists of :
 F1 layer or the comparison field(where the inputs are processed)
 F2 layer or the recognition field (which consists of the clustering units)
 The Reset Module (that acts as a control mechanism)
The F1 layer accepts the inputs and performs some processing and transfers it to the F2
layer that best matches with the classification factor. There exist two sets of weighted
interconnection for controlling the degree of similarity between the units in the F1 and
the F2 layer. The F2 layer is a competitive layer. The cluster unit with the large net
input becomes the candidate to learn the input pattern first and the rest F2 units are
ignored. The reset unit makes the decision whether or not the cluster unit is allowed to
learn the input pattern depending on how similar its top-down weight vector is to the
input vector and to the decision. This is called the vigilance test. Thus we can say that
the vigilance parameter helps to incorporate new memories or new information.
Higher vigilance produces more detailed memories, lower vigilance produces more
general memories.
Generally two types of learning exists,slow learning and fast learning. In fast learning,
weight update during resonance occurs rapidly. It is used in ART1.In slow learning, the
weight change occurs slowly relative to the duration of the learning trial. It is used in
ART2.
Advantage of Adaptive Resonance Theory (ART)
 It exhibits stability and is not disturbed by a wide variety of inputs provided to
its network.
 It can be integrated and used with various other techniques to give more good
results.
 It can be used for various fields such as mobile robot control, face recognition,
land cover classification, target recognition, medical diagnosis, signature
verification, clustering web users, etc.
70
 It has got advantages over competitive learning (like bpnn etc). The competitive
learning lacks the capability to add new clusters when deemed necessary.
 It does not guarantee stability in forming clusters.
Limitations of Adaptive Resonance Theory Some ART networks are inconsistent
(like the Fuzzy ART and ART1) as they depend upon the order of training data, or upon
the learning rate.
Special Networks
In the context of deep learning, "Special Networks" is not a well-defined or commonly used term. It is
possible that you are referring to specialized network architectures or specific types of neural networks
designed for specific tasks. I can provide information on some popular specialized network architectures
used in deep learning:
1. Convolutional Neural Networks (CNNs): CNNs are commonly used for image and video
processing tasks. They are designed to process data with a grid-like structure, such as images, by
using convolutional layers that capture local patterns and hierarchical representations.
2. Recurrent Neural Networks (RNNs): RNNs are used for sequential data processing tasks, such as
natural language processing and speech recognition. RNNs have feedback connections that allow
them to retain information from previous inputs, making them suitable for tasks with temporal
dependencies.
3. Generative Adversarial Networks (GANs): GANs consist of two neural networks, a generator and
a discriminator, that are trained together in a competitive manner. GANs are commonly used for
generative tasks, such as generating realistic images, by learning to capture the underlying
distribution of the training data.
4. Transformer Networks: Transformers have gained popularity in natural language processing
tasks, especially for machine translation and text generation. They rely on self-attention
mechanisms to capture the relationships between different words or tokens in a sequence,
enabling them to model long-range dependencies effectively.
5. Autoencoders: Autoencoders are neural networks used for unsupervised learning and
dimensionality reduction tasks. They are composed of an encoder network that compresses the
input data into a latent representation and a decoder network that reconstructs the original input
from the latent space.
These are just a few examples of specialized network architectures used in deep learning. There are many
other architectures and variations tailored to specific tasks and domains, such as object detection, speech
synthesis, and reinforcement learning.
Convolutional Neural Network (CNN)

A Convolutional Neural Network (CNN) is a type of Deep Learning neural network
architecture commonly used in Computer Vision. Computer vision is a field of
Artificial Intelligence that enables a computer to understand and interpret the image or
visual data.
71
When it comes to Machine Learning, Artificial Neural Networks perform really well.
Neural Networks are used in various datasets like images, audio, and text. Different
types of Neural Networks are used for different purposes, for example for predicting the
sequence of words we use Recurrent Neural Networks more precisely an LSTM,
similarly for image classification we use Convolution Neural networks. In this blog, we
are going to build a basic building block for CNN.
In a regular Neural Network there are three types of layers:
1. Input Layers: It’s the layer in which we give input to our model. The number of
neurons in this layer is equal to the total number of features in our data (number of
pixels in the case of an image).
2. Hidden Layer: The input from the Input layer is then feed into the hidden layer.
There can be many hidden layers depending upon our model and data size. Each
hidden layer can have different numbers of neurons which are generally greater than
the number of features. The output from each layer is computed by matrix
multiplication of output of the previous layer with learnable weights of that layer
and then by the addition of learnable biases followed by activation function which
makes the network nonlinear.
3. Output Layer: The output from the hidden layer is then fed into a logistic function
like sigmoid or softmax which converts the output of each class into the probability
score of each class.
The data is fed into the model and output from each layer is obtained from the above
step is called feedforward, we then calculate the error using an error function, some
common error functions are cross-entropy, square loss error, etc. The error function
measures how well the network is performing. After that, we backpropagate into the
model by calculating the derivatives. This step is called Backpropagation which
basically is used to minimize the loss.
Convolutional Neural Network consists of multiple layers like the input layer, Convolutional layer,
Pooling layer, and fully connected layers.
Simple CNN architecture

The Convolutional layer applies filters to the input image to extract features, the Pooling layer
downsamples the image to reduce computation, and the fully connected layer makes the final prediction.
The network learns the optimal filters through backpropagation and gradient descent.
72
How Convolutional Layers works

Convolution Neural Networks or covnets are neural networks that share their parameters. Imagine you
have an image. It can be represented as a cuboid having its length, width (dimension of the image), and
height (i.e the channel as images generally have red, green, and blue channels).
Now imagine taking a small patch of this image and running a small neural network, called a filter or
kernel on it, with say, K outputs and representing them vertically. Now slide that neural network across
the whole image, as a result, we will get another image with different widths, heights, and depths. Instead
of just R, G, and B channels now we have more channels but lesser width and height. This operation is
called Convolution. If the patch size is the same as that of the image it will be a regular neural network.
Because of this small patch, we have fewer weights.
Image source: Deep Learning Udacity

Now let’s talk about a bit of mathematics that is involved in the whole convolution process.
 Convolution layers consist of a set of learnable filters (or kernels) having small widths and
heights and the same depth as that of input volume (3 if the input layer is image input).
 For example, if we have to run convolution on an image with dimensions 34x34x3. The possible
size of filters can be axax3, where ‘a’ can be anything like 3, 5, or 7 but smaller as compared to
the image dimension.
 During the forward pass, we slide each filter across the whole input volume step by step where
each step is called stride (which can have a value of 2, 3, or even 4 for high-dimensional images)
and compute the dot product between the kernel weights and patch from input volume.
73
 As we slide our filters we’ll get a 2-D output for each filter and we’ll stack them together as a
result, we’ll get output volume having a depth equal to the number of filters. The network will
learn all the filters.
Layers used to build ConvNets
A complete Convolution Neural Networks architecture is also known as covnets. A covnets is a sequence
of layers, and every layer transforms one volume to another through a differentiable function.
Types of layers: datasets
Let’s take an example by running a covnets on of image of dimension 32 x 32 x 3.
 Input Layers: It’s the layer in which we give input to our model. In CNN, Generally, the input
will be an image or a sequence of images. This layer holds the raw input of the image with width
32, height 32, and depth 3.
 Convolutional Layers: This is the layer, which is used to extract the feature from the input
dataset. It applies a set of learnable filters known as the kernels to the input images. The
filters/kernels are smaller matrices usually 2×2, 3×3, or 5×5 shape. it slides over the input image
data and computes the dot product between kernel weight and the corresponding input image
patch. The output of this layer is referred ad feature maps. Suppose we use a total of 12 filters for
this layer we’ll get an output volume of dimension 32 x 32 x 12.
 Activation Layer: By adding an activation function to the output of the preceding layer,
activation layers add nonlinearity to the network. it will apply an element-wise activation function
to the output of the convolution layer. Some common activation functions are RELU: max(0,
x), Tanh, Leaky RELU, etc. The volume remains unchanged hence output volume will have
dimensions 32 x 32 x 12.
 Pooling layer: This layer is periodically inserted in the covnets and its main function is to reduce
the size of volume which makes the computation fast reduces memory and also prevents
overfitting. Two common types of pooling layers are max pooling and average pooling. If we
use a max pool with 2 x 2 filters and stride 2, the resultant volume will be of dimension
16x16x12.
74
Image source: cs231n.stanford.edu

 Flattening: The resulting feature maps are flattened into a one-dimensional vector after the
convolution and pooling layers so they can be passed into a completely linked layer for
categorization or regression.
 Fully Connected Layers: It takes the input from the previous layer and computes the final
classification or regression task.
Image source: cs231n.stanford.edu
75
 Output Layer: The output from the fully connected layers is then fed into a logistic function for
classification tasks like sigmoid or softmax which converts the output of each class into the
probability score of each class.
Recurrent Neural Network (RNN)

What is Recurrent Neural Network (RNN)?
Recurrent Neural Network(RNN) is a type of Neural Network where the output from
the previous step is fed as input to the current step. In traditional neural networks, all
the inputs and outputs are independent of each other, but in cases when it is required to
predict the next word of a sentence, the previous words are required and hence there is a
need to remember the previous words. Thus RNN came into existence, which solved
this issue with the help of a Hidden Layer. The main and most important feature of
RNN is its Hidden state, which remembers some information about a sequence. The
state is also referred to as Memory State since it remembers the previous input to the
network. It uses the same parameters for each input as it performs the same task on all
the inputs or hidden layers to produce the output. This reduces the complexity of
parameters, unlike other neural networks.
Recurrent neural network
Architecture Of Recurrent Neural Network

RNNs have the same input and output architecture as any other deep neural
architecture. However, differences arise in the way information flows from input to
output. Unlike Deep neural networks where we have different weight matrices for each
Dense network in RNN, the weight across the network remains the same. It calculates
state hidden state Hi for every input Xi . By using the following formulas:
h= σ(UX + Wh-1 + B)
Y = O(Vh + C) Hence
76
Y = f (X, h , W, U, V, B, C)
Here S is the State matrix which has element si as the state of the network at timestep i
The parameters in the network are W, U, V, c, b which are shared across timestep
How RNN works
The Recurrent Neural Network consists of multiple fixed activation function units, one
for each time step. Each unit has an internal state which is called the hidden state of the
unit. This hidden state signifies the past knowledge that the network currently holds at a
given time step. This hidden state is updated at every time step to signify the change in
the knowledge of the network about the past. The hidden state is updated using the
following recurrence relation:-
The formula for calculating the current state:
where:
ht -> current state
ht-1 -> previous state
xt -> input state
Formula for applying Activation function(tanh):
where:
77
whh -> weight at recurrent neuron

wxh -> weight at input neuron
The formula for calculating output:
Yt -> output
Why -> weight at output layer
These parameters are updated using Backpropagation. However, since RNN works on
sequential data here we use an updated backpropagation which is known as
Backpropagation through time.
Backpropagation Through Time (BPTT)
In RNN the neural network is in an ordered fashion and since in the ordered network
each variable is computed one at a time in a specified order like first h1 then h2 then h3
so on. Hence we will apply backpropagation throughout all these hidden time states
sequentially.
L(θ)(loss function) depends on h3

h3 in turn depends on h2 and W
where h0 is a constant starting state.
Advantages of Recurrent Neural Network
78
1. An RNN remembers each and every piece of information through time. It is useful in time series
prediction only because of the feature to remember previous inputs as well. This is called Long
Short Term Memory.
2. Recurrent neural networks are even used with convolutional layers to extend the effective pixel
neighborhood.
Disadvantages of Recurrent Neural Network
1. Gradient vanishing and exploding problems.
2. Training an RNN is a very difficult task.
3. It cannot process very long sequences if using tanh or relu as an activation function.
Applications of Recurrent Neural Network
1. Language Modelling and Generating Text
2. Speech Recognition
3. Machine Translation
4. Image Recognition, Face detection
5. Time series Forecasting
Types Of RNN
There are four types of RNNs based on the number of inputs and outputs in the network.
One to One
One to Many
Many to One
Many to Many
One to One
This type of RNN behaves the same as any simple Neural network it is also known as Vanilla Neural
Network. In this Neural network, there is only one input and one output.
79
One To Many
In this type of RNN, there is one input and many outputs associated with it. One of the most used
examples of this network is Image captioning where given an image we predict a sentence having
Multiple words.
Many to One
In this type of network, Many inputs are fed to the network at several states of the network generating
only one output. This type of network is used in the problems like sentimental analysis. Where we give
multiple words as input and predict only the sentiment of the sentence as output.
80
Many to Many
In this type of neural network, there are multiple inputs and multiple outputs corresponding to a problem.
One Example of this Problem will be language translation. In language translation, we provide multiple
words from one language as input and predict multiple words from the second language as output.
Generative Adversarial Network (GAN)

A Generative Adversarial Network (GAN) is a deep learning architecture that consists of two neural
networks competing against each other in a zero-sum game framework. The goal of GANs is to generate
new, synthetic data that resembles some known data distribution.
What is a Generative Adversarial Network?
Generative Adversarial Networks (GANs) are a powerful class of neural networks that are used
for unsupervised learning. It was developed and introduced by Ian J. Goodfellow in 2014. GANs are
basically made up of a system of two competing neural network models which compete with each other
and are able to analyze, capture and copy the variations within a dataset.
Why were GANs developed in the first place?
81
It has been noticed most of the mainstream neural nets can be easily fooled into misclassifying things by
adding only a small amount of noise into the original data. Surprisingly, the model after adding noise has
higher confidence in the wrong prediction than when it predicted correctly. The reason for such an
adversary is that most machine learning models learn from a limited amount of data, which is a huge
drawback, as it is prone to overfitting. Also, the mapping between the input and the output is almost
linear. Although, it may seem that the boundaries of separation between the various classes are linear, but
in reality, they are composed of linearities, and even a small change in a point in the feature space might
lead to the misclassification of data.
How do GANs work?
Generative Adversarial Networks (GANs) can be broken down into three parts:
 Generative: To learn a generative model, which describes how data is generated in terms of a
probabilistic model.
 Adversarial: The training of a model is done in an adversarial setting.
 Networks: Use deep neural networks as artificial intelligence (AI) algorithms for training
purposes.
In GANs, there is a Generator and a Discriminator. The Generator generates fake samples of data(be it
an image, audio, etc.) and tries to fool the Discriminator. The Discriminator, on the other hand, tries to
distinguish between the real and fake samples. The Generator and the Discriminator are both Neural
Networks and they both run in competition with each other in the training phase. The steps are repeated
several times and in this, the Generator and Discriminator get better and better in their respective jobs
after each repetition. The work can be visualized by the diagram given below:
Here, the generative model captures the distribution of data and is trained in such a manner that it tries
to maximize the probability of the Discriminator making a mistake. The Discriminator, on the other
hand, is based on a model that estimates the probability that the sample that it got is received from the
training data and not from the Generator. The GANs are formulated as a minimax game, where the
Discriminator is trying to minimize its reward V(D, G) and the Generator is trying to minimize the
Discriminator’s reward or in other words, maximize its loss. It can be mathematically described by the
formula below:
82
Loss function for a GAN Model
where,
 G = Generator
 D = Discriminator
 Pdata(x) = distribution of real data
 P(z) = distribution of generator
 x = sample from Pdata(x)
 z = sample from P(z)
 D(x) = Discriminator network
 G(z) = Generator network
Generator Model
The Generator is trained while the Discriminator is idle. After the Discriminator is trained by the
generated fake data of the Generator, we can get its predictions and use the results for training the
Generator and get better from the previous state to try and fool the Discriminator.
Discriminator Model
The Discriminator is trained while the Generator is idle. In this phase, the network is only forward
propagated and no back-propagation is done. The Discriminator is trained on real data for n epochs
and sees if it can correctly predict them as real. Also, in this phase, the Discriminator is also trained
on the fake generated data from the Generator and see if it can correctly predict them as fake.
Different Types of GAN Models

Vanilla GAN: This is the simplest type of GAN. Here, the Generator and the Discriminator are
simple multi-layer perceptrons. In vanilla GAN, the algorithm is really simple, it tries to optimize
the mathematical equation using stochastic gradient descent.
Conditional GAN (CGAN): CGAN can be described as a deep learning method in which some
conditional parameters are put into place. In CGAN, an additional parameter ‘y’ is added to the
Generator for generating the corresponding data. Labels are also put into the input to the
Discriminator in order for the Discriminator to help distinguish the real data from the fake
generated data.
Deep Convolutional GAN (DCGAN): DCGAN is one of the most popular and also the most
successful implementations of GAN. It is composed of ConvNets in place of multi-layer
83
perceptrons. The ConvNets are implemented without max pooling, which is in fact replaced by
convolutional stride. Also, the layers are not fully connected.
Laplacian Pyramid GAN (LAPGAN): The Laplacian pyramid is a linear invertible image
representation consisting of a set of band-pass images, spaced an octave apart, plus a low-frequency
residual. This approach uses multiple numbers of Generator and Discriminator networks and
different levels of the Laplacian Pyramid. This approach is mainly used because it produces very
high-quality images. The image is down-sampled at first at each layer of the pyramid and then it is
again up-scaled at each layer in a backward pass where the image acquires some noise from the
Conditional GAN at these layers until it reaches its original size.
Super Resolution GAN (SRGAN): SRGAN as the name suggests is a way of designing a GAN in
which a deep neural network is used along with an adversarial network in order to produce higher-
resolution images. This type of GAN is particularly useful in optimally up-scaling native low-
resolution images to enhance their details minimizing errors while doing so.
Transformer Neural Network

What is a Transformer Neural Network?
The transformer is a component used in many neural network designs for processing sequential data, such
as natural language text, genome sequences, sound signals or time series data. Most applications of
transformer neural networks are in the area of natural language processing.
A transformer neural network can take an input sentence in the form of a sequence of vectors, and
converts it into a vector called an encoding, and then decodes it back into another sequence.
An important part of the transformer is the attention mechanism. The attention mechanism represents how
important other tokens in an input are for the encoding of a given token. For example, in a machine
translation model, the attention mechanism allows the transformer to translate words like ‘it’ into a word
of the correct gender in French or Spanish by attending to all relevant words in the original sentence.
Crucially, the attention mechanism allows the transformer to focus on particular words on both the left
and right of the current word in order to decide how to translate it. Transformer neural networks replace
the earlier recurrent neural network (RNN), long short term memory (LSTM), and gated recurrent (GRU)
neural network designs.
Transformer Neural Network Design
The transformer neural network receives an input sentence and converts it into two sequences: a sequence
of word vector embeddings, and a sequence of positional encodings.
The word vector embeddings are a numeric representation of the text. It is necessary to convert the words
to the embedding representation so that a neural network can process them. In the embedding
representation, each word in the dictionary is represented as a vector. The positional encodings are a
vector representation of the position of the word in the original sentence.
The transformer adds the word vector embeddings and positional encodings together and passes the result
through a series of encoders, followed by a series of decoders. Note that in contrast to RNNs and LSTMs,
the entire input is fed into the network simultaneously rather than sequentially.
84
The encoders each convert their input into another sequence of vectors called encodings. The decoders do
the reverse: they convert the encodings back into a sequence of probabilities of different output words.
The output probabilities can be converted into another natural language sentence using the softmax
function.
Each encoder and decoder contains a component called the attention mechanism, which allows the
processing of one input word to include relevant data from certain other words, while masking the words
which do not contain relevant information.
Because this must be calculated many times, we implement multiple attention mechanisms in parallel,
taking advantage of the parallel computing offered by GPUs. This is called the multi-head attention
mechanism. The ability to pass multiple words through a neural network simultaneously is one advantage
of transformers over LSTMs and RNNs.
The architecture of a transformer neural network. In the original paper, there were 6 encoders chained to 6
decoders.
Autoencoders
What is an autoencoder?
An autoencoder is a type of artificial neural network used to learn data encodings in an unsupervised
manner.
The aim of an autoencoder is to learn a lower-dimensional representation (encoding) for a higher-

dimensional data, typically for dimensionality reduction, by training the network to capture the most
important parts of the input image.
The architecture of autoencoders
Let’s start with a quick overview of autoencoders’ architecture.
Autoencoders consist of 3 parts:
1. Encoder: A module that compresses the train-validate-test set input data into an encoded
representation that is typically several orders of magnitude smaller than the input data.
85
2. Bottleneck: A module that contains the compressed knowledge representations and is therefore the
most important part of the network.
3. Decoder: A module that helps the network“decompress” the knowledge representations and
reconstructs the data back from its encoded form. The output is then compared with a ground truth.
The architecture as a whole looks something like this:
Ready to explore this topic more in-depth?
Let’s break it down.
The relationship between the Encoder, Bottleneck, and Decoder
Encoder
The encoder is a set of convolutional blocks followed by pooling modules that compress the input to
the model into a compact section called the bottleneck.
The bottleneck is followed by the decoder that consists of a series of upsampling modules to bring the
compressed feature back into the form of an image. In case of simple autoencoders, the output is
expected to be the same as the input data with reduced noise.
However, for variational autoencoders it is a completely new image, formed with information the
model has been provided as input.
86
Bottleneck
The most important part of the neural network, and ironically the smallest one, is the bottleneck. The
bottleneck exists to restrict the flow of information to the decoder from the encoder, thus,allowing only
the most vital information to pass through.
Since the bottleneck is designed in such a way that the maximum information possessed by an image is
captured in it, we can say that the bottleneck helps us form a knowledge-representation of the input.
Thus, the encoder-decoder structure helps us extract the most from an image in the form of data and
establish useful correlations between various inputs within the network.
A bottleneck as a compressed representation of the input further prevents the neural network from
memorising the input and overfitting on the data.
As a rule of thumb, remember this: The smaller the bottleneck, the lower the risk of overfitting.
However—
Very small bottlenecks would restrict the amount of information storable, which increases the chances
of important information slipping out through the pooling layers of the encoder.
Decoder
Finally, the decoder is a set of upsampling and convolutional blocks that reconstructs the bottleneck's
output.
Since the input to the decoder is a compressed knowledge representation, the decoder serves as a
“decompressor” and builds back the image from its latent attributes.
How to train autoencoders?
You need to set 4 hyperparameters before training an autoencoder:
1. Code size: The code size or the size of the bottleneck is the most important hyperparameter
used to tune the autoencoder. The bottleneck size decides how much the data has to be
compressed. This can also act as a regularisation term.
87
2. Number of layers: Like all neural networks, an important hyperparameter to tune

autoencoders is the depth of the encoder and the decoder. While a higher depth increases model
complexity, a lower depth is faster to process.
3. Number of nodes per layer: The number of nodes per layer defines the weights we use per
layer. Typically, the number of nodes decreases with each subsequent layer in the autoencoder
as the input to each of these layers becomes smaller across the layers.
4. Reconstruction Loss: The loss function we use to train the autoencoder is highly dependent on
the type of input and output we want the autoencoder to adapt to. If we are working with image
data, the most popular loss functions for reconstruction are MSE Loss and L1 Loss. In case the
inputs and outputs are within the range [0,1], as in MNIST, we can also make use of Binary
Cross Entropy as the reconstruction loss.
88
Unit 3
Introduction to Deep Learning
Deep learning is a branch of machine learning which is based on artificial neural networks. It is capable
of learning complex patterns and relationships within data. In deep learning, we don’t need to explicitly
program everything. It has become increasingly popular in recent years due to the advances in
processing power and the availability of large datasets. Because it is based on artificial neural networks
(ANNs) also known as deep neural networks (DNNs). These neural networks are inspired by the
structure and function of the human brain’s biological neurons, and they are designed to learn from
large amounts of data.
1. Deep Learning is a subfield of Machine Learning that involves the use of neural networks to model
and solve complex problems. Neural networks are modeled after the structure and function of the
human brain and consist of layers of interconnected nodes that process and transform data.
2. The key characteristic of Deep Learning is the use of deep neural networks, which have multiple
layers of interconnected nodes. These networks can learn complex representations of data by
discovering hierarchical patterns and features in the data. Deep Learning algorithms can
automatically learn and improve from data without the need for manual feature engineering.
3. Deep Learning has achieved significant success in various fields, including image recognition,
natural language processing, speech recognition, and recommendation systems. Some of the
popular Deep Learning architectures include Convolutional Neural Networks (CNNs), Recurrent
Neural Networks (RNNs), and Deep Belief Networks (DBNs).
4. Training deep neural networks typically requires a large amount of data and computational
resources. However, the availability of cloud computing and the development of specialized
hardware, such as Graphics Processing Units (GPUs), has made it easier to train deep neural
networks.
In summary, Deep Learning is a subfield of Machine Learning that involves the use of deep neural
networks to model and solve complex problems. Deep Learning has achieved significant success in
various fields, and its use is expected to continue to grow as more data becomes available, and more
powerful computing resources become available.
What is Deep Learning?
Deep learning is the branch of machine learning which is based on artificial neural network
architecture. An artificial neural network or ANN uses layers of interconnected nodes called neurons
that work together to process and learn from the input data.
In a fully connected Deep neural network, there is an input layer and one or more hidden layers
connected one after the other. Each neuron receives input from the previous layer neurons or the input
layer. The output of one neuron becomes the input to other neurons in the next layer of the network,
and this process continues until the final layer produces the output of the network. The layers of the
neural network transform the input data through a series of nonlinear transformations, allowing the
network to learn complex representations of the input data.
89
Today Deep learning has become one of the most popular and visible areas of machine learning, due to
its success in a variety of applications, such as computer vision, natural language processing, and
Reinforcement learning.
Deep learning can be used for supervised, unsupervised as well as reinforcement machine learning. it
uses a variety of ways to process these.
 Supervised Machine Learning: Supervised machine learning is the machine learning technique in
which the neural network learns to make predictions or classify data based on the labeled datasets.
Here we input both input features along with the target variables. the neural network learns to make
predictions based on the cost or error that comes from the difference between the predicted and the
actual target, this process is known as backpropagation. Deep learning algorithms like
Convolutional neural networks, Recurrent neural networks are used for many supervised tasks like
image classifications and recognization, sentiment analysis, language translations, etc.
 Unsupervised Machine Learning: Unsupervised machine learning is the machine
learning technique in which the neural network learns to discover the patterns or to cluster the
dataset based on unlabeled datasets. Here there are no target variables. while the machine has to
self-determined the hidden patterns or relationships within the datasets. Deep learning algorithms
like autoencoders and generative models are used for unsupervised tasks like clustering,
dimensionality reduction, and anomaly detection.
 Reinforcement Machine Learning: Reinforcement Machine Learning is the machine
learning technique in which an agent learns to make decisions in an environment to maximize a
reward signal. The agent interacts with the environment by taking action and observing the
resulting rewards. Deep learning can be used to learn policies, or a set of actions, that maximizes
the cumulative reward over time. Deep reinforcement learning algorithms like Deep Q networks
and Deep Deterministic Policy Gradient (DDPG) are used to reinforce tasks like robotics and game
playing etc.
Difference between Machine Learning and Deep Learning :
 machine learning and deep learning both are subsets of artificial intelligence but there
are many similarities and differences between them.
90
Machine Learning Deep Learning
Apply statistical algorithms to learn the Uses artificial neural network

hidden patterns and relationships in the architecture to learn the hidden patterns
dataset. and relationships in the dataset.
Can work on the smaller amount of Requires the larger volume of dataset
dataset compared to machine learning
Better for complex task like image

Better for the low-label task. processing, natural language
processing, etc.
Takes less time to train the model. Takes more time to train the model.
A model is created by relevant features Relevant features are automatically

which are manually extracted from extracted from images. It is an end-to-
images to detect an object in the image. end learning process.
More complex, it works like the black

Less complex and easy to interpret the
box interpretations of the result are not
result.
easy.
It can work on the CPU or requires less

It requires a high-performance
computing power as compared to deep
computer with GPU.
learning.
Applications of Deep Learning :

The main applications of deep learning can be divided into computer vision, natural
language processing (NLP), and reinforcement learning.
Computer vision
In computer vision, Deep learning models can enable machines to identify and
understand visual data. Some of the main applications of deep learning in computer
vision include:
91
 Object detection and recognition: Deep learning model can be used to identify and
locate objects within images and videos, making it possible for machines to perform
tasks such as self-driving cars, surveillance, and robotics.
 Image classification: Deep learning models can be used to classify images into
categories such as animals, plants, and buildings. This is used in applications such as
medical imaging, quality control, and image retrieval.
 Image segmentation: Deep learning models can be used for image segmentation
into different regions, making it possible to identify specific features within images.
Natural language processing (NLP):
In NLP, the Deep learning model can enable machines to understand and generate
human language. Some of the main applications of deep learning in NLP include:
 Automatic Text Generation – Deep learning model can learn the corpus of text and
new text like summaries, essays can be automatically generated using these trained
models.
 Language translation: Deep learning models can translate text from one language
to another, making it possible to communicate with people from different linguistic
backgrounds.
 Sentiment analysis: Deep learning models can analyze the sentiment of a piece of
text, making it possible to determine whether the text is positive, negative, or
neutral. This is used in applications such as customer service, social media
monitoring, and political analysis.
 Speech recognition: Deep learning models can recognize and transcribe spoken
words, making it possible to perform tasks such as speech-to-text conversion, voice
search, and voice-controlled devices.
Reinforcement learning:
In reinforcement learning, deep learning works as training agents to take action in an
environment to maximize a reward. Some of the main applications of deep learning in
reinforcement learning include:
 Game playing: Deep reinforcement learning models have been able to beat human
experts at games such as Go, Chess, and Atari.
 Robotics: Deep reinforcement learning models can be used to train robots to
perform complex tasks such as grasping objects, navigation, and manipulation.
 Control systems: Deep reinforcement learning models can be used to control
complex systems such as power grids, traffic management, and supply chain
optimization.
Challenges in Deep Learning
Deep learning has made significant advancements in various fields, but there are still
some challenges that need to be addressed. Here are some of the main challenges in
deep learning:
1. Data availability: It requires large amounts of data to learn from. For using deep
learning it’s a big concern to gather as much data for training.
92
2. Computational Resources: For training the deep learning model, it is

computationally expensive because it requires specialized hardware like GPUs and
TPUs.
3. Time-consuming: While working on sequential data depending on the computational
resource it can take very large even in days or months.
4. Interpretability: Deep learning models are complex, it works like a black box. it is
very difficult to interpret the result.
5. Overfitting: when the model is trained again and again, it becomes too specialized
for the training data, leading to overfitting and poor performance on new data.
Advantages of Deep Learning:
1. High accuracy: Deep Learning algorithms can achieve state-of-the-art performance

in various tasks, such as image recognition and natural language processing.
2. Automated feature engineering: Deep Learning algorithms can automatically
discover and learn relevant features from data without the need for manual feature
engineering.
3. Scalability: Deep Learning models can scale to handle large and complex datasets,
and can learn from massive amounts of data.
4. Flexibility: Deep Learning models can be applied to a wide range of tasks and can
handle various types of data, such as images, text, and speech.
5. Continual improvement: Deep Learning models can continually improve their
performance as more data becomes available.
Disadvantages of Deep Learning:
1. High computational requirements: Deep Learning models require large amounts of

data and computational resources to train and optimize.
2. Requires large amounts of labeled data: Deep Learning models often require a large
amount of labeled data for training, which can be expensive and time- consuming to
acquire.
3. Interpretability: Deep Learning models can be challenging to interpret, making it
difficult to understand how they make decisions.
Overfitting: Deep Learning models can sometimes overfit to the training data,
resulting in poor performance on new and unseen data.
4. Black-box nature: Deep Learning models are often treated as black boxes, making it
difficult to understand how they work and how they arrived at their predictions.
In summary, while Deep Learning offers many advantages, including high accuracy
and scalability, it also has some disadvantages, such as high computational
requirements, the need for large amounts of labeled data, and interpretability
challenges. These limitations need to be carefully considered when deciding whether
to use Deep Learning for a specific task.
93
Historical Trends in Deep learning

Over the years, deep learning has evolved causing a massive disruption into industries and
business domains. Deep learning is a branch of machine learning that deploys algorithms for data
processing and imitates the thinking process and even develops abstractions. Deep learning uses
layers of algorithms for data processing, understands human speech and recognizes objects
visually. In deep learning, Information is passed through each layer, and the output of the
previous layer acts as the input for the next layer. The first layer in a network is referred as the
input layer, while the last is the output layer the middle layers are referred to as hidden layers
where each layer is a simple, uniform algorithm consisting of one kind of activation function.
Another aspect of deep learning is feature extraction which uses an algorithm to automatically
construct meaningful features of the data for learning, training and understanding.
History of Deep Learning Over the Years
The history of deep learning dates back to 1943 when Warren McCulloch and Walter Pitts
created a computer model based on the neural networks of the human brain. Warren McCulloch
and Walter Pitts used a combination of mathematics and algorithms they called threshold logic to
mimic the thought process. Since then, deep learning has evolved steadily, over the years with
two significant breaks in its development. The development of the basics of a continuous Back
Propagation Model is credited to Henry J. Kelley in 1960. Stuart Dreyfus came up with a simpler
version based only on the chain rule in 1962. The concept of back propagation existed in the
early 1960s but only became useful until 1985.
Developing Deep Learning Algorithms
The earliest efforts in developing deep learning algorithms date to 1965, when Alexey
Grigoryevich Ivakhnenko and Valentin Grigorʹevich Lapa used models with polynomial
(complicated equations) activation functions, which were subsequently analysed statistically.
During the 1970’s a brief setback was felt into the development of AI, lack of funding limited
both deep learning and artificial intelligence research. However, individuals carried on the
research without funding through those difficult years. Convolutional neural networks were first
used by Kunihiko Fukushima who designed the neural networks with multiple pooling and
convolutional layers. Kunihiko Fukushima developed an artificial neural network, called
Neocognitron in 1979, which used a multi-layered and hierarchical design. The multi-layered
and hierarchical design allowed the computer to learn to recognize visual patterns. The networks
resembled modern versions and were trained with a reinforcement strategy of recurring
activation in multiple layers, gaining strength over time.
The FORTRAN code for Back Propagation
In 1970’s, back propagation, was developed which uses errors into training deep learning
models. Back propagation became popular when Seppo Linnainmaa wrote his master’s thesis,
including a FORTRAN code for back propagation. Though developed in the 1970’s, the concept
was not applied to neural networks until 1985 when Hinton and Rumelhart, Williams
demonstrated back propagation in a neural network which could provide interesting distribution
representations. Yann LeCun explained the first practical demonstration of backpropagation at
Bell Labs in 1989 by combining convolutional neural networks with back propagation to read
94
handwritten digits. The combination of convolutional neural networks with back

propagation system was used to read the numbers of handwritten checks. 1985-90s kicked the
second lull into artificial intelligence which effected research for neural networks and deep
learning. Going on over the years, in 1995 Vladimir Vapnik and Dana Cortes developed the
support vector machine which is a system for mapping and recognizing similar data. Long short-
term memory or LSTM was developed in 1997 by Juergen Schmidhuber and Sepp Hochreiter for
recurrent neural networks. The next significant deep learning advancement was in 1999 when
computers adopted the speed of the GPU processing. Faster processing meant increased
computational speeds of 1000 times over a 10-year span. This era meant neural networks began
competing with support vector machines. Neural networks offered better results using the same
data, though slow to a support vector machine.
Deep Learning from the 2000s and Beyond
The Vanishing Gradient Problem came out in the year 2000 when “features” (lessons) formed in
lower layers were not being learned by the upper layers since no learning signal reached these
layers were discovered. This was not a fundamental problem for all neural networks but is
restricted to only gradient-based learning methods. This problem turned out to be certain
activation functions which condensed their input and reduced the output range in a chaotic
fashion. This led to large areas of input mapped over an extremely small range. In 2001, a
research report compiled by the META Group (now called Gartner) came up with the challenges
and opportunities of the three-dimensional data growth. This report marked the onslaught of Big
Data and described the increasing volume and speed of data as increasing the range of data
sources and types. Fei-Fei Li, an AI professor at Stanford launched ImageNet in 2009
assembling a free database of more than 14 million labeled images. These images were the inputs
to train neural nets. The speed of GPUs had increased significantly by 2011, making it possible
to train convolutional neural networks without the need of layer by layer pre-training. Deep
learning holds significant advantages into efficiency and speed.
The Cat Experiment
In 2012, Google Brain released the results of an unusual free-spirited project called the Cat
Experiment which explored the difficulties of unsupervised learning. Deep learning deploys
supervised learning, which means the convolutional neural net is trained using labeled data like
the images from ImageNet. This experiment used a neural net which was spread over 1,000
computers where ten million unlabelled images were taken randomly from YouTube, as inputs to
the training software. From that year onwards, unsupervised learning remains a significant goal
in the field of deep learning. 2018 and years beyond will mark the evolution of artificial
intelligence which will be dependent on deep learning. Deep learning is still in the growth phase
and in constant need of creative ideas to evolve further.
Deep Feed - forward networks

Now, we know how with the combination of lines with different weight and biases can result in
non-linear models. How does a neural network know what weight and biased values to have in
each layer? It is no different from how we did it for the single based perceptron model.
95
We are still making use of a gradient descent optimization algorithm which acts to minimize the
error of our model by iteratively moving in the direction with the steepest descent, the direction
which updates the parameters of our model while ensuring the minimal error. It updates the weight
of every model in every single layer. We will talk more about optimization algorithms and
backpropagation later.
It is important to recognize the subsequent training of our neural network. Recognition is done by
dividing our data samples through some decision boundary.
"The process of receiving an input to produce some kind of output to make some kind of prediction
is known as Feed Forward." Feed Forward neural network is the core of many other important
neural networks such as convolution neural network.
In the feed-forward neural network, there are not any feedback loops or connections in the network.
Here is simply an input layer, a hidden layer, and an output layer.
96
97
So, what we will do we use our non-linear model to produce an output that describes the probability
of the point being in the positive region. The point was represented by 2 and 2. Along with bias,
we will represent the input as
The first linear model in the hidden layer recall and the equation defined it
Which means in the first layer to obtain the linear combination the inputs are multiplied by -4, -1
and the bias value is multiplied by twelve.
98
The weight of the inputs are multiplied by -1/5, 1, and the bias is multiplied by three to obtain the
linear combination of that same point in our second model.
Now, to obtain the probability of the point is in the positive region relative to both models we
apply sigmoid to both points as
The second layer contains the weights which dictated the combination of the linear models in the
first layer to obtain the non-linear model in the second layer. The weights are 1.5, 1, and a bias
value of 0.5.
Now, we have to multiply our probabilities from the first layer with the second set of weights as
Now, we will take the sigmoid of our final score
99
It is complete math behind the feed forward process where the inputs from the input traverse the
entire depth of the neural network. In this example, there is only one hidden layer. Whether there
is one hidden layer or twenty, the computational processes are the same for all hidden layers.
Back-Propagation
Backpropagation is one of the important concepts of a neural network. Our task is to classify
our data best. For this, we have to update the weights of parameter and bias, but how can we do
that in a deep neural network? In the linear regression model, we use gradient descent to
optimize the parameter. Similarly here we also use gradient descent algorithm using
Backpropagation.
For a single training example, Backpropagation algorithm calculates the gradient of the error
function. Backpropagation can be written as a function of the neural network. Backpropagation
algorithms are a set of methods used to efficiently train artificial neural networks following a
gradient descent approach which exploits the chain rule.
The main features of Backpropagation are the iterative, recursive and efficient method through
which it calculates the updated weight to improve the network until it is not able to perform the
task for which it is being trained. Derivatives of the activation function to be known at network
design time is required to Backpropagation.
Now, how error function is used in Backpropagation and how Backpropagation works? Let start
with an example and do it mathematically to understand how exactly updates the weight using
Backpropagation.
Backw ar
Input values
X1=0.05
X2=0.10
100
Initial weight
W1=0.15 w5=0.40
W2=0.20 w6=0.45
W3=0.25 w7=0.50
W4=0.30 w8=0.55
Bias Values
b1=0.35 b2=0.60
Target Values
T1=0.01
T2=0.99
Now, we first calculate the values of H1 and H2 by a forward pass.
Forward Pass
To find the value of H1 we first multiply the input value from the weights as
H1=x1×w1+x2×w2+b1
H1=0.05×0.15+0.10×0.20+0.35
H1=0.3775
To calculate the final result of H1, we performed the sigmoid function as
We will calculate the value of H2 in the same way as H1
H2=x1×w3+x2×w4+b1
H2=0.05×0.25+0.10×0.30+0.35
H2=0.3925
101
Now, we calculate the values of y1 and y2 in the same way as we calculate the H1 and H2.
To find the value of y1, we first multiply the input value i.e., the outcome of H1 and H2 from the
weights as
y1=H1×w5+H2×w6+b2
y1=0.593269992×0.40+0.596884378×0.45+0.60
y1=1.10590597
To calculate the final result of y1 we performed the sigmoid function as
We will calculate the value of y2 in the same way as y1
y2=H1×w7+H2×w8+b2
y2=0.593269992×0.50+0.596884378×0.55+0.60
y2=1.2249214
102
Our target values are 0.01 and 0.99. Our y1 and y2 value is not matched with our target values T1
and T2.
Now, we will find the total error, which is simply the difference between the outputs from the
target outputs. The total error is calculated as
So, the total error is
Now, we will backpropagate this error to update the weights using a backward pass.
Backward pass at the output layer

To update the weight, we calculate the error correspond to each weight with the help of a total
error. The error on weight w is calculated by differentiating total error with respect to w.
We perform backward process so first consider the last weight w5 as
103
From equation two, it is clear that we cannot partially differentiate it with respect to w5 because
there is no any w5. We split equation one into multiple terms so that we can easily differentiate it
with respect to w5 as
Now, we calculate each term one by one to differentiate Etotal with respect to w5 as
104
Putting the value of e-y in equation (5)
So, we put the values of in equation no (3) to find the final result.
Now, we will calculate the updated weight w5new with the help of the following formula
In the same way, we calculate w6new,w7new, and w8new and this will give us the following values
w5new=0.35891648
w6new=408666186
105
w7new=0.511301270
w8new=0.561370121
Backward pass at Hidden layer

Now, we will backpropagate to our hidden layer and update the weight w1, w2, w3, and w4 as we
have done with w5, w6, w7, and w8 weights.
We will calculate the error at w1 as
From equation (2), it is clear that we cannot partially differentiate it with respect to w1 because
there is no any w1. We split equation (1) into multiple terms so that we can easily differentiate it
with respect to w1 as
Now, we calculate each term one by one to differentiate Etotal with respect to w1 as
We again split this because there is no any H1final term in Etoatal as
will again split because in E1 and E2 there is no H1 term. Splitting is done

as
106
We again Split both because there is no any y1 and y2 term in E1 and E2. We split it
as
Now, we find the value of by putting values in equation (18) and (19) as
From equation (18)
From equation (8)
From equation (19)
107
Putting the value of e-y2 in equation (23)
From equation (21)
108
Now from equation (16) and (17)
Put the value of in equation (15) as
109
We have we need to figure out as
Putting the value of e-H1 in equation (30)
We calculate the partial derivative of the total net input to H1 with respect to w1 the same as we
did for the output neuron:
110
So, we put the values of in equation (13) to find the final result.
Now, we will calculate the updated weight w1new with the help of the following formula
In the same way, we calculate w2new,w3new, and w4 and this will give us the following values
w1new=0.149780716
w2new=0.19956143
w3new=0.24975114
w4new=0.29950229
We have updated all the weights. We found the error 0.298371109 on the network when we fed
forward the 0.05 and 0.1 inputs. In the first round of Backpropagation, the total error is down to
0.291027924. After repeating this process 10,000, the total error is down to 0.0000351085. At this
point, the outputs neurons generate 0.159121960 and 0.984065734 i.e., nearby our target value
when we feed forward the 0.05 and 0.1.
dOther Differentiation Algorithms

111
In addition to traditional gradient-based learning, there are several differentiation algorithms and
techniques used in deep learning to train neural networks more effectively. Here are some
notable ones:
1. Stochastic Gradient Descent with Momentum (SGD with Momentum): This algorithm
enhances the standard SGD by incorporating momentum. Momentum helps accelerate
gradient descent by accumulating a weighted average of past gradients and using it to
update the parameters. This momentum term reduces oscillations and helps the optimizer
navigate flat or shallow regions more efficiently.
2. Adaptive Learning Rate Methods: These algorithms dynamically adjust the learning rate
during training based on the gradient information or historical update statistics. Some
popular methods include:
a. Adam (Adaptive Moment Estimation): Adam combines the advantages of adaptive learning
rates and momentum. It adapts the learning rate for each parameter based on estimates of first-
order moments (mean) and second-order moments (variance) of the gradients.
b. RMSprop (Root Mean Square Propagation): RMSprop adjusts the learning rate for each
parameter by dividing it by a moving average of the root mean square of past gradients. This
technique helps in controlling the learning rate based on the magnitude of the gradients.
c. Adagrad (Adaptive Gradient): Adagrad adapts the learning rate for each parameter by scaling
it inversely proportional to the cumulative sum of the historical squared gradients. This method
gives larger updates for parameters with infrequent updates and smaller updates for frequently
updated parameters.
3. Nesterov Accelerated Gradient (NAG): NAG is an optimization algorithm that improves
upon SGD with Momentum. It computes an intermediate step in the direction of the
accumulated momentum before calculating the gradient. This lookahead step allows
NAG to make better-informed updates and often leads to faster convergence.
4. Second-Order Methods: While most deep learning optimization algorithms rely on first-
order gradients, second-order methods consider second-order derivatives (Hessian) as
well. These methods can provide more accurate and faster convergence but come with
increased computational complexity. Examples include Newton's method and Quasi-
Newton methods like L-BFGS (Limited-memory Broyden-Fletcher-Goldfarb-Shanno).
5. Regularization Techniques: Regularization techniques are used to prevent overfitting and
improve generalization. These techniques modify the loss function or add additional
terms to the optimization process. Some commonly used regularization techniques
include L1 regularization (Lasso), L2 regularization (Ridge), Dropout, and Batch
Normalization.
6. Learning Rate Scheduling: Instead of using a fixed learning rate throughout training,
learning rate scheduling adjusts the learning rate dynamically over time. Techniques like
step decay, exponential decay, or polynomial decay reduce the learning rate periodically
112
or gradually during training. Learning rate scheduling can help fine-tune the optimization
process and improve convergence.
These differentiation algorithms and techniques are employed to optimize deep learning models,
enhance convergence, and improve generalization. The choice of algorithm depends on the
specific task, dataset, and model architecture, and it often involves experimentation to identify
the most effective approach.
Unit 4
Regularization for Deep Learning
Parameter norm Penalties
113
In our last post, we learned about feedforward neural networks and how to design them. In this
post, we will learn how to tackle one of the most central problems that arise in the domain of
machine learning, that is how to make our algorithm to find a perfect fit not only to the training
set but also to the testing set. When an algorithm performs well on the training set but performs
poorly on the testing set, the algorithm is said to be overfitted on the Training data. After all, our
main goal is to perform well on never seen before data, ie reducing the overfitting. To tackle this
problem we have to make our model generalize over the training data which is done using various
regularization techniques which we will learn about in this post.
Strategies or techniques which are used to reduce the error on the test set at an expense of
increased training error are collectively known as Regularization. Many such techniques are
available to the deep learning practitioner. In fact, developing more effective regularization
strategies have been one of the major research efforts in the field.
Regularization can be defined as any modification we make to a learning algorithm that is
intended to reduce its generalization error but not its training error. This regularization is often
done by putting some extra constraints on a machine learning model, such as adding restrictions
on the parameter values or by adding extra terms in the objective function that can be thought of
as corresponding to a soft constraint on the parameter values. If chosen correctly these can lead to
a reduced testing error. An effective regularizer is said to be the one that makes a profitable trade
by reducing variance significantly while not overly increasing the bias.
Parameter Norm Penalties
This technique is based on limiting the capacity of models, by adding a parameter norm penalty to
the objective function J.
114
Where alpha is a hyperparameter that weighs the relative contribution of the norm penalty omega.
Setting alpha to 0 means no regularization and larger values of alpha correspond to more
regularization.
For Neural networks, we choose to use parameter norm penalty that penalizes on the weights of
the affine transformations and leave biases unregularized. This is because of the fact that biases
require lesser data to fit accurately than the weights. As weights are used to denote the
relationship between the two variables and require observing both variables in various conditions
whereas bias controls only single variables hence they can be left unregularized.
L2 Parameter Regularization
This regularization is popularly known as weight decay. This strategy drives the weights closer to
the origin by adding the regularization term omega which is defined as:
This technique is also known as ridge regression or Tikhonov regularization.
The objective function after regularization is denoted by:
Corresponding parameter gradient:
A single gradient step to update the weights:
115
We can see that the weight decay term is now multiplicatively shrinking the weight vector by a
constant factor on each step, before performing the usual gradient update.
L1 Regularization
Here the regularization term is defined as:
Which leads us to have the objective function as:
Corresponding gradient:
By observing the gradient we can notice how the gradient is scaled by the constant factor with a
sign equal to sign(wi).
Dataset Augmentation
116
The best and easiest way to make a model generalize is to train it on a large amount of data but
mostly we are provided with limited data. One way is to create fake data and add it to our training
dataset, for some domains this is fairly straightforward and easy.
This approach is mostly taken for classification problem, A classifier needs to take a complicated,
high dimensional input x and summarize it with a single category identity y. This means that the
main task facing a classifier is to be invariant to a wide variety of transformations. We can
generate new ( x, y) pairs easily just by transforming the x inputs in our training set. This
approach isn’t always suitable for a task such as for a density estimation task it is difficult to
generate fake data unless we have already solved the density estimation problem.
Dataset Augmentation is a very popular approach for Computer vision tasks such as Image
classification or object recognition as Images are high dimensional and include an enormous
variety of factors of variation, many of which can be easily simulated. Operations like translating
the training images a few pixels in each direction, rotating the image or scaling the image can
often greatly improve generalization, even if the model has already been designed to be partially
translation invariant by using the convolution and pooling techniques.
Noise Robustness
Noise is often introduced to the inputs as a dataset augmentation strategy. the addition of noise
with infinitesimal variance at the input of the model is equivalent to imposing a penalty on the
norm of the weights. Noise injection is much more powerful than simply shrinking the
parameters, especially when the noise is added to the hidden units.
117
Another way that noise has been used in the service of regularizing models is by adding it to the
weights. This technique has been used primarily in the context of recurrent neural networks. This
can be interpreted as a stochastic implementation of Bayesian inference over the weights.
Semi-Supervised Learning
In semi-supervised learning, both unlabeled examples from P (x) and labeled examples from P (x,
y) are used to estimate P (y | x) or predict y from x. the context of deep learning, semi-supervised
learning usually refers to learning a representation h = f (x). The goal is to learn a representation
so that examples from the same class have similar representations. Unsupervised learning
provides cues about how to group training examples in representation Space. Using a principal
component analysis as a pre-processing step before applying our classifier is an example of this
approach.
Instead of using separate models for unsupervised and supervised components, one can construct
models in which a generative model of either P (x) or P(x, y) shares parameters with a
discriminative model of P(y | x). Now the structure of P(x) is connected to the structure of P(y | x)
in a way that is captured by the shared parametrization. By controlling how much of the
generative criterion is included in the total criterion, one can find a better trade-off than with a
purely generative or a purely discriminative training criterion.
Multi-Task Learning
Multi-task learning is a way to improve generalization by pooling the examples arising out of
several tasks. In the same way that additional training examples put more pressure on the
parameters of the model towards values that generalize well, when part of a model is shared
across tasks, that part of the model is more constrained towards good values, often yielding better
generalization.
118
The model can generally be divided into two kinds of parts and associated parameters:
Task-specific parameters which only benefit from the examples of their task to achieve good
generalization.
Generic parameters shared across all the tasks which benefit from the pooled data of all the tasks.
Early Stopping of Training
When training a large model on a sufficiently large dataset, if the training is done for a long
amount of time rather than increasing the generalization capability of the model, it increases the
overfitting. As in the training process, the training error keeps on reducing but after a certain
point, the validation error starts to increase hence signifying that our model has started to overfit.
Loss comparison of training and Validation
One way to think of early stopping is as a very efficient hyperparameter selection algorithm. The
idea of early stopping of training is that as soon as the validation error starts to increase we freeze
the parameters and stop the training process. Or we can also store the copy of model parameters
119
every time the error on the validation set improves and return these parameters when the training
terminates rather than the latest parameters.
Early stopping has an advantage over weight decay that early stopping automatically determines
the correct amount of regularization while weight decay requires many training experiments with
different values of its hyperparameter.
Bagging
Bagging or bootstrap aggregating is a technique for reducing generalization error by combining
several models. The idea is to train several different models separately, then have all of the
models vote on the output for test examples. This is an example of a general strategy in machine
learning called model averaging. Techniques employing this strategy are known
as ensemble methods. This is an efficient method as different models don’t make the same types
of errors.
Bagging involves constructing k different datasets. Each dataset has the same number of examples
as the original dataset, but each dataset is constructed by sampling with replacement from the
original dataset. This means that, with high probability, each dataset is missing some of the
examples from the original dataset and also contains several duplicate examples. Model i is then
trained on dataset i. The differences between which examples are included in each dataset result
in differences between the trained models.
Dropout
Dropout is a computationally inexpensive but powerful regularization method, dropout can be
thought of as a method of making bagging practical for ensembles of very many large neural
120
networks. The method of bagging cannot be directly applied to large neural networks as it
involves training multiple models, and evaluating multiple models on each test example. since
training and evaluating such networks is costly in terms of runtime and memory, this method is
impractical for neural networks. Dropout provides an inexpensive approximation to training and
evaluating a bagged ensemble of exponentially many neural networks. Dropout trains the
ensemble consisting of all sub-networks that can be formed by removing non-output units from an
underlying base network.
In most modern neural networks, based on a series of affine transformations and nonlinearities,
we can effectively remove a unit from a network by multiplying its output value by zero. This
procedure requires some slight modification for models such as radial basis function networks,
which take the difference between the unit’s state and some reference value. Here, we present
the dropout algorithm in terms of multiplication by zero for simplicity, but it can be trivially
modified to work with other operations that remove a unit from the network.
Dropout training is not quite the same as bagging training. In the case of bagging, the models are
all independent. In the case of dropout, the models share parameters, with each model inheriting a
different subset of parameters from the parent neural network. This parameter sharing makes it
possible to represent an exponential number of models with a tractable amount of memory. One
advantage of dropout is that it is very computationally cheap. Using dropout during training
requires only O(n) computation per example per update, to generate n random binary numbers
and multiply them by the state. Another significant advantage of dropout is that it does not
significantly limit the type of model or training procedure that can be used. It works well with
nearly any model that uses a distributed representation and can be trained with stochastic gradient
descent.
121
Adversarial Training
In many cases, Neural Networks seem to have achieved human-level understanding the task but to
check if it really is able to perform at human-level, Networks are tested on adversarial examples.
Adversarial examples can be defined as if for an input a near a data point x such that the model
output is very different at a, then a is called an Adversarial example. Adversarial examples are
intentionally constructed by using an optimization procedure and models have a nearly 100%
error rate on these examples.
Adversarial training helps in regularization of models as when models are trained on the training
sets that are augmented with Adversarial examples, it improves the generalization of the model.
Norm Penalties as Constrained Optimization

Norm penalties in constrained optimization refer to a regularization technique that introduces additional
terms into the objective function to encourage solutions with specific properties. These penalties are often
used to impose constraints on the magnitude or complexity of the solution.
In the context of machine learning and optimization, norm penalties are commonly applied to control the
size of the weight or parameter vectors in models. By incorporating these penalties, the optimization
process is biased towards finding solutions that have smaller norm values, leading to simpler or more
regularized models.
One popular type of norm penalty is the L2 norm, also known as ridge regression or Tikhonov
regularization. It is defined as the sum of the squares of the individual elements of the weight vector.
Mathematically, the L2 norm penalty can be expressed as:
P(w) = λ ||w||^2
where w is the weight vector, ||w||^2 is the squared L2 norm of w, and λ is the regularization parameter
that controls the strength of the penalty. By including this penalty term in the objective function, the
optimization process is incentivized to find a solution with small weight values.
Another common norm penalty is the L1 norm, known as Lasso regularization. Unlike the L2 norm
penalty, the L1 norm is the sum of the absolute values of the individual elements of the weight vector.
The L1 norm penalty encourages sparsity in the solution by driving some of the weights to exactly zero.
This can be useful for feature selection or creating sparse models.
P(w) = λ ||w||_1
where w is the weight vector, ||w||_1 is the L1 norm of w, and λ is the regularization parameter.
122
Norm penalties can also be used in combination, leading to elastic net regularization, which combines
both the L1 and L2 penalties. The elastic net penalty helps balance the benefits of sparsity from the L1
norm and the robustness to correlated features from the L2 norm.
Norm Penalties as Constrained Optimization Apart from adding penalty term Ω(θ) to objective
function J and try to minimize their sum J˜, we can also make sure it is small by optimizing J
with a constraint Ω(θ) < k.
This can be done by constructing a generalized Lagrange function, consisting of original
objective function plus a set of penalties: L(θ, α; X, y) = J(θ; X, y) + α(Ω(θ) − k) (11) The
solution will be: θ ∗ = argmin θ max α,α≥0 L(θ, α) (12) Note that both θ and α are variables in
this objective function.
When α ∗ is fixed (say, we already know the best α), the optimization problem becomes θ ∗ =
argmin θ L(θ, α∗ ) = argmin θ J(θ; X, y) + α ∗Ω(θ) (13) which is the same as regularization with
parameter norm penalty.
For example, if Ω is L 2 norm, we can think of it as limiting weights to be in a L 2 ball. Benefits
of regularization as constraints:
• We can specify a concrete constraint region, while the effect of adjusting α for Ω(θ) is vague.
We can take a step with stochastic gradient descent, and re-project θ back to the feasible region
Ω(θ) < k.
• We can avoid getting stuck in local minima, which is common with regularization term in
objective function. These constraints only take effect when the weights attempt to leave the
constraint region.
• More stable optimization procedure. A large learning rate may result in positive feedback loop.
Explicit constraints with re-projection prevent this.
Regularization and Under-Constrained Problems
In the context of deep learning, regularization and under-constrained problems are important
concepts.
Regularization in Deep Learning: In deep learning, regularization techniques are employed to
prevent overfitting, which occurs when a model performs well on the training data but fails to
generalize to new, unseen data. Overfitting often happens when a model becomes too complex
and starts memorizing the training examples instead of learning meaningful patterns.
Regularization techniques in deep learning typically involve adding a regularization term to the
loss function during training. This extra term encourages the model to have certain desirable
properties, such as smaller weights or sparsity. The most common regularization techniques used
in deep learning include L1 and L2 regularization (also known as weight decay), dropout, and
batch normalization.
123
L1 regularization encourages sparsity in the model by adding a penalty proportional to the

absolute values of the weights. It promotes some weights to become exactly zero, effectively
performing feature selection and simplifying the model.
L2 regularization, on the other hand, adds a penalty proportional to the squared values of the
weights. It encourages smaller weights and leads to smoother solutions, reducing the impact of
individual weights and making the model more robust to noise.
Dropout regularization randomly sets a fraction of the activations in a layer to zero during each
training iteration. This technique helps prevent co-adaptation of neurons and encourages the
model to learn more robust and generalizable representations.
Batch normalization is another regularization technique that normalizes the inputs to each layer
by subtracting the batch mean and dividing by the batch standard deviation. This helps stabilize
the training process and makes the model less sensitive to the scale of the input features.
Under-Constrained Problems in Deep Learning: Under-constrained problems refer to situations
where there is a lack of sufficient data or information to uniquely determine a solution. In the
context of deep learning, under-constrained problems may arise when the number of parameters
in a model is much larger than the available training data.
When faced with under-constrained problems, deep learning models may struggle to learn
meaningful representations and generalize well. Overfitting becomes a concern, as the model
may be able to fit the training data perfectly but fail to generalize to unseen data due to the lack
of constraints.
To mitigate under-constrained problems in deep learning, various strategies can be employed.
One approach is to introduce additional constraints or assumptions into the model architecture,
such as architectural priors or regularization techniques, to guide the learning process. Another
approach is to leverage transfer learning, where pre-trained models trained on large-scale
datasets are fine-tuned on smaller, task-specific datasets to overcome the lack of data.
Additionally, techniques like data augmentation, where synthetic training examples are generated
from existing data, can help increase the effective size of the training set and improve
generalization.
In a linear regression problem, when the number of instances is smaller than the number of variables, the
problem is under-constrained, and its closed form solution w = (XT X) −1XT y can not be calculated.
In a logistic regression problem, when the two classes are linearly separable with vector w, 2w will also
be a feasible solution. An iterative optimization algorithm may keep increasing the magnitude of w and
never stops.
However, when we add a regularization term to loss function, convergence is guaranteed. For example, w
will not be updated to 2w because the likelihood loss is not decreased, while the regularization term is
huge. The idea of using regularization to solve under-determined problems extends beyond machine
learning.
124
In linear algebra, we have Moore-Penrose pseudo-inverse of matrix X, which is X+ = limα→0(XT X +

αI) −1XT .
This is exactly the same as performing linear regression with weight decay. We can interpret the pesudo-
inverse as stabilizing under-determined problems using regularization.
It’s hard to build the right dataset from

scratch
“What’s wrong with my dataset?!?”
Don’t worry, we didn’t mean to insult you. It’s not your fault: it’s Deep Learning’s fault.
Algorithms are getting infinitely more complex, and neural nets are getting deeper and deeper.
More layers in neural nets means more parameters that your model is learning from your data.
In some of the recent more state of the art models we’ve seen, there can be more than 100
million parameters learned during training:
When your model is trying to understand a relationship this deeply, it needs a lot of examples to
learn from. That’s why popular datasets for models like these might have something like 10,000
images for training. That size of data is not at all easy to come by.
Even if you’re using simpler or smaller types of models, it’s challenging to organize a dataset
large enough to train effectively. Especially as Machine Learning gets applied to newer and
newer verticals, it’s becoming harder and harder to find reliable training data. If you wanted to
125
create a classifier to distinguish iPhones from Google Pixels, how would you get thousands of
different photos?
Finally, even with the right size training set, things can still go awry. Remember that algorithms
don’t think like humans: while you classify images based on a natural understanding of what’s in
the image, algorithms are learning that on the fly. If you’re creating a cat / dog classifier and
most of your training images for dogs have a snowy background, your algorithm might end up
learning the wrong rules. Having images from varied perspectives and with different contexts is
crucial.
Dataset augmentation can multiply your

data’s effectiveness
For all of the reasons outlined above, it’s important to be able to augment your dataset: to make
it more effective without acquiring loads of more training data. Dataset augmentation applies
transformations to your training examples: they can be as simple as flipping an image, or as
complicated as applying neural style transfer. The idea is that by changing the makeup of your
data, you can improve your performance and increase your training set size.
For an idea of just how much this process can help, check out this benchmark that NanoNets ran
in their explainer post. Their results showed an almost 20 percentage point increase in test
accuracy with dataset augmentation applied.
It’s safer for us to assume the cause of this accuracy boost was a bit more complicated than just
dataset augmentation, but the message is clear: it can really help.
Before we dive into what you might practically do to augment your data, it’s worth noting that
there are two broad approaches to when to augment it. In offline dataset augmentation,
transforms are applied en masse to your dataset before training. You might, for example, flip
each of your images horizontally and vertically, resulting in a training set with twice as many
126
examples. In online dataset augmentation, transforms are applied in real time as batches are
passed into training. This won’t help with size, but is much quicker for larger training sets.
How basic dataset augmentation works

Basic augmentation is super simple, at least when it comes to images: just try to imagine all the
things you could do in photoshop with a picture! A few of the simple and popular ones include:
 Flipping (both vertically and horizontally)

 Rotating
 Zooming and scaling
 Cropping
 Translating (moving along the x or y axis)
 Adding Gaussian noise (distortion of high frequency features)
Most of these transformations have fairly simple implementations in packages like Tensorflow.
And though they might seem simple, combining them in creative ways across your dataset can
yield impressive improvements in model accuracy.
One issue that often comes up is input size requirements, which are one of the most frustrating
parts of neural nets for practitioners. If you shift or rotate an image, you’re going to end up with
something that’s a different size, and that needs to be fixed before training. Different approaches
advocate filling in empty space with constant values, zooming in until you’ve reached the right
size, or reflecting pixel values into your empty space. As with any preprocessing, testing and
validating is the best way to find a definitive answer.
Deep Learning for dataset augmentation

Moving from the simple to the complex, there are some more interesting things than just flips
and rotations that you can do to your dataset to make it more robust.
Neural Style Transfer

Neural networks have proven effective in transferring stylistic elements from one image to
another, like “Starry Stanford” here:
127
You can utilize pre-trained nets that transfer exterior styles onto your training images as part of a
dataset augmentation pipeline.
Noise Robustness
Noise with infinitesimal variance imposes a penalty on the norm of the weights. Noise added to
hidden units is very important and is discussed later in Dropout. Noise can even be added to the
weights. This has several interpretations. One of them is that adding noise to weights is a
stochastic implementation of Bayesian inference over the weights, where the weights are
considered to be uncertain, with the uncertainty being modelled by a probability distribution. It is
also interpreted as a more traditional form of regularization by ensuring stability in learning.
For e.g. in the linear regression case, we want to learn the mapping y(x) for each feature vector x,
by reducing the mean square error.
128
Now, suppose a zero mean unit variance Gaussian random noise, ϵ, is added to the weights. We
till want to learn the appropriate mapping through reducing the mean square. Minimizing the loss
after adding noise to the weights is equivalent to adding another regularization term which makes
sure that small perturbations in the weight values don’t affect the predictions much, thus
stabilising training.
Sometimes we may have the wrong output labels, in which case maximizing p(y | x)may not be a
good idea. In such a case, we can add noise to the labels by assigning a probability of (1-ϵ) that
the label is correct and a probability of ϵ that it is not. In the latter case, all the other labels are
equally likely. Label Smoothing regularizes a model with k softmax outputs by assigning the
classification targets with probability (1-ϵ ) or choosing any of the remaining (k-1) classes with
probability ϵ / (k-1).
Semi-Supervised Learning
Today’s Machine Learning algorithms can be broadly classified into three categories,
Supervised Learning, Unsupervised Learning, and Reinforcement Learning. Casting
Reinforced Learning aside, the primary two categories of Machine Learning problems
are Supervised and Unsupervised Learning. The basic difference between the two is
that Supervised Learning datasets have an output label associated with each tuple while
Unsupervised Learning datasets do not.
What is Semi-Supervised Learning?

Semi-supervised learning is a type of machine learning that falls in between supervised
and unsupervised learning. It is a method that uses a small amount of labeled data and a
large amount of unlabeled data to train a model. The goal of semi-supervised learning is
to learn a function that can accurately predict the output variable based on the input
variables, similar to supervised learning. However, unlike supervised learning, the
algorithm is trained on a dataset that contains both labeled and unlabeled data.
Semi-supervised learning is particularly useful when there is a large amount of
unlabeled data available, but it’s too expensive or difficult to label all of it.
129
Semi-Supervised Learning Flow Chart
Intuitively, one may imagine the three types of learning algorithms as Supervised
learning where a student is under the supervision of a teacher at both home and school,
Unsupervised learning where a student has to figure out a concept himself and Semi-
Supervised learning where a teacher teaches a few concepts in class and gives questions
as homework which are based on similar concepts.
Examples of Semi-Supervised Learning

 Text classification: In text classification, the goal is to classify a given text into one or
more predefined categories. Semi-supervised learning can be used to train a text
classification model using a small amount of labeled data and a large amount of
unlabeled text data.
 Image classification: In image classification, the goal is to classify a given image into
one or more predefined categories. Semi-supervised learning can be used to train an
image classification model using a small amount of labeled data and a large amount
of unlabeled image data.
 Anomaly detection: In anomaly detection, the goal is to detect patterns or observations
that are unusual or different from the norm
Assumptions followed by Semi-Supervised Learning

A Semi-Supervised algorithm assumes the following about the data
1. Continuity Assumption: The algorithm assumes that the points which are closer to
each other are more likely to have the same output label.
2. Cluster Assumption: The data can be divided into discrete clusters and points in the
same cluster are more likely to share an output label.
130
3. Manifold Assumption: The data lie approximately on a manifold of a much lower

dimension than the input space. This assumption allows the use of distances and
densities which are defined on a manifold.
Applications of Semi-Supervised Learning

1. Speech Analysis: Since labeling audio files is a very intensive task, Semi-Supervised
learning is a very natural approach to solve this problem.
2. Internet Content Classification: Labeling each webpage is an impractical and
unfeasible process and thus uses Semi-Supervised learning algorithms. Even the
Google search algorithm uses a variant of Semi-Supervised learning to rank the
relevance of a webpage for a given query.
3. Protein Sequence Classification: Since DNA strands are typically very large in size, the
rise of Semi-Supervised learning has been imminent in this field.
Disadvantages of Semi-Supervised Learning

The most basic disadvantage of any Supervised Learning algorithm is that the dataset
has to be hand-labeled either by a Machine Learning Engineer or a Data Scientist. This
is a very costly process, especially when dealing with large volumes of data. The most
basic disadvantage of any Unsupervised Learning is that its application spectrum is
limited.
To counter these disadvantages, the concept of Semi-Supervised Learning was

introduced. In this type of learning, the algorithm is trained upon a combination of
labeled and unlabelled data. Typically, this combination will contain a very small
amount of labeled data and a very large amount of unlabelled data. The basic procedure
involved is that first, the programmer will cluster similar data using an unsupervised
learning algorithm and then use the existing labeled data to label the rest of the
unlabelled data. The typical use cases of such type of algorithm have a common
property among them – The acquisition of unlabelled data is relatively cheap while
labeling the said data is very expensive.
P(x,y) denotes the joint distribution of x and y, i.e., corresponding to a training sample x, I have a
label y. P(x) denotes the marginal distribution of x, i.e., just the training examples without any
labels. In Semi-supervised Learning, we use both P(x,y)(some labelled samples)
and P(x)(unlabelled samples) to estimate P(y|x)(since we want to predict the class, given the
131
training sample). We want to learn some representation h = f(x)such that samples which are closer
in the input space have similar representations and a linear classifier in the new space achieves
better generalization error.
Instead of separating the supervised and unsupervised criteria, we can instead have a generative
model of P(x) (or P(x, y)) which shares parameters with the discriminative model. The idea is to
share the unsupervised/generative criterion with the supervised criterion to express a prior belief
that the structure of P(x) (or P(x, y)) is connected to the structure of P(y|x), which is expressed by
the shared parameters.
Multi-Task Learning (MTL)
Multi-Task Learning (MTL) is a type of machine learning technique where a model is
trained to perform multiple tasks simultaneously. In deep learning, MTL refers to
training a neural network to perform multiple tasks by sharing some of the network’s
layers and parameters across tasks.
In MTL, the goal is to improve the generalization performance of the model by
leveraging the information shared across tasks. By sharing some of the network’s
parameters, the model can learn a more efficient and compact representation of the data,
which can be beneficial when the tasks are related or have some commonalities.
There are different ways to implement MTL in deep learning, but the most common
approach is to use a shared feature extractor and multiple task-specific heads. The
shared feature extractor is a part of the network that is shared across tasks and is used to
extract features from the input data. The task-specific heads are used to make
predictions for each task and are typically connected to the shared feature extractor.
Another approach is to use a shared decision-making layer, where the decision-making
layer is shared across tasks, and the task-specific layers are connected to the shared
decision-making layer.
MTL can be useful in many applications such as natural language processing, computer
vision, and healthcare, where multiple tasks are related or have some commonalities. It
is also useful when the data is limited, MTL can help to improve the generalization
performance of the model by leveraging the information shared across tasks.
However, MTL also has its own limitations, such as when the tasks are very different
Multi-Task Learning is a sub-field of Deep Learning. It is recommended that you
familiarize yourself with the concepts of neural networks to understand what multi-task
learning means.
132
What is Multi-Task Learning? Multi-Task learning is a sub-field of Machine

Learning that aims to solve multiple different tasks at the same time, by taking
advantage of the similarities between different tasks. This can improve the learning
efficiency and also act as a regularizer which we will discuss in a while. Formally, if
there are n tasks (conventional deep learning approaches aim to solve just 1 task using 1
particular model), where these n tasks or a subset of them are related to each other but
not exactly identical, Multi-Task Learning (MTL) will help in improving the learning
of a particular model by using the knowledge contained in all the n tasks.
Intuition behind Multi-Task Learning (MTL): By using Deep learning models, we

usually aim to learn a good representation of the features or attributes of the input data
to predict a specific value. Formally, we aim to optimize for a particular function by
training a model and fine-tuning the hyperparameters till the performance can’t be
increased further. By using MTL, it might be possible to increase performance even
further by forcing the model to learn a more generalized representation as it learns
(updates its weights) not just for one specific task but a bunch of tasks. Biologically,
humans learn in the same way. We learn better if we learn multiple related tasks instead
of focusing on one specific task for a long time. MTL as a regularizer: In the lingo of
Machine Learning, MTL can also be looked at as a way of inducing bias. It is a form of
inductive transfer, using multiple tasks induces a bias that prefers hypotheses that can
explain all the n tasks. MTL acts as a regularizer by introducing inductive bias as stated
above. It significantly reduces the risk of overfitting and also reduces the model’s
ability to accommodate random noise during training. Now, let’s discuss the major and
prevalent techniques to use MTL.
Hard Parameter Sharing – A common hidden layer is used for all tasks but several
task specific layers are kept intact towards the end of the model. This technique is very
useful as by learning a representation for various tasks by a common hidden layer, we
reduce the risk of overfitting.
133
Hard Parameter Sharing
Soft Parameter Sharing – Each model has their own sets of weights and biases and the
distance between these parameters in different models is regularized so that the
parameters become similar and can represent all the tasks.
Soft Parameter Sharing
Assumptions and Considerations – Using MTL to share knowledge among tasks are
very useful only when the tasks are very similar, but when this assumption is violated,
the performance will significantly decline. Applications: MTL techniques have found
various uses, some of the major applications are-
 Object detection and Facial recognition
 Self Driving Cars: Pedestrians, stop signs and other obstacles can be detected
together
 Multi-domain collaborative filtering for web applications
 Stock Prediction
 Language Modelling and other NLP applications
134
Important points:
Here are some important points to consider when implementing Multi-Task Learning
(MTL) for deep learning:
1. Task relatedness: MTL is most effective when the tasks are related or have some
commonalities, such as natural language processing, computer vision, and
healthcare.
2. Data limitation: MTL can be useful when the data is limited, as it allows the model
to leverage the information shared across tasks to improve the generalization
performance.
3. Shared feature extractor: A common approach in MTL is to use a shared feature
extractor, which is a part of the network that is shared across tasks and is used to
extract features from the input data.
4. Task-specific heads: Task-specific heads are used to make predictions for each task
and are typically connected to the shared feature extractor.
5. Shared decision-making layer: another approach is to use a shared decision-making
layer, where the decision-making layer is shared across tasks, and the task-specific
layers are connected to the shared decision-making layer.
6. Careful architecture design: The architecture of MTL should be carefully designed
to accommodate the different tasks and to make sure that the shared features are
useful for all tasks.
7. Overfitting: MTL models can be prone to overfitting if the model is not regularized
properly.
8. Avoiding negative transfer: when the tasks are very different or independent, MTL
can lead to suboptimal performance compared to training a single-task model.
Therefore, it is important to make sure that the shared features are useful for all
tasks to avoid negative transfer.
Early stopping
A significant challenge when training a machine learning model is deciding how
many epochs to run. Too few epochs might not lead to model convergence, while too
many epochs could lead to overfitting.
Early stopping is an optimization technique used to reduce overfitting without

compromising on model accuracy. The main idea behind early stopping is to stop
training before a model starts to overfit.
135
Early stopping approaches

There are three main ways early stopping can be achieved. Let’s look at each of them:
1. Training model on a preset number of epochs

This method is a simple, but naive way to early stop. By running a set number of
epochs, we run the risk of not reaching a satisfactory training point. With a higher
learning rate, the model might possibly converge with fewer epochs, but this method
requires a lot of trial and error. Due to the advancements in machine learning, this
method is pretty obsolete.
2. Stop when the loss function update becomes small

This approach is more sophisticated than the first as it is built on the fact that the
weight updates in gradient descent become significantly smaller as the model
approaches minima. Usually, the training is stopped when the update becomes as
small as 0.001, as stopping at this point minimizes loss and saves computing power by
preventing any unnecessary epochs. However, overfitting might still occur.
3. Validation set strategy

This clever technique is the most popular early stopping approach. To understand how
it works, it’s important to look at how training and validation errors change with the
number of epochs (as in the figure above). The training error decreases exponentially
until increasing epochs no longer have such a large effect on the error. The validation
error, however, initially decreases with increasing epochs, but after a certain point, it
starts increasing. This is the point where a model should be early stopped as beyond
this the model will start to overfit.
Although the validation set strategy is the best in terms of preventing overfitting, it
usually takes a large number of epochs before a model begins to overfit, which could
cost a lot of computing power. A smart way to get the best of both worlds is to devise
136
a hybrid approach between the validation set strategy and then stop when the loss
function update becomes small. For example, the training could stop when either of
them is achieved.
Parameter Tying
Parameter tying is a regularization technique. We divide the parameters or weights of a machine

learning model into groups by leveraging prior knowledge, and all parameters in each group are
constrained to take the same value. In simple terms, we want to express that specific parameter
should be close to each other.
Example
Two models perform the same classification task (with the same set of classes) but with different
input data.
• Model A with parameters wt(A).
• Model B with parameters wt(B).
The two models hash the input to two different but related outputs.
Some standard regularisers like l1 and l2 penalize model parameters for deviating from the fixed
value of zero. One of the side effects of Lasso or group-Lasso regularization in learning a Deep
Neural Networks is that there is a possibility that many of the parameters may become zero.
Thus, reducing the amount of memory required to store the model and lowering the
computational cost of applying it. A significant drawback of Lasso (or group-Lasso)
regularization is that in the presence of groups of highly correlated features, it tends to select
only one or an arbitrary convex combination of elements from each group. Moreover, the
learning process of Lasso tends to be unstable because the subsets of parameters that end up
selected may change dramatically with minor changes in the data or algorithmic procedure.
In Deep Neural Networks, it is almost unavoidable to encounter correlated features due to the
high dimensionality of the input to each layer and because neurons tend to adapt, producing
strongly correlated features that we pass as an input to the subsequent layer.
To overcome the issues we face while using Lasso or group lasso is countered by a regularizer,
the group version of the ordered weighted one norm, known as group-OWL (GrOWL). GrOWL
supports sparsity and simultaneously learns which parameters should share a similar value.
GrOWL has been effective in linear regression, identifying and coping with strongly correlated
covariates.
Unlike standard sparsity-inducing regularizers (e.g., Lasso), GrOWL eliminates unimportant
neurons by setting all their weights to zero and explicitly identifies strongly correlated neurons
by tying the corresponding weights to an expected value.
This ability of GrOWL motivates the following two-stage procedure:
137
(i) use GrOWL regularization during training to simultaneously identify significant neurons and
groups of parameters that should be tied together.
(ii) retrain the network, enforcing the structure unveiled in the previous phase, i.e., keeping only
the significant neurons and implementing the learned tying structure.
Parameter Sharing
The parameters of one model, trained as a classifier in a supervised paradigm, were regularised to be
close to the parameters of another model, trained in an unsupervised paradigm, using this method (to
capture the distribution of the observed input data).
Many of the parameters in the classifier model might be linked with similar parameters in the
unsupervised model thanks to the designs. While a parameter norm penalty is one technique to require
sets of parameters to be equal, constraints are a more prevalent way to regularise parameters to be close
to one another. Because we view the numerous models or model components as sharing a unique set of
parameters, this form of regularisation is commonly referred to as parameter sharing. The fact that only
a subset of the parameters (the unique set) needs to be retained in memory is a significant advantage of
parameter sharing over regularising the parameters to be close (through a norm penalty).
This can result in a large reduction in the memory footprint of certain models, such as the convolutional
neural network.
Convolutional neural networks (CNNs) used in computer vision are by far the most widespread and
extensive usage of parameter sharing. Many statistical features of natural images are translation
insensitive. A shot of a cat, for example, can be translated one pixel to the right and still be a shot of a
cat. By sharing parameters across several picture locations, CNNs take this property into account.
Different locations in the input are computed with the same feature (a hidden unit with the same
weights). This indicates that whether the cat appears in column i or column i + 1 in the image, we can
find it with the same cat detector.
CNN’s have been able to reduce the number of unique model parameters and raise network sizes
greatly without requiring a comparable increase in training data thanks to parameter sharing. It’s still
one of the best illustrations of how domain knowledge can be efficiently integrated into the network
architecture.
Bagging and other Ensemble Methods
Ensemble Methods
The general principle of an ensemble method in Machine Learning to combine the predictions
of several models. These are built with a given learning algorithm in order to improve
robustness over a single model. Ensemble methods can be divided into two groups:
 Parallel ensemble methods: In these methods, the base learners are generated in
parallel simultaneously. For example, when deciding the movie you want to watch, you
may ask multiple friends for suggestions and probably watch the movie which got the
highest votes.
 Sequential ensemble methods: In this technique, different learners learn sequentially
with early learners fitting simple models to the data. Then the data is analyzed for
errors. The goal is to solve for net error from the prior model. The overall performance
can be boosted by weighing previously mislabeled examples with higher weight.
138
Most ensemble methods use a single base learning algorithm to produce homogeneous base
learners, i.e. learners of the same type, leading to homogeneous ensembles. For
example, Random forests (Parallel ensemble method) and Adaboost(Sequential ensemble
methods).
Some methods use heterogeneous learners, i.e. learners of different types. This leads to
heterogeneous ensembles. For ensemble methods to be more accurate than any of its members,
the base learners have to be as accurate and as diverse as possible. In Scikit-learn, there is a
model known as a voting classifier. This is an example of heterogeneous learners.
Bagging
Bagging, a Parallel ensemble method (stands for Bootstrap Aggregating), is a way to decrease
the variance of the prediction model by generating additional data in the training stage. This is
produced by random sampling with replacement from the original set. By sampling with
replacement, some observations may be repeated in each new training data set. In the case of
Bagging, every element has the same probability to appear in a new dataset. By increasing the
size of the training set, the model’s predictive force can’t be improved. It decreases the
variance and narrowly tunes the prediction to an expected outcome.
These multisets of data are used to train multiple models. As a result, we end up with an
ensemble of different models. The average of all the predictions from different models is used.
This is more robust than a model. Prediction can be the average of all the predictions given by
the different models in case of regression. In the case of classification, the majority vote is
taken into consideration.
For example, Decision tree models tend to have a high variance. Hence, we apply bagging to
them. Usually, the Random Forest model is used for this purpose. It is an extension over-
bagging. It takes the random selection of features rather than using all features to grow trees.
When you have many random trees. It’s called Random Forest.
Boosting
Boosting is a sequential ensemble method that in general decreases the bias error and builds
strong predictive models. The term ‘Boosting’ refers to a family of algorithms which converts
a weak learner to a strong learner.
Boosting gets multiple learners. The data samples are weighted and therefore, some of them
may take part in the new sets more often.
In each iteration, data points that are mispredicted are identified and their weights are
increased so that the next learner pays extra attention to get them right. The following figure
illustrates the boosting process.
139
During training, the algorithm allocates weights to each resulting model. A learner with good
prediction results on the training data will be assigned a higher weight than a poor one. So
when evaluating a new learner, Boosting also needs to keep track of learner’s errors.
Some of the Boosting techniques include an extra-condition to keep or discard a single learner.
For example, in AdaBoost an error of less than 50% is required to maintain the model;
otherwise, the iteration is repeated until achieving a learner better than a random guess.
Bagging vs Boosting
There’s no outright winner, it depends on the data, the simulation, and the
circumstances. Bagging and Boosting in machine learning decrease the variance of a single
estimate as they combine several estimates from different models. As a result, the performance
of the model increases, and the predictions are much more robust and stable.
But how do we measure the performance of a model? One of the ways is to compare its
training accuracy with its validation accuracy which is done by splitting the data into two sets,
viz- training set and validation set.
The model is trained on the training set and evaluated on the validation set. Thus, the training
accuracy is evaluated on the training set and gives us a measure of how good the model can fit
the training data. On the other hand, validation accuracy is evaluated on the validation set and
reveals the generalization ability of the model. A model’s ability to generalize is crucial to the
success of a model. Thus, we can say that the performance of a model is good if it can fit the
training data well and also predict the unknown data points accurately.
If a single model gets a low performance, Bagging rarely gets a better bias. However, Boosting
can generate a combined model with lower errors. As it optimizes the advantages and reduces
the pitfalls of the single model. On the other hand, Bagging can increase the generalization
ability of the model and help it better predict the unknown samples. Let us see an example of
this in the next section.
Implementation
In this section, we demonstrate the effect of Bagging and Boosting on the decision boundary of
a classifier. Let us start by introducing some of the algorithms used in this code.
 Decision Tree Classifier: Decision Tree Classifier is a simple and widely used
classification technique. It applies a straightforward idea to solve the classification
problem. Decision Tree Classifier poses a series of carefully crafted questions about the
attributes of the test record. Each time it receives an answer, a follow-up question is
asked until a conclusion about the class label of the record is reached.
 Decision Stump: A decision stump is a machine learning model consisting of a one-
level decision tree. That is, it is a decision tree with one internal node (the root) which
is immediately connected to the terminal nodes (its leaves). A decision stump makes a
140
prediction based on the value of just a single input feature. Here we take decision stump
as a weak learner for the AdaBoost algorithm.
 RandomForest: Random forest is an ensemble learning algorithm that uses the concept
of Bagging.
 AdaBoost: AdaBoost, short for Adaptive Boosting, is a machine learning meta-
algorithm that works on the principle of Boosting. We use a Decision stump as a weak
learner here.
Dropout
INTRODUCTION
So before diving deep into its world, let’s address the first question. What is
the problem that we are trying to solve?
The deep neural networks have different architectures, sometimes shallow,

sometimes very deep trying to generalise on the given dataset. But, in this
pursuit of trying too hard to learn different features from the dataset, they
sometimes learn the statistical noise in the dataset. This definitely improves
the model performance on the training dataset but fails massively on new data
points (test dataset). This is the problem of overfitting. To tackle this problem
we have various regularisation techniques that penalise the weights of the
network but this wasn’t enough.
The best way to reduce overfitting or the best way to regularise a fixed-size
model is to get the average predictions from all possible settings of the
parameters and aggregate the final output. But, this becomes too
computationally expensive and isn’t feasible for a real-time
inference/prediction.
141
The other way is inspired by the ensemble techniques (such as AdaBoost,

XGBoost, and Random Forest) where we use multiple neural networks of
different architectures. But this requires multiple models to be trained and
stored, which over time becomes a huge challenge as the networks grow
deeper.
So, we have a great solution known as Dropout Layers.
Figure 1: Dropout applied to a Standard Neural Network (Image by Nitish)
What is a Dropout?
The term “dropout” refers to dropping out the nodes (input and hidden
layer) in a neural network (as seen in Figure 1). All the forward and
backwards connections with a dropped node are temporarily removed,
thus creating a new network architecture out of the parent network. The
nodes are dropped by a dropout probability of p.
Let’s try to understand with a given input x: {1, 2, 3, 4, 5} to the fully

connected layer. We have a dropout layer with probability p = 0.2 (or keep
142
probability = 0.8). During the forward propagation (training) from the input x,
20% of the nodes would be dropped, i.e. the x could become {1, 0, 3, 4, 5} or
{1, 2, 0, 4, 5} and so on. Similarly, it applied to the hidden layers.
For instance, if the hidden layers have 1000 neurons (nodes) and a dropout is
applied with drop probability = 0.5, then 500 neurons would be randomly
dropped in every iteration (batch).
Generally, for the input layers, the keep probability, i.e. 1- drop probability, is
closer to 1, 0.8 being the best as suggested by the authors. For the hidden
layers, the greater the drop probability more sparse the model, where 0.5 is the
most optimised keep probability, that states dropping 50% of the nodes.
So how does dropout solves the problem of overfitting?
How does it solve the Overfitting problem?
In the overfitting problem, the model learns the statistical noise. To be precise,
the main motive of training is to decrease the loss function, given all the units
(neurons). So in overfitting, a unit may change in a way that fixes up the
mistakes of the other units. This leads to complex co-adaptations, which in
turn leads to the overfitting problem because this complex co-adaptation fails
to generalise on the unseen dataset.
Now, if we use dropout, it prevents these units to fix up the mistake of other
units, thus preventing co-adaptation, as in every iteration the presence of a
unit is highly unreliable. So by randomly dropping a few units (nodes), it
143
forces the layers to take more or less responsibility for the input by taking a
probabilistic approach.
This ensures that the model is getting generalised and hence reducing the
overfitting problem.
Figure 2:(a) Hidden layer features without dropout; (b) Hidden layer features with
dropout
From figure 2, we can easily make out that the hidden layer with dropout is
learning more of the generalised features than the co-adaptations in the layer
without dropout. It is quite apparent, that dropout breaks such inter-unit
relations and focuses more on generalisation.
Dropout Implementation
Enough of the talking! Let’s head to the mathematical explanation of the

dropout.
144
Figure 3 (a) A unit (neuron) during training is present with a probability p and is
connected to the next layer with weights ‘w’ ; (b) A unit during inference/prediction is
always present and is connected to the next layer with weights, ‘pw’
In the original implementation of the dropout layer, during training, a unit

(node/neuron) in a layer is selected with a keep probability (1-drop
probability). This creates a thinner architecture in the given training batch, and
every time this architecture is different.
In the standard neural network, during the forward propagation we have the
following equations:
Figure 4: Forward propagation of a standard neural network
where:
z: denote the vector of output from layer (l + 1) before activation
y: denote the vector of outputs from layer l
w: weight of the layer l
b: bias of the layer l
145
Further, with the activation function, z is transformed into the output for layer
(l+1).
Now, if we have a dropout, the forward propagation equations change in the

following way:
Figure 5: Forward propagation of a layer with dropout (Image by Nitish)
So before we calculate z, the input to the layer is sampled and multiplied

element-wise with the independent Bernoulli variables. r denotes the
Bernoulli random variables each of which has a probability p of being 1.
Basically, r acts as a mask to the input variable, which ensures only a few
units are kept according to the keep probability of a dropout. This ensures that
we have thinned outputs “y(bar)”, which is given as an input to the layer
during feed-forward propagation.
Figure 6
146
Comparison of the dropout network with the standard network for a given layer
during forward propagation
Dropout during Inference
Now, we know the dropout works mathematically but what happens during the
inference/prediction? Do we use the network with dropout or do we remove the dropout during
inference?
This is one of the most important concepts of dropout which very few data scientists are aware of.
According to the original implementation (Figure 3b) during the inference, we do not use a dropout
layer. This means that all the units are considered during the prediction step. But, because of taking
all the units/neurons from a layer, the final weights will be larger than expected and to deal with
this problem, weights are first scaled by the chosen dropout rate. With this, the network would be
able to make accurate predictions.
To be more precise, if a unit is retained with probability p during training, the

outgoing weights of that unit are multiplied by p during the prediction stage.
Adversarial Training
So first, what are adversarial examples?
An adversarial example is a sample of input data which has been

modified very slightly in a way that is intended to cause a machine learning
to misclassify it.
An example should help anchor the idea.
147
An image classified as “suit” by the VGG16 neural network (left), a perturbation

created specifically using the image on the left (middle) and the resulting perturbed
image classified as a wig (right).
In this figure, we can see that the classification of the perturbed image
(rightmost) is clearly absurd. We can also notice that the difference between
the original and modified images is very slight.
With article like “California’s finally ready for truly driverless cars” or “The
Pentagon’s ‘Terminator Conundrum’: Robots That Could Kill on Their Own”,
applications of adversarial examples are not exactly hard to come up with…
In this post I will first explain how such images are created and then go
through the main defenses that have been published. For more details and
more rigorous information, please refer to the research papers referenced.
First, a quick reminder on gradient descent. Gradient descent is an

optimization algorithm used to find a local minimum of a differentiable
function.
148
A curve and a tangent to this curve.
In the figure on the left, you can see a simple curve. Suppose that we want to
find a local minimum (a value of x for which f(x) is locally minimal).
The gradient descent consists in the following steps: first pick an initial value
for x, then compute the derivative f’ of f according to x and evaluate it for our
initial guess. f’(x) is the slope of the tangent to the curve at x. According to the
sign of this slope, we know whether we have to increase or decrease x to
make f(x) decrease. In the example on the left, the slope is negative so we
should increase the value of x to make f(x) decrease. As the tangent is a good
149
approximation of the curve in a tiny neighborhood, the value change applied

to x is very small to ensure that we do not jump too far.
Now, we can use this algorithm to train machine learning models.
A set of data points and a linear model initialized randomly.
Suppose that we have a set of points and we want to find a line that is a
reasonable approximation for these values.
Our machine learning model will be the line y = ax + b and the model
parameters will be a and b.
Now to use the gradient descent, we are going to define a function of which
we will want to find a local minimum. This is our loss function.
150
This loss takes a data point x, its corresponding value y and the model
parameters a and b. The loss is the squared difference of the real
value y and ax + b, the prediction of our model. The bigger the difference
between the real and predicted values is, the bigger the value of the loss
function will be. Intuitively, we chose the squaring operation to penalize big
differences between real and predicted values more than small ones.
Now we compute the derivative of our function L according to the parameters

of the model a and b.
And as before, we can evaluate this derivatives with our current values
of a and b for each data point (x, y), which will give us the slopes of the
tangents to the loss function and use these slopes to update a and b in order to
minimize L.
OK, that’s cool and all but that’s not how we’re going to generate our
adversarial examples…
Well, in fact it is exactly how we are going to do it. Suppose now that
the model is fixed (you can’t change a and b) and you want to increase the
value of the loss. The only thing left to modify are the data points (x, y). As
modifying the ys does not really make sense, we will modify the xs.
151
We could just replace the x by random values and the loss value would
increase by a tremendous amount but that’s not really subtle, in particular, it
would be really obvious to a human plotting the data points. To make our
changes in a way that is not obviously detected by an observer, we will
compute the derivative of the loss function according to x.
And now, just as before, we can evaluate this derivative on our data points,
get the slope of the tangent and update the x values by a small amount
accordingly. The loss will increase and, as we are modifying all the points by
a small amount, our perturbation will be hard to detect.
Well that was a very simple model that we’ve just messed with, deep learning
is much more complicated than that…
Guess what? It’s not. Everything we just did has a direct equivalent in the
world of deep learning. When we are training a neural network to classify
images, the loss function is usually a categorical cross entropy, the model
parameters are the weights of the network and the inputs are the pixel values
of the image.
The basic algorithm of adversarial sample generation, called Fast Gradient

Sign Method (from this paper), is exactly what I described above. Let’s
explain it and run it on an example.
152
Let x be the original image, y the class of x, θ the weights of the network
and L(θ, x, y) the loss function used to train the network.
First, we compute the gradient of the loss function according to the input
pixels. The ∇ operator is just a concise mathematical way of taking the
derivatives of a function according to many of its parameters. You can think
of as a matrix of shape [width, height, channels] containing the slopes of the
tangents.
As before, we are only interested in the sign of the slopes to know if we want
to increase or decrease the pixel values. We multiply these signs by a very
small value ε to ensure that we do not go too far on the loss function surface
and that the perturbation will be imperceptible. This will be our perturbation.
Our final image is just our original image to which we add the perturbation η.
Let’s run it on an example:
153
FGSM applied to an image. The original is classified as ‘king penguin’ with 100%
confidence and the perturbed one is classified as ‘tripod’ with 71% confidence.
The family of attack where you are able to use compute gradients using the
target model are called white-box attacks.
Now you could tell me that the attack I’ve just presented is not really realistic
as you’re unlikely to get access to the gradients of the loss function on a self-
driving car. Researchers thought exactly the same thing and, in this paper,
they found a way to deal with it.
In a more realistic context, you would want to attack a system having only
access to its outputs. The problem with this is that you would not be able to
apply the FGSM algorithm anymore as you would not have access to the
network itself.
The solution proposed is to train a new neural network M’ to solve the same
classification task as the target model M. Then, when M’ is trained, use it to
154
generate adversarial samples using FGSM (which we now can do since it is

our own network) and ask M to classify them.
What they found is that M will very often misclassify adversarial samples
generated using M’. Moreover, if we do not have access to a proper training
set for M’, we can build one using M predictions as truth values. The authors
call this synthetic inputs. This is an excerpt of their article in which they
describe their attack on the MetaMind network to which they did not have
access:
“After labeling 6,400 synthetic inputs to train our substitute (an order of
magnitude smaller than the training set used by MetaMind) we find that their
DNN misclassifies adversarial examples crafted with our substitute at a rate
of 84.24%”.
This kind of attack is called black-box attack as you see the target model as a
black-box.
So, even when the attacker does not have access to the internals of the model
he can still produce adversarial sample that will fool it but still, this attack
context is not realistic either. In a real scenario, the attacker would not be
allowed to provide its own image files, the neural network would take camera
pictures as input. That’s the problem the authors of this article are trying to
solve.
What they noticed is that when you print adversarial samples which have been
generated with a high-enough ε and then take a picture of the print and
155
classify it, the neural network is still fooled a significant portion of the time.
The authors recorded a video to showcase their results:s
156
UNIT 5
Optimization for Train Deep Models
Challenges in Neural Network Optimization
Optimization is one of the broadest areas of research in the deep learning space. In
previous articles, I explained the differences between optimization and
regularization as two of the fundamental techniques used to improve deep learning
models. There are several types of optimization in deep learning algorithms but the
most interesting ones are focused on reducing the value of cost functions.
When we say that optimization is one of the key areas of deep learning we are not
exaggerating. In real world deep learning implementations, data scientists often
spend more time refining and optimizing models than building new ones. What
makes deep learning optimization such a difficult endeavor. To answer that, we
need to understand some of the principles behind this new type of optimization n.
Some Basics of Optimization in Deep Learning Models
The core of deep learning optimization relies on trying to minimize the cost
function of a model without affecting its training performance. That type of
optimization problem contrasts with the general optimization problem in which the
objective is to simply minimize a specific indicator without being constrained by
the performance of other elements( ex:training).
Most optimization algorithms in deep learning are based on gradient estimations

(see my previous article about gradient based optimization). In that context,
optimization algorithms try to reduce the gradient of specific cost functions
evaluated against the training dataset. There are different categories of optimization
157
algorithms depending on the way they interact with the training dataset. For
instance, algorithms that use the entire training set at once are called deterministic.
Other techniques that use one training example at a time has come to be known as
online algorithms. Similarly, algorithms that use more than one but less than the
entire training dataset during the optimization process are known as minibatch
stochastic or simply stochastic. The most famous method of stochastic optimization
which is also the most common algorithm in deep learning solution is known as
stochastic gradient descent(SGD)(read my previous article about SGD).
Regardless of the type of optimization algorithm used, the process of optimizing a

deep learning model is a careful path full of challenges.
Common Challenges in Deep Learning Optimization
There are plenty of challenges in deep learning optimization but most of them are
related to the nature of the gradient of the model. Below, I’ve listed some of the
most common challenges in deep learning optimization that you are likely to run
into:
a)Local Minima: The grandfather of all optimization problems, local minima is a

permanent challenge in the optimization of any deep learning algorithm. The local
minima problem arises when the gradient encounters many local minimums that are
different and not correlated to a global minimum for the cost function.
b)Flat Regions: In deep learning optimization models, flat regions are common
areas that represent both a local minimum for a sub-region and a local maximum for
another. That duality often causes the gradient to get stuck.
158
c)Inexact Gradients: There are many deep learning models in which the cost
function is intractable which forces an inexact estimation of the gradient. In these
cases, the inexact gradients introduces a second layer of uncertainty in the model.
d)Local vs. Global Structures: Another very common challenge in the optimization
of deep leavening models is that local regions of the cost function don’t correspond
with its global structure producing a misleading gradient.
Optimization Strategies
Optimizer algorithms are optimization method that helps improve a deep learning
model’s performance. These optimization algorithms or optimizers widely affect the
accuracy and speed training of the deep learning model. But first of all, the question
arises of what an optimizer is.
While training the deep learning optimizers model, modify each epoch’s weights
and minimize the loss function. An optimizer is a function or an algorithm that
adjusts the attributes of the neural network, such as weights and learning rates. Thus,
it helps in reducing the overall loss and improving accuracy. The problem of
choosing the right weights for the model is a daunting task, as a deep learning model
generally consists of millions of parameters. It raises the need to choose a suitable
optimization algorithm for your application. Hence understanding these machine
learning algorithms is necessary for data scientists before having a deep dive into
the field.
You can use different optimizers in the machine learning model to change your weights
and learning rate. However, choosing the best optimizer depends upon the application.
As a beginner, one evil thought that comes to mind is that we try all the possibilities
159
and choose the one that shows the best results. This might be fine initially, but when
dealing with hundreds of gigabytes of data, even a single epoch can take considerable
time. So randomly choosing an algorithm is no less than gambling with your precious
time that you will realize sooner or later in your journey.
This guide will cover various deep-learning optimizers, such as Gradient Descent,
Stochastic Gradient Descent, Stochastic Gradient descent with momentum, Mini-Batch
Gradient Descent, Adagrad, RMSProp, AdaDelta, and Adam. By the end of the article,
you can compare various optimizers and the procedure they are based upon.
Important Deep Learning Terms
Before proceeding, there are a few terms that you should be familiar with.
Epoch – The number of times the algorithm runs on the whole training dataset.
Sample – A single row of a dataset.
Batch – It denotes the number of samples to be taken to for updating the model
parameters.
Learning rate – It is a parameter that provides the model a scale of how much model
weights should be updated.
160
Cost Function/Loss Function – A cost function is used to calculate the cost, which is
the difference between the predicted value and the actual value.
Weights/ Bias – The learnable parameters in a model that controls the signal between
two neurons.
Now let’s explore each optimizer.
Gradient Descent Deep Learning Optimizer
Gradient Descent can be considered the popular kid among the class of optimizers. This
optimization algorithm uses calculus to modify the values consistently and to achieve
the local minimum. Before moving ahead, you might have the question of what a
gradient is.
In simple terms, consider you are holding a ball resting at the top of a bowl. When you
lose the ball, it goes along the steepest direction and eventually settles at the bottom of
the bowl. A Gradient provides the ball in the steepest direction to reach the local
minimum which is the bottom of the bowl.
Gradient Descent Deep Learning Optimizer formula
The above equation means how the gradient is calculated. Here alpha is the step size
that represents how far to move against each gradient with each iteration.
161
Gradient descent works as follows:
1. It starts with some coefficients, sees their cost, and searches for cost value lesser
than what it is now.
2. It moves towards the lower weight and updates the value of the coefficients.
3. The process repeats until the local minimum is reached. A local minimum is a
point beyond which it can not proceed.
Gradient Descent Deep Learning Optimizer Image
Gradient descent works best for most purposes. However, it has some downsides too.
It is expensive to calculate the gradients if the size of the data is huge. Gradient descent
works well for convex functions, but it doesn’t know how far to travel along the gradient
for nonconvex functions.
Stochastic Gradient Descent Deep Learning Optimizer
At the end of the previous section, you learned why using gradient descent on
massive data might not be the best option. To tackle the problem, we have stochastic
162
gradient descent. The term stochastic means randomness on which the algorithm is
based upon. In stochastic gradient descent, instead of taking the whole dataset for
each iteration, we randomly select the batches of data. That means we only take a
few samples from the dataset.
The procedure is first to select the initial parameters w and learning rate n. Then
randomly shuffle the data at each iteration to reach an approximate minimum.
Since we are not using the whole dataset but the batches of it for each iteration, the
path taken by the algorithm is full of noise as compared to the gradient descent
algorithm. Thus, SGD uses a higher number of iterations to reach the local minima.
Due to an increase in the number of iterations, the overall computation time
increases. But even after increasing the number of iterations, the computation cost
is still less than that of the gradient descent optimizer. So the conclusion is if the data
is enormous and computational time is an essential factor, stochastic gradient
descent should be preferred over batch gradient descent algorithm.
Stochastic Gradient Descent With Momentum Deep

Learning Optimizer
As discussed in the earlier section, you have learned that stochastic gradient descent
takes a much more noisy path than the gradient descent algorithm. Due to this reason,
it requires a more significant number of iterations to reach the optimal minimum, and
hence computation time is very slow. To overcome the problem, we use stochastic
gradient descent with a momentum algorithm.
163
What the momentum does is helps in faster convergence of the loss function. Stochastic
gradient descent oscillates between either direction of the gradient and updates the
weights accordingly. However, adding a fraction of the previous update to the current
update will make the process a bit faster. One thing that should be remembered while
using this algorithm is that the learning rate should be decreased with a high momentum
term.
In the above image, the left part shows the convergence graph of the stochastic gradient
descent algorithm. At the same time, the right side shows SGD with momentum. From
the image, you can compare the path chosen by both algorithms and realize that using
momentum helps reach convergence in less time. You might be thinking of using a
large momentum and learning rate to make the process even faster. But remember that
while increasing the momentum, the possibility of passing the optimal minimum also
increases. This might result in poor accuracy and even more oscillations.
Mini Batch Gradient Descent Deep Learning Optimizer
In this variant of gradient descent, instead of taking all the training data, only a subset
of the dataset is used for calculating the loss function. Since we are using a batch of
data instead of taking the whole dataset, fewer iterations are needed. That is why the
mini-batch gradient descent algorithm is faster than both stochastic gradient descent
and batch gradient descent algorithms. This algorithm is more efficient and robust than
164
the earlier variants of gradient descent. As the algorithm uses batching, all the training
data need not be loaded in the memory, thus making the process more efficient to
implement. Moreover, the cost function in mini-batch gradient descent is noisier than
the batch gradient descent algorithm but smoother than that of the stochastic gradient
descent algorithm. Because of this, mini-batch gradient descent is ideal and provides a
good balance between speed and accuracy.
Despite all that, the mini-batch gradient descent algorithm has some downsides too. It
needs a hyperparameter that is “mini-batch-size”, which needs to be tuned to achieve
the required accuracy. Although, the batch size of 32 is considered to be appropriate for
almost every case. Also, in some cases, it results in poor final accuracy. Due to this,
there needs a rise to look for other alternatives too.
Adagrad (Adaptive Gradient Descent) Deep Learning Optimizer
The adaptive gradient descent algorithm is slightly different from other gradient descent
algorithms. This is because it uses different learning rates for each iteration. The change
in learning rate depends upon the difference in the parameters during training. The more
the parameters get changed, the more minor the learning rate changes. This
modification is highly beneficial because real-world datasets contain sparse as well as
dense features. So it is unfair to have the same value of learning rate for all the features.
The Adagrad algorithm uses the below formula to update the weights. Here the alpha(t)
denotes the different learning rates at each iteration, n is a constant, and E is a small
positive to avoid division by 0.
165
The benefit of using Adagrad is that it abolishes the need to modify the learning rate
manually. It is more reliable than gradient descent algorithms and their variants, and it
reaches convergence at a higher speed.
One downside of the AdaGrad optimizer is that it decreases the learning rate
aggressively and monotonically. There might be a point when the learning rate becomes
extremely small. This is because the squared gradients in the denominator keep
accumulating, and thus the denominator part keeps on increasing. Due to small learning
rates, the model eventually becomes unable to acquire more knowledge, and hence the
accuracy of the model is compromised.
RMS Prop (Root Mean Square) Deep Learning Optimizer
RMS prop is one of the popular optimizers among deep learning enthusiasts. This is
maybe because it hasn’t been published but is still very well-known in the community.
RMS prop is ideally an extension of the work RPPROP. It resolves the problem of
varying gradients. The problem with the gradients is that some of them were small while
others may be huge. So, defining a single learning rate might not be the best idea.
RPPROP uses the gradient sign, adapting the step size individually for each weight. In
this algorithm, the two gradients are first compared for signs. If they have the same
sign, we’re going in the right direction, increasing the step size by a small fraction. If
they have opposite signs, we must decrease the step size. Then we limit the step size
and can now go for the weight update.
The problem with RPPROP is that it doesn’t work well with large datasets and when
we want to perform mini-batch updates. So, achieving the robustness of RPPROP and
the efficiency of mini-batches simultaneously was the main motivation behind the rise
166
of RMS prop. RMS prop is an advancement in AdaGrad optimizer as it reduces the

monotonically decreasing learning rate.
RMS Prop Formula

The algorithm mainly focuses on accelerating the optimization process by decreasing
the number of function evaluations to reach the local minimum. The algorithm keeps
the moving average of squared gradients for every weight and divides the gradient by
the square root of the mean square.
where gamma is the forgetting factor. Weights are updated by the below formula
In simpler terms, if there exists a parameter due to which the cost function oscillates a
lot, we want to penalize the update of this parameter. Suppose you built a model to
classify a variety of fishes. The model relies on the factor ‘color’ mainly to differentiate
between the fishes. Due to this, it makes a lot of errors. What RMS Prop does is,
penalize the parameter ‘color’ so that it can rely on other features too. This prevents the
algorithm from adapting too quickly to changes in the parameter ‘color’ compared to
other parameters. This algorithm has several benefits as compared to earlier versions of
gradient descent algorithms. The algorithm converges quickly and requires lesser
tuning than gradient descent algorithms and their variants.
The problem with RMS Prop is that the learning rate has to be defined manually, and
the suggested value doesn’t work for every application.
AdaDelta Deep Learning Optimizer
167
AdaDelta can be seen as a more robust version of the AdaGrad optimizer. It is based
upon adaptive learning and is designed to deal with significant drawbacks of AdaGrad
and RMS prop optimizer. The main problem with the above two optimizers is that the
initial learning rate must be defined manually. One other problem is the decaying
learning rate which becomes infinitesimally small at some point. Due to this, a certain
number of iterations later, the model can no longer learn new knowledge.
To deal with these problems, AdaDelta uses two state variables to store the leaky
average of the second moment gradient and a leaky average of the second moment of
change of parameters in the model.
Here St and delta Xt denote the state variables, g’t denotes rescaled gradient, delta Xt-
1 denotes squares rescaled gradients, and epsilon represents a small positive integer to
handle division by 0.
Adam Optimizer in Deep Learning
Adam optimizer, short for Adaptive Moment Estimation optimizer, is an optimization

algorithm commonly used in deep learning. It is an extension of the stochastic gradient
168
descent (SGD) algorithm and is designed to update the weights of a neural network
during training.
The name “Adam” is derived from “adaptive moment estimation,” highlighting its
ability to adaptively adjust the learning rate for each network weight individually.
Unlike SGD, which maintains a single learning rate throughout training, Adam
optimizer dynamically computes individual learning rates based on the past gradients
and their second moments.
The creators of Adam optimizer incorporated the beneficial features of other

optimization algorithms such as AdaGrad and RMSProp. Similar to RMSProp, Adam
optimizer considers the second moment of the gradients, but unlike RMSProp, it
calculates the uncentered variance of the gradients (without subtracting the mean).
By incorporating both the first moment (mean) and second moment (uncentered
variance) of the gradients, Adam optimizer achieves an adaptive learning rate that can
efficiently navigate the optimization landscape during training. This adaptivity helps in
faster convergence and improved performance of the neural network.
In summary, Adam optimizer is an optimization algorithm that extends SGD by

dynamically adjusting learning rates based on individual weights. It combines the
features of AdaGrad and RMSProp to provide efficient and adaptive updates to the
network weights during deep learning training.
Adam Optimizer Formula

The adam optimizer has several benefits, due to which it is used widely. It is adapted
as a benchmark for deep learning papers and recommended as a default optimization
algorithm. Moreover, the algorithm is straightforward to implement, has a faster
169
running time, low memory requirements, and requires less tuning than any other
optimization algorithm.
The above formula represents the working of adam optimizer. Here B1 and B2 represent
the decay rate of the average of the gradients.
If the adam optimizer uses the good properties of all the algorithms and is the best
available optimizer, then why shouldn’t you use Adam in every application? And what
was the need to learn about other algorithms in depth? This is because even Adam has
some downsides. It tends to focus on faster computation time, whereas algorithms like
stochastic gradient descent focus on data points. That’s why algorithms like SGD
generalize the data in a better manner at the cost of low computation speed. So, the
optimization algorithms can be picked accordingly depending on the requirements and
the type of data.
170
The above visualizations create a better picture in mind and help in comparing the
results of various optimization algorithms.
Meta-Algorithms
Introduction
171
The field of meta-learning, or learning-to-learn, has seen a dramatic increase in interest

in recent years. Unlike conventional approaches to AI, where tasks are solved from
scratch using a fixed learning algorithm, meta-learning aims to improve the algorithm,
considering the experience of multiple learning episodes. This paradigm provides an
opportunity to address many of the conventional challenges of deep learning, including
data and computational bottlenecks and generalization. This survey describes the
current landscape of meta-learning. We first discuss definitions of meta-learning and
situate them concerning related fields such as transfer learning and hyperparameter
optimization. We then propose a new taxonomy that provides a more comprehensive
partitioning of the space of today’s meta-learning methods. We explore meta-learning’s
promising applications and achievements, such as multiple trials and reinforcement
learning.
The performance of a learning model depends on its training dataset, algorithm, and
parameters. Many experiments are needed to find the best performing algorithm and
algorithm parameters. Meta-learning approaches help them find and optimize the
number of experiments. The result is better predictions in less time.
Meta-learning can be used for various machine learning models (e.g., few-
shot, Reinforcement Learning, natural language processing, etc.). Meta-learning
algorithms make predictions by inputting the outputs and metadata of machine
learning algorithms. Meta-learning algorithms can learn to use the best predictions
from machine learning algorithms to make better predictions. In computer science,
meta-learning studies and approaches started in the 1980s and became popular after
the works of Jürgen Schmidhuber and Yoshua Bengio on the subject.
What is Meta-learning?
Meta-learning, described as “learning to learn”, is a subset of machine learning in the
field of computer science. It is used to improve the results and performance of the
learning algorithm by changing some aspects of the learning algorithm based on the
results of the experiment. Meta-learning helps researchers understand which
algorithms generate the best/better predictions from datasets.
Meta-learning algorithms use learning algorithm metadata as input. They then make
predictions and provide information about the performance of these learning
172
algorithms as output. An image’s metadata in a learning model can include, for

example, its size, resolution, style, creation date, and owner.
Systematic experiment design in meta-learning is the most important challenge.
Applications
Large-scale deep learning refers to the application of deep learning
techniques to massive datasets and high-computational resources. It
involves training deep neural networks on large datasets with millions or
even billions of data points using powerful hardware such as multiple
GPUs or specialized hardware like TPUs (Tensor Processing Units).
The need for large-scale deep learning arises from the complexity and
capacity of deep neural networks. Deep learning models, particularly
deep neural networks with multiple hidden layers, have the ability to
learn intricate patterns and representations from data, making them
highly effective in various tasks like image recognition, natural language
173
processing, speech recognition, and more. However, training these

models requires a vast amount of data to generalize well and powerful
computational resources to handle the enormous number of parameters
and computations involved.
Here are some key aspects of large-scale deep learning:
1. Big Data: Large-scale deep learning leverages big datasets that
contain millions or billions of data points. The availability of
extensive data ensures that the model can learn meaningful
representations and make accurate predictions across various
conditions.
2. High-Performance Computing: Training deep neural networks on
massive datasets is computationally intensive and time-consuming.
Large-scale deep learning relies on high-performance computing
resources such as clusters of GPUs or TPUs to accelerate training
and inference processes.
3. Distributed Computing: To handle large datasets and deep learning
models, distributed computing frameworks like TensorFlow,
PyTorch, or Horovod are often used. These frameworks enable the
efficient distribution of computations across multiple devices or
machines, making it feasible to train models at scale.
4. Model Parallelism and Data Parallelism: In large-scale deep
learning, model parallelism and data parallelism techniques are
employed to distribute the neural network across multiple GPUs or
machines. Model parallelism splits the model across devices,
allowing each device to handle different parts of the model. Data
parallelism, on the other hand, replicates the entire model on each
device but feeds different batches of data to each replica.
5. Transfer Learning: Large-scale deep learning often involves
pretraining models on vast datasets and then fine-tuning them on
specific tasks or domains. This approach leverages the knowledge
174
learned from the large dataset to boost performance on smaller,

task-specific datasets.
Applications of large-scale deep learning include:
 Image and object recognition in computer vision.
 Natural language understanding and machine translation in natural
language processing.
 Speech recognition and synthesis in speech processing.
 Drug discovery and genomics in the healthcare domain.
 Autonomous vehicles and robotics.
 Recommendation systems and personalized marketing.
Large-scale deep learning has opened up new possibilities in AI research
and has been instrumental in achieving state-of-the-art results in various
real-world applications. However, it also poses challenges related to
hardware costs, energy consumption, and the need for effective
algorithms to scale efficiently. As technology continues to advance,
large-scale deep learning will likely play a vital role in shaping the
future of artificial intelligence.
Computer Vision
Computer Vision in deep learning is a specialized field of artificial
intelligence (AI) that focuses on teaching computers to interpret and
understand visual information from the world. It involves the use of
deep learning techniques to process, analyze, and extract meaningful
insights from images and videos.
The primary goal of computer vision is to enable machines to perceive
the visual world in a manner similar to how humans do, and to make
decisions or take actions based on that understanding. Deep learning,
specifically deep neural networks, has revolutionized computer vision by
175
providing highly effective tools for feature extraction, pattern

recognition, and image understanding.
Key components of computer vision in deep learning include:
1. Convolutional Neural Networks (CNNs): CNNs are a class of
deep neural networks specifically designed for processing grid-like
data, such as images. They consist of multiple layers of
convolutional and pooling operations that help the network learn
hierarchical features from the input images. CNNs have been
instrumental in achieving state-of-the-art performance in various
computer vision tasks like image classification, object detection,
and semantic segmentation.
2. Image Classification: Image classification is the task of assigning
a label or category to an input image. Deep learning models,
especially CNNs, have shown remarkable performance in image
classification tasks, surpassing human-level accuracy on certain
datasets.
3. Object Detection: Object detection involves locating and
classifying multiple objects within an image. Deep learning
approaches, like region-based CNNs (R-CNN), Faster R-CNN, and
You Only Look Once (YOLO), have significantly advanced object
detection accuracy and speed.
4. Semantic Segmentation: Semantic segmentation aims to label
each pixel in an image with its corresponding object class. Fully
Convolutional Networks (FCNs) and U-Net architectures are
commonly used in semantic segmentation tasks.
5. Instance Segmentation: Instance segmentation extends semantic
segmentation by differentiating individual instances of the same
object class in an image. Mask R-CNN is a popular deep learning
model for instance segmentation.
176
6. Generative Adversarial Networks (GANs): GANs are used in

computer vision to generate realistic images that resemble a
specific dataset. They have been used for tasks like image-to-
image translation, super-resolution, and style transfer.
7. Transfer Learning: Transfer learning is a technique where pre-
trained deep learning models, trained on large-scale datasets like
ImageNet, are fine-tuned on specific computer vision tasks with
limited data. This approach allows leveraging the knowledge
learned from one domain to improve performance in another
domain.
Computer vision in deep learning has found applications in various
industries, including:
 Healthcare: Medical image analysis, disease diagnosis, and
radiology.
 Automotive: Autonomous vehicles, object detection, and traffic
sign recognition.
 Surveillance and Security: Video surveillance, facial recognition,
and anomaly detection.
 Retail: Product recognition, shelf analysis, and customer behavior
analysis.
 Augmented Reality: Image recognition for AR applications.
As deep learning techniques continue to evolve, computer vision is
expected to see even more significant advancements and widespread
adoption in various domains, contributing to the development of
sophisticated AI systems capable of understanding and interacting with
the visual world.
Natural Language Processing
177
Natural Language Processing (NLP) in deep learning refers to the

application of deep learning techniques to process, understand, and
generate human language. NLP is a subfield of artificial intelligence that
focuses on enabling computers to interact with and understand human
language in a way that is meaningful and contextually relevant.
Deep learning has revolutionized NLP by providing powerful methods
to learn complex patterns and representations from vast amounts of
textual data. Deep learning models, particularly neural networks, have
demonstrated remarkable capabilities in various NLP tasks, such as:
1. Text Classification: Assigning predefined categories or labels to
text documents, such as sentiment analysis, spam detection, and
topic categorization.
2. Machine Translation: Translating text from one language to
another, often using sequence-to-sequence models based on
recurrent neural networks or transformers.
3. Named Entity Recognition (NER): Identifying and classifying
entities in text, such as names of people, organizations, locations,
and dates.
4. Text Summarization: Generating concise and coherent
summaries of longer text documents.
5. Sentiment Analysis: Determining the sentiment or emotion
expressed in a piece of text, such as positive, negative, or neutral
sentiment.
6. Question Answering: Understanding and generating human-like
responses to questions posed in natural language.
7. Speech Recognition and Synthesis: Converting spoken language
to text (speech recognition) or generating human-like speech from
text (speech synthesis).
178
8. Language Generation: Generating text in a creative or coherent

manner, such as chatbots and language models.
Deep learning models, such as recurrent neural networks (RNNs), long
short-term memory networks (LSTMs), and transformer-based
architectures (e.g., GPT, BERT), have shown tremendous success in
NLP tasks. These models can learn hierarchical representations of
language, capture contextual dependencies, and adapt to various
linguistic patterns.
The availability of large-scale pre-trained language models has also
contributed to the advancements in NLP. Transfer learning, where
models are first pre-trained on massive datasets and then fine-tuned on
specific tasks, has become a common practice, leading to state-of-the-art
performance in various NLP benchmarks.
Overall, deep learning has played a pivotal role in shaping the field of
natural language processing, enabling computers to understand and
generate human language with increasing accuracy and sophistication.
As research and development in deep learning continue to progress, NLP
applications are likely to become even more prevalent in various
domains, including virtual assistants, customer support, healthcare,
finance, and more.
179
MID 1
PART A
Fill in the Blanks
1. Artificial Neural Networks (ANNs) are a fundamental component
of _________ and _________.
Answer: Machine learning and artificial intelligence.
2. ANNs are designed to mimic the structure and functioning of the
_________.
Answer: Human brain.
3. he simplest form of an artificial neuron is called the _________.
 Answer: Perceptron.
4. Multi-Layer Perceptron (MLP) consists of an input layer, one or
more _________ layers, and an output layer.
 Answer: Hidden.
5. Convolutional Neural Networks (CNNs) are specialized for tasks
related to _________.
 Answer: Image processing and computer vision.
6. Recurrent Neural Networks (RNNs) are suitable for processing
_________ data. Answer: Sequential.
7. The basic computational unit in an ANN is called a
_________.Answer: Neuron or Node.
8. Weights indicate the _________ of connections between
neurons.Answer: Strength.
9. The function applied to the weighted sum of inputs to introduce
non-linearity is known as the _________ function.Answer:
Activation.
180
10. An optimization algorithm used to update ANN weights

during training, minimizing the loss function, is called
_________.Answer: Gradient Descent.
11. The basic computational unit in a Perceptron is called a
_________.Answer: Neuron or Perceptron.
12. Perceptrons are typically used for _________ classification
tasks.Answer: Binary or linear.
13. Adaptive Linear Neuron (Adaline) uses a _________
function to make predictions.Answer: Linear or identity.
14. Adaline adjusts its weights using a _________ rule during
training.Answer: Delta or Widrow-Hoff.
15. Back-propagation is a supervised learning algorithm used for
training _________ networks.Answer: Multilayer Perceptron
(MLP) or neural networks.
16. During back-propagation, the network updates its weights by
computing _________ for each layer.Answer: Gradients.
17. BAM (Bidirectional Associative Memory) networks establish
associations between patterns in both _________ and _________
directions.Answer: Forward and backward or input and output.
18. Hopfield Networks use an energy-based approach to find
stable states using a _________ function.Answer: Energy or
Lyapunov.
19. RNNs are well-suited for tasks involving _________ data,
such as natural language processing.Answer: Sequential or time-
series data.
20. CNNs are particularly effective for tasks involving the
analysis of _________.Answer: Images.
181
182

Deep Learning Course File Aiml-1

Uploaded by

Deep Learning Course File Aiml-1

Uploaded by

Neural Networks and Deep Learning

Neural Networks and Deep Learning

Faculty In-Charge HOD,CSE(AI&ML)

NAME OF THE FACULTY P.NALINI

DEPARTMENT CSE - AIML

YEAR & SECTION IV & CSE – AIML

3 Class Time table & Individual Time table 7

4 Student Roll List 8

Unit Wise Lecture Notes 12

c) Short and long answer question with

d) Beyond the syllabus topics and notes

f) PPT’S/NPTEL VIDEOS/any other

7 Student Seminar Topics

8 Previous University Question Papers to practice

9 Sample Internal Examination Question papers with key

10 Course Attendance Register

PEO’S, PO’S, PSO’S

PROGRAM EDUCATIONAL OBJECTIVES:

Engineering knowledge: Apply the knowledge of mathematics, science,

Problem analysis: Identify, formulate, review research literature, and analyze

Design/development of solutions: Design solutions for complex engineering

Conduct investigations of complex problems: Use research-based knowledge

Environment and sustainability: Understand the impact of the professional

Ethics: Apply ethical principles and commit to professional ethics and

Individual and team work: Function effectively as an individual, and as a member

Communication: Communicate effectively on complex engineering activities

Project management and finance: Demonstrate knowledge and understanding of

PROGRAM SPECIFIC OUTCOMES:

PSO1: Design and development of software applications by using data mining

PSO2: Enrichment of graduates with global certifications to produce reliable software

Neural Networks and Deep Learning

 To introduce the foundations of Artificial Neural Networks

 Ability to understand the concepts of Neural Networks

2. Class Time Table

MON CC RF IPR NN PROJECT STAGE 1

FRI ASN RF IPR NN H PROJECT STAGE 1

SAT ASN NN CC IPR PROJECT STAGE 1

4. Individual Time Table

Ms. P Nalini (Neural Networks & Deep Learning)-Class Incharge

25 20S11A6625 PRAKESH SAI

30 20S11A6630 RAGHAVENDRA SRINIVAS REMELLA

31 20S11A6631 RAKESH VARMA CHINDAM

32 20S11A6632 SAHANA REDDY ANNAM

33 20S11A6633 SAI BHAVYA POLEMONI

34 20S11A6634 SAI CHERAN CHENNOJU

35 20S11A6635 SAI PAVAN SIDDAM

36 20S11A6636 SAI VARDHAN REPALLE

37 20S11A6637 SAI VARDHAN YADLAPALLI

38 20S11A6638 SAKETH MACHA

39 20S11A6639 SOUMITH PARUVELLI

40 20S11A6640 SRI VENKATA SAI SUHAAS

41 20S11A6641 SUJITH DAS KURAMA

42 20S11A6642 SURABHI VAS PRAGNAN

43 20S11A6643 TARUN BHOSLE

44 20S11A6644 TEJA KONDAPARTHY

45 20S11A6645 VAMSHI GANDLA

46 20S11A6646 VARSHA AITHA

47 20S11A6647 VARSHITHA GOURISHETTY

48 20S11A6648 VENKATA RAMANA SAI THALKA

49 20S11A6649 VENKATA SAI SUNNAM

50 20S11A6650 VENKATA SURYA PRAKASH

51 20S11A6651 VINAY KUMAR CHINORI

52 20S11A6652 VINEET REDDY SADDI

53 21S15A6653 VISHNU PRIYA MARISHETTY

54 21S15A6654 VISHNU REDDY KONDITIVAR

55 21S15A6655 VISHWA SHREYA SHARAB

56 21S15A6656 YASHWANTH REDDY ANAM

57 21S15A6657 GOLAMARU DAMINI

58 21S15A6658 SATHE BHARGAVI

59 21S15A6659 SIDDIRALA SRIKANTH

60 21S15A6660 SADANALA SAIKIRAN

61 21S15A6601 BALA XAVIER GOVINDU

62 21S15A6602 LIKHITH REDDY KORPOL

63 21S15A6603 PRAVEEN KATROTH

64 21S15A6604 SAI KIRAN BODLA

65 21S15A6605 SANDEEP KUMAR JADA