Notes On Introduction To Deep Learning

Introduction
Recently, Deep Learning has gained mass popularity in the world of science
computing. When we fire up Alexa or Siri, sometimes we wonder how the machine
is able to self-actualize decisions and make right choices ?
Well, Deep learning and AI enable the machines to perform these functions making
our lives easier and simpler. Deep learning often tends to alleviate the levels of
customer experience and the aura of premiumness is unmatched.
What is Deep learning?
Deep learning is a subset function of AI that imitates the functioning of the human
brain and borrows the skill of processing data for use in detecting objects,
recognizing speech, translating languages, and making decisions. It is a type of
machine learning that works based on the structure and functioning of the human
brain.
It uses artificial neural networks(ANNs) to perform sophisticated and intricate

computations on enormous amounts of data.
Deep learning is a subset of artificial intelligence that has networks capable of

unsupervised learning from data that is unstructured or unlabeled.
Deep learning has evolved hand-in-hand with the digital era, which has brought
about a revolution in terms of data extraction in all forms and from every region of
the world.
This data, renowned as Big data, is drawn from the sensational sources like social
media, internet search engines, e-commerce platforms, and online cinemas, among
others.
Slide : 8
What is a Neural network?
A neural network is a series of algorithms that endeavors to recognize underlying

relationships in a set of data through a process that mimics the way the human brain
operates. In this sense, neural networks refer to systems of neurons, either organic or
artificial in nature.
Neural networks can adapt to changing input; so the network generates the best
possible result without needing to redesign the output criteria. The concept of neural
networks, which has its roots in artificial intelligence, is swiftly gaining popularity
in the development of trading systems.
These nodes are stacked next to each other in three layers:
 The input layer

 The hidden layer(s)
 The output layer
Application of Neural Networks
Neural networks are broadly used, with applications for financial operations,
enterprise planning, trading, business analytics, and product maintenance. Neural
networks have also gained widespread adoption in business applications such as
forecasting and marketing research solutions, fraud detection, and risk assessment.
A neural network evaluates price data and unearths opportunities for making trade
decisions based on the data analysis. The networks can distinguish subtle nonlinear
interdependencies and patterns other methods of technical analysis cannot.
Slide : 10
Feedforward Propagation –
is the way to move from the Input layer (left) to the Output layer (right)
in the neural network. The flow of information occurs in the forward
direction. The input is used to calculate some intermediate function in the
hidden layer, which is then used to calculate an output.
What is Backpropagation?
Backpropagation is the essence of neural network training. It is the method of fine-

tuning the weights of a neural network based on the error rate obtained in the
previous epoch (i.e., iteration). Proper tuning of the weights allows you to reduce
error rates and make the model reliable by increasing its generalization.
Backpropagation in neural network is a short form for “backward propagation of

errors.” It is a standard method of training artificial neural networks. This method
helps calculate the gradient of a loss function with respect to all the weights in the
network.
Slide: 11
How Backpropagation Algorithm Works

The Back propagation algorithm in neural network computes the gradient of
the loss function for a single weight by the chain rule. It efficiently computes
one layer at a time, unlike a native direct computation. It computes the
gradient, but it does not define how the gradient is used. It generalizes the
computation in the delta rule.
Consider the following Back propagation neural network example diagram to

understand:
Ho
w Backpropagation Algorithm Works
1. Inputs X, arrive through the preconnected path

2. Input is modeled using real weights W. The weights are usually
randomly selected.
3. Calculate the output for every neuron from the input layer, to the
hidden layers, to the output layer.
4. Calculate the error in the outputs
ErrorB= Actual Output – Desired Output
5. Travel back from the output layer to the hidden layer to adjust the
weights such that the error is decreased.
Keep repeating the process until the desired output is achieved
Why We Need Backpropagation?

Most prominent advantages of Backpropagation are:
 Backpropagation is fast, simple and easy to program

 It has no parameters to tune apart from the numbers of input
 It is a flexible method as it does not require prior knowledge about the
network
 It is a standard method that generally works well
 It does not need any special mention of the features of the function to
be learned.
Slide :12
What is an activation function and why to use them?
Definition of activation function:- Activation function decides,

whether a neuron should be activated or not by calculating weighted
sum and further adding bias with it. The purpose of the activation
function is to introduce non-linearity into the output of a neuron.
Explanation :-
We know, neural network has neurons that work in correspondence
of weight, bias and their respective activation function. In a neural
network, we would update the weights and biases of the neurons on
the basis of the error at the output. This process is known as back-
propagation. Activation functions make the back-propagation
possible since the gradients are supplied along with the error to
update the weights and biases.
Why do we need Non-linear activation functions :-

A neural network without an activation function is essentially just a
linear regression model. The activation function does the non-linear
transformation to the input making it capable to learn and perform
more complex tasks.
Slide 13 & 14:
VARIANTS OF ACTIVATION FUNCTION :-
1). Linear Function :-

 Equation : Linear function has the equation similar to as of a straight
line i.e. y = ax
 No matter how many layers we have, if all are linear in nature, the final
activation function of last layer is nothing but just a linear function of
the input of first layer.
 Range : -inf to +inf
 Uses : Linear activation function is used at just one place i.e. output
layer.
 Issues : If we will differentiate linear function to bring non-linearity,
result will no more depend on input “x” and function will become
constant, it won’t introduce any ground-breaking behavior to our
algorithm.

For example : Calculation of price of a house is a regression problem. House
price may have any big/small value, so we can apply linear activation at output
layer. Even in this case neural net must have any non-linear function at hidden
layers.
2). Sigmoid Function :-

 It is a function which is plotted as ‘S’ shaped graph.
 Equation :
A = 1/(1 + e-x)
 Nature : Non-linear. Notice that X values lies between -2 to 2, Y values
are very steep. This means, small changes in x would also bring about
large changes in the value of Y.
 Value Range : 0 to 1
 Uses : Usually used in output layer of a binary classification, where
result is either 0 or 1, as value for sigmoid function lies between 0 and 1
only so, result can be predicted easily to be 1 if value is greater
than 0.5 and 0 otherwise.
3). Tanh Function :- The activation that works almost always better than
sigmoid function is Tanh function also knows as Tangent Hyperbolic
function. It’s actually mathematically shifted version of the sigmoid function.
Both are similar and can be derived from each other.
 Equation :-
f(x) = tanh(x) = 2/(1 + e-2x) - 1
OR
tanh(x) = 2 * sigmoid(2x) - 1
 Value Range :- -1 to +1
 Nature :- non-linear
 Uses :- Usually used in hidden layers of a neural network as it’s values
lies between -1 to 1 hence the mean for the hidden layer comes out be 0
or very close to it, hence helps in centering the data by bringing mean
close to 0. This makes learning for the next layer much easier.
4). RELU :- Stands for Rectified linear unit. It is the most widely used
activation function. Chiefly implemented in hidden layers of Neural network.
 Equation :- A(x) = max(0,x). It gives an output x if x is positive and 0
otherwise.
 Value Range :- [0, inf)
 Nature :- non-linear, which means we can easily backpropagate the
errors and have multiple layers of neurons being activated by the ReLU
function.
 Uses :- ReLu is less computationally expensive than tanh and sigmoid
because it involves simpler mathematical operations. At a time only a
few neurons are activated making the network sparse making it efficient
and easy for computation.
In simple words, RELU learns much faster than sigmoid and Tanh function.
5). Softmax Function :- The softmax function is also a type of sigmoid

function but is handy when we are trying to handle classification problems.
 Nature :- non-linear
 Uses :- Usually used when trying to handle multiple classes. The
softmax function would squeeze the outputs for each class between 0
and 1 and would also divide by the sum of the outputs.
 Output:- The softmax function is ideally used in the output layer of the
classifier where we are actually trying to attain the probabilities to
define the class of each input.
CHOOSING THE RIGHT ACTIVATION FUNCTION

 The basic rule of thumb is if you really don’t know what activation
function to use, then simply use RELU as it is a general activation
function and is used in most cases these days.
 If your output is for binary classification then, sigmoid function is very
natural choice for output layer.
Slide: 15
Overfitting is a modeling error in statistics that occurs when a
function is too closely aligned to a limited set of data points. ... Thus,
attempting to make the model conform too closely to slightly
inaccurate data can infect the model with substantial errors and
reduce its predictive power.
Ways to prevent overfitting include cross-validation, in which the

data being used for training the model is chopped into folds or
partitions and the model is run for each fold. Then, the overall error
estimate is averaged. Other methods include ensembling: predictions
are combined from at least two separate models, data augmentation,
in which the available data set is made to look diverse, and data
simplification, in which the model is streamlined to avoid overfitting.
Overfitting Example
For example, a university that is seeing a college dropout rate that is higher than what
it would like decides it wants to create a model to predict the likelihood that an
applicant will make it all the way through to graduation.
To do this, the university trains a model from a dataset of 5,000 applicants and their
outcomes. It then runs the model on the original dataset—the group of 5,000
applicants—and the model predicts the outcome with 98% accuracy. But to test its
accuracy, they also run the model on a second dataset—5,000 more applicants.
However, this time, the model is only 50% accurate, as the model was too closely fit
to a narrow data subset, in this case, the first 5,000 applications.
Regularization is a technique which makes slight modifications to the

learning algorithm such that the model generalizes better. This in turn
improves the model’s performance on the unseen data as well.
Different Regularization Techniques in Deep Learning

Now that we have an understanding of how regularization helps in reducing
overfitting, we’ll learn a few different techniques in order to apply
regularization in deep learning.
L2 & L1 regularization
L1 and L2 are the most common types of regularization. These update the
general cost function by adding another term known as the regularization
term.
Cost function = Loss (say, binary cross entropy) + Regularization term
Due to the addition of this regularization term, the values of weight matrices
decrease because it assumes that a neural network with smaller weight
matrices leads to simpler models. Therefore, it will also reduce overfitting to
quite an extent.
However, this regularization term differs in L1 and L2.
In L2, we have:
Here, lambda is the regularization parameter. It is the hyperparameter

whose value is optimized for better results. L2 regularization is also known
as weight decay as it forces the weights to decay towards zero (but not
exactly zero).
In L1, we have:
In this, we penalize the absolute value of the weights. Unlike L2, the weights
may be reduced to zero here. Hence, it is very useful when we are trying
to compress our model. Otherwise, we usually prefer L2 over it.
Dropout
This is the one of the most interesting types of regularization techniques. It

also produces very good results and is consequently the most frequently
used regularization technique in the field of deep learning.
So what does dropout do? At every iteration, it randomly selects some nodes
and removes them along with all of their incoming and outgoing connections
as shown below.
So each iteration has a different set of nodes and this results in a different
set of outputs. It can also be thought of as an ensemble technique in
machine learning.
Ensemble models usually perform better than a single model as they capture
more randomness. Similarly, dropout also performs better than a normal
neural network model.
This probability of choosing how many nodes should be dropped is the

hyperparameter of the dropout function.
Due to these reasons, dropout is usually preferred when we have a large

neural network structure in order to introduce more randomness.
Early stopping
Early stopping is a kind of cross-validation strategy where we keep one part

of the training set as the validation set. When we see that the performance
on the validation set is getting worse, we immediately stop the training on the
model. This is known as early stopping.
Slide: 17
A Machine Learning model is defined as a mathematical model with a

number of parameters that need to be learned from the data. By training a
model with existing data, we are able to fit the model parameters.
However, there is another kind of parameters, known as Hyperparameters,

that cannot be directly learned from the regular training process. They are
usually fixed before the actual training process begins. These parameters
express important properties of the model such as its complexity or how fast
it should learn.
Deep learning models are full of hyper-parameters and finding the best
configuration for these parameters in such a high dimensional space is not a
trivial challenge.
What Is a Loss Function and Loss?
In the context of an optimization algorithm, the function used to evaluate a

candidate solution (i.e. a set of weights) is referred to as the objective
function.
We may seek to maximize or minimize the objective function, meaning that

we are searching for a candidate solution that has the highest or lowest
score respectively.
Typically, with neural networks, we seek to minimize the error. As such, the
objective function is often referred to as a cost function or a loss function and
the value calculated by the loss function is referred to as simply “loss.”
The cost or loss function has an important job in that it must faithfully distill
all aspects of the model down into a single number in such a way that
improvements in that number are a sign of a better model.
What are Top Deep Learning Algorithms?
1. Convolutional Neural Network
Yann LeCun developed the first CNN in 1988, and named it LeNet. Then it
was primarily used for recognizing characters like ZIP codes and digits.
 CNNs now known as ConvNets, consist of multiple layers structure and are
mostly used for image processing and object detection.
 CNN has a convolution layer that has several filters to deal with its intricacy
and perform the convolution operation.
 CNN's also have a Rectified Linear Unit (ReLU) layer to perform operations
on elements and present a rectified feature map as an output.
 The rectified feature map is next fed into a pooling layer and as the name
suggests this layer converts the resulting two-dimensional arrays from the
pooled feature map into a single, continuous, linear vector by flattening it.
2. Long Short Term Memory Networks
LSTMs are a subset of Recurrent Neural Networks (RNN) that are specialised
in learning and memorizing long-term information. By default, LSTMs are
supposed to recall past information for long periods of time.
 LSTMs have a chain-like structure where four unique layers are stacked.
 LSTMs are typically used for time-series predictions, speech synthesis,
language modeling and translation, music composition, and pharmaceutical
development.
 They are programmed to forget irrelevant parts of the data and selectively
update the cell-state values.
3. Recurrent Neural Networks
As the Wikipedia page defines a recurrent neural network (RNN), it is a class

of artificial neural networks where connections between nodes form a directed
graph along a temporal sequence. This allows it to exhibit temporal dynamic
behavior.
Due to this dynamic behaviour, the output from LSTMs is allowed to be fed as
input here.
 The output from the LSTM becomes an input to the current phase allowing to
memorize previous inputs due to its efficient internal memory.
 RNNs are mostly used for image captioning, time-series analysis, natural-
language processing, handwriting recognition, and machine translation
 RNNs can process inputs of varied lengths. The more the computation, the
more will be the possibility of information to be gathered and in addition the
model size does not increase with the input size.
4. Generative Adversarial Networks
GANs are generative deep learning algorithms that are responsible for
producing new data instances that identify with the training data provided.
 GANs have two main components: a generator, which is used to generate fake
data, and a discriminator, which learns from that false information.
 GANs expertise are used to generate realistic images and cartoon characters,
create photographs of human faces, and render 3D objects.
 You might have noticed GANs logos on video games as developers use GANs
to upgrade low-resolution, 2D textures in vintage video games by recreating
them in surreal 4K or higher resolutions via image training.
 They are also used to improve astronomical images and simulate gravitational
lensing for dark-matter research.
 During the training period, the Discriminator learns to distinguish between real
and fake data and is able to rectify whenever the Generator produces fake data.
5. Radial basis function networks
RBFNs are an example of artificial neural networks, mainly used for function
approximation problems.
 Radial basis function networks are considered better from other neural
networks because of their universal approximation method and faster learning
speed.
 An RBF network is a special type of feed forward neural network. It consists
of three different layers, namely the input layer, the hidden layer and the output
layer.
 An RBF network with 10 nodes in its hidden layer is chosen. The training of
the RBF model is terminated once the calculated error boils down to ideal
values (i.e. 0.01) or number of training iterations (i.e. 500) already was
completed.
 RBFNs tend to perform classification by measuring the input's congruency to

examples from the training set. The function finds the total sum of the inputs,
and the output layer receives one node per category or class of data.
 The neurons in the hidden layer work on the principles of Gaussian transfer
functions, which produces outputs that are inversely proportional to their
distance from the neuron's center.The network's output is an inter-webbed
combination of the input’s radial-basis functions and the neuron’s parameters.
6. Multilayer Perceptrons
MLPs belong to the family of feedforward neural networks with multiple

layers of perceptrons that have different functions.
 MLPs consist of an input layer and an output layer that are fully connected
with the hidden layers in between.
 They have the same number of input and output layers but may have multiple
hidden layers, which act as the true computation engine of MLPs.
 They are used to build speech-recognition, financial prediction, and carry out
data compression.
 The data is fed to the input layer of the network. Then the layers of neurons
form a pattern which enables the signal to pass in one direction.
 MLPs compute the input with the entities that exist between the input layer
and the hidden layers.
 Activation functions like ReLUs, sigmoid functions, and Tanhs allow MLPs to
determine which nodes to use.
 MLPs assist the model to understand the correlation and learn the
dependencies between the independent and the target variables from a
particular training data set.
7. Self Organising Maps
Professor Teuvo Kohonin invented SOMs or Kohenin’s map, which enable

data visualization to reduce the dimensions of data by creating a spatially
organised representation.
It also helps us to understand the correlation between sets of data. Data

visualization attempts to solve the problem that the human mind cannot easily
visualize i.e. high-dimensional data.
 SOMs are created to help users access and understand this high-dimensional
information.
 SOMs don't have activation functions in neurons, so they initialize weights for
each node and choose a vector at random from the training data.
 SOMs examine every node to find which weights are the most likely input
vector and the most suitable node is called the Best Matching Unit(BMU).
 SOMs discover the crowd around BMU’s neighborhood, which tends to get
lower over time. The closer a node is to a BMU, the more its weight changes
and the winning weight is awarded to the sample vector.
8. Deep Beliefs Networks
DBNs are generative graphical models or a class of deep neural networks that
consist of multiple layers of stochastic, latent variables.The latent variables
have binary values and are often called hidden units which are connected
between the layers but not within a single layer.
 DBNs are an arrangement of Boltzmann Machines with connections between

the layers, and in which each RBM layer communicates with both the previous
and subsequent layers.
 DBNs are used for image-recognition, drug discovery, video-recognition, and
motion-capture data.
 Greedy(to choose the most optimal option) learning algorithms train DBNs.
The greedy learning algorithm uses an intensive approach of layer-by-layer
learning of the top-down, generative weights.
 DBNs run the steps of Gibbs sampling for analysing on the top two hidden
layers.
 DBNs draw samples from the visible units using a single pass of ancestral
sampling throughout the model.
 DBNs learn that the values of the latent variables in every layer can be
concluded by a single, bottom-up pass.
9. Restricted Boltzmann Machines

Developed by Geoffrey Hinton, RBMs are stochastic neural networks that
possess the capability to learn from a probability distribution over the data
ingested.
 RBMs is the founder of many applications in the fields of dimensionality

reduction, classification and can be trained over supervised or unsupervised
data.
 This neural network has applications in regression, collaborative filtering,
feature learning, and even many body quantum mechanics.
 RBMs are considered to be the building blocks of DBNs.
 RBMs consist of two units: Visible units and Hidden units. Each visible unit is
symmetrically connected to all hidden units. RBMs consist of a bias unit that is
connected to all the visible units and the hidden units but lack output nodes.
 RBMs accept the input and encode it via numbers in the forward pass.RBMs
combine each input with the individual's own weight and one overall bias unit.
10. Autoencoders
Autoencoder is an unsupervised artificial neural network that learns and

understands how to efficiently compress and encode data.
Then learns how to reconstruct the data back from the encoded compression to
a representation that is as close to the original input provided at first.
Autoencoders are supposed to first encode the image, then reduce the size of
the input into a smaller entity. Finally, the autoencoder decodes the image in
order to generate the reconstructed image.
Conclusion
All the Deep learning algorithms show us that why are they preferred over other
techniques. All the algorithms compel us to use deep learning as they have become
the norm of the world lately and also serve to our comfort with time, effort and ease
of use.
Deep learning has made the working of computers to actually become smart and
make them work according to our needs.
With the ever growing data, it can be concluded that these algorithms would only
become more efficient with time and would truly be able to replicate the juggernauts
of a human brain.

Notes On Introduction To Deep Learning

Uploaded by

Notes On Introduction To Deep Learning

Uploaded by

Introduction

What is Deep learning?

It uses artificial neural networks(ANNs) to perform sophisticated and intricate

Deep learning is a subset of artificial intelligence that has networks capable of

What is a Neural network?

A neural network is a series of algorithms that endeavors to recognize underlying

These nodes are stacked next to each other in three layers:

 The input layer

Application of Neural Networks

Backpropagation is the essence of neural network training. It is the method of fine-

Backpropagation in neural network is a short form for “backward propagation of

How Backpropagation Algorithm Works

Consider the following Back propagation neural network example diagram to

1. Inputs X, arrive through the preconnected path

ErrorB= Actual Output – Desired Output

Keep repeating the process until the desired output is achieved

Why We Need Backpropagation?

 Backpropagation is fast, simple and easy to program

What is an activation function and why to use them?

Definition of activation function:- Activation function decides,

Why do we need Non-linear activation functions :-

Slide 13 & 14:

VARIANTS OF ACTIVATION FUNCTION :-

1). Linear Function :-

2). Sigmoid Function :-

5). Softmax Function :- The softmax function is also a type of sigmoid

CHOOSING THE RIGHT ACTIVATION FUNCTION

Ways to prevent overfitting include cross-validation, in which the

Regularization is a technique which makes slight modifications to the

Different Regularization Techniques in Deep Learning

Cost function = Loss (say, binary cross entropy) + Regularization term

However, this regularization term differs in L1 and L2.

Here, lambda is the regularization parameter. It is the hyperparameter

This is the one of the most interesting types of regularization techniques. It

This probability of choosing how many nodes should be dropped is the

Due to these reasons, dropout is usually preferred when we have a large

Early stopping is a kind of cross-validation strategy where we keep one part

A Machine Learning model is defined as a mathematical model with a

However, there is another kind of parameters, known as Hyperparameters,

What Is a Loss Function and Loss?

In the context of an optimization algorithm, the function used to evaluate a

We may seek to maximize or minimize the objective function, meaning that

What are Top Deep Learning Algorithms?

1. Convolutional Neural Network

3. Recurrent Neural Networks

As the Wikipedia page defines a recurrent neural network (RNN), it is a class

4. Generative Adversarial Networks

5. Radial basis function networks

 RBFNs tend to perform classification by measuring the input's congruency to

MLPs belong to the family of feedforward neural networks with multiple

7. Self Organising Maps

Professor Teuvo Kohonin invented SOMs or Kohenin’s map, which enable

It also helps us to understand the correlation between sets of data. Data

8. Deep Beliefs Networks

 DBNs are an arrangement of Boltzmann Machines with connections between

9. Restricted Boltzmann Machines

 RBMs is the founder of many applications in the fields of dimensionality

Autoencoder is an unsupervised artificial neural network that learns and

You might also like