Unit 2 Deep Learning
Unit 2 Deep Learning
Feedforward neural networks were among the first and most successful learning
algorithms. They are also called deep networks, multi-layer perceptron (MLP),
or simply neural networks. Feedforward neural networks are made up of the
following:
Input layer: This layer consists of the neurons that receive inputs and
pass them on to the other layers. The number of neurons in the input layer
should be equal to the attributes or features in the dataset.
Output layer: The output layer is the predicted feature and depends on
the type of model you’re building.
Hidden layer: In between the input and output layer, there are hidden
layers based on the type of model. Hidden layers contain a vast number of
neurons which apply transformations to the inputs before passing them.
As the network is trained, the weights are updated to be more predictive.
Neuron weights: Weights refer to the strength or amplitude of a
connection between two neurons. If you are familiar with linear
regression, you can compare weights on inputs like coefficients. Weights
are often initialized to small random values, such as values in the range 0
to 1.
Activation Function:
1-Sigmoid:
2-Tanh:
Only positive values are allowed to flow through this function. Negative values
get mapped to 0.
x1 - day/night
x2 - temperature
x3 - month
Let us assume the threshold value to be 20, and if the output is higher
than 20 then it will be raining, otherwise it is a sunny day.
Given a data tuple with inputs (x1, x2, x3) as (0, 12, 11), initial weights
of the feedforward network (w1, w2, w3) as (0.1, 1, 1) and biases as (1, 0,
0).
Here is how the neural network computes the data in three simple steps:
2. Adding the biases: In the next step, the product found in the previous step is
added to their respective biases. The modified inputs are then summed up to a
single value.
(x1* w1) + b1 = 0 + 1
(x2* w2) + b2 = 12 + 0
(x3* w3) + b3 = 11 + 0
4. Output signal: Finally, the weighted sum obtained is turned into an output
signal by feeding the weighted sum into an activation function (also called
transfer function). Since the weighted sum in our example is greater than 20, the
perceptron predicts it to be a rainy day.
In simple terms, a loss function quantifies how “good” or “bad” a given model
is in classifying the input data. In most learning networks, the loss is calculated
as the difference between the actual output and the predicted output.
Mathematically:
The function that is used to compute this error is known as loss function J(.).
Different loss functions will return different errors for the same prediction,
having a considerable effect on the performance of the model.
Backpropagation
Feed forward neural networks are artificial neural networks in which nodes do
not form loops. This type of neural network is also known as a multi-layer
neural network as all information is only passed forward.
During data flow, input nodes receive data, which travel through hidden layers,
and exit output nodes. No links exist in the network that could get used to by
sending information back from the output node.
When the feed forward neural network gets simplified, it can appear as a
single layer perceptron.
This model multiplies inputs with weights as they enter the layer.
Afterward, the weighted input values get added together to get the sum.
As long as the sum of the values rises above a certain threshold, set at
zero, the output value is usually 1, while if it falls below the threshold, it
is usually -1.
As a feed forward neural network model, the single-layer perceptron
often gets used for classification. Machine learning can also get
integrated into single-layer perceptron’s.
As a result of training and learning, gradient descent occurs. Similarly,
multi-layered perceptron’s update their weights. But this process gets
known as back-propagation.
If this is the case, the network's hidden layers will get adjusted according
to the output values produced by the final layer.
In the third step, a vector of ones gets multiplied by the output of our hidden
layer.
Gradient Descent
Gradient descent is an optimization algorithm which is commonly-used to
train machine learning models and neural networks.
Training data helps these models learn over time, and the cost function
within gradient descent specifically acts as a barometer, gauging its
accuracy with each iteration of parameter updates.
Until the function is close to or equal to zero, the model will continue to
adjust its parameters to yield the smallest possible error.
Once machine learning models are optimized for accuracy, they can be
powerful tools for artificial intelligence (AI) and computer science
applications.
Learning rate (also referred to as step size or the alpha) is the size of the
steps that are taken to reach the minimum. This is typically a small value,
and it is evaluated and updated based on the behaviour of the cost
function.
High learning rates result in larger steps but risks overshooting the
minimum.
Conversely, a low learning rate has small step sizes. While it has the
advantage of more precision, the number of iterations compromises
overall efficiency as this takes more time and computations to reach the
minimum.
The cost (or loss) function measures the difference, or error, between
actual y and predicted y at its current position. This improves the machine
learning model's efficacy by providing feedback to the model so that it
can adjust the parameters to minimize the error and find the local or
global minimum.
It continuously iterates, moving along the direction of steepest descent
(or the negative gradient) until the cost function is close to or at zero. At
this point, the model will stop learning.
For convex problems, gradient descent can find the global minimum with
ease, but as nonconvex problems emerge, gradient descent can struggle to
find the global minimum, where the model achieves the best results.
Recall that when the slope of the cost function is at or close to zero, the
model stops learning. A few scenarios beyond the global minimum can
also yield this slope, which are local minima and saddle points.
Local minima mimic the shape of a global minimum, where the slope of
the cost function increases on either side of the current point.
However, with saddle points, the negative gradient only exists on one
side of the point, reaching a local maximum on one side and a local
minimum on the other. Noisy gradients can help the gradient escape local
minimums and saddle points.
Vanishing and Exploding Gradients
Y=mX+c
Where 'm' represents the slope of the line, and 'c' represents the intercepts on the
y-axis.
The starting point (shown in above fig.) is used to evaluate the
performance as it is considered just as an arbitrary point.
At this starting point, we will derive the first derivative or slope and then
use a tangent line to calculate the steepness of this slope. Further, this
slope will inform the updates to the parameters (weights and bias).
The slope becomes steeper at the starting point or arbitrary point, but
whenever new parameters are generated, then steepness gradually
reduces, and at the lowest point, it approaches the lowest point, which is
called a point of convergence.
The main objective of gradient descent is to minimize the cost function or
the error between expected and actual. To minimize the cost function,
two data points are required.
Gradient descent with momentum will always work much faster than the
algorithm Standard Gradient Descent. The basic idea of Gradient Descent
with momentum is to calculate the exponentially weighted average of
your gradients and then use that gradient instead to update your weights.
It functions faster than the regular algorithm for the gradient descent.
Momentum-based gradient descent is a gradient descent optimization
algorithm variant that adds a momentum term to the update rule.
The momentum term is computed as a moving average of the past
gradients, and the weight of the past gradients is controlled by a
hyperparameter called Beta.
The momentum term helps to accelerate the optimization process by
allowing the updates to build up in the direction of the steepest descent.
This can help to address some of the problems with gradient descent,
such as oscillations, slow convergence, and getting stuck in local minima.
By using momentum-based gradient descent, it is possible to train
machine learning models more efficiently and achieve better
performance.
We can use gradient descent with momentum to address some of these
problems. It works by adding a fraction of the previous weight update to
the current weight update so that the optimization algorithm can build
momentum as it descends the loss function. This can help the algorithm
escape from local minima and saddle points and can also help the
algorithm converge faster by avoiding oscillations.
To apply gradient descent with momentum, you can update the weights as
follows:
weights = v
It's important to note that momentum is just one of many techniques we can use
to improve the convergence of gradient descent. Other techniques include
Nesterov momentum, adaptive learning rates, and mini-batch gradient descent.
Let's say we have a set of weights w and a loss function L, and we want to use
gradient descent to find the weights that minimize the loss function. The
standard gradient descent update rule is:
w = w - alpha * gradient
Where alpha is the learning rate, and the gradient is the gradient of the loss
function to the weights.
To incorporate gradient descent with momentum into this update rule, we can
add a momentum term v that is based on the previous weight update:
w=w-v
As can see, in the momentum-based gradient, the steps become larger and
larger due to the accumulated momentum, and then we overshoot at the
4th step. We then must take steps in the opposite direction to reach the
minimum point.
However, the update in NAG happens in two steps.
First, a partial step to reach the look-ahead point, and then the final
update. We calculate the gradient at the look-ahead point and then use it
to calculate the final update.
If the gradient at the look-ahead point is negative, our final update will be
smaller than that of a regular momentum-based gradient.
Like in the above example, the updates of NAG are similar to that of the
momentum-based gradient for the first three steps because the gradient at
that point and the look-ahead point are positive. But at step 4, the gradient
of the look-ahead point is negative.
In NAG, the first partial update 4a will be used to go to the look-ahead
point and then the gradient will be calculated at that point without
updating the parameters. Since the gradient at step 4b is negative, the
overall update will be smaller than the momentum-based gradient
descent.
We can see in the above example that the momentum-based gradient
descent takes six steps to reach the minimum point, while NAG takes
only five steps.
This looking ahead helps NAG to converge to the minimum points in
fewer steps and reduce the chances of overshooting.
We saw how NAG solves the problem of overshooting by ‘looking ahead’. Let
us see how this is calculated and the actual math behind it.
This is how the gradient of all the previous updates is added to the current
update.
This look-ahead gradient will be used in our update and will prevent
overshooting.
Equation:
In the above Adagrad optimizer equation, the learning rate has been
modified in such a way that it will automatically decrease because the
summation of the previous gradient square will always keep on increasing
after every time step.
Now, let us take a simple example to check how the learning rate is
different for every parameter in a single time step. For this example, we
will consider a single neuron with 2 inputs and 1 output. So, the total
number of parameters will be 3 including bias.
The above computation is done at a single time step, where all the three
parameters learning rate “η” is divided by the square root of “α” which is
different for all parameters.
So, we can see that the learning rate is different for all three parameters.
RMSProp (Root Mean Square Propagation) Optimize
The above equation shows that as the time steps “t” increase the
summation of squared gradients “α” increases which led to a decrease in
learning rate “η”.
In order to resolve the exponential increase in the summation of squared
gradients “α”, we replaced the “α” with exponentially weighted averages
of squared gradients.
So, here unlike the alpha “α” in Adagrad, where it increases exponentially
after every time step.
The typical “β” value is 0.9 or 0.95.