Chapter 3
Artificial Neural Networks
3.1. Introduction to ANN
• ANNs have been around for quite a while.
• They were first introduced back in 1943 by the neurophysiologist
Warren McCulloch and the mathematician Walter Pitts.
• McCulloch and Pitts presented a simplified computational model
of how biological neurons might work together in human brains to
perform complex computations using propositional logic.
• This was the first artificial neural network architecture.
• Since then many other architectures have been invented.
3.1. Introduction to ANN…
• Biological neuron
3.1. Introduction to ANN…
• Biological neurons receive short electrical impulses called signals
from other neurons via these synapses.
• When a neuron receives a sufficient number of signals from other
neurons within a few milliseconds, it fires its own signals.
• Thus, individual biological neurons seem to behave in a rather
simple way, but they are organized in a vast network of billions of
neurons
• And each neuron is typically connected to thousands of other
neurons.
• Highly complex computations can be performed by a vast
network of fairly simple neurons.
3.1. Introduction to ANN…
• A neural network, or more properly an artificial neural network, is
a type of machine learning model.
• Under the surface, ANN is a collection of mathematical
operations that take raw input data and, based on the internal
structure and parameters of the neural network, it produces an
estimate or prediction of the output.
• Just as with other ML models, the parameters of the neural
network are learned from training data.
• ANN have been around for many years and they have gone
through several periods during which they have fallen in and out
of favor.
3.1. Introduction to ANN…
• ANNs are at the very core of Deep Learning.
• They are versatile, powerful, and scalable, making them ideal to
tackle large and highly complex Machine Learning tasks such as
– classifying billions of images (e.g., Google image search)
– powering speech recognition services (e.g., Apple Siri)
– recommending the best videos to watch to hundreds of millions
of users every day (e.g., YouTube)
– learning to beat the world champion at the game of Go
(DeepMind’s AlphaGo).
3.1. Introduction to ANN…
• A neural network is a mathematical model for information
processing.
• The characteristics of a neural network are as follows:
– Information processing occurs in its simplest form, over simple
elements called neurons
– Neurons are connected and they exchange signals between
them through connection links
– Connection links between neurons can be stronger or weaker,
and this determines how information is processed
– Each neuron has an internal state that is determined by all the
incoming connections from other neurons
– Each neuron has an activation function that is calculated on its
state, and determines its output
3.1. Introduction to ANN…
• The building block of a neural network is a single computational
unit called node (it is also called neuron).
• A node (neuron) takes a set of real valued numbers as input,
performs some computation on them, and produces an output.
Figure A neuron, taking 3 inputs x1, x2, and x3 and a bias b and producing an output y
• The components of the basic neuron are:
• Inputs: inputs are the set of values for which we need to predict an
output.
• The inputs are typically the raw data points of your dataset.
• This could be:
– Numerical values: like temperature, stock prices, pixel values in an
image or video, or amplitude values in speech.
– Categorical values: represented as numerical codes (e.g., 0 for "cat", 1
for "dog", 2 for "parrot").
– Text data: converted into numerical representations (e.g., using word
embeddings).
• They can be viewed as features or attributes in a dataset.
• Note: neural networks can accept only numbers as input.
• In the figure above, they are indicated with x1, x2, and x3.
• Weights: weights are the values that are attached with each input.
• They convey the importance of the corresponding input in predicting
the final output.
• Weight determines the strength of the input to the neuron.
• It determines what impact the input will have on the output.
• Weights are coefficients that scale the input signal (amplify or
minimize) to a given neuron in the network.
• In representation of neural networks, these are the lines/arrows
going from one node to another.
• Often, connections are notated as w in mathematical
representations of ANNs.
• In the above figure, they are represented by w1, w2, and w3.
• Bias: Bias is an additional parameter used to produce an output.
• Bias in ANN is required to shift the activation function across the
plane either towards the left or the right.
• You can compare bias to y-intercept in the line equation.
• The input value of bias is always 1 and it has weight just like
inputs.
• Biases are scalar values added to the input to ensure that at least a
few nodes per layer are activated regardless of signal strength.
• Biases allow learning to happen by giving the network action in
the event of low signal.
• They allow the network to try new interpretations or behaviors.
• Biases are generally notated b, and like weights, they are modified
throughout the learning process.
• Summation function: The work of the summation function is to
bind the weights and inputs together and calculate their sum.
• The bias is also added to the summation.
• Activation function: It introduces non-linearity in the model.
Figure components of a neuron
•
3.1. Introduction to ANN…
• The output z is just a real valued number.
• In this computation, z is a linear function of x.
• This means that
– any changes in x will result in proportional changes in z, and
– the relationship between x and z is straightforward and predictable.
• But, instead of using z directly as the output, neural units apply a
non-linear function f to z.
• We call this function f an activation function.
• The output of the activation function is represented as a.
• For a single neuron, the activation for the node is in fact the final
output of the network, which we will generally represent as ŷ.
• So, the value ŷ is defined as:
ŷ = a = f(z)
•
3.1. Introduction to ANN…
Activation Functions
• Activation functions allow for machine learning models to solve
non-linear problems.
• If we have a neural network working without the activation
function, then every neuron will only be performing a linear
transformation on the inputs using the weights and biases.
• It doesn’t matter how many hidden layers we attach in the neural
network.
• All layers will behave in the same way because the composition
of two linear functions is a linear function itself.
• Hence, the purpose of an activation function is to add
non-linearity to the neural networks.
• Every activation function takes a single number and performs a
certain fixed mathematical operation on it and produce output.
3.1. Introduction to ANN…
• There are several activation functions you may encounter in
practice.
Figure Linearly-separable classes vs non-linearly separable classes
3.1. Introduction to ANN…
•
3.1. Introduction to ANN…
• The Sigmoid Function curve looks like a S-shape.
3.1. Introduction to ANN…
•
3.1. Introduction to ANN…
• The output of the Logistic Sigmoid function is saturated for
higher and lower inputs, which leads to vanishing gradient
problem.
• The vanishing gradient problem depicts to a scenario where the
gradient of objective function with respect to a parameter
becomes very close to zero.
• This leads to almost no update in the parameters during the
training of the network.
• Hence, the training is almost killed under vanishing gradient
scenario.
3.1. Introduction to ANN…
•
3.1. Introduction to ANN…
• ReLU function looks like this.
• The ReLU function retains only positive elements and discards all
negative elements by setting the corresponding activations to 0.
• But the issue is that all the negative values become 0 immediately
which decreases the ability of the model to train from the data
properly.
• That means any negative output given to the ReLU activation
function turns the value into zero immediately.
• This, in turn, affects the resulting graph by not mapping the
negative values appropriately.
• There are several pros to using the ReLUs:
– It was found to greatly accelerate the convergence of stochastic
gradient descent compared to the sigmoid/tanh functions. It is argued
that this is due to its linear, non-saturating form.
– Compared to tanh/sigmoid neurons that involve expensive operations
(exponentials, etc.), the ReLU can be implemented by simply
thresholding activations at zero.
3.1. Introduction to ANN…
•
• Figure of ReLU vs Leaky ReLU
• The leak helps to increase the range of the ReLU function.
• Usually, the value of a is 0.01.
• Therefore, the range of the Leaky ReLU is negative infinity to
positive infinity (-∞ to +∞).
• If a neuron gets stuck in the negative side, it will always output 0,
and its gradient will be 0 during backpropagation.
• This can cause the neuron to become inactive and stop learning.
• Variants like Leaky ReLU or Parametric ReLU are designed to
address this issue.
3.1. Introduction to ANN…
•
•
3.1. Introduction to ANN…
e. Softmax
• Softmax function calculates the probability distribution of the
event over ‘n’ different events.
• This function will calculate the probabilities of each target class
over all possible target classes.
• Later, the calculated probabilities will be helpful for determining
the target class for the given inputs.
• The sofmax is usually used for the output layer.
• If the output is multiple, softmax can cover the multiple class
problems category.
• Sofmax normalizes each class’s output to the range between 0 and
1 to make it probability.
3.1. Introduction to ANN…
•
•
3.1. Introduction to ANN…
• For example, Softmax might produce the following likelihoods of
an image belonging to a particular class:
• The main difference between the Sigmoid and Softmax activation
function is that
– the Sigmoid is used in binary classification (two classes)
– the Softmax is used for multiclass classification tasks
Why use bias in a neural network?
• Bias can be defined as a constant which is added to the product of
inputs and weights.
• It is used to offset the result.
• It helps the models to shift the activation function towards the
positive or negative side.
• If we vary the values of the weight ‘w’, keeping bias ‘b’=0, we
will get the following graph.
Figure Activation function with bias b=0 (i.e. no bias)
• While changing the values of weight w, there is no way we can shift
the origin of the activation function.
• By changing the values of weight w, only the steepness of the curve
will change.
• There is only one way to shift the origin and that is to include bias b.
• On keeping the value of weight ‘w’ fixed and varying the value of
bias ‘b’, we will get the graph below.
Figure Activation function with bias b=1 with weight constant
3.2. The Basic Architecture of ANN
• A neural network is made up of many neurons connected together
which are arranged in layers.
• Groups of neurons at the same level form layers.
• Hence, a layer usually consists multiple nodes.
• However, some layers (output layer for example), may consist
only a single node.
• A feed-forward neural network or multi-layered perceptron
(MLP) consists of several input nodes in input layer, hidden layers
in the middle, and a layer of output nodes at the end.
• The number of input nodes depends on the input data and how it
is encoded.
• For example, if you had seven numerical features about houses for
sale, you could use seven input nodes in the input layer.
• The number and size of the hidden layers is up to the user to define.
• And finally, the size of output layer depends on the number of
outputs needed.
• For a classification task with N classes, you would use N output
nodes, one corresponding to each class.
• For a standard regression problem, you only need single output
node to provide a single numerical value.
Input layer
• This layer is how we get input data fed into our network.
• The number of neurons in an input layer is typically the same
number as the input features fed to the network.
• Input layers are followed by one or more hidden layers.
• Input layers in feed-forward neural networks are fully connected
to the next hidden layer.
• No data processing happens in the input layer.
Hidden layer
• There are one or more hidden layers in a feed-forward neural
network.
• The weight values on the connections between the layers are how
neural networks encode the learned information extracted from
the raw training data.
• Hidden layers are the key to allowing neural networks to model
nonlinear functions, solving the limitations of the single-layer
perceptron networks.
Output layer
• We get the prediction from the output layer.
• Given that we are mapping an input space to an output space with
the ANN model, the output layer gives us an output based on the
input from the input layer.
• Depending on the setup of the neural network, the final output
may be:
– a real-valued output (regression)
– a set of probabilities (classification)
• This is controlled by the type of activation function we use on the
neurons in the output layer.
• The output layer typically uses either a softmax or sigmoid
activation function for classification.
• For a real-valued output (regression), the most common activation
function used in the output layer is the linear activation function.
• This means no activation function is used.
• It can be expressed as:
f(z) = z
3.2. The Basic Architecture of ANN…
Connections, weights, and biases
• Data flows in only one direction in a feed-forward networks.
• It flows from the input nodes, through the hidden layers, and to
the output nodes.
• The nodes in each layer are connected to all of the nodes in the
layer immediately before and after it.
• The connections represent where the data is flowing.
• In the network, each connection has its own weight and each node
has its own bias.
• The sum of the weight-multiplied data along with the bias term
are fed into the activation function.
• The output of the activation function at each node then becomes
the input for the next layer.
3.2. The Basic Architecture of ANN…
• These weights and biases are the parameters that the model learns
during the training process.
• Data moving all the way through the network once, from input to
output is called a forward pass.
• This is why it is called a feed-forward network architecture.
• As the number of layers grows, the number of nodes also grows.
• This results in the number of parameters in the network to grow
extremely rapidly.
• This is a big part of what gives neural networks their power.
• But it is also what can make them difficult to train:
– a lot of parameters typically means you need a lot of training
data and a lot of computing resources.
• ANN are very good function approximators provided that large
dataset of the corresponding domain is available.
• Almost all practical problems such as playing a game of Go or
mimicking intelligent behavior can be represented by
mathematical functions.
• The corresponding theorem that formulates this basic idea of
approximation is called Universal Approximation Theorem.
• This theorem suggest that a neural network is capable of learning
complex patterns and relationships in data as long as certain
conditions are fulfilled.
Figure A continuous function
• The Universal Approximation Theorem is a foundational result in
the theory of neural networks.
• It states that:
A feedforward neural network with at least one hidden layer and a
finite number of neurons can approximate any continuous
function on a compact subset of Rn, provided the activation
function is non-linear and satisfies certain properties.
• Key points of the Theorem are:
• i. Function Approximation: The theorem guarantees that such a neural
network can approximate a target function f(x) as closely as desired, given
enough neurons in the hidden layer.
• ii. Compact Subset of Rn: The theorem applies to functions defined on
compact subsets of real numbers, which are bounded and closed regions.
• This means the function can be approximated well within a finite domain.
• iii. Nonlinear Activation Function: For the theorem to hold, the
activation function in the hidden layer must be nonlinear and satisfy
certain properties, like being continuous and non-constant.
• iv. Finite Number of Neurons: Although the number of neurons required
for a specific approximation may be very large, the theorem does not
impose any strict limit on this number.
Multilayer Perceptron (MLP)
• The simplest kind of neural network is the feedforward network.
• A feedforward network is a multilayer network in which the units
are connected with no cycles.
• The outputs from units in each layer are passed to units in the next
higher layer, and no outputs are passed back to lower layers.
• For historical reasons multilayer networks, especially feedforward
networks, are sometimes called multi-layer perceptrons (MLP).
• However, this is a technical misnomer since the units in modern
multilayer networks aren’t perceptrons.
• Perceptrons are purely linear, but modern networks are made up
of units with non-linearities such sigmoids or RELUs.
Multilayer Perceptron (MLP)…
• Simple feedforward networks have three kinds of nodes: input
nodes, hidden nodes, and output nodes.
• The input layer x is a vector of scalar values.
• This feeds input to the neural network.
• No operation is performed at input layer.
Figure A simple 2-layer
feedforward network
Multilayer Perceptron (MLP)…
• Every single hidden and output nodes have parameters:
– a weight vector and a bias.
• We can represent the parameters for any hidden layer by combining
the weight vector and bias for each neuron i into a single weight
matrix W and a single bias vector b for the layer.
• Each element Wji of the weight matrix W represents the weight of the
connection from the i-th input unit xi to the j-th hidden unit hj.
• The advantage of using matrix W for the weights of the entire layer is
that the computation for a feedforward neural network can be done
very efficiently with simple matrix operations.
• For each layer, the computation only has three steps:
– multiplying the weight matrix by the input vector x
– adding the bias vector b
– applying the activation function g.
Multilayer Perceptron (MLP)…
• The output of each hidden layer will be a vector.
• The output of the hidden layer, the vector h, is thus the following:
h = σ(Wx + b1)
• Here, we are applying the σ activation function to a vector.
• We are thus making σ(·), and indeed any activation function g(·),
to apply it to a vector element-wise. So,
g[z1, z2, z3] = [g(z1), g(z2), g(z3)].
• Like the hidden layer, the output layer has a weight matrix, let us
call it U.
• The weight matrix U is multiplied by the hidden layer output
vector, h, to produce the output.
z = Uh + b2
a = softmax(z)
•
• The resulting value h forms a representation of the input.
• The role of the output layer is to take this new hidden layer output
h and compute a final output.
• There are n2 output nodes, so the weight matrix U has
dimensionality U ∈ Rn2×n1, and element Uij is the weight from
unit j in the hidden layer to unit i in the output layer.
• In many cases, the goal of the network is to make some kind of
classification decision, and so we will focus on the case of
classification.
• Here are the final equations for a feedforward network with a
single hidden layer, which takes an input vector x, outputs a
probability distribution y, and is parameterized by weight matrices
W and U and a bias vector b:
h = σ(Wx + b)
z = Uh
a = softmax(z)
Example:
• Suppose we want to classify whether a person is male or female
by using their height and weight as a feature.
•
Multilayer Perceptron (MLP)…
• It is necessary to set up some notation to make it easier to talk
about deeper networks of depth more than 2.
• We use superscripts in square brackets to indicate layer numbers,
starting at 0 for the input layer.
• So, W[1] will mean the weight matrix for the first hidden layer,
and b[1] will mean the bias vector for the first hidden layer.
• The value nj will mean the number of units at layer j.
• We use g(·) to stand for the activation function,
• This could be ReLU or tanh for intermediate layers and sigmoid
or softmax for output layers.
• We will use a[i] to mean the output from layer i, and z[i] to mean
the summation of weights and biases W[i]a[i−1] + b[i].
• The 0th layer is for inputs, so we will refer to the inputs x more
generally as a[0].
•
• Example: compute the forward propagation for the following 2-layer
neural network.
• Remember that each unit of a neural network performs two
operations: compute weighted sum and pass the weighted sum
through an activation function.
• For hidden layer node h1:
Zh1 = i1∗w1 + i2∗w3 + b1 = 0.1∗0.1 + 0.5∗0.3 + 0.25 = 0.41
h1 = 1/(1+e−Zh 1) = 1/(1+e−0.41) = 0.60108
• For hidden layer node h2:
Zh2 = i1∗w2 + i2∗w4 + b1 = 0.1*0.2 + 0.5*0.4 + 0.25 = 0.47
h2 = 1/(1+e−Zh2) = 1/(1+e-0.47) = 0.61538
• For output layer node o1:
Zo1 = h1∗w5 + h2∗w6 + b2 = 0.60108*0.5 + 0.61538 * 0.6 + 0.35 = 1.01977
o1 = 1/(1+e−Zo1) = 1/(1+e-1.01977) = 0.73492
• For output layer node o2:
Zo2 = h1∗w7 + h2∗w8 + b2 = 0.60108 *0.7 + 0.61538*0.8 + 0.35 = 1.26306
o2 = 1/(1+e−Zo2) = 1/(1+e-1.26306) = 0.77955
•
Exercise: calculate the output of the following neural network
• Architecture:
• Input Layer: 2 neurons
• Hidden Layer: 3 neurons
• Output Layer: 1 neuron
• Activation Function:
• Hidden Layer: ReLU
• Output Layer: Sigmoid
• Weights and Biases:
• Hidden Layer:
• Weights: [[0.1, 0.2], [0.3, 0.4], [0.5, 0.6]]
• Biases: [0.1, 0.2, 0.3]
• Output Layer:
• Weights: [0.7, 0.8, 0.9]
• Bias: 0.1
Exercise
1. calculate the output of the following neural network
• Architecture: three layers
• Activation functions:
• Hidden Layer: ReLU
• Output Layer: Sigmoid
• Input layer:
– Input: [2, 5]
• Hidden Layer:
• Weights: [[0.1, 0.2], [0.3, 0.4], [0.5, 0.6]]
• Biases: [0.1, 0.2, 0.3]
• Output Layer:
• Weights: [[0.7, 0.8, 0.9], [0.4, 0.2, 0.6]]
• Bias: [0.1, 0.2]
2. Given the inputs, weights, and biases in the figure, calculate the
output. Use RELU for hidden layer and softmax for output layer.
3.3 Training ANN with Backpropagation
• A feedforward neural net is a supervised machine learning in
which we know the correct output y for each observation x.
• What the system produces is ŷ, the system’s estimate of the true y.
• The goal of the training procedure is
– to learn parameters W[i] and b[i] for each layer i that make ŷ for
each training example as close as possible to the true y.
• This requires three components.
• 1. We need a metric for checking how close the current output
(ŷ) is to the true gold label y.
• Rather than measuring similarity, we usually talk about the
opposite of this:
– the distance between the system output and the gold output.
• We will need a loss function that computes the distance between
the system output and the gold output.
• We call this distance the loss function or the cost function.
• The loss function that is commonly used for neural networks is
the cross-entropy loss.
• 2. We need is an optimization algorithm.
• In order to find the parameters that minimize the loss function, we
use an optimization algorithm.
• Optimization algorithm iteratively updates the weights and biases
so as to minimize the loss function.
• The standard optimization algorithm for this is gradient descent.
• 3. Gradient descent requires knowing the gradient of the loss
function.
• This is a vector that contains the partial derivative of the loss
function with respect to each of the parameters.
• How do we partial out the loss over all those intermediate layers
of neural networks?
• The answer is the algorithm called error backpropagation or
backward differentiation.
1. Loss function
• Loss functions are a measurement of how good your model is in
terms of predicting the expected outcome.
• It is also called cost function.
• A cost function is a measure of error between what value your
model predicts and what the value actually is.
• Loss function is a method of evaluating how well your algorithm
models your dataset.
• It helps us to understand how much the predicted value differ
from actual value.
– If the predictions are totally off, the loss function will output a
higher number.
– If the algorithm is performing well, the loss function will
output a lower number.
• As you train your model, the loss function will tell you if the
machine learning algorithm is improving or not.
• There is no one-size-fits-all loss function that works for all
machine learning algorithms.
• There are various factors involved in choosing a loss function for
specific problem such as
– type of machine learning algorithm chosen
– ease of calculating the derivatives
– to some degree the percentage of outliers in the data set
• Broadly speaking, loss functions can be grouped into two major
categories:
– classification loss function
– Regression loss function
• In classification problems, the task is to predict the respective
probabilities of all classes the problem is dealing with.
• In regression problems, the task is to predict the continuous value
concerning a given set of independent features to the learning
algorithm.
3.3 Training ANN with Backpropagation…
• The commonly used losses are:
i. Regression losses
– MSE (Mean Squared Error)
– MAE (Mean Absolute Error)
– Hubber loss
ii. Classification losses
– Binary cross-entropy
– Categorical cross-entropy
– Hinge loss
3.3 Training ANN with Backpropagation…
•
3.3 Training ANN with Backpropagation…
• Lower MSE numbers indicate a better fit.
• MSE is only concerned with the average magnitude of error
irrespective of their direction.
• However, due to squaring, predictions which are far away from
actual values are penalized heavily in comparison to less deviated
predictions.
• Since errors are squared, larger errors contribute
disproportionately to the loss, making MSE sensitive to outliers.
• MSE has nice mathematical properties which makes it easier to
calculate gradients.
• The squaring operation simplifies the derivative calculation,
which is useful for gradient-based methods.
B. Mean Absolute Error (MAE)
• Mean absolute error (also called L1 loss) is used to minimize the
error which is the sum of all the absolute differences between the
true value and the predicted value.
• MAE measures the average absolute differences.
• Here, n is the number of total data points or samples, y is the
expected output, and ŷ is the value predicted by the model.
• The calculation of MAE is simple and computationally efficient.
• The MAE loss function is more robust to outliers compared to the
MSE loss function.
• Therefore, you should use it if the data is prone to many outliers.
C. Cross-Entropy Loss
• A commonly used loss function for classification is the
cross-entropy loss.
• We can define cross entropy as the difference between two
probability distributions p and q, where p is our true output and q
is our estimate of this true output.
• As an example, consider that we have a classification problem of
3 classes: orange, apple, and tomato.
• The deep learning model will give a probability distribution of
these 3 classes as output for a given input data.
• The class with the highest probability is considered as a winner
class for prediction.
output = [p(orange), p(apple), p(tomato)]
• The sum of the probabilities p(orange), p(apple), p(tomato) is 1.
• The actual probability distribution for each class is shown below.
orange = [1, 0, 0]
apple = [0, 1, 0]
tomato = [0, 0, 1]
• During training, if the input class is Tomato, the predicted probability
distribution should tend towards the actual probability distribution of
Tomato.
• If the predicted probability distribution is not closer to the actual one, the
model has to adjust its weight.
• This is where cross-entropy becomes a tool to calculate how much far
the predicted probability distribution is from the actual distribution.
• Cross-entropy can be considered as a way to measure the distance
between two probability distributions.
• The following image illustrates the intuition behind cross-entropy:
3.3 Training ANN with Backpropagation…
•
•
3.3 Training ANN with Backpropagation…
•
• Binary cross entropy is the loss function used for classification
problems between two categories only.
• It’s also known as a binary classification problem.
• Categorical cross entropy is a loss function used for multiclass
classification.
• It is also known as Softmax Loss.
• Using this loss, we can train neural network to output a
probability over the k classes.
•
3.3 Training ANN with Backpropagation…
2. Optimization Algorithms
• In DL, we have the loss function which tells us how poorly the
model is performing currently.
• We need to use this loss to train our network such that it performs
better.
• Essentially what we need to do is to take the loss and try to
minimize it, because a lower loss means our model is performing
better.
• The process of minimizing (or maximizing) any mathematical
expression is called optimization.
• Optimization is the process where we train the model iteratively
that results in a maximum and/or minimum function evaluation.
• It is one of the most important operations in ML to get better
results.
3.3 Training ANN with Backpropagation…
• Optimizers are algorithms used to change the attributes of the
neural network such as weights and bias to reduce the losses.
• Optimizers are used to solve optimization problems by
minimizing the cost function.
Figure Finding minima and maxima
• First, we will define a loss function.
• Once the loss function is defined, we can use an optimization
algorithm in attempt to minimize the loss.
• In optimization, a loss function is often referred to as the objective
function of the optimization problem.
• If we ever need to maximize an objective function, there is a
simple solution: just flip the sign on the objective.
• There are many types of optimization algorithms.
• First-order optimization algorithms explicitly involve using the
first derivative (gradient) to choose the direction to move in the
search space.
• The procedures involves
– first calculating the gradient of the function
– then follow the gradient in the opposite direction (e.g. downhill
to the minimum for minimization problems) using a step size
(also called the learning rate).
3.3 Training ANN with Backpropagation…
• First-order algorithms are generally referred to as gradient descent.
• A more specific names are referring to minor extensions to the
gradient descent.
• These are:
– Gradient Descent
– Momentum
– Adagrad
– RMSProp
– Adam
• A second-order optimization algorithm uses the first and second
derivatives of a function to find its minimum or maximum value.
• Second-order optimization algorithms are more powerful than
first-order optimization algorithms, but they are more complex to
implement.
Gradient Descent
• What is a gradient?
• In ML, a gradient is the derivative of a function that has more than
one input variable.
• Known as the slope of a function in mathematical terms, the gradient
simply measures the change in all weights with regard to the change
in error.
• It is computed using partial derivative.
• Gradient descent is an optimization algorithm which is
commonly-used to train neural networks.
• Training data helps these models learn over time, and the cost
function within gradient descent specifically acts as a barometer,
gauging its accuracy with each iteration of parameter updates.
• Until the cost function is close to or equal to zero, the model will
continue to adjust its parameters to yield the smallest possible error.
• Once machine learning models are optimized for accuracy, they can
be powerful tools for artificial intelligence.
• Gradient Descent is an optimization algorithm for finding a local
minimum of a differentiable function.
• Gradient descent in ML is simply used to find the values of a function's
parameters that minimize the cost function as far as possible.
• It is a minimization algorithm that minimizes a given function.
• A gradient simply measures the change in all weights with regard to the
change in error.
• You can also think of a gradient as the slope of a function.
• The higher the gradient, the steeper the slope and the faster a model can
learn.
• But if the slope is zero, the model stops learning.
• A gradient is a partial derivative of cost function with respect to its
inputs.
• Hence, the goal of gradient descent is to minimize the cost function, or
the error between predicted and actual output.
• In order to do this, it requires two data points:
– a direction and a learning rate.
• These factors determine the partial derivative calculations of future
iterations, allowing it to gradually arrive at the local or global minimum.
• Learning rate (also called step size) is the size of the steps that are taken
to reach the minimum.
• This is typically a small value, usually a small fraction.
• High learning rates result in larger steps but risks overshooting the
minimum.
• Conversely, a low learning rate has small step sizes.
• While it has the advantage of more precision, the number of iterations
compromises overall efficiency as this takes more time and
computations to reach the minimum.
• The learning rate is the determinant of how big the steps gradient descent
takes in the direction of the local minimum.
• It determines the speed with which the algorithm moves towards the
optimum values of the cost function.
3.3 Training ANN with Backpropagation…
• Because of this, the choice of the learning rate, η, is important and
has a significant impact on the effectiveness of the algorithm.
• If the learning rate is too big as shown above, in a bid to find the
optimal point, it moves from the point on the left all the way to
the point on the right.
• In that case, you see that the cost function has gotten worse.
• On the other hand, if the learning rate is too small, then gradient
descents will work, albeit very slowly.
• It is important to pick the learning rate carefully.
3.3 Training ANN with Backpropagation…
• The cost function provides feedback to the model so that it can
adjust the parameters, such as weight and bias, to minimize the
error and find the local or global minimum.
• It continuously iterates, moving along the direction of steepest
descent (or the negative gradient) until the cost function is close to
or at zero.
• At this point, the model will stop learning.
• There are different variants of gradient descent algorithm.
• In this example, we will use what is called stochastic gradient
descent.
Computing the Gradient
• To check how a network is performing, we apply cost function to
find the difference between system output and expected output.
• The cost function gives us the error of the network in the current
state.
• Once the error has been computed, what do we do in order to
improve the network performance?
• We calculate gradients that can be used to update network
parameters.
• How do we compute the gradient of this loss function?
• Computing the gradient requires the partial derivative of the loss
function with respect to each parameter of the neural network.
• Partial derivatives calculate the rate of change of a function of
several variables with respect to one of those variables while
holding the other variables fixed or constant.
3.3 Training ANN with Backpropagation…
• To minimize the difference between the network’s output and the
target output, we need to know how the model performance
changes with respect to each parameter of the neural network
model.
• This means we need to define the relationship between the cost
function and each weight and bias.
• This can be done using partial derivative.
• A partial derivative tells us how the cost function changes with
small changes in the parameters of the network i.e. weights and
biases.
• It quantifies how much of error is contributed by each of the
parameters of the neural network.
• This helps us to update these parameters in order to minimize
error.
•
Figure how error moves backward in
logistic regression
•
3.3 Training ANN with Backpropagation…
• But these derivatives only give correct updates for one weight
layer: the last one.
• For deep networks, computing the gradients for each weight is
much more complex.
• In this case, we compute the derivative with respect to weight
parameters that appear all the way back in the very early layers of
the network, i.e. input layer.
• The solution to computing this gradient is an algorithm called
error backpropagation or backprop.
• While backprop was invented specially for neural networks, it
turns out to be the same as a more general procedure called
backward differentiation, which depends on the notion of
computation graphs.
• Reading about computation graph will help you understand
backpropagation.
Backward differentiation for a neural network
• Computation graphs for real neural networks are much more
complex.
• The figure below shows a simple neural network that has 2 layers
with n0=2, n1=2, and n2=1, doing binary classification and hence
using a sigmoid output unit for simplicity.
•
3.3 Training ANN with Backpropagation…
•
•
3.3 Training ANN with Backpropagation…
•
•
•
3.3 Training ANN with Backpropagation…
Example:
• let us compute a backpropagation for the following neural
network which we used as an example for the forward pass
computation.
3.3 Training ANN with Backpropagation…
• In the forward pass, we got:
Zh1 = 0.41, h1 = 0.60108
Zh2 = 0.47, h2 = 0.61538
Zo1 = 1.01977, o1 = 0.73492
Zo2 = 1.26306, o2 = 0.77955
• For the backward pass or backpropagation:
• The expected outputs of the network are 0.05 and 0.95
respectively for o1 and o2.
• In the forward pass, the network produced outputs 0.73492 and
0.77955 for o1 and o2 respectively.
• Now we will compute the errors based on network outputs (ŷ) and
the expected outputs (y).
3.3 Training ANN with Backpropagation…
•
•
•
•
•
•
•
• iii. Update the weights:
• For output layer weights:
w5 = w5 − η ⋅ δo1 ⋅ ah1
w6 = w6 − η ⋅ δo1 ⋅ ah2
w7 = w7 − η ⋅ δo2 ⋅ ah1
w8 = w8 − η ⋅ δo2 ⋅ ah2
• For hidden layer weights:
w1 = w1 − η ⋅ δh1 ⋅ i1
w2 = w2 − η ⋅ δh2 ⋅ i1
w3 = w3 − η ⋅ δh1 ⋅ i2
w4 = w4 − η ⋅ δh2 ⋅ i2
• Bias update:
b11 = b1 − η⋅δh1 b12 = b1 − η⋅δh2
b21 = b2 − η⋅δo1 b22 = b2 − η⋅δo2
• Output layer:
• δo1 = (o1 − y1) * σ′(zo1) = (0.7468−0.05) * 0.1898 = 0.1323
• δo2 = (o2 − y2) * σ′(zo2) = (0.7691−0.95) * 0.1777 = −0.0321
• Hidden layer:
• δh1 = (δo1 * w5 + δo2 * w6) * σ′(zh1) = (0.06615−0.01926) * 0.2398 = 0.0112
• δh2 = (δo1 * w7 + δo2 * w8) * σ′(zh2) = (0.09261− 0.02568) * 0.2366 = 0.0158
• Update weights (learning rate = 0.1):
• w5 = w5 − η ⋅ δo1 ⋅ ah1 = 0.5 − 0.1 * 0.1323 * 0.6011 = 0.4921
• w6 = w6 − η ⋅ δo1 ⋅ ah2 = 0.6 − 0.1 * (−0.0321) * 0.6011 = 0.6019
• w7 = w7 − η ⋅ δo2 ⋅ ah1 = 0.7 − 0.1 * 0.1323 * 0.6154 = 0.6919
• w8 = w8 − η ⋅ δo2 ⋅ ah2 = 0.8 − 0.1 * (−0.0321) * 0.6154 = 0.80198
• w1 = w1 − η ⋅ δh1 ⋅ i1 = 0.1 − 0.1 * 0.0112 * 0.1 = 0.0999
• w2 = w2 − η ⋅ δh2 ⋅ i1 = 0.2 − 0.1 * 0.0158 * 0.1 = 0.1998
• w3 = w3 − η ⋅ δh1 ⋅ i2 = 0.3 − 0.1 * 0.0112 * 0.5 = 0.2994
• w4 = w4 − η ⋅ δh2 ⋅ i2 = 0.4 − 0.1 * 0.0158 * 0.5 = 0.3992
Exercise
• Compute the backpropagation and update the weights.
Local and Global Minima
• Neural networks learn the best parameters by training on data.
• They can learn in a supervised manner via gradient descent:
iteratively adjusting the model’s parameters to decrease the
overall prediction error, as measured by the loss function.
• Gradient descent can be thought of as stepping “downhill” on a
“loss surface”, which has as many dimensions as there are model
parameters.
• The best parameter value combination will produce the lowest
prediction error, which corresponds to the so-called “global
minimum” in the loss surface.
• Typically, there are many valleys and dips in the loss surface that
are lower than the surroundings, but not quite as low as the global
minimum.
• These are called “local minima”.
• It’s not always possible to find the global minimum, but there are
several strategies to do as well as possible.
• Few of the techniques to reach global minima are:
– Using Stochastic Gradient Descent
– different weight initialization techniques help in reaching the
global minima.
• Reaching a point in which gradient descent makes very small
changes in your objective function is called convergence.
• At the minima point, the model has optimized the weights such
that they minimize the cost function.
• To perform gradient descent, the training process needs to know
three things:
– the value of the loss function for a given set of parameters (i.e. at that
point on the loss surface),
– the slope or gradient of the loss surface at that point, which indicates
which direction to move in, and
– how big of a step to take.
Figure A 3D plot of the cost function of a neural network
• The figure below shows the baby steps of experiment with two
different routes.
• Your starting position on the hill corresponds to the initial values
given to θ0, and θ1.
• Black route has a slightly different starting point compared to the
white one, which reveals an interesting property of the gradient
descent algorithm: changing the initial value of weights might lead
you to a different minimum.
Figure Two different gradient descent routes in a 3D loss function by
the starting point.
3.4. Types of Gradient Descent
• Gradient Descent is a widely used iterative optimization algorithm
that is used to find the minimum of any differentiable function.
• A common problem is that GD may converge to a local minimum
and may not be able to get away from it.
• Thus, it may fail to reach the minimum of a function i.e. the
global minimum.
• There are three types of gradient descent learning algorithms:
– batch gradient descent,
– stochastic gradient descent
– mini-batch gradient descent
3.4. Types of Gradient Descent…
i. Stochastic gradient descent (SGD)
• SGD runs a training for each example within the dataset and it
updates network parameters for each training example.
• It is a variation of the gradient descent that calculates the error and
updates the model for each example in the training datasets.
• Since you only need to hold one training example, they are easier
to store in memory.
• While these frequent updates can offer more detail and speed, it
can result in losses in computational efficiency when compared to
batch gradient descent.
• Its frequent updates can result in noisy gradients.
• But this can also be helpful in escaping the local minimum and
finding the global minimum.
3.4. Types of Gradient Descent…
• SGD updates the parameters for each training example one by one.
• Depending on the problem, this can make SGD faster than batch
gradient descent.
• One advantage is the frequent updates allow us to have a pretty
detailed rate of improvement.
• However, the frequent updates are more computationally expensive
than the batch gradient descent approach.
• Additionally, the frequency of those updates can result in noisy
gradients, which may cause the error rate to jump around instead of
slowly decreasing.
1. Take one training example
2. Feed it to the neural network
3. Calculate its gradient with respect to its parameters
4. Use the calculated gradient to update the weights
Repeat steps 1–4 for all the examples in the training dataset
3.4. Types of Gradient Descent…
ii. Batch gradient descent
• Batch gradient descent sums the error for each example in a
training set, updating the model only after all training examples
have been evaluated.
• We take the average of the gradients of all the training examples
and then use that mean gradient to update our parameters.
• Hence, there is just one step of gradient descent in one epoch.
• While this batching provides computation efficiency, it can still
have a long processing time for large training datasets as it still
needs to store all of the data into memory.
• Batch gradient descent also usually produces a stable error
gradient and convergence, but sometimes that convergence point
isn’t the most ideal, finding the local minimum versus the global
one.
3.4. Types of Gradient Descent…
• Some advantages of batch gradient descent are its computational
efficiency: it produces a stable error gradient and a stable
convergence.
• Some disadvantages are that the stable error gradient can
sometimes result in a state of convergence that isn’t the best the
model can achieve.
• It also requires the entire training dataset to be in memory and
available to the algorithm.
3.4. Types of Gradient Descent…
iii. Mini-batch gradient descent
• Mini-batch gradient descent combines concepts from both batch
gradient descent and stochastic gradient descent.
• It splits the training dataset into small batch sizes and performs
updates on each of those batches.
• This approach strikes a balance between the computational
efficiency of batch gradient descent and the speed of stochastic
gradient descent.
• Mini-batch gradient descent strikes a balance between batch
gradient descent and stochastic gradient descent.
• We split the training dataset into small batches, and then we
calculate the gradient of the loss function over each batch.
• At each step, instead of computing the gradients based on the full
training set (as in Batch GD) or based on just one instance (as in
SGD), mini-batch GD computes the gradients on small random
sets of instances called mini-batches.
• The main advantage of Minibatch GD over SGD is that you can
get a performance boost from hardware optimization of matrix
operations, especially when using GPUs.
• The algorithm’s progress in parameter space is less erratic than
with Stochastic GD, especially with fairly large mini-batches.
• This approach has several advantages:
– It is faster than batch gradient descent, especially for large datasets.
– It is less noisy than stochastic gradient descent, which can lead to
faster convergence.
– It is less likely to overfit than batch gradient descent.
3.4. Types of Gradient Descent…
• Mini-batch gradient descent is the go-to method since it is a
combination of the concepts of SGD and batch gradient descent.
• It simply splits the training dataset into small batches and performs
an update for each of those batches.
• This creates a balance between the robustness of stochastic gradient
descent and the efficiency of batch gradient descent.
• Instead of updating the weights after just one example or after the
entire dataset of examples, you choose a batch size of examples,
after which the weights are updated.
• Common mini-batch sizes range between 50 and 256, but like any
other machine learning technique, there is no clear rule because it
varies for different applications.
• This is the go-to algorithm when training a neural network and it is
the most common type of gradient descent within deep learning.
• Fig batch gradient descent approach
• Fig stochastic gradient descent approach
Figure Minibatch
gradient descent
Mini-batch gradient descent
1. Pick a mini-batch from the training dataset
2. Feed it to the neural network
3. Calculate the mean gradient of the mini-batch
4. Use the mean gradient calculated in step 3 to update the weights
Repeat steps 1–4 for the remaining mini-batches
• Comparison of the effects of each type of gradient descent on the
learning process.
3.4. Types of Gradient Descent…
• Comparison of the effects of each type of gradient descent on the
learning process.
3.5. Regularization
• In machine learning, an error is a measure of how accurately an
algorithm can make predictions for the previously unknown
dataset.
• On the basis of these errors, the best machine learning model is
selected to perform best on the particular dataset.
• There are two types of errors that can be reduced if we take the
proper action:
– Bias: Bias measures the difference between the model’s prediction
and the target value. If the model is oversimplified, then the
predicted value would be far from the ground truth resulting in high
bias.
– Variance: Variance is the measure of the inconsistency of different
predictions over varied datasets. If the model’s performance is tested
on different datasets, the closer the prediction, the lesser the
variance. Higher variance is an indication of overfitting in which the
model loses the ability to generalize.
3.5. Regularization…
3.5. Regularization…
• Ways to reduce high variance:
– Reduce the input features or number of parameters as a model is
overfitted.
– Do not use a much complex model.
– Increase the training data.
– Increase the regularization term.
• High bias mainly occurs due to a much simple model.
• Below are some ways to reduce the high bias:
– Increase the input features as the model is underfitted.
– Decrease the regularization term.
– Use more complex model, such as including increasing the number
of layers.
3.5. Regularization
Overfitting (High Variance)
• One of the most common problems in deep learning is overfitting.
• Overfitting occurs when a deep learning model fits exactly on its
training data.
• Generalization of a model to new data is ultimately what allows us
to use machine learning algorithms to make predictions and
classify data.
• When overfitting happens, the algorithm unfortunately cannot
perform accurately against unseen data, defeating its purpose.
• In overfitting (high variance), the model performs exceptionally
well on train data but was not able to predict test data properly.
• Here, the model tries to fit the training data entirely and ends up
memorizing the data patterns and the noise and random
fluctuations.
3.5. Regularization…
• These models fail to generalize and perform poorly in the case of
unseen data scenarios, defeating the model's purpose.
• When machine learning algorithms are constructed, they leverage
a sample dataset to train the model.
• However, when the model trains for too long on sample data or
when the model is too complex, it can start to learn the “noise,”
or irrelevant information, within the dataset.
• When the model memorizes the noise and fits too closely to the
training set, the model becomes “overfitted,” and it is unable to
generalize well to new data.
• If a model cannot generalize well to new data, then it will not be
able to perform the classification task that it was intended for.
• How do we identify overfitting?
– If the training data has a low error rate and the test data has a
high error rate, it signals overfitting.
• The problem of overfitting (high variance) in neural networks
can be handled in many ways.
• One method is expanding the training set to include more data.
• This can increase the accuracy of the model by providing more
opportunities to parse out the dominant relationship among the
input and output variables.
• This is a more effective method when clean, relevant data is
injected into the model.
• The other most commonly used approaches are:
1. Early stopping
2. L1 regularization
3. L2 regularization
4. Dropout
• Training set error vs test set error as the number of epochs
increases during model training.
3.5. Regularization…
i. Early Stopping
• Early stopping was invented in the early days of neural networks.
• In early stopping approach, the training of the network is stopped
before it overfits to the training data.
• The training should be stopped when generalization start to
decrease.
• Exactly when this occurs could be clear in some cases but less so
in other.
• Therefore, a problem with this method can be deciding an
appropriate stopping criterion.
• A common stopping criterion is to stop training when the
validation loss has not reached a new minimum in the last 5
epochs.
• Hence continuous improvements do not trigger Early stopping.
• Early stopping
3.5. Regularization…
•
3.5. Regularization…
•
3.5. Regularization…
• λ is called as the regularization rate and it is an additional
hyperparameter introduced into the network.
• Simply speaking, λ determines how much we regularize the
model.
• To stop regularizing, we can simply set λ to 0.
iv. Dropout
• The term “dropout” refers to dropping out the nodes in a neural
network from input and hidden layers.
• During the training, randomly selected neurons are turned off or
dropped out.
• All the forward and backwards connections with a dropped node
are temporarily removed, thus creating a new network
architecture out of the parent network.
• The nodes are dropped by a dropout probability of p.
• When we apply dropout to a hidden layer, zeroing out each
hidden unit with probability p, the result can be viewed as a
network containing only a subset of the original neurons.
• During training, dropout is implemented by only keeping a
neuron active with some probability p or setting it to zero
otherwise.
• If a node is dropped, then all incoming and outgoing connections
from that node need to be dropped as well.
3.5. Regularization…
• Dropout prevents the network from becoming too dependent on
any one or any small combination of neurons.
• Dropping results in independent internal representations being
learned by the network, making the network less sensitive to the
specific weight of the neurons.
• Such a network is better generalized and has fewer chances of
producing overfitting.
• Dropout has become one of the most favored methods of
preventing overfitting in deep neural networks.
• Typically, we disable dropout at test time. Given a trained model
and a new example, we do not drop out any nodes.
Showing how dropout works diagrammatically
3.5. Regularization…
Underfitting (High Bias)
• Underfitting occurs when the model has not trained for enough
time or the input dataset are not significant enough to determine
a meaningful relationship between the input and output variables.
3.5. Regularization…
• When the bias is high, assumptions made by the model are too
basic, the model can’t capture the important features of the data.
• This means that the model hasn’t captured patterns in the training
data and hence cannot perform well on the testing data too.
• If this is the case, the model cannot perform well on training data
as well as new data (test data).
• This instance, where the model cannot find patterns in our
training set and hence fails for both seen and unseen data, is
called Underfitting.
Exercise
1. Use a Wine dataset to classify wines into 10 categories based
on their chemical properties. Wine has 13 chemical properties
of wines, such as: Alcohol content, Malic acid, Ash, Alkalinity
of ash, Magnesium, Total phenols. These are: Flavanoids,
Nonflavanoid phenols, Proanthocyanins, Color intensity, Hue,
Proline and OD280/OD315 of diluted wines.
• Based on these properties, wine can be divided in to 10 different
quality scores (1 to 10).
• You can download the dataset from:
https://www.kaggle.com/datasets/yasserh/wine-quality-dataset
• Using this dataset, design and train a neural network that
classifies wines into 10 quality scores.
Exercise
2. Use a house dataset to predict the price of a house. The dataset
contains the number of bedrooms and bathrooms, number of
floors, size of the lot in square feet, the year the property was
built, the city, the country, price, etc. Based on these data
attributes, create an ANN that can predict the price of the
house.
• You can download the dataset from
https://www.kaggle.com/datasets/fratzcan/usa-house-prices