0% found this document useful (0 votes)
60 views61 pages

Asc Notes

The document discusses the structure and function of biological and artificial neural networks, highlighting the similarities and differences between them. It explains the components of artificial neurons, the basic models of artificial neural networks, and various learning methods such as supervised and unsupervised learning. Additionally, it covers network architectures, activation functions, and the importance of weights in the learning process.

Uploaded by

goyalanurag678
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views61 pages

Asc Notes

The document discusses the structure and function of biological and artificial neural networks, highlighting the similarities and differences between them. It explains the components of artificial neurons, the basic models of artificial neural networks, and various learning methods such as supervised and unsupervised learning. Additionally, it covers network architectures, activation functions, and the importance of weights in the learning process.

Uploaded by

goyalanurag678
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

ARTIFICIAL NEURAL NETWORKS

Biological Neural Network


Consider the schematic view of a biological neuron. It consists of a bush of thin fibers called dendrites, cell
body (also known as soma), a long cylindrical fiber known as axon, synapse, and others. A synapse is that
part of a neuron, where its axon makes contact with the dendrites of its neighboring neuron. A neuron collects
information from its neighbors with the help of its dendrites. The collected information is then summed up in the
cell body, before it passes through the axon. This information is then transferred to the next neuron through the
synapse using the difference in concentration of Na+, K+ ions between them.

Artificial Neuron
Consider the schematic view of an artificial neuron. In which a biological neuron has been modeled artificially.

Let us suppose that there are n inputs (such as I1, I2, . . . , In) to a neuron j. The weights connecting n number of
inputs to jth neuron are represented by [W] = [W1j, W2j, ..., Wnj]. The function of summing junctions of an artificial
neuron is to collect the weighted inputs and sum them up. Thus, it is similar to the function of combined
dendrites and soma. The activation function (also known as the transfer function) performs the task of axon
and synapse. The output of the summing junction may sometimes become equal to zero and to prevent such a
situation, a bias of fixed value bj is added to it. Thus, the input to transfer function f is determined as

The output of the summing function is also called Linear combiner Output/Induced field input/net
input/pre-activation value. The output of jth neuron, that is Oj can be obtained as follows:

75
Do it yourself
Q.1 For the network shown in Figure below, calculate the net input to the output neuron.

Q.2 For the network shown in Figure below, calculate the net input to the output neuron.

Difference between Artificial Neural Network (ANN) and Biological Neural Network (BNN):

Artificial Neural Network (ANN) Biological Neural Network (BNN)

Processing speed is fast as compared to Biological They are slow in processing information. Cycle time
Neural Network. Cycle time for execution is in for execution is in milliseconds.
nanoseconds.

It can perform massive parallel operations It can perform massive parallel operations
simultaneously like BNN. simultaneously.

Size and complexity depends on the application The size and complexity of BNN is more than ANN
chosen but it is less complex than BNN. with 1011 neurons and 1015 interconnections.

Information is stored in contiguous memory Information is stored in interconnections or in


locations. synapse strength.

To store new information, the old information is Any new information is stored in interconnection, and
deleted if there is a shortage of storage. the old information is stored with lesser strength.

There is no fault tolerance in ANN. The corrupted It has fault tolerance capability. It can store and
information cannot be processed. retrieve information even if the interconnection is
disconnected.

The control unit processes the information. The chemical present in neurons does the
processing.

Threshold is a set value based upon which the final output of the network may be calculated. The threshold
value is used in the activation function. A comparison is made between the calculated net input and the
threshold to obtain the network output. For each and every application, there is a threshold limit. Consider a
direct current (DC) motor. If its maximum speed is 1500 rpm then the threshold based on the speed is 1500
rpm. If the motor is run on a speed higher than its set threshold, it may damage motor coils. Similarly, in neural
networks, based on the threshold value, the activation functions are defined and the output is calculated. The
activation function using threshold can be defined as
f(net) = {1 𝑖𝑓 𝑛𝑒𝑡≥ θ − 1 𝑖𝑓 𝑛𝑒𝑡 < θ }
76
Where θ is the fixed threshold value.

BASIC MODELS OF ARTIFICIAL NEURAL NETWORK


The models of ANN are specified by the three basic entities namely:
1.​ The model’s synaptic interconnections
2.​ The training or learning rules adopted for updating and adjusting the connection weights
3.​ Their activation functions

The model’s synaptic interconnections


The network architecture refers to arrangement of neurons to form layers and the connection pattern formed
within and between layers. A layer is formed by taking a processing element and combining it with other
processing elements. Practically, a layer implies a stage, going stage by stage, i.e., the input stage and the
output stage are linked with each other. These linked interconnections lead to the formation of various network
architectures. There exist five basic types of neuron connection architectures. They are:

i.​ Single-layer feed-forward Network


When a layer of the processing nodes is formed, the inputs can be connected to these nodes with
various weights, resulting in a series of outputs, one per node. Thus, a single-layer feed-forward
network is formed.

Here are few important functions of the input layer


1.​ The input layer receives raw data (e.g., images, text, numerical features) and passes it to the
subsequent layers for processing. Each neuron corresponds to one input feature (pixel, word, sensor
reading, etc.).
2.​ It does proper data formatting (normalization, flattening) which is essential before feeding into the
input layer.
3.​ It does not perform computations (like weighted sums or activations). Instead, it distributes the data
to the first hidden layer. Unlike hidden/output layers, input layer neurons have no trainable
parameters (weights/biases). The input layer is passive i.e. it does not have an activation function
hence activations are applied in later layers.

A note about the role of weight


1.​ Weights quantify the importance of each input feature (or neuron) in influencing the output. A high
value for weight for an input parameter shows that the feature is important and a near-zero weight
implies the feature is irrelevant.
2.​ Weights are adjusted during backpropagation to minimize the loss function (e.g., MSE,
cross-entropy) i.e. they play a significant role in the learning.

ii.​ Multilayer feed-forward network


A multilayer feed-forward network is formed by the interconnection of several layers. The input layer is
that which receives the input and this layer has no function except buffering the input signal. The output
layer generates the output of the network. Any layer that is formed between the input and output layers
is called a hidden layer. This hidden layer is internal to the network and has no direct contact with the
77
external environment. It should be noted that there may be zero to several hidden layers in an ANN.
More the number of the hidden layers, more is the complexity of the network. This may, however,
provide an efficient output response. In case of a fully connected network, every output from one layer
is connected to each and every node in the next layer.

Difference between hidden layer and output layer


Hidden Layer Output Layer

Extract and transform intermediate features from Produces the final prediction (e.g., class
the input data. Learned features are often probabilities, regression values) which is
abstract and hard to interpret human-readable (e.g. class labels, scalars).

Introduce non-linearity (via activation functions) Maps the learned features to the target format
to model complex relationships. Typically use using task-specific activations like Sigmoidal for
non-linear activations like ReLU, Tanh Binary Classification, Softmax for Multi-Class
Classification
Computes the initial error gradient

In case of error; Propagates error backward In case of error; It computes initial loss gradient

iii.​ Single node with its own feedback


❖​ A network is said to be a feed-forward network if no neuron in the output layer is an input to a
node in the same layer or in the preceding layer.
❖​ When outputs can be directed back as inputs to same or preceding layer nodes then it results in
the formation of feedback networks.
❖​ If the feedback of the output of the processing elements is directed back as input to the
processing elements in the same layer then it is called lateral feedback.
❖​ Recurrent networks are feedback networks with closed loop. The loops allow information to be
retained over time (feedback). It is used for sequential data like time series or natural language
processing. It is used in Speech recognition and language modeling.
❖​ A simple recurrent neural network having a single neuron with feedback to itself.

iv.​ Single-layer recurrent network


A single-layer network with a feedback connection in which a processing element’s output can be
directed back to the processing element itself or to the other processing element or to both.

78
v.​ Multilayer recurrent network
A processing element output can be directed back to the nodes in a preceding layer, forming a
multilayer recurrent network. Also, in these networks, a processing element output can be directed back
to the processing element itself and to other processing elements in the same layer.

Note: Maxnet is a type of neural network used for competitive learning, specifically to determine the maximum
activation among a set of neurons. It is commonly used in winner-take-all (WTA) networks, where only the
most strongly activated neuron remains active while suppressing others. Each neuron excites itself and inhibits
others using a small inhibitory weight −ε, which is a small negative constant. Over multiple iterations, neurons
with lower activations are suppressed until only the neuron with the highest activation remains. The network
iteratively updates neuron activations until a single neuron dominates while others are completely suppressed.

The training or learning rules adopted for updating and adjusting the connection weights
The main property of an ANN is its capability to learn. Learning or training is a process by means of which a
neural network adapts itself to a stimulus by making proper parameter adjustments, resulting in the production
of desired response. Broadly, there are two kinds of learning in ANNs:
1.​ Parameter learning: It updates the connecting weights in a neural net.
2.​ Structure learning: It focuses on the change in network structure (which includes the number of
processing elements as well as their connection types).

The above two types of learning can be performed simultaneously or separately. Apart from these two
categories of learning, the learning in an ANN can be generally classified into three categories as: supervised
learning; unsupervised learning & reinforcement learning.

79
1.​ Supervised Learning
Each input vector requires a corresponding target vector, which represents the desired output. The
input vector along with the target vector is called a training pair. The network here is informed precisely
about what should be emitted as output.

During training, the input vector is presented to the network, which results in an output vector. This
output vector is the actual output vector. Then the actual output vector is compared with the desired
(target) output vector. If there exists a difference between the two output vectors then an error signal is
generated by the network. This error signal is used for adjustment of weights until the actual output
matches the desired (target) output. In this type of training, a supervisor or teacher is required for error
minimization. Hence, the network trained by this method is said to be using supervised training
methodology. In supervised learning, it is assumed that the correct "target" output values are known for
each input pattern.

Key Features:
Requires labelled training data.
Uses loss functions to measure prediction accuracy.
Common algorithms: Neural Networks, Support Vector Machines, Decision Trees

Step-by-Step Training Process:


1.​ Initialize weights and biases.
2.​ Compute output for each input data point.
3.​ Calculate the error (difference between predicted output and actual output).
4.​ Update weights and biases using an optimization algorithm (e.g., Gradient Descent).
5.​ Repeat the process for several epochs (iterations) until the error is minimized.
Example: In binary classification (e.g., XOR problem), the network adjusts weights to correctly map
inputs to the correct class.

Regression technique in supervised learning


Regression is a type of supervised learning where the goal is to predict a continuous output variable
based on one or more input features. The model learns the relationship between the input features and
the continuous output.

Scenario: You want to predict the price of a house based on its size (in square feet) and other features
like the number of bedrooms, location, and age of the house.

Input Features:
●​ Size of the house (square feet)
●​ Number of bedrooms
●​ Location
●​ Age of the house
Output: House price (a continuous value)

A regression model, such as Linear Regression, can be used to predict the house price. The model
learns the relationship between the input features and the house price during training. For example, it
might learn that larger houses with more bedrooms in desirable locations tend to have higher prices.
80
Equation: In simple linear regression, the relationship can be represented as:

where θ0,θ1,θ2,θ3,θ4​are the parameters learned by the model.

Classification technique in supervised learning


Classification is a type of supervised learning where the goal is to predict a discrete label or category
based on one or more input features. The model learns to assign input data to one of several
predefined classes.

Scenario: You want to classify animals into different categories based on their features, such as the
number of legs, type of skin covering, and whether they can fly.

Input Features:
●​ Number of legs
●​ Type of skin covering (e.g., fur, feathers, scales)
●​ Ability to fly (yes/no)
Output: Animal category (e.g., mammal, bird, reptile, amphibian)

A classification model, such as Logistic Regression, Decision Trees, or Support Vector Machines, can
be used to classify the animals. The model learns the relationship between the input features and the
animal category during training. For example, it might learn that animals with feathers and the ability to
fly are likely to be birds.
Decision Boundary: The model creates a decision boundary that separates the different classes. For
instance, it might determine that if an animal has feathers and can fly, it should be classified as a bird.

81
2.​ Un-supervised Learning
The input vectors of similar type are grouped without the use of training data to specify how a member
of each group looks or to which group a number belongs. In the training process, the network receives
the input patterns and organizes these patterns to form clusters. When a new input pattern is applied,
the neural network gives an output response indicating the class to which the input pattern belongs. If
for an input, a pattern class cannot be found then a new class is generated.

It is clear that there is no feedback from the environment to inform what the outputs should be or
whether the outputs are correct. In this case, the network must itself discover patterns, regularities,
features or categories from the input data and relations for the input data over the output. While
discovering all these features, the network undergoes change in its parameters. This process is called
self-organizing in which exact clusters will be formed by discovering similarities and dissimilarities
among the objects.
Example: Clustering, Anomaly detection

The two popular learning algorithms are self organizing maps (SOMs) and k-means Clustering.

Self-Organizing Maps (SOMs)


●​ Feature mapping is a process which converts the patterns of arbitrary dimensionality into a
response of one-two dimensional arrays of neurons. The network performing such a mapping is
called feature map
●​ Apart from its capability to reduce the higher dimensionality, it has to preserve the neighborhood
relations of the input patterns, i.e., it has to obtain a topology preserving map. For obtaining such
feature maps, it is required to find a self-organizing neural array which consists of neurons
arranged in a one-dimensional array or a two-dimensional array.

Topological preservation refers to the ability of the Kohonen Self-Organizing Map (KSOM) to
maintain the spatial relationships between input data points when mapping them onto a
lower-dimensional space (typically a 1D or 2D grid).

82
●​ Similar input vectors should be mapped to neighboring neurons in the output map.
●​ The network should retain the structure of the input data after training.

To depict this, a typical network structure where each component of the input vector x is connected to
each of the nodes is shown in figure below

On the other hand, if the input vector is two-dimensional, the inputs, say x(a, b), can arrange
themselves in a two-dimensional array defining the input space (a, b) as in Figure below; Here, the two
layers are fully connected

A typical architecture of Kohonen self-organizing feature map (KSOFM) is shown in below-

The architecture consists of two layers: input layer and output layer (cluster). There are “n” units in the
input layer and “m” units in the output layer. Basically, here the winner unit is identified by using either
dot product or Euclidean distance method and the weight update using Kohonen learning rules is
performed over the winning cluster unit. At the time of self-organization, the weight vector of the cluster
unit which matches the input pattern very closely is chosen as the winner unit. The closeness of the

83
weight vector of the cluster unit to the input pattern may be based on the square of the minimum
Euclidean distance. The weights are updated for the winning unit and its neighboring units.

The steps involved in the training algorithm are as shown below.


Step-1: Initialize the weights wij: Random values may be assumed. They can be chosen as the same
range of values as the components of the input vector. If information related to distribution of clusters is
known, the initial weights can be taken to reflect that prior knowledge. Initialize the learning rate α: It
should be a slowly decreasing function of time.
Step 2: Perform Steps 3–8 when the stopping condition is false.
Step-3: Take the sample training input vector x from the input layer.
Step 4: Compute the square of the Euclidean distance, i.e., for each j = 1 to m,
𝑛 𝑚
D(J) = ∑ ∑ (xi - wij)2
𝑖=1𝑗 = 1
Find the winning unit index J, so that D(J) is minimum. (In Steps 3; dot product method can also be
used to find the winner, which is basically the calculation of net input, and the winner will be the one
with the
largest dot product.)
Step-5: For all units j within a specific neighborhood of J and for all i, calculate the new weights:
wij(new) = wij(old) + α[xi − wij(old)]
or
wij(new) = (1 - α)wij(old) + αxi
Step-6: Repeat the step 3-5 until update in the weight is negligible
Step-7: Update the learning rate using the formula α(t +1)= 0.5α(t).
Step 8: Test for stopping condition of the network.

An Example
Construct a Kohonen self-organizing map to cluster the four given vectors, [0 0 1 1], [1 0 0 0], [0 1 1 0]
and [0 0 0 1]. The number of clusters to be formed is two. Assume an initial learning rate of 0.5.

Do it yourself
Consider a Kohonen self-organizing net with two cluster units and five input units. The weight vectors
for the cluster units are given by
w1 = [1.0 0.9 0.7 0.5 0.3]
w2 = [0.30.5 0.7 0.91.0]
Use the square of the Euclidean distance to find the winning cluster unit for the input pattern x =[0.0 0.5
1.0 0.5 0.0] . Using a learning rate of 0.25, find the new weights for the winning unit.

Applications of SOMs
●​ Data Clustering: Identifying patterns in customer behavior, genetics, and more.
●​ Anomaly Detection: Detecting fraud or unusual patterns in financial transactions.
●​ Feature Extraction: Reducing data dimensions for visualization and analysis.
●​ Image Recognition: Organizing images based on similarities.

K-means clustering

Clustering is the task of grouping similar data points together based on their features.

K-means clustering is an iterative algorithm that divides the unlabeled dataset into k different clusters in
such a way that each dataset belongs to only one group that has similar properties. The goal is to
maximize intra-cluster similarity & minimize inter-cluster similarity. Intra-cluster similarity means
elements in the same cluster should be close to one another i.e. Euclidean distance between them
should be as little as possible; Inter-cluster similarity means Euclidean distance between two the
centroids of the clusters should be maximum i.e. there should be no common element in two clusters.
The number of clusters is represented using letter K. This algorithm discovers patterns without prior
knowledge of groups i.e. it falls under the category of unsupervised learning.
84
Here is the Steps of K-Means Clustering
1.​ Choose k: Select the number of clusters randomly (e.g., k=2).
2.​ Initialize Centroids: Randomly pick k data points as initial centroids.

3.​ Assign Clusters:


Calculate Euclidean distance between each point and centroids.
Assign each point to the nearest centroid.

4.​ Update Centroids: Recompute centroids as the mean of all points in the cluster.

5.​ Repeat step 3 & 4: Reassign points and update centroids until convergence (no further
changes).

The step of computing the centroid and assigning all the points to the cluster based on their distance
from the centroid is a single iteration. There are essentially three stopping criteria that can be adopted
to stop the K-means algorithm:
1.​ Centroids of newly formed clusters do not change
2.​ Points remain in the same cluster
3.​ Maximum number of iterations is reached

We have to understand the effect of choosing the value of K. Before this let us understand the meaning
of inertia which is the sum of squared distances of points to their centroid (measures cluster
compactness).
●​ If the value of k is too small, it means the size of the cluster will be big that results in high inertia;
It will give us poor insights because distinct groups will be merged into the same cluster.
●​ If the value of k is too large, it means the size of the cluster will be small that results in low
inertia; It will give us clear, well-separated clusters but clusters will have overlapping elements or
fragmented groups.
The impact of increasing the value of K can be understood like stretching a rubber band: Initial effort
(low k) yields big changes; later effort (high k) barely stretches it further.

An Example
Dataset: 12 Customers with Annual Spending ($1000) and Visits/Year
+-------------------------------------------+
| Customer | Spending ($1000) | Visits/Year |
|----------|------------------|-------------|
| 1 | 5 | 2 |
| 2 | 10 | 4 |
85
| 3 | 8 | 3 |
| 4 | 50 | 15 |
| 5 | 55 | 18 |
| 6 | 60 | 20 |
| 7 | 100 | 40 |
| 8 | 95 | 35 |
| 9 | 110 | 45 |
| 10 | 120 | 50 |
| 11 | 4 | 1 |
| 12 | 6 | 2 |
+-------------------------------------------+

Step-01: Say the value of k = 2


Step-02: Initialize the random centroids
C₁: Customer-1 (5, 2) C₂: Customer 4 (50, 15)
Iteration-1
For step-03: Assign Clusters; Compute Euclidean distance (squared) for all points:

​Cluster 1: Customers 1, 2, 3, 11, 12


​Cluster 2: Customers 4, 5, 6, 7, 8, 9, 10

For step-04: update centroid
New C₁: Mean of Cluster 1

New C2: Mean of Cluster 2

Iteration-2

86
For step-03: Assign Clusters; Compute Euclidean distance (squared) for all points:

Clusters remain unchanged → Algorithm converges.

Inertia for k=2


●​ For cluster-01: 2.92 + 14.12 + 2.32 + 8.32 + 0.52 = 28.2​
●​ For cluster-02: 1424 + 1048 + 703 + 275 + 118 + 685 + 1492 = 5745​
●​ Total Inertia: 28.2 + 5745 = 5773​

Application of k-means clustering


●​ Customer segmentation in marketing.
●​ Image compression.
●​ Anomaly detection.
●​ Document clustering.

Differences Between SOMs and K-Means


Aspect Self-Organizing Maps K-means clustering

Output Low-dimensional map (2D grid) Cluster assignments (no visualization)

Topology Preserves topological Does not preserve topology


relationships

Representation Neurons with weight vectors Centroids (mean of cluster points)

Use case Visualization, dimensionality Pure clustering


reduction

Complexity More complex, involves Simpler, only updates centroids


neighborhood updates

87
3.​ Reinforcement Learning
Reinforcement learning is a form of supervised learning because the network receives some feedback
from its environment. However, the feedback obtained here is only evaluative and not instructive. The
external reinforcement signals are processed in the critic signal generator, and the obtained critic
signals are sent to the ANN for adjustment of weights properly so as to get better critic feedback in
future. The critic signal is like a reward or a penalty. The reinforcement learning is also called learning
with a critic as opposed to learning with a teacher, which indicates supervised learning.

Key Features:
●​ Trial-and-error learning.
●​ Uses rewards and penalties as feedback.

Role of Reward and Punishment in Learning


The RL (Reinforcement Learning) agent receives rewards for desirable actions and penalties for
undesired ones. Uses Markov Decision Processes (MDP) to model sequential decision-making Markov
Decision Processes (MDPs) are mathematical frameworks used to model decision-making in situations
where outcomes are partly random and partly under the control of a decision-maker..
Example: A robot navigating a maze receives positive rewards for correct paths and negative rewards
for hitting obstacles.

Use of Neural Networks in Reinforcement Learning


Deep Q-Networks (DQN): Uses deep learning for complex decision-making.
Policy Gradient Methods: Directly learn the optimal action policy.
Applications: Used in game playing, robotic movements, and financial modelling.

Applications of reinforcement learning


●​ Self-driving cars – Learning optimal driving strategies.
●​ Robotics – Automated control systems.
●​ Stock trading – Portfolio management.
●​ AlphaGo (DeepMind) –AlphaGo is an artificial intelligence (AI) program developed by
DeepMind, a subsidiary of Google. It became famous for defeating human world champions in
the board game Go, which is considered one of the most complex games due to its vast number
of possible moves.

Features Supervised Learning Unsupervised Reinforcement


Learning Learning

Definition Learning from labeled Learning from Learning through trial


data (input-output unlabeled data to find and error using
pairs) patterns rewards and penalties

Training Data Labeled data Unlabeled data No predefined data;


learns from actions
taken

88
Objective Predict output based Find structure and Maximize cumulative
on given input patterns in data rewards through
interactions

Algorithm Examples Decision Trees, Neural K-Means Clustering, Q-Learning, Deep


Networks, Support PCA (Principle Q-Networks, Policy
Vector Machines component analysis), Gradient Methods
Autoencoders

Example use cases Spam detection, Face Customer Self-driving cars,


recognition, Fraud segmentation, Anomaly Robotics, Game AI
detection detection, Topic
modeling

Supervision Required Yes No Indirect (via rewards)

A network is generally trained using either an incremental (also known as a sequential) or a batch mode, the
principles of which are discussed below.

1.​ Incremental Training/On-Line Training


Here, a particular training scenario is passed through the network and depending on its output, the
error is calculated using the corresponding target value. The said error is then propagated in the
backward direction to update the connecting weights of the neurons and biases i.e. the model
continuously updates its weights after processing each instance.

Let us consider the incremental training of an NN using a number of scenarios (say 20), sent one after
another. There is a chance that the optimal network obtained after passing the 20-th training scenario
will be too different from that obtained after using the 1-st training scenario.

2.​ Batch Mode of Training/ Off-line


In this approach, the whole training set consisting of a large number of scenarios is passed through the
network and an average error in predictions is determined. It is important to mention that the whole
training set mentioned above may also be called an epoch. The average error in prediction is then
propagated back to update the weights and bias values of the network, so that it can yield a more
accurate prediction. In this mode of training, the necessary data set is to be collected before the actual
commencement of training. As the network is optimized using an average error in predictions, there is a
chance of the network of being adaptive in nature. The adaptability generally grows due to interpolation
capability of the trained network.

Note: It is important to mention that incremental training is easier to implement and computationally
faster than the batch mode of training.

Some important questions


a.​ What are the key components of an artificial neural network (ANN)?
b.​ Explain the function of the input layer in an artificial neural network.
c.​ What role do weights play in the architecture of an ANN?
d.​ How does the output layer differ from the hidden layers in an ANN?
e.​ What is a single-layer artificial neural network (ANN), and how does its architecture differ from a
multi-layer neural network?
f.​ How does a single-layer perceptron (SLP) work in terms of input, processing, and output?
g.​ What is the role of weights and biases in a single-layer neural network?
h.​ How does the activation function influence the output of a single-layer ANN?
89
i.​ What are the limitations of a single-layer neural network in solving complex problems?

The activation function


An activation function in a neural network determines whether a neuron should be activated or not. It takes the
weighted sum of inputs and applies a mathematical function to introduce non-linearity, allowing the network to
learn complex patterns. Apart from this it helps in Decision Making and normalizing the output i.e. conditioned
or dampened as a result of large or small activating stimuli and is thus controllable. Nonlinear functions are
widely used in multilayer networks compared to linear functions because when a signal is fed through a
multilayer network with linear activation functions, the output obtained remains the same as that which could
be obtained using a single-layer network. There are several activation functions. Let us discuss a few in this
section:

1.​ Identity function: It is a linear function and can be defined as

The output here remains the same as input. The input layer uses the identity activation function.

Identity function

2.​ Binary step function: This function can be defined as

Where θ represents the threshold value. This function is most widely used in single-layer nets to
convert the net input to an output that is a binary (1 or 0).

Binary Step function

3.​ Bipolar step function: This function can be defined as

Where θ represents the threshold value. This function is also used in single-layer nets to convert the
net input to an output that is bipolar (+1 or –1).
90
Bipolar step function

4.​ Sigmoidal functions: The sigmoidal functions are widely used in back-propagation nets because of
the relationship between the value of the functions at a point and the value of the derivative at that point
which reduces the computational burden during training. Sigmoidal functions are of two types:
a.​ Binary sigmoid function: It is also termed as logistic sigmoid function or unipolar sigmoid function.
It can be defined as
1
𝑓(𝑥) = −λ𝑥
1+𝑒

Where λ is the steepness parameter. For standard binary sigmoid λ = 1. If more than one input is
available in the neuron then use the summed output value for x. Here the range of the sigmoid
function is from 0 to 1.

Binary sigmoid function


Do it yourself
Q.1 Obtain the output of the neuron Y for the network shown in Figure below using binary
sigmoidal activation function.

The derivative of f(x) function is

This derivative is important in neural networks because it is used during back propagation, which is
the process of updating the weights of the network to minimize the error. Here f(x) is the output of
the sigmoid function. 1−f(x) represents the complement of the sigmoid output. The derivative f′(x)
tells us how sensitive the output of the sigmoid function is to changes in its input x. It is maximum
when f(x)=0.5 and decreases as f(x) approaches 0 or 1.
91
Do it yourself
Q.2 Assume that λ = 1 and x = 0.53; compute f’(x)
[Answer: f(0.53)≈0.625 & f′(0.53)≈0.233]

b.​ Bipolar sigmoid function: This function is defined as

Where λ is the steepness parameter For standard binary sigmoid λ = 1. If more than one input is
available in the neuron then use the summed output value for x. Here the sigmoid function range is
between –1 and +1.

Bipolar sigmoid function

Do it yourself
Q.1 Obtain the output of the neuron Y for the network shown in Figure below using bipolar
sigmoidal activation function.

The derivative of f(x) function can be

For λ = 1, the derivative of f(x) function can be

𝑓(𝑥) =
1
2 (1 − 𝑓(𝑥)2)
Do it yourself
Q.2 Assume that λ = 1 and x = 0.53; compute f’(x)
[Answer: f(0.53)≈0.259 & f′(0.53)≈0.466]

The bipolar sigmoidal function is closely related to hyperbolic tangent function, which is written

92
Hyperbolic tangent function

The derivative of the hyperbolic tangent function is

If the network uses binary data, it is better to convert it to bipolar form and use the bipolar sigmoidal
activation function or hyperbolic tangent function.

5.​ Rectified linear unit (ReLU)/Ramp function: The ramp function is defined as

Ramp function

Key Roles of Activation Functions:


1.​ Introducing Non-Linearity: Without an activation function, a neural network would simply be a linear
regression model, regardless of the number of layers. Activation functions allow the network to learn
and represent complex, non-linear relationships between inputs and outputs.
2.​ Enabling Backpropagation: Activation functions provide differentiable gradients, which are essential for
the backpropagation algorithm. Backpropagation uses these gradients to update the weights of the
network, minimizing the error in predictions.
3.​ Determining Output: Activation functions decide whether a neuron should be activated or not, based on
the weighted sum of inputs. This helps in propagating the signal through the network and producing the
final output.

Desirable properties of activation functions


1.​ Non Linearity: The purpose of the activation function is to introduce non-linearity. Non-linear means that
the output cannot be reproduced from a linear combination of the inputs. Without a non-linear activation
function in the network, a NN, no matter how many layers it had, would behave just like a single-layer
perceptron, because summing these layers would give you just another linear function.
2.​ Continuously differentiable: This property is necessary for enabling gradient-based optimization
methods. The binary step activation function is not differentiable at 0, and it differentiates to 0 for all
other values, so gradient-based methods can make no progress with it

93
3.​ Range: When the range of the activation function is finite, gradient-based training methods tend to be
more stable, because pattern presentations significantly affect only limited weights. When the range is
infinite, training is generally more efficient because pattern presentations significantly affect most of the
weights. In the latter case, smaller learning rates are typically necessary.
4.​ Monotonic: When the activation function is monotonic, the error surface associated with a single-layer
model is guaranteed to be convex.

Some important questions


a.​ What is a Feedforward Neural Network (FNN), and how does it work?
b.​ How does a Multi-Layer Perceptron (MLP) differ from a Single-Layer Perceptron (SLP)?
c.​ What is a Convolutional Neural Network (CNN), and what makes it suitable for image processing
tasks?
d.​ How does a Recurrent Neural Network (RNN) handle sequential data differently from feedforward
networks?

Other than the above discussed; we have following types of neural networks also

Convolution Neural Network


CNNs are a feed-forward neural network and a type of deep learning model commonly used for image
processing tasks. They consist of several layers that work together to extract features from images and make
predictions. The main layers in a CNN are: Input Layer, Convolutional Layer, Pooling Layer, Fully
Connected Layer and Output Layer.

The input layer is where the image data is fed into the network. Each pixel in the image is represented as a
value, and these values form the input to the network.

Now let us have a detailed discussion about the convolution layer


Convolutional layer possess a set of trainable filters and every filter is spatially small (along the width and
height) but noted to extend through the fullest depth of the input volume. When the forward pass gets initiated,
each filter slides across the height and width of the input volume and the dot product is computed between the
input at any position and that of the entries in the filter. When the filter slides across the height and weight of
the input volume, a two-dimensional activation feature map is produced that gives the responses of that filter at
every spatial position. The filters get activated when they come across certain type of visual features (like edge
detection, color stain on the first layer, certain specific patterns or honeycomb existing on higher layers of the
network) and the network learns from the filter that gets activated. The convolutional layer consists of the
complete set of filters and each of these filters produces a separate 2-dimensional activation map. These
activation maps will be stacked along the depth dimension and result in the output volume.
●​ The input presented to the convolutional layer is an n × n × p image where "n" is the height and width of
an image and "p" refers to the number of channels (e.g., an RGB image possess 3 channels and so p =
3, for black and white image it is 1 for medical images it is more than 3).
●​ The convolutional layer to be constructed possesses "m" filters of size r × r × q, where "r" tends to be
smaller than the dimension of the image and "q" can be the same size as that of "p" or it can be smaller
and vary for each of the filter. The filter size enables the design of locally connected structure which
gets convolved with the image for producing "m" feature maps. The size of the feature map will be “n −
r + 1”.

94
An Example
A 5×5 grayscale image might have pixel values like:

Each value represents intensity (0 = black, 255 = white). Say we apply a filter over the pixel matrix. Say it is
3×3 Edge Detection Kernel

We take the top-left 3×3 region from the image and apply the kernel. Extracted 3×3 Region from Image

Now, we apply the kernel:


(1×1)+(2×0)+(3×−1)+(5×1)+(6×0)+(7×−1)+(9×1)+(10×0)+(11×−1) = -6

Second Convolution Operation (Next 3×3 region, sliding right) Now, move the filter one step to the right. New
3×3 Region from Image

Now, we apply the kernel:


(2×1)+(3×0)+(4×−1)+(6×1)+(7×0)+(8×−1)+(10×1)+(11×0)+(12×−1) = -6

Continuing for Other Regions; Following the same process for the entire 3×3 sliding process, we fill up the
feature map. The Final Feature Map output will be

Now let us have a detailed discussion about the pooling layer


Between the successive convolutional layers, pooling layers are placed. The presence of a pooling layer
between the convolutional layers is to gradually decrease the spatial size of the parameters and to reduce the
computation in the network. This placement of the pooling layer also controls the occurrence of over fitting.
Generally used pooling mechanism is the “max pooling”.
●​ Each of the feature maps then gets pooled (sub-sampled) based on maximum or average pooling over
r × r connecting regions. The value of “r” is 2 for small images and 5 for larger images.
●​ A bias and a non-linear sigmoidal function can be applied to each of the feature map before or after the
pooling layer.

Continued in the example; We are applying the max-pooling for every 2×2 region and Stride = 1 (stride is to
specify for sliding the filter.)
Region 1 (Top-left 2×2)

Max value: -6

Region 2 (Top-middle 2×2)

95
Max value: -6
& so in so the final max-pooled output is

Since all values were the same, the result remains unchanged.

After several convolutional and pooling layers, the final output is flattened into a single vector and passed
through one or more fully connected layers. These layers are similar to those in a regular neural network and
are used to combine the features extracted by the previous layers to make final predictions, such as classifying
the image into different categories.

The output layer produces the final output of the network, such as the class scores for classification tasks. The
number of neurons in this layer corresponds to the number of classes the network is trying to predict.

What makes CNN suitable for image processing tasks over ANN?
Convolutional Neural Networks (CNNs) are highly effective for image processing due to their ability to
automatically learn spatial hierarchies of features. Here’s why they work so well:
1.​ Unlike traditional Artificial Neural Networks (ANNs), CNNs do not require manually extracted features.
CNN’s convolutional layers automatically detect edges, textures, patterns, and complex structures
without human intervention.
2.​ CNNs use pooling layers (e.g., max pooling) to reduce spatial dimensions while keeping the most
important features. This makes CNNs robust to position changes (i.e., an object can be anywhere in the
image, and CNN can still detect it).
3.​ Instead of fully connecting each pixel (like ANN), CNNs use small filters (kernels) that slide over the
image. This reduces the number of parameters, making CNNs computationally efficient. A 100×100
image with ANN requires 10,000 neurons, but CNN just needs a few filters to process it.
4.​ CNNs learn directly from raw pixel data and adjust filters automatically using backpropagation. They do
not require handcrafted features, making them highly adaptable.

Recurrent Neural Network (RNN)


It is a specialized class of neural networks designed to handle sequential data. It is a specific type of feedback
network that utilizes a feedback mechanism to process sequences by maintaining hidden states that capture
information about previous inputs. RNNs have connections that form directed cycles, allowing them to maintain
a 'memory' of previous inputs. This unique architecture enables RNNs to process sequences of data, such as
time series data, text data, speech data, and video data, making them particularly effective for tasks where the
order and context of data points are crucial.

RNN information is fed back into the system after each step. Think of it like reading a sentence, when you’re
trying to predict the next word you don’t just look at the current word but also need to remember the words that
came before to make an accurate guess. RNNs allow the network to “remember” past information by feeding
the output from one step into the next step. This helps the network understand the context of what has already

96
happened and make better predictions based on that. For example when predicting the next word in a
sentence the RNN uses the previous words to help decide what word is most likely to come next.

The fundamental processing unit in RNN is a Recurrent Unit. Recurrent units hold a hidden state that
maintains information about previous inputs in a sequence. Recurrent units can “remember” information from
prior steps by feeding back their hidden state, allowing them to capture dependencies across time. RNN
unfolding or unrolling is the process of expanding the recurrent structure over time steps. During unfolding
each step of the sequence is represented as a separate layer in a series illustrating how information flows
across each time step.

This unrolling enables “backpropagation through time (BPTT)” which is a learning process where errors are
propagated across time steps to adjust the network’s weights enhancing the RNN’s ability to learn
dependencies within sequential data. RNNs share similarities in input and output structures with other deep
learning architectures but differ significantly in how information flows from input to output. Unlike traditional
deep neural networks, where each dense layer has distinct weight matrices, RNNs use shared weights across
time steps, allowing them to remember information over sequences.

Application areas include Natural Language Processing, Time Series Prediction, Music Generation, and more.

Generative Adversarial Network (GAN)

A generative adversarial network (GAN) has two parts:


●​ The generator part of a GAN learns to create fake data by incorporating feedback from the
discriminator. It learns to make the discriminator classify its output as real. Generator training requires
tighter integration between the generator and the discriminator than discriminator training requires. The
portion of the GAN that trains the generator includes:
o​ Random input
o​ Generator network, which transforms the random input into a data instance
o​ Discriminator network, which classifies the generated data
o​ Discriminator output
o​ Generator loss, which penalizes the generator for failing to fool the discriminator
The generator is trained with the following procedure:

97
1.​ Sample random noise.
2.​ Produce generator output from sampled random noise.
3.​ Get discriminator "Real" or "Fake" classification for generator output.
4.​ Calculate loss from discriminator classification.
5.​ Backpropagate through both the discriminator and generator to obtain gradients.
6.​ Use gradients to change only the generator weights.

●​ The discriminator learns to distinguish the generator's fake data from real data. The discriminator
penalizes the generator for producing implausible (i.e. fake data that is difficult to believe on) results.
The discriminator data comes from two sources: Real data instances, such as real pictures of people.
The discriminator uses these instances as positive examples during training. Fake data instances
created by the generator. The discriminator uses these instances as negative examples during training.
During discriminator training the generator does not train. Its weights remain constant while it produces
examples for the discriminator to train on. During discriminator training:
1.​ The discriminator classifies both real data and fake data from the generator.
2.​ The discriminator loss penalizes the discriminator for misclassifying a real instance as fake or a
fake instance as real.
3.​ The discriminator updates its weights through backpropagation from the discriminator loss
through the discriminator network.

When training begins, the generator produces obviously fake data, and the discriminator quickly learns to tell
that it's fake:

As training progresses, the generator gets closer to producing output that can fool the discriminator:

Finally, if generator training goes well, the discriminator gets worse at telling the difference between real and
fake. It starts to classify fake data as real, and its accuracy decreases.

A GAN can have two loss functions: one for generator training and one for discriminator training. Among
multiple implementations the common error loss function is minimax. The generator tries to minimize the
following error loss function while the discriminator tries to maximize it

98
In this function:
❖​ D(x) is the discriminator's estimate of the probability that real data instance x is real.
❖​ Ex is the expected value over all real data instances.
❖​ G(z) is the generator's output when given noise z.
❖​ D(G(z)) is the discriminator's estimate of the probability that a fake instance is real.
❖​ Ez is the expected value over all random inputs to the generator (in effect, the expected value over all
generated fake instances G(z)).

The generator can't directly affect the log(D(x)) term in the function, so, for the generator, minimizing the loss is
equivalent to minimizing log(1 - D(G(z))).

RBFN (Radial Basis Function Network)


The radial basis function (RBF) is a classification and functional approximation neural network. This network
uses the most common nonlinearities such as sigmoidal and Gaussian kernel functions. The Gaussian
functions are also used in regularization networks. The response of such a function is positive for all values of
y; the response decreases to 0 as y increases to infinite. The Gaussian function is generally defined as

The derivative of this function is given by

The graphical representation of this Gaussian function is as follow-

Gaussian Kernel function


When the Gaussian functions are being used, each node is found to produce an identical output for inputs
existing within the fixed radial distance from the center of the kernel, they are found to be radically symmetric
(i.e. distance matters not the direction; for same distance in either direction will lead to same value), and hence
the name radial basis function network. The entire network forms a linear combination of the nonlinear basis
function.

The architecture for the radial basis function network (RBFN) is here-

Architecture of RBF

The architecture consists of two layers whose output nodes form a linear combination of the kernel (or basis)
functions computed by means of the RBF nodes or hidden layer nodes. The basis function (nonlinearity) in the
99
hidden layer produces a significant nonzero response to the input stimulus it has received only when the input
of it falls within a small localized region of the input space. This network can also be called as localized
receptive field network.

The training algorithm describes in detail all the calculations involved in the training process depicted in the
flowchart. The training is started in the hidden layer with an unsupervised learning algorithm. The training is
continued in the output layer with a supervised learning algorithm. Simultaneously, we can apply supervised
learning algorithms to the hidden and output layers for fine-tuning of the network. The training algorithm is
given as follows.

Step 0: Set the weights to small random values.


Step 1: Perform Steps 2-8 when the stopping condition is false.
Step 2: Perform Steps 3-7 for each input.
Step 3: Each input unit (xi for all i = 1 to n) receives input signals and transmits to the next hidden layer unit.
Step 4: Calculate the radial basis function.
Step 5: Select the centers for the radial basis function. The centers are selected from the set of input vectors. It
should be noted that a sufficient number of centers have to be selected to ensure adequate sampling of the
input vector space.
Step 6: Calculate the output from the hidden layer unit:

Where
x: Input vector (e.g., xj1,xj2,…,xjn​)
ci​: Center of the ith RBF unit
σi​: Width (spread) of the i-th RBF unit.
∥x−ci​∥: Euclidean distance between x and ci​

Step 7: Calculate the output of the neural network:

Where
m: the number of hidden layer nodes (RBF function).
wim​: Weight connecting the ith hidden unit to the mth output node
w0​: Bias term (optional).

Step 8: Calculate the error and test for the stopping condition. The stopping condition may be the number of
epochs or to a certain extent weight change.

Applications of RBNF:
RBFNs are primarily used for classification tasks, but they can also be applied to regression and function
approximation problems. Some common application areas include:
1.​ Pattern Recognition: RBFNs are effective in recognizing patterns in data, making them useful in
image and speech recognition.
2.​ Time Series Prediction: They can be used to predict future values in a time series based on past data.
3.​ Control Systems: RBFNs are used in adaptive control systems to model and control dynamic systems.
4.​ Medical Diagnosis: They can assist in diagnosing diseases by classifying medical data.

Comparison between different types of Networks


Type of Network Best used For Key Features
Feed-Forward Neural Network Classification, Regression Simple structure, fast training
100
Convolution Neural Network Image and Video ProcessingFeature extraction, deep
learning
Recurrent Neural Network Time series, NLP, Sequential Memory of past inputs
Data
Generative Adversarial Neural Data Generation (e.g. images) Generator vs Discriminator
Network framework
Radial Basis Function Network Pattern Recognition, Gaussian-based activation
Classification functions
Self-Organizing Maps Clustering, Data visualization Unsupervised learning, topology
based

Identify the type of learning for following


1.​ Facebook face recognition
2.​ Netflix movie recommendation
3.​ Fraud detection
4.​ A spam detection system learns from labelled emails (spam or not spam) to classify new emails
automatically
5.​ A self-driving car learns to adjust its speed by receiving rewards for safe driving and penalties for
unsafe behavior.
6.​ A clustering algorithm groups customers based on their purchasing behaviour without prior labels.
7.​ A speech recognition system is trained on labelled voice samples to convert speech into text accurately.
8.​ A robot in a factory learns how to pick up objects by trial and error, receiving rewards when successful.
9.​ A recommendation system suggests new movies to users based on patterns in their previous watch
history, without predefined labels.
10.​A credit card fraud detection system is trained using labelled transactions (fraudulent or non-fraudulent)
to detect fraudulent activities
11.​A chess-playing AI improves by playing millions of games against itself and adjusting its strategies
based on wins and losses.
12.​A genetic algorithm clusters different plant species based on their genetic similarities, with no
predefined classifications
13.​A virtual assistant like Siri or Google Assistant learns to recognize commands by training on labeled
datasets of voice recordings.

Some important questions


a.​ What is the difference between supervised and unsupervised learning in neural networks?
b.​ How does the process of learning occur in an artificial neural network?
c.​ What is backpropagation, and how does it contribute to the learning process in ANNs?
d.​ How do neural networks adapt to new data during training?

McCulloch–Pitts Neuron (M–P neuron)


●​ The McCulloch–Pitts neuron was the earliest neural network discovered in 1943.
●​ The M–P neurons are connected by directed weighted paths. The weights associated with the
communication links may be excitatory (weight is positive) or inhibitory (weight is negative). All the
excitatory connected weights entering into a particular neuron will have the same weights.
●​ The threshold plays a major role in M–P neuron. There is a fixed threshold for each neuron, and if the
net input to the neuron is greater than the threshold then the neuron fires. Also, it should be noted that
any nonzero inhibitory input would prevent the neuron from firing. The M–P neurons are most widely
used in the case of logic functions.
●​ It should be noted that the activation of a M–P neuron is binary, that is, at any time step the neuron may
fire or may not fire.

101
The inputs from x1 to xn possess excitatory weighted connections and inputs from xn+1 to xn+m possess inhibitory
weighted interconnections. Since the firing of the output neuron is based upon the threshold, the activation
function here is defined as

For inhibition to be absolute, the threshold with the activation function should satisfy the following condition:
Ө > nw - P

In the above equation P refers to total contribution from all inhibitory inputs and output. The above equation
works when all the inhibitory inputs are active i.e. in case of weak absolute inhibition.

For the strong absolute inhibition i.e. when only one inhibitory inputs is active the equation should be modified
as
Ө > nw - Pmin

Here Pmin refers to the minimum contribution from inhibitory inputs (e.g., the weight of a single inhibitory input).

Do not get confused with the firing condition which is that a neuron can fire if the net input equals the threshold.
Ө ≤ nw - P

The output will fire if it receives say “k” or more excitatory inputs but no inhibitory inputs, where
kw ≥ Ө ≥ (k - 1)w

If the neuron receives k excitatory inputs, the net input (kw) will be greater than or equal to the threshold,
causing the neuron to fire. If the neuron receives fewer than k excitatory inputs (k−1), the net input ((k−1)w) will
be less than the threshold, and the neuron will not fire.

The M–P neuron has no particular training algorithm. An analysis has to be performed to determine the values
of the weights and the threshold. Here the weights of the neuron are set along with the threshold to make the
neuron perform a simple logic function. The M-P neurons are used as building blocks on which we can model
any function or phenomenon, which can be represented as a logic function.

Do it yourself
Q.1 Implement AND function using McCulloch–Pitts neuron (take binary data).
Q.2 Implement ANDNOT function using McCulloch–Pitts neuron (use binary data representation). In the case
of the ANDNOT function, the response is true if the first input is true and the second input is false. For all other
input variations, the response is false.
Q.3 Implement XOR function using McCulloch–Pitts neuron (use binary data representation).

Hebb Network
Donald Hebb stated in 1949

102
“When an axon of cell A is near enough to excite cell B, and repeatedly or permanently takes place in firing it,
some growth process or metabolic change takes place in one or both the cells such that A’s efficiency, as one
of the cells firing B, is increased”.

According to the Hebb rule, the weight vector is found to increase proportionately to the product of the input
and the learning signal which is equal to the neuron’s output. In Hebb learning, if two interconnected neurons
are ‘on’ simultaneously then the weights associated with these neurons can be increased by the modification
made in their synaptic gap (strength). The weight update in Hebb rule is given by

The Hebb rule is more suited for bipolar data than binary data.

Flowchart of Training Algorithm


❖​ Step 0: First initialize the weights. Basically in this
network they may be set to zero, i.e., wi = 0 for i = 1
to “n” where n may be the total number of input
neurons.
❖​ Step 1: Steps 2–4 have to be performed for each
input training vector and target output pair, s : t.
❖​ Step 2: Input units activations are set. Generally, the
activation function of input layer is identity function:
xi = sj for i = 1 to n.
❖​ Step 3: Output units activations are set: y = t.
❖​ Step 4: Weight adjustments and bias adjustments are
performed:
wi(new) = wi(old) + xiy
b(new)= b(old) + y

In Step 4, the weight updation formula can also be given


in vector form as
w(new)=w(old)+ xy
Here the change in weight can be expressed as
Δw = xy
As a result,
w(new) = w(old) + Δw
The Hebb rule can be used for pattern association,
pattern categorization, pattern classification and over a
range of other areas.

Hebbian Learning Rule


Hebbian Learning is one of the oldest and most influential learning rules in Artificial Neural Networks (ANN). It
was proposed by Donald Hebb in 1949 in his book "The Organization of Behavior". The core concept of
Hebbian Learning is based on associative learning, which means: "Neurons that fire together, wire together." In
simple terms, if two neurons activate together, their connection strength (weight) increases. The Hebbian
Learning Rule is based on the idea that:
●​ If an input and output neuron activate simultaneously, their connection strength (weight) should be
increased.
●​ If one activates and the other doesn’t, no significant change occurs in the connection.
●​ There is no error calculation in Hebbian Learning as there is no target output.
103
●​ It is a type of Unsupervised Learning because there is no target output to compare.
●​ The strength of the connection (weight) between neurons is increased based on:
●​ Strength of Connection is proportional to Activity of Neurons
●​ The standard mathematical form of the Hebbian Learning Rule is:
wi = wi + η × x × y

wi: weight of the ith connection between input and output neuron
η: Learning rate
x: Input value from the input neuron
y: Output value from the output neuron

The Hebbian Learning Rule can be explained as:

Characteristics of the Learning Rule

Limitations of Basic Hebbian Learning


1.​ Unbounded weight growth: Without normalization or decay, weights may explode.
2.​ No weakening of connections: Hebbian learning alone does not handle cases where weights should
decrease (e.g., when x and y are anti-correlated).

Perceptron Networks
Let us understand the linear separability first with an example. Imagine you have a table with a bunch of fruits:
apples and oranges. Your task is to separate the apples from the oranges using a straight stick (like a ruler).

Scenario 1: Easy Separation (Linearly Separable)


Suppose all the apples are on one side of the table, and all the oranges are on the other side. You can easily
place the stick in such a way that all the apples are on one side of the stick, and all the oranges are on the
other side. This is an example of linear separability because a straight line (the stick) can perfectly separate
the two types of fruits.

Scenario 2: Mixed Fruits (Not Linearly Separable)


Now, imagine the apples and oranges are mixed up on the table. Some apples are on the left, some on the
right, and the same goes for the oranges. No matter how you place the stick, you cannot separate all the
apples from all the oranges with a single straight line. This is an example of non-linearly separable data.

104
Summarization: Linear separability means you can draw a straight line (or a flat plane in higher dimensions)
to separate two groups of things (like apples and oranges). If you cannot draw such a straight line, the data is
not linearly separable.

The perceptron is the simplest form of a neural network used for the classification of patterns said to be linearly
separable (i.e., patterns that lie on opposite sides of a hyperplane). Basically, it consists of a single neuron with
adjustable synaptic weights and bias.

Rosenblatt proved that if the patterns (vectors) used to train the perceptron are drawn from two linearly
separable classes, then the perceptron algorithm converges (i.e. eventually find solution) and positions the
decision surface in the form of a hyperplane between the two classes. The proof of convergence of the
algorithm is known as the perceptron convergence theorem.

The perceptron built around a single neuron is limited to performing pattern classification with only two classes
(hypotheses). By expanding the output (computation) layer of the perceptron to include more than one neuron,
we may correspondingly perform classification with more than two classes.

Consider the figure

Signal-flow Graph of the Perceptron


The summing node of the neural model computes a linear combination of the inputs applied to its synapses, as
well as incorporates an externally applied bias. The resulting sum, that is, the induced local field, is applied to a
hard limiter. Accordingly, the neuron produces an output equal to +1 if the hard limiter input is positive, and -1 if
it is negative.

The goal of the perceptron is to correctly classify the set of externally applied stimuli (i.e. input data) x1, x2 ... xm
into one of two classes C1 and C2. The decision rule for the classification is to assign the point represented by
the inputs x1, x2, ..., xm to class C1 if the perceptron output y is +1 and to class C2 if it is -1.

The synaptic weights of the perceptron are denoted by w1, w2 ...,wm. Correspondingly, the inputs applied to the
perceptron are denoted by x1, x2, ..., xm. The externally applied bias is denoted by b. From the model, we find
that the hard limiter input, or induced local field, of the neuron is

To develop insight into the behavior of a pattern classifier, it is customary to plot a map of the decision regions
in the m-dimensional signal space spanned by the m input variables x1, x2, ..., xm. In the simplest form of the
perceptron, there are two decision regions separated by a hyperplane, which is defined by

Take a look at the figure for the case of two input variables x1 and x2, for which the decision boundary takes the
form of a straight line.

105
A point (x1, x2) that lies above the boundary line is assigned to class C1, and a point (x1, x2) that lies below the
boundary line is assigned to class C2. Note also that the effect of the bias b is merely to shift the decision
boundary away from the origin. The synaptic weights w1, w2, ...,wm of the perceptron can be adapted on an
iteration-by-iteration basis.

For the perceptron to function properly, the two classes C1 and C2 must be linearly separable. This, in turn,
means that the patterns to be classified must be sufficiently separated from each other to ensure that the
decision surface consists of a hyperplane. This requirement is illustrated in Figure below for the case of a
two-dimensional perceptron. In the (a) part of the figure, the two classes C1 and C2 are sufficiently separated
from each other for us to draw a hyperplane (in this case, a straight line) as the decision boundary. If, however,
the two classes C1 and C2 are allowed to move too close to each other, as in (b) part of the figure, they become
nonlinearly separable, a situation that is beyond the computing capability of the perceptron.

Gradient Descent Learning


Terminology
1. Cost Function
The cost function (or loss function) measures how well the model is performing. It quantifies the difference
between the predicted values and the actual values. For linear regression, the cost function is Mean Squared
Error (MSE):

Where:
yi​: Actual value
^
𝑦​i​: Predicted value
n : Number of data points.
106
2. Gradient Vector
A partial derivative measures how a function changes when you vary only one variable, while keeping all
other variables constant.

Consider a function of two variables:

The function depends on both x and y.


●​ A partial derivative with respect to x means we treat y as a constant and differentiate only with respect
to x.
●​ Similarly, a partial derivative with respect to y means we treat x as a constant and differentiate only with
respect to y.

Partial derivative with respect to x:

The derivative of x2 with respect to x is 2x. Since 3y is treated as a constant, its derivative is 0.

Partial derivative with respect to y:

The term x is treated as a constant, so its derivative is 0. The derivative of 3y with respect to y is 3
2

The gradient vector is simply a vector of partial derivatives and points in the direction of the steepest ascent.
The gradient vector (denoted as ∇f pronounced as nabla f) is formed by collecting all partial derivatives:

For our function:

The function changes most rapidly in the direction of (2x,3). If we move in the direction of this gradient, the
function f(x,y) increases fastest.

In this case; For a weight wi​, the gradient is:

This represents the rate of change of the cost function with respect to wi.

3. Chain Rule
The chain rule helps us find the derivative of a function that is composed of two or more functions. In simple
terms, it tells us how to take the derivative of a "function inside a function."
●​ If y=f(g(x)), then f is the "outer function," and g is the "inner function."
●​ The chain rule helps us find the derivative of y with respect to x.

The derivative of y with respect to x for y=f(g(x)) is:

In words:
107
●​ Take the derivative of the outer function (f) with respect to the inner function (g).
●​ Multiply it by the derivative of the inner function (g) with respect to x.

An Example
Let’s say:

Here
●​ The outer function is f(g) = g2
●​ The inner function is g(x) = 3x + 2.

The derivative of f(g) = g2 with respect to g is:

The derivative of g(x) = 3x+2 with respect to x is:

Multiply the two derivatives:

Substitute g = 3x + 2

So, the derivative of y is:

Note:
In Gradient Descent, we use the chain rule to compute the gradient of the cost function. For example:
^
●​ The cost function J(w) depends on the predicted value 𝑦​.
^
●​ The predicted value 𝑦​depends on the weight w.
We use the chain rule

Here

4. Learning Rate(α)
The learning rate (α) affects the convergence of the ANN. It controls the size of the steps taken during
parameter updates. The range of α is from 0 to.

The Gradient Descent Algorithm


Gradient Descent is a powerful optimization algorithm used to minimize cost functions in machine learning. It
works by iteratively updating model parameters (weights) in the direction of the negative gradient. It is widely
used in training neural networks, linear regression, and other models.

Step-1: Initialize Weights: Start with random values for the weights (wi​), bais (b) and learning rate (α).
Step-2: Compute Gradient: Calculate the gradient of the cost function with respect to each weight and bias:

Step-3: Update Weights: Adjust the weights in the opposite direction of the gradient:
108
Step-4: Repeat: Repeat steps 2 and 3 until one of the stopping criteria is met
●​ Maximum number of iterations is reached.
●​ The step size becomes smaller than a predefined tolerance.

An Example
Input: House sizes (x) in square feet
Output: House prices (y) in thousands of dollars.
^
Model: Linear regression model 𝑦 = wx + b, where: w is weight (slope) and b is bias (intercept)
Goal: Use Gradient Descent to find the optimal values of w and b that minimize the Mean Squared Error (MSE)
cost function.

Here is the dataset


House size House Price
(x) (y)
1 2
2 4
3 6
4 8

Solution
Step-1: Let us initialize w = 0 and b = 0 and α = 0.1
Step-2 & 3:
Iteration-01
^
​ Compute predicted output using formula 𝑦 = wx + b
^ ^ ^ ^
​ 𝑦1 = 0 x 1 + 0 = 0​ 𝑦2 = 0 x 2 + 0 = 0​ 𝑦3 = 0 x 3 + 0 = 0​ 𝑦4 = 0 x 4 + 0 = 0

​ Compute the Gradient

​ Update Parameters:

Iteration-02
^
Compute predicted output using formula 𝑦 = wx + b
^ ^ ^ ^
​ 𝑦1 = 1.5 x 1 + 0.5 = 2 𝑦2 = 1.5 x 2 + 0.5 = 3.5 𝑦3 = 1.5 x 3 + 0.5 = 5 𝑦4 = 1.5 x 4 + 0.5 = 6.5

​ Compute the Gradient

109
​ Update Parameters:

Iteration-03
^
Compute predicted output using formula 𝑦 = wx + b
^ ^
​ 𝑦1 = 1.75 x 1 + 0.575 = 2.325​​ 𝑦2 = 1.75 x 2 + 0.575 = 4.075
^ ^
​ 𝑦3 = 1.75 x 3 + 0.575 = 5.825​​ 𝑦4 = 1.75 x 4 + 0.575 = 7.575

​ Compute the Gradient

​ Update Parameters:

●​ Continue iterating until the changes in w and b become very small (e.g. <0.001).
●​ After several iterations, w and b will converge to their optimal values. For this example, they will
^
approach w = 2, b = 0 & the final model will be 𝑦 = 2𝑥

A note about the tolerance based stopping


The predefined tolerance (ϵ) is a parameter that defined the level of accepted error in the predicted output (i.e.
the output generated by a model). Once the change in the cost function is smaller than the predefined
tolerance then the algorithm stops. Another condition of stopping the algorithm is when maximum number of
iterations is already performed.

Least Mean Squares (LMS)


1.​ LMS is an adaptive filtering algorithm used to minimize the Mean Squared Error (MSE) between
predicted and actual values.
2.​ It is an online learning algorithm, meaning it updates the model parameters (weights and bias)
iteratively using one data point (or a small batch) at a time. It is computationally efficient because it
processes one data point at a time.

The MSE is the cost function that LMS aims to minimize:


110
The gradient of the MSE with respect to w and b is:

LMS Update Rule to Update w & b

Where α is the learning rate.

Let us do the same example again


Step-01: Initialize w = 0, b = 0 and α = 0.1
Step-02:
Iteration-01

Iteration-02

111
Iteration-03

●​ Continue iterating until the changes in w and b become very small (e.g. <0.001).
●​ After several iterations, w and b will converge to their optimal values. For this example, they will
^
approach w = 2, b = 0 & the final model will be 𝑦 = 2𝑥
●​ After the first four iterations (where you’ve used all four data points), you simply start over from the first
data point and continue the process. This is called cycling through the dataset.

Do it Yourself
Do this same question with adjusting weight only not bias

Key difference between the gradient descent and LMS


Aspect Full MSE (Gradient Descent) LMS
Data usage Predict outputs for all inputs using the Uses one data point (or small batch) at a
current weights and bias. time.
112
Gradient Computes the exact gradient using all Approximates the gradient using one data
Computation data. point.
Update Updates weights and bias after Updates weights and bias after each data
Frequency processing all data. point.
Convergence Slower but more stable. Faster but noisier (may oscillate around
the minimum).
Approach This is a batch learning approach This is an online learning approach
because it uses the entire dataset in because it processes one data point (or
each iteration. small batch) at a time.

Effect of learning rate


The learning rate determines how quickly or slowly the model parameters (e.g., weights w and bias b) are
updated during training. It scales the gradient of the cost function:

Learning Rate (α) Behavior Pros Cons

Too Low Small steps toward the 1.​ Stable convergence. 1.​ Slow convergence
minimum. 2.​ Less likely to overshoot (requires many
the minimum. iterations).
2.​ May get stuck in local
minima or saddle points.

Optimal Balanced steps that 1.​ Fast and stable 1.​ Requires tuning to find
converge efficiently to convergence. the right value
the minimum. 2.​ Efficient use of
computational resources.

Too High Large steps that may 1.​ Faster initial progress. 1.​ Oscillations around the
overshoot the minimum.
minimum. 2.​ Risk of divergence
(moving away from the
minimum)

local minima and saddle point


➢​ A local minimum is a point in the cost function where the error is lower than all nearby points but higher
than the global minimum (the best possible solution). It happens when the cost function in complex models
(e.g., neural networks) is often non-convex, meaning it has many "hills" and "valleys." If the model
parameters (weights and biases) land in a local minimum during training, Gradient Descent or other
optimization algorithms may get stuck there because the gradient is zero (or close to zero). An example of
it is rolling a ball down a hilly terrain. If the ball lands in a small valley (local minimum), it won’t roll further
even if there’s a deeper valley (global minimum) nearby.
➢​ The saddle point is a point in the cost function where the gradient is zero, but it is neither a minimum nor a
maximum (it’s a flat region). It happens in high-dimensional spaces (common in neural networks), saddle
points are more prevalent than local minima. The model may get stuck at a saddle point because the
gradient is zero, and the optimization algorithm stops making progress.

Oscillations & Divergence


➢​ Oscillation occurs when the learning rate is too high, causing the model parameters to bounce around the
minimum instead of converging smoothly. It happens when Large steps cause the model to overshoot the
minimum, and the next update overshoots in the opposite direction, creating a cycle. An example of it is a

113
ball rolling down a hill. If the ball has too much momentum (high learning rate), it will overshoot the bottom
and roll up the other side, then roll back, and so on.
➢​ Divergence occurs when the learning rate is so high that the model parameters move away from the
minimum instead of converging toward it. It happens when extremely large steps cause the model to
overshoot the minimum by such a large margin that the cost function increases instead of decreasing. An
example of it is a ball rolling down a hill with so much momentum that it flies off the hill entirely and never
returns.

Multilayer Perceptron (MLP)


It is a type of artificial neural network (ANN) that consists of multiple layers of neurons. It is one of the most
commonly used architectures for deep learning tasks such as classification, regression, and pattern
recognition. The key components of an MLP include:
●​ Input Layer: Receives the raw data (e.g., features of a dataset). Each neuron in the input layer
represents a feature of the input data.
●​ Hidden Layers: It is the heart of an MLP. Perform complex computations like feature extraction and
transformation. Each neuron in a hidden layer computes a weighted sum of its inputs, applies an
activation function, and passes the result to the next layer.
●​ Output Layer: The output layer produces the final prediction (e.g., class label for classification,
continuous value for regression). The number of neurons in the output layer depends on the task:
Binary Classification: 1 neuron (e.g., sigmoid activation).
Regression: 1 neuron (linear activation)
Multi-Class Classification: n neurons

An MLP is a fully connected feedforward neural network, meaning that each neuron in one layer is connected
to every neuron in the next layer. It uses activation functions such as Sigmoid, or Tanh to introduce
non-linearity, enabling it to learn complex patterns in data.

Role of Hidden Layers in ANN


The number of hidden layers and neurons in each layer is a hyperparameter that must be tuned.
Hyperparameters are parameters that are set before training a model. They are not learned from the data but
are crucial for controlling the learning process. Some examples of hyperparameters in an MLP are; Number of
hidden layers, Number of neurons in each hidden layer, Learning rate., Activation functions, Batch size &
Number of epochs (the model has seen every training example once when one epoch is completed. Training
for multiple epochs allows the model to gradually improve its performance by minimizing the loss function.).
The parameters other than the hyperparameter are called trainable parameters e.g.weight and bias are
examples of trainable parameters. Hidden layers play a crucial role in deep learning models by enabling
hierarchical feature extraction. Their functions include:

●​ Feature Learning
Lower Layers: Detect simple patterns, such as edges, textures, or basic shapes.
Deeper Layers: Detect abstract concepts, such as objects or high-level features.
●​ Non-Linearity: Introducing non-linearity using activation functions, which allows the model to solve
complex problems. Without non-linearity, an MLP would be equivalent to a linear model, incapable of
solving complex problems.
●​ Representation Learning: Hidden layers transform raw input data into meaningful representations that
make it easier for the output layer to perform classification or regression.
●​ Capturing Relationships: Hidden layers can capture complex relationships between input features
that are not easily separable in lower-dimensional space.

Applications of MLP
●​ Classification: Image classification, spam detection, sentiment analysis.
●​ Regression: Predicting house prices, stock prices, or temperature.
●​ Pattern Recognition: Handwriting recognition, speech recognition.
●​ Function Approximation: Approximating complex mathematical functions.

114
BACK-PROPAGATION NETWORK
A back-propagation neural network is a multilayer, feed-forward neural network consisting of an input layer, a
hidden layer and an output layer. The neurons present in the hidden and output layers have biases, which are
the connections from the units whose activation is always 1. The bias terms also acts as weights.

The figure above shows the architecture of a BPN, depicting only the direction of information flow for the
feed-forward phase. During the back-propagation phase of learning, signals are sent in the reverse direction.

The inputs are sent to the BPN and the output obtained from the net could be either binary (0, 1) or bipolar (–1,
+1). The activation function could be any function which increases monotonically and is also differentiable.

x = input training vector (x1, ..., xi , ..., xn)


t = target output vector ( t1, ..., tk , ..., tm)
α = learning rate parameter
xi = input unit i. (Since the input layer uses an identity activation function, the input and output signals here are
same.)
v0j = bias on jth hidden unit
w0k = bias on kth output unit
zj = hidden unit j.

The net input to zj is

and the output is

yk = output unit k. The net input to yk is

and the output is

115
δk: Error correction weight adjustment for Wjk that is due to an error at output unit yk, which is back-propagated
to the hidden units that feed into unit yk
δj: Error correction weight adjustment for vij that is due to the back-propagation of error to the hidden unit zj.

Also, it should be noted that the commonly used activation functions are binary sigmoidal and bipolar sigmoidal
activation functions. The range of binary sigmoid is from 0 to 1, and for bipolar sigmoid it is from –1 to +1.
These functions are used in the BPN because of the following characteristics
(i) continuity
(ii) differentiability
(iii) nondecreasing monotony

The error back-propagation learning algorithm can be outlined in the following algorithm:
Step 0: Initialize weights and learning rate (take some small random values).
Step 1: Perform Steps 2–9 when stopping condition is false.
Step 2: Perform Steps 3–8 for each training pair.

Feed-forward phase (Phase I):


Step 3: Each input unit receives input signal xi and sends it to the hidden unit (i = 1 to n).
Step 4: Each hidden unit zj (j =1 to p) sums its weighted input signals to calculate net input:

Calculate output of the hidden unit by applying its activation functions over zinj (binary or bipolar sigmoidal
activation function):

and send the output signal from the hidden unit to the input of output layer units.
Step 5: For each output unit yk (k = 1 to m), calculate the net input:

and apply the activation function to compute output signal

Back-propagation of error (Phase II):


Step 6: Each output unit yk (k = 1 to m) receives a target pattern corresponding to the input training pattern and
computes the error correction term:

On the basis of the calculated error correction term, update the change in weights and bias:

Step 7: Each hidden unit (zj, j = 1 to p) sums its delta inputs from the output units:

The term δinj gets multiplied with the derivative of f(zinj) to calculate the error term:

On the basis of the calculated δj, update the change in weights and bias:

116
Weight and bias updation (Phase III):
Step 8: Each output unit (yk, k = 1 to m) updates the bias and weights:

Each hidden unit (zj, j = 1 to p) updates its bias and weights

Step 9: Check for the stopping condition. The stopping condition may be a certain number of epochs reached
or when the actual output equals the target output.

The above algorithm uses the incremental approach for updation of weights, i.e. the weights are being
changed immediately after a training pattern is presented i.e. it is working like the online training. There is
another way of training called batch-mode training, where the weights are changed only after all the training
patterns are presented. The batch-mode training requires additional local storage for each connection to
maintain the immediate weight changes.

The training of a BPN is based on the choice of various parameters. Also, the convergence of the BPN is
based on some important learning factors such as the initial weights, the learning rate, the updation rule, the
size and nature of the training set, and the architecture (number of layers and number of neurons per layer).

An Example
Using a back-propagation network, find the new weights for the net shown in Figure below. It is presented with
the input pattern [0, 1] and the target output is 1. Use a learning rate a =0.25 and binary sigmoidal activation
function.

Do it yourself
Find the new weights, using a back-propagation network for the network shown in Figure below. The network is
presented with the input pattern [ - 1, 1] and the target output is + 1. Use a learning rate of a = 0.25 and bipolar
sigmoidal activation function

Difference between LMS and backpropagation

117
Aspect Least Mean Square Backpropagation

Model type Simple linear models (e.g., linear Complex models (e.g., neural
regression). networks).

Gradient Computation Approximates gradient because Computes exact gradient using chain
computed using a single data point. rule because gradient is computed
using all data points

Efficiency LMS is computationally efficient because Backpropagation is more


it does not require processing the entire computationally expensive than LMS
dataset at once. because it processes multiple data
points at once.

Learning type Online learning (real-time adaptation). Batch or mini-batch learning.

Use Cases Online learning, real-time systems. Deep learning, multi-layer networks.

XOR Problem
In Rosenblatt’s single-layer perceptron, there are no hidden neurons. Consequently, it cannot classify input patterns that
are not linearly separable. However, nonlinearly separable patterns commonly occur. For example, this situation arises in
the exclusive-OR (XOR) problem, which may be viewed as a special case of a more general problem, namely, that of
classifying points in the unit hypercube (An n-D hypercube is has 2n vertices in a n-dimensional space; here it is two-D
space so it has 4 vertices. "unit" in unit hypercube means that each dimension is constrained to values between 0 and 1
and we have binary input here so the values will be exactly 0 and 1).

However, in the special case of the XOR problem, we need consider only the four corners of a unit square that correspond
to the input patterns (0,0), (0,1), (1,1), and (1,0), where a single bit (i.e., binary digit) changes as we move from one corner
to the next.

The first and third input patterns are in class 0, as shown by


0⨁0=0
and
1⨁1=0

Where ⨁ denotes the exclusive-OR boolean function operator. The input patterns (0,0) and (1,1) are at opposite corners of
the unit square, yet they produce the identical output 0. On the other hand, the input patterns (0,1) and (1,0) are also at
opposite corners of the square, but they are in class 1, as shown by
1⨁0=1
and
0⨁1=1

We first recognize that the use of a single neuron with two inputs results in a straight line for a decision boundary in the
input space. For all points on one side of this line, the neuron outputs 1; for all points on the other side of the line, it
outputs 0. The position and orientation of the line in the input space are determined by the synaptic weights of the neuron
connected to the input nodes and the bias applied to the neuron. With the input patterns (0,0) and (1,1) located on opposite
corners of the unit square, and likewise for the other two input patterns (0,1) and (1,0), it is clear that we cannot construct
a straight line for a decision boundary so that (0,0) and (0,1) lie in one decision region and (0,1) and (1,0) lie in the other
decision region. In other words, the single layer perceptron cannot solve the XOR problem.

However, we may solve the XOR problem by using a single hidden layer with two neurons (as in figure below along with
its diagram of signal flow)

118
​ ​ ​
Architectural graph of network for solving the XOR problem​ ​ Signal-flow graph of the network

Take a look at following-


●​ wij​: Weight from input xi to hidden neuron j.
o​ w11​: Weight from x1 to Neuron 1​.
o​ w12​: Weight from x1 to Neuron 2​.
o​ w21​: Weight from x2 to Neuron 1.
o​ w22​: Weight from x2 to Neuron 2.
●​ bi​: Bias term for neuron i.

The top neuron, labeled as “Neuron 1” in the hidden layer, is characterized as


w11 = w21 = +1 & b1 = -1.5

The slope of the decision boundary constructed by this hidden neuron is equal to -1. Here is the calculation
For an input pattern (x1,x2), the weighted sum z1​for Neuron 1 is calculated as:
z1=w11⋅x1 + w21⋅x2 + b1

Substituting the given values:


z1=1⋅x1 + 1⋅x2 – 1.5

The decision boundary is the line where the neuron's output transitions from 0 to 1. This occurs when the
weighted sum z1​is exactly 0:
z1 = 0 ⟹ x1 ​+ x2 ​− 1.5=0
x2 = -x1 + 1.5
This is the equation of the decision boundary line for Neuron 1. It has:
●​ A slope of −1 (since the coefficient of x1​is −1)
●​ A y-intercept of 1.5 (when x1=0, x2=1.5)

The decision boundary is a straight line that passes through the points (0, 1.5) & (1.5, 0) & positioned as

Decision boundary constructed by hidden neuron 1 of the network

The bottom neuron, labeled as “Neuron 2” in the hidden layer, is characterized as


w12 = w22 = +1 & b2 = -0.5

The slope of the decision boundary constructed by this hidden neuron is equal to -1. Here is the calculation
119
For an input pattern (x1,x2), the weighted sum z2​for Neuron 2 is calculated as:
z2=w12⋅x1 + w22⋅x2 + b2

Substituting the given values:


z2=1⋅x1 + 1⋅x2 – 0.5

The decision boundary is the line where the neuron's output transitions from 0 to 1. This occurs when the
weighted sum z2 is exactly 0:
z2 = 0 ⟹ x1 ​+ x2 ​− 0.5=0
x2 = -x1 + 0.5

This is the equation of the decision boundary line for Neuron 2. It has:
●​ A slope of −1 (since the coefficient of x1​is −1)
●​ A y-intercept of 0.5 (when x1=0, x2=0.5)

The orientation and position of the decision boundary constructed by this second hidden neuron are as follow-

Decision boundary constructed by hidden neuron 2 of the network.

The output neuron, labeled as “Neuron 3” , is characterized as


w1 = -2, w2 = +1 & b3 = -0.5

Say the output from neuron-1 is a1 and output from neuron-2 is a2; so the output from the neuron-3 is
z3​=−2.a1​+1.a2​−0.5

The function of the output neuron is to construct a linear combination of the decision boundaries formed by the
two hidden neurons. The result of this computation as follow-

Decision boundaries constructed by the complete network.

The activation function for the neuron is assumed to be a step function, which outputs 1 if the weighted sum of
the inputs is greater than or equal to 0, and 0 otherwise.

Input: (0,0)
Neuron 1: z₁ = 1⋅0 + 1⋅0 - 1.5 = -1.5 < 0 ⟹ a₁ = 0
120
Neuron 2: z₂ = 1⋅0 + 1⋅0 - 0.5 = -0.5 < 0 ⟹ a₂ = 0
Output Neuron (Neuron 3): z₃ = (-2)⋅0 + 1⋅0 - 0.5 = -0.5 < 0 ⟹ Output = 0. Matches XOR: 0 ⊕ 0 = 0

Input: (0,1)
Neuron 1: z₁ = 1⋅0 + 1⋅1 - 1.5 = -0.5 < 0 ⟹ a1 = 0
Neuron 2: z2 = 1⋅0 + 1⋅1 - 0.5 = +0.5 ≥ 0 ⟹ a1 = 1
Output Neuron (Neuron 3): z3 = (-2)⋅0 + 1⋅1 - 0.5 = +0.5 ≥ 0 ⟹ Output = 1. Matches XOR: 0 ⊕ 1 = 1

Input: (1,0)
Neuron 1: z₁ = 1⋅1 + 1⋅0 - 1.5 = -0.5 < 0 ⟹ a1 = 0
Neuron 2: z2 = 1⋅1 + 1⋅0 - 0.5 = +0.5 ≥ 0 ⟹ a2 = 1
Output Neuron (Neuron 3): z3 = (-2)⋅0 + 1⋅1 - 0.5 = +0.5 ≥ 0 ⟹ Output = 1. Matches XOR: 1 ⊕ 0 = 1

Input: (1,1)
Neuron 1: z1 = 1⋅1 + 1⋅1 - 1.5 = +0.5 ≥ 0 ⟹ a1 = 1
Neuron 2: z2 = 1⋅1 + 1⋅1 - 0.5 = +1.5 ≥ 0 ⟹ a2 = 1
Output Neuron (Neuron 3): z₃ = (-2)⋅1 + 1⋅1 - 0.5 = -1.5 < 0 ⟹ Output = 0. Matches XOR: 1 ⊕ 1 = 0

The bottom hidden neuron has an excitatory (positive) connection to the output neuron, whereas the top
hidden neuron has an inhibitory (negative) connection to the output neuron. When both hidden neurons are off,
which occurs when the input pattern is (0,0), the output neuron remains off. When both hidden neurons are on,
which occurs when the input pattern is (1,1), the output neuron is switched off again because the inhibitory
effect of the larger negative weight connected to the top hidden neuron overpowers the excitatory effect of the
positive weight connected to the bottom hidden neuron. When the top hidden neuron is off and the bottom
hidden neuron is on, which occurs when the input pattern is (0,1) or (1,0), the output neuron is switched on
because of the excitatory effect of the positive weight connected to the bottom hidden neuron.

Applications of ANN to solve Real Life Problem


1. Artificial Neural Networks in Healthcare
Disease Prediction
Artificial Neural Networks have revolutionized disease prediction by identifying patterns in patient data that
might go unnoticed by traditional statistical methods.
Examples of ANN in Disease Prediction:
1.​ Cardiovascular Disease Prediction
●​ ANNs analyze multiple risk factors simultaneously (blood pressure, cholesterol levels, age,
family history)
●​ Studies show accuracy rates of 85-95% in predicting heart disease risk
●​ Deep learning models can incorporate time-series data to predict acute cardiac events
2.​ Cancer Detection and Classification
●​ Convolutional Neural Networks (CNNs) identify malignant patterns in imaging data
●​ Particularly successful in breast cancer detection from mammograms and skin cancer
identification from dermatological images
●​ Research shows some AI systems matching or exceeding dermatologist accuracy in melanoma
detection
3.​ Diabetes Risk Assessment
●​ ANNs predict diabetes onset by analyzing blood glucose patterns, BMI, age, and other
biomarkers
●​ Recurrent Neural Networks (RNNs) can track changes over time to predict progression from
pre-diabetes to diabetes
Case Study: Diabetic Retinopathy Detection
Google's DeepMind developed a system using CNNs to identify diabetic retinopathy from retinal scans.
The system achieved over 90% accuracy, comparable to human ophthalmologists, potentially allowing
earlier intervention in areas with limited specialist access.

121
2. Artificial Neural Networks in Finance
Artificial Neural Networks have transformed financial forecasting by identifying complex patterns in market data
that traditional statistical methods might miss.
Examples of ANN in Stock Market Prediction:
1.​ Price Movement Forecasting
●​ Time-series analysis using Recurrent Neural Networks (RNNs) and Long Short-Term Memory
(LSTM) networks to predict short-term price movements
●​ Models incorporate technical indicators, historical prices, and trading volumes
●​ Performance typically exceeds traditional time-series forecasting methods like ARIMA
2.​ Sentiment Analysis for Market Prediction
●​ Natural Language Processing (NLP) combined with neural networks analyzes news articles,
social media, and financial reports
●​ Models quantify market sentiment as an additional predictive feature
●​ Helps capture market reactions to breaking news and events
3.​ Portfolio Optimization
●​ Deep Reinforcement Learning (DRL) models dynamically adjust portfolio allocations
●​ Neural networks optimize for risk-adjusted returns across various market conditions
●​ Can incorporate multiple objectives like volatility minimization and return maximization

JPMorgan developed the LOXM (Limit Order Execution) system using deep learning to execute equity trades
at optimal prices. The system analyzes market conditions and historical patterns to minimize market impact
while achieving best execution prices, outperforming human traders in many scenarios.

3. ANN in Image Recognition


Image recognition is one of the most successful and widely implemented applications of artificial neural
networks, with transformative impacts across numerous industries.

1.​ Medical Imaging Analysis


●​ Detection of tumors, fractures, and anomalies in X-rays, MRIs, and CT scans
●​ Automated classification of skin lesions for cancer detection
●​ Retinal scan analysis for diabetic retinopathy and other eye conditions

2.​ Facial Recognition and Biometrics


●​ Identity verification for security systems
●​ Emotion recognition from facial expressions
●​ Age and gender estimation
●​ Face detection and tracking in photographs and video

3.​ Autonomous Vehicles


●​ Object detection and classification (pedestrians, vehicles, road signs)
●​ Lane detection and road condition analysis
●​ Environmental mapping and navigation
●​ Obstacle avoidance systems
Google Lens uses advanced CNNs to recognize objects in real-time through a smartphone camera. The
system can identify products, landmarks, plants, animals, and text, demonstrating how image recognition can
create intuitive user interfaces for information retrieval.
Another example is the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) was a pivotal moment
in image recognition history. In 2012, AlexNet, a CNN developed by Alex Krizhevsky, Ilya Sutskever, and
Geoffrey Hinton, reduced the error rate from 26% to 15.3%, demonstrating the power of deep CNNs.
Subsequent architectures like VGG, ResNet, and Inception further improved performance, with modern
systems now surpassing human-level accuracy on certain image recognition tasks.
122
4. Artificial Neural Networks in Speech Recognition
Speech recognition has been revolutionized by artificial neural networks, enabling the voice-activated
assistants and transcription services we now use daily.
1.​ Applications of ANNs in Speech Recognition
●​ Voice Assistants
●​ Virtual assistants like Siri, Alexa, Google Assistant
●​ Command and control systems for smart homes and devices
●​ In-car voice control systems

2.​ Transcription Services


●​ Real-time meeting transcription
●​ Medical dictation systems
●​ Legal and court reporting
●​ Accessibility tools for the hearing impaired

3.​ Language Learning and Education


●​ Pronunciation assessment and feedback
●​ Interactive language learning applications
●​ Automated scoring of spoken language tests

4.​ Call Centers and Customer Service


●​ Interactive voice response (IVR) systems
●​ Call routing based on spoken queries
●​ Real-time transcription for customer service quality monitoring
Google's speech recognition system has achieved near-human accuracy using a combination of LSTMs,
CNNs, and more recently, Transformer-based models. Their system processes audio using multiple neural
network layers trained on thousands of hours of speech data and has been deployed across various products
including Google Assistant, automatic YouTube captioning, and Google Translate.

5. ANN in Robotics and Automation


Artificial Neural Networks have revolutionized robotics and automation by enabling machines to perceive their
environment, make decisions, and perform complex tasks with increasing autonomy.

1.​ Manufacturing Automation


●​ Visual inspection and quality control
●​ Anomaly detection in assembly lines
●​ Robot learning from demonstration (LfD)
●​ Collaborative robots (cobots) working alongside humans

2.​ Warehouse and Logistics


●​ Autonomous mobile robots (AMRs) for material transport
●​ Pick-and-place robots for order fulfillment
●​ Optimized routing and resource allocation
●​ Package handling and sorting

3.​ Agriculture
●​ Autonomous harvesting robots
●​ Precision weeding and crop management
●​ Livestock monitoring systems
●​ Soil and crop health analysis

4.​ Autonomous Vehicles


123
●​ Self-driving cars and trucks
●​ Autonomous delivery robots
●​ Drones for aerial inspection and delivery
●​ Mining and construction equipment automation

NVIDIA has developed Isaac Sim, a robotics simulation platform that uses neural networks to generate
synthetic training data. This enables sim-to-real transfer learning, where robots train in virtual environments
before deploying skills in the physical world.

6. ANN in Natural Language Processing


Natural Language Processing (NLP) has been transformed by artificial neural networks, enabling machines to
understand, generate, and interact with human language in increasingly sophisticated ways.

Key Applications of ANNs in NLP


1.​ Machine Translation
●​ Neural Machine Translation (NMT) systems outperform traditional statistical approaches
●​ End-to-end sequence-to-sequence models with attention mechanisms
●​ Multilingual models capable of translating between dozens of languages
●​ Example systems: Google Translate, DeepL, and Meta's No Language Left Behind

2.​ Text Classification


●​ Sentiment analysis for product reviews and social media monitoring
●​ Topic classification for news articles and documents
●​ Spam detection and content moderation
●​ Intent recognition for conversational systems

3.​ Question Answering


●​ Extractive QA systems locate answers within reference documents
●​ Generative QA systems formulate original answers based on knowledge
●​ Domain-specific systems for customer support and information retrieval
●​ Open-domain QA for general knowledge questions

OpenAI's GPT models (and subsequently similar models like Claude) demonstrated that neural networks
trained on massive text corpora can generate coherent, contextually appropriate text across diverse topics.
These models showcase emergent abilities including complex reasoning, code generation, and creative
writing, highlighting how scale and architecture innovations can produce systems with capabilities beyond their
explicit training objectives.

Build a simple perceptron model to classify linearly separable data using


Python.
First let us go through the working of Perceptron
The perceptron is a binary classifier that learns a linear decision boundary. It works as follows:
1.​ Initialize weights and bias (often to zeros or small random values).
2.​ Iterate over the training data:
●​ Compute the prediction for each sample.
linear_output = input * weight + bias
y_pred = apply activation function on linear_output
●​ Update weights and bias if the prediction is wrong.
weight = weight + α * (y_true - y_pred) * x_i
bias = bias + α * (y_true - y_pred)
3.​ Repeat until all samples are correctly classified (or max epochs reached).

Here I this the implementation of the perceptron learning rule


https://colab.research.google.com/drive/1NruWaRyXYERbJDHJ3YHzbUyfFxCNlce1?usp=sharing
124
Use Python libraries (e.g., TensorFlow/Keras) to train a neural network on the
MNIST dataset for digit recognition
Introduction to tensorflow and Keras
TensorFlow is the powerhouse behind modern machine learning, acting as a versatile framework that handles
the heavy-duty computations required for training and deploying AI models. Developed by Google, it operates
at a low level, managing everything from tensor operations (multidimensional arrays) to GPU acceleration,
making it ideal for both research and production. Think of TensorFlow as the "electric grid" of machine
learning—it provides the essential infrastructure, ensuring energy (data) flows efficiently through complex
circuits (algorithms) to power everything from small devices to massive data centers.

Keras, on the other hand, is the user-friendly interface built on top of TensorFlow, designed to simplify the
process of creating neural networks. Originally an independent library, Keras is now TensorFlow’s official
high-level API, offering intuitive tools to construct models with minimal code. Imagine Keras as the "smart
home system" that lets you control the electric grid with a simple app. Instead of wiring circuits manually
(coding low-level math), you use preconfigured switches (layers like Dense or Conv2D) to build models
effortlessly.

Together, TensorFlow and Keras form a seamless partnership: TensorFlow handles the gritty details of
optimization and hardware acceleration, while Keras provides a clean, modular way to design experiments.
This combo is why they dominate industries—from healthcare (diagnosing diseases) to entertainment (Netflix
recommendations). For students, Keras lowers the barrier to entry, while TensorFlow ensures your skills scale
to real-world challenges.

"If TensorFlow is the engine and gears of a high-performance car, Keras is the steering wheel and
dashboard—giving you control without needing to be a mechanical engineer."

A simple code example to check the tensor flow version


import tensorflow as tf
print(tf.__version__)

Click on the run button available on the left hand side of the code; you will get the tensor flow version which is
2.18.0

A note about the MNIST (Modified National Institute of Standards and Technology) Database
The MNIST dataset is the quintessential starting point for anyone learning machine learning and computer
vision. It consists of 70,000 handwritten digits (0–9), split into 60,000 training images and 10,000 test images,
each grayscale and sized at 28×28 pixels.

125
Here are the steps to use it in the form of paragraph
1.​ Preprocessing: Pixel values (0–255) are scaled to 0–1 (normalization).
2.​ Model Input: Images are flattened into 1D arrays (784 values) for classic neural networks.
3.​ Labels: Each digit comes with a true label (0–9), enabling supervised learning.

The MNIST dataset has become the quintessential starting point for machine learning and computer vision due
to its simplicity, accessibility, and well-structured format. Its small image size (28x28 pixels) and grayscale
format reduce computational complexity, making it ideal for beginners to experiment with algorithms without
needing high-end hardware. The dataset's clean, centered digits and balanced class distribution allow
newcomers to focus on core concepts like data preprocessing, model training, and evaluation metrics without
getting bogged down by noise or class imbalances. Additionally, MNIST's integration into popular libraries like
TensorFlow and PyTorch ensures easy access, enabling rapid prototyping and benchmarking.

Despite its widespread use, MNIST has notable limitations. Its simplicity, while great for beginners, fails to
capture real-world challenges like varying backgrounds, lighting conditions, or distorted handwriting,
leading to inflated accuracy scores (often >99%) that don't translate to practical applications. The dataset's
uniformity also means models trained on MNIST struggle with more complex tasks, exposing a gap between
academic exercises and real-world problems.

https://colab.research.google.com/drive/1WO6Cq2ihoipDjkaq2YhUf2HMGgFXBoRm?usp=sharing

**NOTES END HERE; DO NOT READ FURTHER PAGES**

126
The next topic is perceptron convergence theorem but before taking about this, let us have a discussion about
few important terms
❖​ A vector is a one-dimensional array of numbers, either representing inputs, weights, or outputs. A
vector can indeed be thought of as a matrix of order n×1, where n is the number of elements in the
vector. A vector is often represented as a column vector, which is a matrix with n rows and 1 column
(n×1). For example, a vector v with 3 elements can be written as:

The transpose of a column vector is a row vector, and vice versa.

❖​ A Hyperplane is a geometric entity that separates a space into two distinct parts. In n-dimensional
space, a hyperplane is an (n−1)-dimensional subspace. The general equation of a hyperplane in
n-dimensional space is:

w1​x1​+ w2​x2 ​+ ⋯ + wn​xn​+ b = 0

Where:
●​ w1, w2, …, wn are the weights
●​ x1, x2, …, xn are the input features
●​ b is the bias term.

A.​In one-dimensional space, a hyperplane is simply a point. For example, on a number line, a point x =
c can separate the space into two regions: x < c and x > c.
B.​In two-dimensional space, a hyperplane is a line.
C.​In three-dimensional space, a hyperplane is a plane.

❖​ What is Norm notation ||W|| and Euclidean norm ||W||2


127
A norm is a function that assigns a non-negative scalar value to a vector. It measures the "size" or
"length" of the vector. The general form of the Lp norm for a vector v = [v1,v2,…,vd]T is defined as:

p is a positive integer (e.g., p=1,2 … ∞)


∥v∥p​is the Lp norm of v, which is a scalar value.

The specific type of norm depends on the subscript or context. For example:
||W||1​: L1 norm (sum of absolute values).
||W||2​: L2 norm (Euclidean norm).
||W||p​: Lp norm (generalized norm).

Euclidean Norm (L2 Norm): ∥W∥2


The Euclidean norm (or L2 norm) is the most commonly used norm. For a vector W = [w1,w2,…,wn] the
Euclidean norm is defined as:

This represents the "straight-line distance" from the origin to the point defined by the vector W in
Euclidean space.

❖​ The Cauchy-Schwarz Inequality is a fundamental inequality in mathematics that establishes a


relationship between the inner product of two vectors and their norms. It's important in various fields
including linear algebra, analysis, probability theory, and machine learning. For two vectors u and v in
an inner product space, the inequality states:

|⟨u, v⟩| ≤ ||u||2 · ||v||2 (The same can be written as |⟨u, v⟩| ≤ ||u|| · ||v|| if euclidean norm is mentioned
explicitly for norm)

Where:
⟨u, v⟩ represents the absolute value of inner product of vectors u and v
||u||2 and ||v||2 represent the norms of the vectors u and v respectively

See the demonstration with a simple example using vectors:


Example:
Let's take two vectors:
u = (1, 2, 3)
v = (4, 5, 6)

Step 1: Calculate the inner product ⟨u, v⟩


Inner product = u₁v₁ + u₂v₂ + u₃v₃ = (1×4) + (2×5) + (3×6) = 4 + 10 + 18 = 32

Step 2: Calculate the norms of both vectors


||u|| = √(1² + 2² + 3²) = √(1 + 4 + 9) = √14 ≈ 3.742
||v|| = √(4² + 5² + 6²) = √(16 + 25 + 36) = √77 ≈ 8.775

Step 3: Verify the inequality


Left side: |⟨u, v⟩| = |32| = 32
Right side: ||u|| · ||v|| ≈ 3.742 × 8.775 ≈ 32.84

Therefore: 32 ≤ 32.84
The inequality holds, as expected. Note that the values aren't exactly equal, which tells us that these
two vectors aren't scalar multiples of each other.

128
Note: The equality holds if and only if one vector is a scalar multiple of the other (meaning they're
linearly dependent; When vectors are linearly dependent, the angle between them is either 0° (same
direction) or 180° (opposite direction).).

❖​ General Strategy for Tightening Inequalities: The idea of eliminating redundant terms to tighten an
inequality is based on the following points:
1.​ Redundancy: If a term in an inequality is already included in another term (e.g., as part of a
sum), explicitly writing it separately does not provide additional information.
2.​ Monotonicity of Inequalities: If A ≤ B + C and C is already included in B (i.e. B = C + D), then A ≤
B + C can be rewritten as A ≤ B, which is a tighter bound.
The principle given above is useful when applying Cauchy–Schwarz inequality in the Perceptron
Convergence Theorem, where we try to remove unnecessary terms to tighten the inequalities.

The perceptron convergence theorem


Instruction: Here multiple fonts are used so to avoid confusion in the capital letters and the small
letters, the capital lettered characters are marked bold.

To derive the error-correction learning algorithm for the perceptron, we find it more convenient to work with the
modified signal-flow graph model in figure below

The only difference here is the bias b(n) is treated as a synaptic weight driven by a fixed input equal to +1. We
may thus define the (m + 1)-by-1 input vector

or

The T in the superscript stands for the transpose operation. The n denotes the time-step when the algorithm is
applied. A time-step (denoted by n) represents a specific iteration or update in the algorithm. It is the point at
which the algorithm processes a data point, updates its parameters (e.g., weights), and moves closer to finding
a solution.

Similarly, we define the (m + 1)-by-1 weight vector as

or

The w T(n) symbol in will be-


W T(n) = [b, w1(n), w2(n), ..., wm(n)]
Accordingly, the linear combiner output is written in the compact form
129
i.e.

v(n) = [b, w1(n), w2(n), ..., wm(n)] x

The dot product is computed as:


v(n)=WT(n).X(n)=b⋅(+1)+w1​(n)⋅x1​(n)+w2​(n)⋅x2​(n)+⋯+wm​(n)⋅xm​(n)
This is equivalent to the linear combiner output of the Perceptron.

In the first line, w0(n), corresponding to i = 0, represents the bias b. For fixed n, the equation WTX = 0, plotted
in an m-dimensional space (and for some prescribed bias) with coordinates x1, x2, ..., xm, defines a hyperplane
as the decision surface between two different classes of inputs.

Suppose then that the input variables of the perceptron originate from two linearly separable classes. Let H1 be
the subspace of training vectors X1(1), X1(2), ... that belong to class C1, and let H2 be the subspace of training
vectors X2(1), X2(2), ... that belong to class C2. The union of H1 and H2 is the complete space denoted by H.
Given the sets of vectors H1 and H2 to train the classifier, the training process involves the adjustment of the
weight vector W in such a way that the two classes C1 and C2 are linearly separable. That is, there exists a
weight vector W such that we may state

Note:
If we used strict inequalities for both classes:
●​ wTx > 0 for C1.
●​ wTx < 0 for C2​.

This would leave input vectors with wTx=0 unclassified, which is undesirable. The Perceptron algorithm needs to classify
all input vectors, so it uses:
●​ wTx > 0 for C1​.
●​ wTx ≤ 0 for C2​.
This ensures that every input vector is assigned to one of the two classes.

Given the subsets of training vectors H1 and H2, the training problem for the perceptron is then to find a weight
vector w such that the two inequalities of above statements are satisfied.

The algorithm for adapting the weight vector of the elementary perceptron may now be formulated as follows:
1.​ If the nth member of the training set, x(n), is correctly classified by the weight vector w(n) computed at
the nth iteration of the algorithm, no correction is made to the weight vector of the perceptron in
accordance with the rule:

2.​ Otherwise, the weight vector of the perceptron is updated in accordance with the rule

130
Note: The learning rate is denoted by η (Greek letter ETA). It is used to control the amount of weight
adjustment at each step of training. The learning rate, ranging from 0 to 1, determines the rate of
learning at each time step. The learning rate plays a significant role in determining how fast or slow a
neural network learns. If the learning rate is low then the neuron will learn slowly similarly If the learning
rate is high then the neuron will learn fastly.

We are using the fixed-increment adaptation rule for the perceptron in which we are keeping the η fixed i.e. it is
a constant independent of the iteration number n. This means the learning rate does not change over time.

Proof of the perceptron convergence algorithm is presented for the initial condition W(0) = 0. Suppose that
WT(n)X(n) < 0 for n = 1, 2, ..., and the input vector X(n) belongs to the subset H1. That is, the perceptron
incorrectly classifies the vectors X(1), X(2) ..., since the first condition of equation (4) is violated. Then, with the
constant η(n) = 1, we may use the second line of equation (6) to write
W(n + 1) = W(n) + X(n) for X(n) belonging to class C1 ---------- (7)
Note: - The update rule W(n+1) = W(n) + X(n) means that the weight vector at the next iteration W(n+1) is updated based
on the current input X(n) and the current weight vector W(n). This ensures that the algorithm processes each input X(n)
sequentially and updates the weights accordingly. The variable n is being used in two different ways in the perceptron
algorithm.
-​ For weight; It is taking data either 0 or from the previous input data
-​ For input; it is taking values from 1 to total number of inputs

For iteration n = 0, the weight vector is initialized as W(0)


For iteration n = 1, The algorithm uses W(0) to classify the X(1); it updates the weight of W(1) based on the X(1)
W(1) = W(0) + X(1) [because W(0) = 0]
hence W(1) = X(1)

For iteration n = 2, The algorithm uses W(1) to classify the X(2); it updates the weight of W(2) based on the X(2)
W(2) = W(1) + X(2) [because W(1) = X(1)]
hence W(2) = X(1) + X(2)

For iteration n = 3, The algorithm uses W(2) to classify the X(3); it updates the weight of W(3) based on the X(3)
W(3) = W(2) + X(3) [because W(2) = X(1) + X(2)]
hence W(3) = X(1) + X(2) + X(3)

From the above calculation we can say that that for W(0) = 0, we may iteratively solve this equation for W(n + 1),
obtaining the result
W(n + 1) = X(1) + X(2) + . . . + X(n) - - - - - - - - - - (8)

Since the class C1 and C2 are assumed to be linearly separable, there exists a solution Wo for which WTX(n) >
0 for the vectors X(1), ..., X(n) belonging to the subset H1 For a fixed solution Wo, we may then define a
positive number α as

Hence, multiplying both sides of Eq (8) by the row vector WoT, we get
WoTW(n + 1) = WoTX(1) + WoTX(2) + WoTX(3) + … + WoTX(n)

From the equation (9) we can say that


WoTW(n + 1) ≥ nα ---------(10)

Next we make use of an inequality known as the Cauchy–Schwarz inequality. Given two vectors W0 and W(n +
1), the Cauchy–Schwarz inequality states that
131
||Wo||2||W(n + 1)||2 ≥ n2α2 ---------(11)​
Here ||wo|| is the squared euclidean norm it can be written as ||wo||22 for the sake of simplicity we are writing it ||wo||2
2

because ||Wo||2||W(n + 1)||2 ≥ WoTW(n + 1) and WoTW(n + 1) ≥ nα so ||Wo||2||W(n + 1)||2 ≥ n2α2

or, equivalently

We next follow another development route.


W(k + 1) = W(k) + X(k) for k = 1, ..., n and X(k) ∈ H1 ---------- (13)

By taking the squared Euclidean norm of both sides


||W(k + 1)||2 = ||W(k)||2 + ||X(k)||2 + 2WT(k)X(k) ---------- (14)

But, WT(k)X(k) ≤ 0.We therefore deduce from


||W(k + 1)||2 ≤ ||W(k)||2 + ||X(k)||2

or equivalently
||W(k + 1)||2 - ||W(k)||2 ≤ ||X(k)||2 ---------- (15)

Summing the inequalities for k=1,…,n gives:

The left-hand side is a telescoping sum, meaning most terms cancel out:

After cancellation, this simplifies to:

However, since W(0)=0, we can say W(1) = X(1) so:

So we will get

Rearranging:

The above can be written as follow after applying the General Strategy for tightening Inequalities

||W(n + 1)||2 ≤ nβ
where β is a positive number defined by

Let us combine the result of equation (12) and equation (16)


132
Here

The inequality mentioned in the second part of equation (16) is in conflict with the inequality mentioned in the
equation (12) for sufficiently large values of n because
1.​ The upper bound grows linearly with n
2.​ The lower bound grows quadratically with n
For sufficiently large n, the quadratic term will eventually exceed the linear term which would violate the upper
bound. This is why the two inequalities appear to be in conflict for large n.

The Perceptron algorithm is guaranteed to converge (i.e. find solution) after a finite number of updates
(nmax) if the data is linearly separable. This means that the inequalities are only relevant for n ≤ nmax​, where nmax​
is the maximum number of updates required for convergence.

Solving for nmax, given a solution vector Wo, we find that

We have thus proved that for η(n) = 1 for all n and W(0) = 0, and given that a solution vector Wo exists, the rule
for adapting the synaptic weights of the perceptron must terminate after at most nmax iterations. We may now
state the fixed-increment convergence theorem for the perceptron as follows

Let the subsets of training vectors H1 and H2 be linearly separable. Let the inputs presented to the perceptron
originate from these two subsets. The perceptron converges after some no iterations, in the sense that
w(no) = w(no + 1) = w(no + 2) = ….
is a solution vector for n0 < nmax.

The Perceptron Convergence Algorithm guarantees that if the data is linearly separable, the algorithm will find
a solution (i.e., a weight vector that correctly classifies all training examples) in a finite number of iterations.
The goal is indeed to determine the value of n0​ (or nmax) such that the algorithm will surely converge within n0
iterations.

133
Back Propagation Network
A note about partial derivative and gradient vector:
A partial derivative measures how a function changes when you vary only one variable, while keeping all
other variables constant.

Consider a function of two variables:

The function depends on both x and y.


●​ A partial derivative with respect to x means we treat y as a constant and differentiate only with respect
to x.
●​ Similarly, a partial derivative with respect to y means we treat x as a constant and differentiate only with
respect to y.

Partial derivative with respect to x:

The derivative of x2 with respect to x is 2x. Since 3y is treated as a constant, its derivative is 0.

Partial derivative with respect to y:

The term x2 is treated as a constant, so its derivative is 0. The derivative of 3y with respect to y is 3

The gradient vector is simply a vector of partial derivatives and points in the direction of the steepest ascent.
The gradient vector (denoted as ∇f pronounced as nabla f) is formed by collecting all partial derivatives:

For our function:

The function changes most rapidly in the direction of (2x,3). If we move in the direction of this gradient, the
function f(x,y) increases fastest.

Convergence is made faster if a momentum factor is added to the weight updation process. This is generally
done in the back propagation network. If momentum has to be used, the weights from one or more previous

134
training patterns must be saved. Momentum helps the net in reasonably large weight adjustments until the
corrections are in the same general direction for several patterns.

The vigilance parameter is denoted by “ρ”. It is generally used in adaptive resonance theory (ART) networks.
The vigilance parameter is used to control the degree of similarity required for patterns to be assigned to the
same cluster unit. The choice of vigilance parameter ranges approximately from 0.7 to 1 to perform useful work
in controlling the number of clusters.

𝓵
η

δ


α

135

You might also like