Asc Notes
Asc Notes
Artificial Neuron
Consider the schematic view of an artificial neuron. In which a biological neuron has been modeled artificially.
Let us suppose that there are n inputs (such as I1, I2, . . . , In) to a neuron j. The weights connecting n number of
inputs to jth neuron are represented by [W] = [W1j, W2j, ..., Wnj]. The function of summing junctions of an artificial
neuron is to collect the weighted inputs and sum them up. Thus, it is similar to the function of combined
dendrites and soma. The activation function (also known as the transfer function) performs the task of axon
and synapse. The output of the summing junction may sometimes become equal to zero and to prevent such a
situation, a bias of fixed value bj is added to it. Thus, the input to transfer function f is determined as
The output of the summing function is also called Linear combiner Output/Induced field input/net
input/pre-activation value. The output of jth neuron, that is Oj can be obtained as follows:
75
Do it yourself
Q.1 For the network shown in Figure below, calculate the net input to the output neuron.
Q.2 For the network shown in Figure below, calculate the net input to the output neuron.
Difference between Artificial Neural Network (ANN) and Biological Neural Network (BNN):
Processing speed is fast as compared to Biological They are slow in processing information. Cycle time
Neural Network. Cycle time for execution is in for execution is in milliseconds.
nanoseconds.
It can perform massive parallel operations It can perform massive parallel operations
simultaneously like BNN. simultaneously.
Size and complexity depends on the application The size and complexity of BNN is more than ANN
chosen but it is less complex than BNN. with 1011 neurons and 1015 interconnections.
To store new information, the old information is Any new information is stored in interconnection, and
deleted if there is a shortage of storage. the old information is stored with lesser strength.
There is no fault tolerance in ANN. The corrupted It has fault tolerance capability. It can store and
information cannot be processed. retrieve information even if the interconnection is
disconnected.
The control unit processes the information. The chemical present in neurons does the
processing.
Threshold is a set value based upon which the final output of the network may be calculated. The threshold
value is used in the activation function. A comparison is made between the calculated net input and the
threshold to obtain the network output. For each and every application, there is a threshold limit. Consider a
direct current (DC) motor. If its maximum speed is 1500 rpm then the threshold based on the speed is 1500
rpm. If the motor is run on a speed higher than its set threshold, it may damage motor coils. Similarly, in neural
networks, based on the threshold value, the activation functions are defined and the output is calculated. The
activation function using threshold can be defined as
f(net) = {1 𝑖𝑓 𝑛𝑒𝑡≥ θ − 1 𝑖𝑓 𝑛𝑒𝑡 < θ }
76
Where θ is the fixed threshold value.
Extract and transform intermediate features from Produces the final prediction (e.g., class
the input data. Learned features are often probabilities, regression values) which is
abstract and hard to interpret human-readable (e.g. class labels, scalars).
Introduce non-linearity (via activation functions) Maps the learned features to the target format
to model complex relationships. Typically use using task-specific activations like Sigmoidal for
non-linear activations like ReLU, Tanh Binary Classification, Softmax for Multi-Class
Classification
Computes the initial error gradient
In case of error; Propagates error backward In case of error; It computes initial loss gradient
78
v. Multilayer recurrent network
A processing element output can be directed back to the nodes in a preceding layer, forming a
multilayer recurrent network. Also, in these networks, a processing element output can be directed back
to the processing element itself and to other processing elements in the same layer.
Note: Maxnet is a type of neural network used for competitive learning, specifically to determine the maximum
activation among a set of neurons. It is commonly used in winner-take-all (WTA) networks, where only the
most strongly activated neuron remains active while suppressing others. Each neuron excites itself and inhibits
others using a small inhibitory weight −ε, which is a small negative constant. Over multiple iterations, neurons
with lower activations are suppressed until only the neuron with the highest activation remains. The network
iteratively updates neuron activations until a single neuron dominates while others are completely suppressed.
The training or learning rules adopted for updating and adjusting the connection weights
The main property of an ANN is its capability to learn. Learning or training is a process by means of which a
neural network adapts itself to a stimulus by making proper parameter adjustments, resulting in the production
of desired response. Broadly, there are two kinds of learning in ANNs:
1. Parameter learning: It updates the connecting weights in a neural net.
2. Structure learning: It focuses on the change in network structure (which includes the number of
processing elements as well as their connection types).
The above two types of learning can be performed simultaneously or separately. Apart from these two
categories of learning, the learning in an ANN can be generally classified into three categories as: supervised
learning; unsupervised learning & reinforcement learning.
79
1. Supervised Learning
Each input vector requires a corresponding target vector, which represents the desired output. The
input vector along with the target vector is called a training pair. The network here is informed precisely
about what should be emitted as output.
During training, the input vector is presented to the network, which results in an output vector. This
output vector is the actual output vector. Then the actual output vector is compared with the desired
(target) output vector. If there exists a difference between the two output vectors then an error signal is
generated by the network. This error signal is used for adjustment of weights until the actual output
matches the desired (target) output. In this type of training, a supervisor or teacher is required for error
minimization. Hence, the network trained by this method is said to be using supervised training
methodology. In supervised learning, it is assumed that the correct "target" output values are known for
each input pattern.
Key Features:
Requires labelled training data.
Uses loss functions to measure prediction accuracy.
Common algorithms: Neural Networks, Support Vector Machines, Decision Trees
Scenario: You want to predict the price of a house based on its size (in square feet) and other features
like the number of bedrooms, location, and age of the house.
Input Features:
● Size of the house (square feet)
● Number of bedrooms
● Location
● Age of the house
Output: House price (a continuous value)
A regression model, such as Linear Regression, can be used to predict the house price. The model
learns the relationship between the input features and the house price during training. For example, it
might learn that larger houses with more bedrooms in desirable locations tend to have higher prices.
80
Equation: In simple linear regression, the relationship can be represented as:
Scenario: You want to classify animals into different categories based on their features, such as the
number of legs, type of skin covering, and whether they can fly.
Input Features:
● Number of legs
● Type of skin covering (e.g., fur, feathers, scales)
● Ability to fly (yes/no)
Output: Animal category (e.g., mammal, bird, reptile, amphibian)
A classification model, such as Logistic Regression, Decision Trees, or Support Vector Machines, can
be used to classify the animals. The model learns the relationship between the input features and the
animal category during training. For example, it might learn that animals with feathers and the ability to
fly are likely to be birds.
Decision Boundary: The model creates a decision boundary that separates the different classes. For
instance, it might determine that if an animal has feathers and can fly, it should be classified as a bird.
81
2. Un-supervised Learning
The input vectors of similar type are grouped without the use of training data to specify how a member
of each group looks or to which group a number belongs. In the training process, the network receives
the input patterns and organizes these patterns to form clusters. When a new input pattern is applied,
the neural network gives an output response indicating the class to which the input pattern belongs. If
for an input, a pattern class cannot be found then a new class is generated.
It is clear that there is no feedback from the environment to inform what the outputs should be or
whether the outputs are correct. In this case, the network must itself discover patterns, regularities,
features or categories from the input data and relations for the input data over the output. While
discovering all these features, the network undergoes change in its parameters. This process is called
self-organizing in which exact clusters will be formed by discovering similarities and dissimilarities
among the objects.
Example: Clustering, Anomaly detection
The two popular learning algorithms are self organizing maps (SOMs) and k-means Clustering.
Topological preservation refers to the ability of the Kohonen Self-Organizing Map (KSOM) to
maintain the spatial relationships between input data points when mapping them onto a
lower-dimensional space (typically a 1D or 2D grid).
82
● Similar input vectors should be mapped to neighboring neurons in the output map.
● The network should retain the structure of the input data after training.
To depict this, a typical network structure where each component of the input vector x is connected to
each of the nodes is shown in figure below
On the other hand, if the input vector is two-dimensional, the inputs, say x(a, b), can arrange
themselves in a two-dimensional array defining the input space (a, b) as in Figure below; Here, the two
layers are fully connected
The architecture consists of two layers: input layer and output layer (cluster). There are “n” units in the
input layer and “m” units in the output layer. Basically, here the winner unit is identified by using either
dot product or Euclidean distance method and the weight update using Kohonen learning rules is
performed over the winning cluster unit. At the time of self-organization, the weight vector of the cluster
unit which matches the input pattern very closely is chosen as the winner unit. The closeness of the
83
weight vector of the cluster unit to the input pattern may be based on the square of the minimum
Euclidean distance. The weights are updated for the winning unit and its neighboring units.
An Example
Construct a Kohonen self-organizing map to cluster the four given vectors, [0 0 1 1], [1 0 0 0], [0 1 1 0]
and [0 0 0 1]. The number of clusters to be formed is two. Assume an initial learning rate of 0.5.
Do it yourself
Consider a Kohonen self-organizing net with two cluster units and five input units. The weight vectors
for the cluster units are given by
w1 = [1.0 0.9 0.7 0.5 0.3]
w2 = [0.30.5 0.7 0.91.0]
Use the square of the Euclidean distance to find the winning cluster unit for the input pattern x =[0.0 0.5
1.0 0.5 0.0] . Using a learning rate of 0.25, find the new weights for the winning unit.
Applications of SOMs
● Data Clustering: Identifying patterns in customer behavior, genetics, and more.
● Anomaly Detection: Detecting fraud or unusual patterns in financial transactions.
● Feature Extraction: Reducing data dimensions for visualization and analysis.
● Image Recognition: Organizing images based on similarities.
K-means clustering
Clustering is the task of grouping similar data points together based on their features.
K-means clustering is an iterative algorithm that divides the unlabeled dataset into k different clusters in
such a way that each dataset belongs to only one group that has similar properties. The goal is to
maximize intra-cluster similarity & minimize inter-cluster similarity. Intra-cluster similarity means
elements in the same cluster should be close to one another i.e. Euclidean distance between them
should be as little as possible; Inter-cluster similarity means Euclidean distance between two the
centroids of the clusters should be maximum i.e. there should be no common element in two clusters.
The number of clusters is represented using letter K. This algorithm discovers patterns without prior
knowledge of groups i.e. it falls under the category of unsupervised learning.
84
Here is the Steps of K-Means Clustering
1. Choose k: Select the number of clusters randomly (e.g., k=2).
2. Initialize Centroids: Randomly pick k data points as initial centroids.
4. Update Centroids: Recompute centroids as the mean of all points in the cluster.
5. Repeat step 3 & 4: Reassign points and update centroids until convergence (no further
changes).
The step of computing the centroid and assigning all the points to the cluster based on their distance
from the centroid is a single iteration. There are essentially three stopping criteria that can be adopted
to stop the K-means algorithm:
1. Centroids of newly formed clusters do not change
2. Points remain in the same cluster
3. Maximum number of iterations is reached
We have to understand the effect of choosing the value of K. Before this let us understand the meaning
of inertia which is the sum of squared distances of points to their centroid (measures cluster
compactness).
● If the value of k is too small, it means the size of the cluster will be big that results in high inertia;
It will give us poor insights because distinct groups will be merged into the same cluster.
● If the value of k is too large, it means the size of the cluster will be small that results in low
inertia; It will give us clear, well-separated clusters but clusters will have overlapping elements or
fragmented groups.
The impact of increasing the value of K can be understood like stretching a rubber band: Initial effort
(low k) yields big changes; later effort (high k) barely stretches it further.
An Example
Dataset: 12 Customers with Annual Spending ($1000) and Visits/Year
+-------------------------------------------+
| Customer | Spending ($1000) | Visits/Year |
|----------|------------------|-------------|
| 1 | 5 | 2 |
| 2 | 10 | 4 |
85
| 3 | 8 | 3 |
| 4 | 50 | 15 |
| 5 | 55 | 18 |
| 6 | 60 | 20 |
| 7 | 100 | 40 |
| 8 | 95 | 35 |
| 9 | 110 | 45 |
| 10 | 120 | 50 |
| 11 | 4 | 1 |
| 12 | 6 | 2 |
+-------------------------------------------+
Iteration-2
86
For step-03: Assign Clusters; Compute Euclidean distance (squared) for all points:
87
3. Reinforcement Learning
Reinforcement learning is a form of supervised learning because the network receives some feedback
from its environment. However, the feedback obtained here is only evaluative and not instructive. The
external reinforcement signals are processed in the critic signal generator, and the obtained critic
signals are sent to the ANN for adjustment of weights properly so as to get better critic feedback in
future. The critic signal is like a reward or a penalty. The reinforcement learning is also called learning
with a critic as opposed to learning with a teacher, which indicates supervised learning.
Key Features:
● Trial-and-error learning.
● Uses rewards and penalties as feedback.
88
Objective Predict output based Find structure and Maximize cumulative
on given input patterns in data rewards through
interactions
A network is generally trained using either an incremental (also known as a sequential) or a batch mode, the
principles of which are discussed below.
Let us consider the incremental training of an NN using a number of scenarios (say 20), sent one after
another. There is a chance that the optimal network obtained after passing the 20-th training scenario
will be too different from that obtained after using the 1-st training scenario.
Note: It is important to mention that incremental training is easier to implement and computationally
faster than the batch mode of training.
The output here remains the same as input. The input layer uses the identity activation function.
Identity function
Where θ represents the threshold value. This function is most widely used in single-layer nets to
convert the net input to an output that is a binary (1 or 0).
Where θ represents the threshold value. This function is also used in single-layer nets to convert the
net input to an output that is bipolar (+1 or –1).
90
Bipolar step function
4. Sigmoidal functions: The sigmoidal functions are widely used in back-propagation nets because of
the relationship between the value of the functions at a point and the value of the derivative at that point
which reduces the computational burden during training. Sigmoidal functions are of two types:
a. Binary sigmoid function: It is also termed as logistic sigmoid function or unipolar sigmoid function.
It can be defined as
1
𝑓(𝑥) = −λ𝑥
1+𝑒
Where λ is the steepness parameter. For standard binary sigmoid λ = 1. If more than one input is
available in the neuron then use the summed output value for x. Here the range of the sigmoid
function is from 0 to 1.
This derivative is important in neural networks because it is used during back propagation, which is
the process of updating the weights of the network to minimize the error. Here f(x) is the output of
the sigmoid function. 1−f(x) represents the complement of the sigmoid output. The derivative f′(x)
tells us how sensitive the output of the sigmoid function is to changes in its input x. It is maximum
when f(x)=0.5 and decreases as f(x) approaches 0 or 1.
91
Do it yourself
Q.2 Assume that λ = 1 and x = 0.53; compute f’(x)
[Answer: f(0.53)≈0.625 & f′(0.53)≈0.233]
Where λ is the steepness parameter For standard binary sigmoid λ = 1. If more than one input is
available in the neuron then use the summed output value for x. Here the sigmoid function range is
between –1 and +1.
Do it yourself
Q.1 Obtain the output of the neuron Y for the network shown in Figure below using bipolar
sigmoidal activation function.
𝑓(𝑥) =
1
2 (1 − 𝑓(𝑥)2)
Do it yourself
Q.2 Assume that λ = 1 and x = 0.53; compute f’(x)
[Answer: f(0.53)≈0.259 & f′(0.53)≈0.466]
The bipolar sigmoidal function is closely related to hyperbolic tangent function, which is written
92
Hyperbolic tangent function
If the network uses binary data, it is better to convert it to bipolar form and use the bipolar sigmoidal
activation function or hyperbolic tangent function.
5. Rectified linear unit (ReLU)/Ramp function: The ramp function is defined as
Ramp function
93
3. Range: When the range of the activation function is finite, gradient-based training methods tend to be
more stable, because pattern presentations significantly affect only limited weights. When the range is
infinite, training is generally more efficient because pattern presentations significantly affect most of the
weights. In the latter case, smaller learning rates are typically necessary.
4. Monotonic: When the activation function is monotonic, the error surface associated with a single-layer
model is guaranteed to be convex.
Other than the above discussed; we have following types of neural networks also
The input layer is where the image data is fed into the network. Each pixel in the image is represented as a
value, and these values form the input to the network.
94
An Example
A 5×5 grayscale image might have pixel values like:
Each value represents intensity (0 = black, 255 = white). Say we apply a filter over the pixel matrix. Say it is
3×3 Edge Detection Kernel
We take the top-left 3×3 region from the image and apply the kernel. Extracted 3×3 Region from Image
Second Convolution Operation (Next 3×3 region, sliding right) Now, move the filter one step to the right. New
3×3 Region from Image
Continuing for Other Regions; Following the same process for the entire 3×3 sliding process, we fill up the
feature map. The Final Feature Map output will be
Continued in the example; We are applying the max-pooling for every 2×2 region and Stride = 1 (stride is to
specify for sliding the filter.)
Region 1 (Top-left 2×2)
Max value: -6
95
Max value: -6
& so in so the final max-pooled output is
Since all values were the same, the result remains unchanged.
After several convolutional and pooling layers, the final output is flattened into a single vector and passed
through one or more fully connected layers. These layers are similar to those in a regular neural network and
are used to combine the features extracted by the previous layers to make final predictions, such as classifying
the image into different categories.
The output layer produces the final output of the network, such as the class scores for classification tasks. The
number of neurons in this layer corresponds to the number of classes the network is trying to predict.
What makes CNN suitable for image processing tasks over ANN?
Convolutional Neural Networks (CNNs) are highly effective for image processing due to their ability to
automatically learn spatial hierarchies of features. Here’s why they work so well:
1. Unlike traditional Artificial Neural Networks (ANNs), CNNs do not require manually extracted features.
CNN’s convolutional layers automatically detect edges, textures, patterns, and complex structures
without human intervention.
2. CNNs use pooling layers (e.g., max pooling) to reduce spatial dimensions while keeping the most
important features. This makes CNNs robust to position changes (i.e., an object can be anywhere in the
image, and CNN can still detect it).
3. Instead of fully connecting each pixel (like ANN), CNNs use small filters (kernels) that slide over the
image. This reduces the number of parameters, making CNNs computationally efficient. A 100×100
image with ANN requires 10,000 neurons, but CNN just needs a few filters to process it.
4. CNNs learn directly from raw pixel data and adjust filters automatically using backpropagation. They do
not require handcrafted features, making them highly adaptable.
RNN information is fed back into the system after each step. Think of it like reading a sentence, when you’re
trying to predict the next word you don’t just look at the current word but also need to remember the words that
came before to make an accurate guess. RNNs allow the network to “remember” past information by feeding
the output from one step into the next step. This helps the network understand the context of what has already
96
happened and make better predictions based on that. For example when predicting the next word in a
sentence the RNN uses the previous words to help decide what word is most likely to come next.
The fundamental processing unit in RNN is a Recurrent Unit. Recurrent units hold a hidden state that
maintains information about previous inputs in a sequence. Recurrent units can “remember” information from
prior steps by feeding back their hidden state, allowing them to capture dependencies across time. RNN
unfolding or unrolling is the process of expanding the recurrent structure over time steps. During unfolding
each step of the sequence is represented as a separate layer in a series illustrating how information flows
across each time step.
This unrolling enables “backpropagation through time (BPTT)” which is a learning process where errors are
propagated across time steps to adjust the network’s weights enhancing the RNN’s ability to learn
dependencies within sequential data. RNNs share similarities in input and output structures with other deep
learning architectures but differ significantly in how information flows from input to output. Unlike traditional
deep neural networks, where each dense layer has distinct weight matrices, RNNs use shared weights across
time steps, allowing them to remember information over sequences.
Application areas include Natural Language Processing, Time Series Prediction, Music Generation, and more.
97
1. Sample random noise.
2. Produce generator output from sampled random noise.
3. Get discriminator "Real" or "Fake" classification for generator output.
4. Calculate loss from discriminator classification.
5. Backpropagate through both the discriminator and generator to obtain gradients.
6. Use gradients to change only the generator weights.
● The discriminator learns to distinguish the generator's fake data from real data. The discriminator
penalizes the generator for producing implausible (i.e. fake data that is difficult to believe on) results.
The discriminator data comes from two sources: Real data instances, such as real pictures of people.
The discriminator uses these instances as positive examples during training. Fake data instances
created by the generator. The discriminator uses these instances as negative examples during training.
During discriminator training the generator does not train. Its weights remain constant while it produces
examples for the discriminator to train on. During discriminator training:
1. The discriminator classifies both real data and fake data from the generator.
2. The discriminator loss penalizes the discriminator for misclassifying a real instance as fake or a
fake instance as real.
3. The discriminator updates its weights through backpropagation from the discriminator loss
through the discriminator network.
When training begins, the generator produces obviously fake data, and the discriminator quickly learns to tell
that it's fake:
As training progresses, the generator gets closer to producing output that can fool the discriminator:
Finally, if generator training goes well, the discriminator gets worse at telling the difference between real and
fake. It starts to classify fake data as real, and its accuracy decreases.
A GAN can have two loss functions: one for generator training and one for discriminator training. Among
multiple implementations the common error loss function is minimax. The generator tries to minimize the
following error loss function while the discriminator tries to maximize it
98
In this function:
❖ D(x) is the discriminator's estimate of the probability that real data instance x is real.
❖ Ex is the expected value over all real data instances.
❖ G(z) is the generator's output when given noise z.
❖ D(G(z)) is the discriminator's estimate of the probability that a fake instance is real.
❖ Ez is the expected value over all random inputs to the generator (in effect, the expected value over all
generated fake instances G(z)).
The generator can't directly affect the log(D(x)) term in the function, so, for the generator, minimizing the loss is
equivalent to minimizing log(1 - D(G(z))).
The architecture for the radial basis function network (RBFN) is here-
Architecture of RBF
The architecture consists of two layers whose output nodes form a linear combination of the kernel (or basis)
functions computed by means of the RBF nodes or hidden layer nodes. The basis function (nonlinearity) in the
99
hidden layer produces a significant nonzero response to the input stimulus it has received only when the input
of it falls within a small localized region of the input space. This network can also be called as localized
receptive field network.
The training algorithm describes in detail all the calculations involved in the training process depicted in the
flowchart. The training is started in the hidden layer with an unsupervised learning algorithm. The training is
continued in the output layer with a supervised learning algorithm. Simultaneously, we can apply supervised
learning algorithms to the hidden and output layers for fine-tuning of the network. The training algorithm is
given as follows.
Where
x: Input vector (e.g., xj1,xj2,…,xjn)
ci: Center of the ith RBF unit
σi: Width (spread) of the i-th RBF unit.
∥x−ci∥: Euclidean distance between x and ci
Where
m: the number of hidden layer nodes (RBF function).
wim: Weight connecting the ith hidden unit to the mth output node
w0: Bias term (optional).
Step 8: Calculate the error and test for the stopping condition. The stopping condition may be the number of
epochs or to a certain extent weight change.
Applications of RBNF:
RBFNs are primarily used for classification tasks, but they can also be applied to regression and function
approximation problems. Some common application areas include:
1. Pattern Recognition: RBFNs are effective in recognizing patterns in data, making them useful in
image and speech recognition.
2. Time Series Prediction: They can be used to predict future values in a time series based on past data.
3. Control Systems: RBFNs are used in adaptive control systems to model and control dynamic systems.
4. Medical Diagnosis: They can assist in diagnosing diseases by classifying medical data.
101
The inputs from x1 to xn possess excitatory weighted connections and inputs from xn+1 to xn+m possess inhibitory
weighted interconnections. Since the firing of the output neuron is based upon the threshold, the activation
function here is defined as
For inhibition to be absolute, the threshold with the activation function should satisfy the following condition:
Ө > nw - P
In the above equation P refers to total contribution from all inhibitory inputs and output. The above equation
works when all the inhibitory inputs are active i.e. in case of weak absolute inhibition.
For the strong absolute inhibition i.e. when only one inhibitory inputs is active the equation should be modified
as
Ө > nw - Pmin
Here Pmin refers to the minimum contribution from inhibitory inputs (e.g., the weight of a single inhibitory input).
Do not get confused with the firing condition which is that a neuron can fire if the net input equals the threshold.
Ө ≤ nw - P
The output will fire if it receives say “k” or more excitatory inputs but no inhibitory inputs, where
kw ≥ Ө ≥ (k - 1)w
If the neuron receives k excitatory inputs, the net input (kw) will be greater than or equal to the threshold,
causing the neuron to fire. If the neuron receives fewer than k excitatory inputs (k−1), the net input ((k−1)w) will
be less than the threshold, and the neuron will not fire.
The M–P neuron has no particular training algorithm. An analysis has to be performed to determine the values
of the weights and the threshold. Here the weights of the neuron are set along with the threshold to make the
neuron perform a simple logic function. The M-P neurons are used as building blocks on which we can model
any function or phenomenon, which can be represented as a logic function.
Do it yourself
Q.1 Implement AND function using McCulloch–Pitts neuron (take binary data).
Q.2 Implement ANDNOT function using McCulloch–Pitts neuron (use binary data representation). In the case
of the ANDNOT function, the response is true if the first input is true and the second input is false. For all other
input variations, the response is false.
Q.3 Implement XOR function using McCulloch–Pitts neuron (use binary data representation).
Hebb Network
Donald Hebb stated in 1949
102
“When an axon of cell A is near enough to excite cell B, and repeatedly or permanently takes place in firing it,
some growth process or metabolic change takes place in one or both the cells such that A’s efficiency, as one
of the cells firing B, is increased”.
According to the Hebb rule, the weight vector is found to increase proportionately to the product of the input
and the learning signal which is equal to the neuron’s output. In Hebb learning, if two interconnected neurons
are ‘on’ simultaneously then the weights associated with these neurons can be increased by the modification
made in their synaptic gap (strength). The weight update in Hebb rule is given by
The Hebb rule is more suited for bipolar data than binary data.
wi: weight of the ith connection between input and output neuron
η: Learning rate
x: Input value from the input neuron
y: Output value from the output neuron
Perceptron Networks
Let us understand the linear separability first with an example. Imagine you have a table with a bunch of fruits:
apples and oranges. Your task is to separate the apples from the oranges using a straight stick (like a ruler).
104
Summarization: Linear separability means you can draw a straight line (or a flat plane in higher dimensions)
to separate two groups of things (like apples and oranges). If you cannot draw such a straight line, the data is
not linearly separable.
The perceptron is the simplest form of a neural network used for the classification of patterns said to be linearly
separable (i.e., patterns that lie on opposite sides of a hyperplane). Basically, it consists of a single neuron with
adjustable synaptic weights and bias.
Rosenblatt proved that if the patterns (vectors) used to train the perceptron are drawn from two linearly
separable classes, then the perceptron algorithm converges (i.e. eventually find solution) and positions the
decision surface in the form of a hyperplane between the two classes. The proof of convergence of the
algorithm is known as the perceptron convergence theorem.
The perceptron built around a single neuron is limited to performing pattern classification with only two classes
(hypotheses). By expanding the output (computation) layer of the perceptron to include more than one neuron,
we may correspondingly perform classification with more than two classes.
The goal of the perceptron is to correctly classify the set of externally applied stimuli (i.e. input data) x1, x2 ... xm
into one of two classes C1 and C2. The decision rule for the classification is to assign the point represented by
the inputs x1, x2, ..., xm to class C1 if the perceptron output y is +1 and to class C2 if it is -1.
The synaptic weights of the perceptron are denoted by w1, w2 ...,wm. Correspondingly, the inputs applied to the
perceptron are denoted by x1, x2, ..., xm. The externally applied bias is denoted by b. From the model, we find
that the hard limiter input, or induced local field, of the neuron is
To develop insight into the behavior of a pattern classifier, it is customary to plot a map of the decision regions
in the m-dimensional signal space spanned by the m input variables x1, x2, ..., xm. In the simplest form of the
perceptron, there are two decision regions separated by a hyperplane, which is defined by
Take a look at the figure for the case of two input variables x1 and x2, for which the decision boundary takes the
form of a straight line.
105
A point (x1, x2) that lies above the boundary line is assigned to class C1, and a point (x1, x2) that lies below the
boundary line is assigned to class C2. Note also that the effect of the bias b is merely to shift the decision
boundary away from the origin. The synaptic weights w1, w2, ...,wm of the perceptron can be adapted on an
iteration-by-iteration basis.
For the perceptron to function properly, the two classes C1 and C2 must be linearly separable. This, in turn,
means that the patterns to be classified must be sufficiently separated from each other to ensure that the
decision surface consists of a hyperplane. This requirement is illustrated in Figure below for the case of a
two-dimensional perceptron. In the (a) part of the figure, the two classes C1 and C2 are sufficiently separated
from each other for us to draw a hyperplane (in this case, a straight line) as the decision boundary. If, however,
the two classes C1 and C2 are allowed to move too close to each other, as in (b) part of the figure, they become
nonlinearly separable, a situation that is beyond the computing capability of the perceptron.
Where:
yi: Actual value
^
𝑦i: Predicted value
n : Number of data points.
106
2. Gradient Vector
A partial derivative measures how a function changes when you vary only one variable, while keeping all
other variables constant.
The derivative of x2 with respect to x is 2x. Since 3y is treated as a constant, its derivative is 0.
The term x is treated as a constant, so its derivative is 0. The derivative of 3y with respect to y is 3
2
The gradient vector is simply a vector of partial derivatives and points in the direction of the steepest ascent.
The gradient vector (denoted as ∇f pronounced as nabla f) is formed by collecting all partial derivatives:
The function changes most rapidly in the direction of (2x,3). If we move in the direction of this gradient, the
function f(x,y) increases fastest.
This represents the rate of change of the cost function with respect to wi.
3. Chain Rule
The chain rule helps us find the derivative of a function that is composed of two or more functions. In simple
terms, it tells us how to take the derivative of a "function inside a function."
● If y=f(g(x)), then f is the "outer function," and g is the "inner function."
● The chain rule helps us find the derivative of y with respect to x.
In words:
107
● Take the derivative of the outer function (f) with respect to the inner function (g).
● Multiply it by the derivative of the inner function (g) with respect to x.
An Example
Let’s say:
Here
● The outer function is f(g) = g2
● The inner function is g(x) = 3x + 2.
Substitute g = 3x + 2
Note:
In Gradient Descent, we use the chain rule to compute the gradient of the cost function. For example:
^
● The cost function J(w) depends on the predicted value 𝑦.
^
● The predicted value 𝑦depends on the weight w.
We use the chain rule
Here
4. Learning Rate(α)
The learning rate (α) affects the convergence of the ANN. It controls the size of the steps taken during
parameter updates. The range of α is from 0 to.
Step-1: Initialize Weights: Start with random values for the weights (wi), bais (b) and learning rate (α).
Step-2: Compute Gradient: Calculate the gradient of the cost function with respect to each weight and bias:
Step-3: Update Weights: Adjust the weights in the opposite direction of the gradient:
108
Step-4: Repeat: Repeat steps 2 and 3 until one of the stopping criteria is met
● Maximum number of iterations is reached.
● The step size becomes smaller than a predefined tolerance.
An Example
Input: House sizes (x) in square feet
Output: House prices (y) in thousands of dollars.
^
Model: Linear regression model 𝑦 = wx + b, where: w is weight (slope) and b is bias (intercept)
Goal: Use Gradient Descent to find the optimal values of w and b that minimize the Mean Squared Error (MSE)
cost function.
Solution
Step-1: Let us initialize w = 0 and b = 0 and α = 0.1
Step-2 & 3:
Iteration-01
^
Compute predicted output using formula 𝑦 = wx + b
^ ^ ^ ^
𝑦1 = 0 x 1 + 0 = 0 𝑦2 = 0 x 2 + 0 = 0 𝑦3 = 0 x 3 + 0 = 0 𝑦4 = 0 x 4 + 0 = 0
Compute the Gradient
Update Parameters:
Iteration-02
^
Compute predicted output using formula 𝑦 = wx + b
^ ^ ^ ^
𝑦1 = 1.5 x 1 + 0.5 = 2 𝑦2 = 1.5 x 2 + 0.5 = 3.5 𝑦3 = 1.5 x 3 + 0.5 = 5 𝑦4 = 1.5 x 4 + 0.5 = 6.5
Compute the Gradient
109
Update Parameters:
Iteration-03
^
Compute predicted output using formula 𝑦 = wx + b
^ ^
𝑦1 = 1.75 x 1 + 0.575 = 2.325 𝑦2 = 1.75 x 2 + 0.575 = 4.075
^ ^
𝑦3 = 1.75 x 3 + 0.575 = 5.825 𝑦4 = 1.75 x 4 + 0.575 = 7.575
Compute the Gradient
Update Parameters:
● Continue iterating until the changes in w and b become very small (e.g. <0.001).
● After several iterations, w and b will converge to their optimal values. For this example, they will
^
approach w = 2, b = 0 & the final model will be 𝑦 = 2𝑥
Iteration-02
111
Iteration-03
● Continue iterating until the changes in w and b become very small (e.g. <0.001).
● After several iterations, w and b will converge to their optimal values. For this example, they will
^
approach w = 2, b = 0 & the final model will be 𝑦 = 2𝑥
● After the first four iterations (where you’ve used all four data points), you simply start over from the first
data point and continue the process. This is called cycling through the dataset.
Do it Yourself
Do this same question with adjusting weight only not bias
Too Low Small steps toward the 1. Stable convergence. 1. Slow convergence
minimum. 2. Less likely to overshoot (requires many
the minimum. iterations).
2. May get stuck in local
minima or saddle points.
Optimal Balanced steps that 1. Fast and stable 1. Requires tuning to find
converge efficiently to convergence. the right value
the minimum. 2. Efficient use of
computational resources.
Too High Large steps that may 1. Faster initial progress. 1. Oscillations around the
overshoot the minimum.
minimum. 2. Risk of divergence
(moving away from the
minimum)
113
ball rolling down a hill. If the ball has too much momentum (high learning rate), it will overshoot the bottom
and roll up the other side, then roll back, and so on.
➢ Divergence occurs when the learning rate is so high that the model parameters move away from the
minimum instead of converging toward it. It happens when extremely large steps cause the model to
overshoot the minimum by such a large margin that the cost function increases instead of decreasing. An
example of it is a ball rolling down a hill with so much momentum that it flies off the hill entirely and never
returns.
An MLP is a fully connected feedforward neural network, meaning that each neuron in one layer is connected
to every neuron in the next layer. It uses activation functions such as Sigmoid, or Tanh to introduce
non-linearity, enabling it to learn complex patterns in data.
● Feature Learning
Lower Layers: Detect simple patterns, such as edges, textures, or basic shapes.
Deeper Layers: Detect abstract concepts, such as objects or high-level features.
● Non-Linearity: Introducing non-linearity using activation functions, which allows the model to solve
complex problems. Without non-linearity, an MLP would be equivalent to a linear model, incapable of
solving complex problems.
● Representation Learning: Hidden layers transform raw input data into meaningful representations that
make it easier for the output layer to perform classification or regression.
● Capturing Relationships: Hidden layers can capture complex relationships between input features
that are not easily separable in lower-dimensional space.
Applications of MLP
● Classification: Image classification, spam detection, sentiment analysis.
● Regression: Predicting house prices, stock prices, or temperature.
● Pattern Recognition: Handwriting recognition, speech recognition.
● Function Approximation: Approximating complex mathematical functions.
114
BACK-PROPAGATION NETWORK
A back-propagation neural network is a multilayer, feed-forward neural network consisting of an input layer, a
hidden layer and an output layer. The neurons present in the hidden and output layers have biases, which are
the connections from the units whose activation is always 1. The bias terms also acts as weights.
The figure above shows the architecture of a BPN, depicting only the direction of information flow for the
feed-forward phase. During the back-propagation phase of learning, signals are sent in the reverse direction.
The inputs are sent to the BPN and the output obtained from the net could be either binary (0, 1) or bipolar (–1,
+1). The activation function could be any function which increases monotonically and is also differentiable.
115
δk: Error correction weight adjustment for Wjk that is due to an error at output unit yk, which is back-propagated
to the hidden units that feed into unit yk
δj: Error correction weight adjustment for vij that is due to the back-propagation of error to the hidden unit zj.
Also, it should be noted that the commonly used activation functions are binary sigmoidal and bipolar sigmoidal
activation functions. The range of binary sigmoid is from 0 to 1, and for bipolar sigmoid it is from –1 to +1.
These functions are used in the BPN because of the following characteristics
(i) continuity
(ii) differentiability
(iii) nondecreasing monotony
The error back-propagation learning algorithm can be outlined in the following algorithm:
Step 0: Initialize weights and learning rate (take some small random values).
Step 1: Perform Steps 2–9 when stopping condition is false.
Step 2: Perform Steps 3–8 for each training pair.
Calculate output of the hidden unit by applying its activation functions over zinj (binary or bipolar sigmoidal
activation function):
and send the output signal from the hidden unit to the input of output layer units.
Step 5: For each output unit yk (k = 1 to m), calculate the net input:
On the basis of the calculated error correction term, update the change in weights and bias:
Step 7: Each hidden unit (zj, j = 1 to p) sums its delta inputs from the output units:
The term δinj gets multiplied with the derivative of f(zinj) to calculate the error term:
On the basis of the calculated δj, update the change in weights and bias:
116
Weight and bias updation (Phase III):
Step 8: Each output unit (yk, k = 1 to m) updates the bias and weights:
Step 9: Check for the stopping condition. The stopping condition may be a certain number of epochs reached
or when the actual output equals the target output.
The above algorithm uses the incremental approach for updation of weights, i.e. the weights are being
changed immediately after a training pattern is presented i.e. it is working like the online training. There is
another way of training called batch-mode training, where the weights are changed only after all the training
patterns are presented. The batch-mode training requires additional local storage for each connection to
maintain the immediate weight changes.
The training of a BPN is based on the choice of various parameters. Also, the convergence of the BPN is
based on some important learning factors such as the initial weights, the learning rate, the updation rule, the
size and nature of the training set, and the architecture (number of layers and number of neurons per layer).
An Example
Using a back-propagation network, find the new weights for the net shown in Figure below. It is presented with
the input pattern [0, 1] and the target output is 1. Use a learning rate a =0.25 and binary sigmoidal activation
function.
Do it yourself
Find the new weights, using a back-propagation network for the network shown in Figure below. The network is
presented with the input pattern [ - 1, 1] and the target output is + 1. Use a learning rate of a = 0.25 and bipolar
sigmoidal activation function
117
Aspect Least Mean Square Backpropagation
Model type Simple linear models (e.g., linear Complex models (e.g., neural
regression). networks).
Gradient Computation Approximates gradient because Computes exact gradient using chain
computed using a single data point. rule because gradient is computed
using all data points
Use Cases Online learning, real-time systems. Deep learning, multi-layer networks.
XOR Problem
In Rosenblatt’s single-layer perceptron, there are no hidden neurons. Consequently, it cannot classify input patterns that
are not linearly separable. However, nonlinearly separable patterns commonly occur. For example, this situation arises in
the exclusive-OR (XOR) problem, which may be viewed as a special case of a more general problem, namely, that of
classifying points in the unit hypercube (An n-D hypercube is has 2n vertices in a n-dimensional space; here it is two-D
space so it has 4 vertices. "unit" in unit hypercube means that each dimension is constrained to values between 0 and 1
and we have binary input here so the values will be exactly 0 and 1).
However, in the special case of the XOR problem, we need consider only the four corners of a unit square that correspond
to the input patterns (0,0), (0,1), (1,1), and (1,0), where a single bit (i.e., binary digit) changes as we move from one corner
to the next.
Where ⨁ denotes the exclusive-OR boolean function operator. The input patterns (0,0) and (1,1) are at opposite corners of
the unit square, yet they produce the identical output 0. On the other hand, the input patterns (0,1) and (1,0) are also at
opposite corners of the square, but they are in class 1, as shown by
1⨁0=1
and
0⨁1=1
We first recognize that the use of a single neuron with two inputs results in a straight line for a decision boundary in the
input space. For all points on one side of this line, the neuron outputs 1; for all points on the other side of the line, it
outputs 0. The position and orientation of the line in the input space are determined by the synaptic weights of the neuron
connected to the input nodes and the bias applied to the neuron. With the input patterns (0,0) and (1,1) located on opposite
corners of the unit square, and likewise for the other two input patterns (0,1) and (1,0), it is clear that we cannot construct
a straight line for a decision boundary so that (0,0) and (0,1) lie in one decision region and (0,1) and (1,0) lie in the other
decision region. In other words, the single layer perceptron cannot solve the XOR problem.
However, we may solve the XOR problem by using a single hidden layer with two neurons (as in figure below along with
its diagram of signal flow)
118
Architectural graph of network for solving the XOR problem Signal-flow graph of the network
The slope of the decision boundary constructed by this hidden neuron is equal to -1. Here is the calculation
For an input pattern (x1,x2), the weighted sum z1for Neuron 1 is calculated as:
z1=w11⋅x1 + w21⋅x2 + b1
The decision boundary is the line where the neuron's output transitions from 0 to 1. This occurs when the
weighted sum z1is exactly 0:
z1 = 0 ⟹ x1 + x2 − 1.5=0
x2 = -x1 + 1.5
This is the equation of the decision boundary line for Neuron 1. It has:
● A slope of −1 (since the coefficient of x1is −1)
● A y-intercept of 1.5 (when x1=0, x2=1.5)
The decision boundary is a straight line that passes through the points (0, 1.5) & (1.5, 0) & positioned as
The slope of the decision boundary constructed by this hidden neuron is equal to -1. Here is the calculation
119
For an input pattern (x1,x2), the weighted sum z2for Neuron 2 is calculated as:
z2=w12⋅x1 + w22⋅x2 + b2
The decision boundary is the line where the neuron's output transitions from 0 to 1. This occurs when the
weighted sum z2 is exactly 0:
z2 = 0 ⟹ x1 + x2 − 0.5=0
x2 = -x1 + 0.5
This is the equation of the decision boundary line for Neuron 2. It has:
● A slope of −1 (since the coefficient of x1is −1)
● A y-intercept of 0.5 (when x1=0, x2=0.5)
The orientation and position of the decision boundary constructed by this second hidden neuron are as follow-
Say the output from neuron-1 is a1 and output from neuron-2 is a2; so the output from the neuron-3 is
z3=−2.a1+1.a2−0.5
The function of the output neuron is to construct a linear combination of the decision boundaries formed by the
two hidden neurons. The result of this computation as follow-
The activation function for the neuron is assumed to be a step function, which outputs 1 if the weighted sum of
the inputs is greater than or equal to 0, and 0 otherwise.
Input: (0,0)
Neuron 1: z₁ = 1⋅0 + 1⋅0 - 1.5 = -1.5 < 0 ⟹ a₁ = 0
120
Neuron 2: z₂ = 1⋅0 + 1⋅0 - 0.5 = -0.5 < 0 ⟹ a₂ = 0
Output Neuron (Neuron 3): z₃ = (-2)⋅0 + 1⋅0 - 0.5 = -0.5 < 0 ⟹ Output = 0. Matches XOR: 0 ⊕ 0 = 0
Input: (0,1)
Neuron 1: z₁ = 1⋅0 + 1⋅1 - 1.5 = -0.5 < 0 ⟹ a1 = 0
Neuron 2: z2 = 1⋅0 + 1⋅1 - 0.5 = +0.5 ≥ 0 ⟹ a1 = 1
Output Neuron (Neuron 3): z3 = (-2)⋅0 + 1⋅1 - 0.5 = +0.5 ≥ 0 ⟹ Output = 1. Matches XOR: 0 ⊕ 1 = 1
Input: (1,0)
Neuron 1: z₁ = 1⋅1 + 1⋅0 - 1.5 = -0.5 < 0 ⟹ a1 = 0
Neuron 2: z2 = 1⋅1 + 1⋅0 - 0.5 = +0.5 ≥ 0 ⟹ a2 = 1
Output Neuron (Neuron 3): z3 = (-2)⋅0 + 1⋅1 - 0.5 = +0.5 ≥ 0 ⟹ Output = 1. Matches XOR: 1 ⊕ 0 = 1
Input: (1,1)
Neuron 1: z1 = 1⋅1 + 1⋅1 - 1.5 = +0.5 ≥ 0 ⟹ a1 = 1
Neuron 2: z2 = 1⋅1 + 1⋅1 - 0.5 = +1.5 ≥ 0 ⟹ a2 = 1
Output Neuron (Neuron 3): z₃ = (-2)⋅1 + 1⋅1 - 0.5 = -1.5 < 0 ⟹ Output = 0. Matches XOR: 1 ⊕ 1 = 0
The bottom hidden neuron has an excitatory (positive) connection to the output neuron, whereas the top
hidden neuron has an inhibitory (negative) connection to the output neuron. When both hidden neurons are off,
which occurs when the input pattern is (0,0), the output neuron remains off. When both hidden neurons are on,
which occurs when the input pattern is (1,1), the output neuron is switched off again because the inhibitory
effect of the larger negative weight connected to the top hidden neuron overpowers the excitatory effect of the
positive weight connected to the bottom hidden neuron. When the top hidden neuron is off and the bottom
hidden neuron is on, which occurs when the input pattern is (0,1) or (1,0), the output neuron is switched on
because of the excitatory effect of the positive weight connected to the bottom hidden neuron.
121
2. Artificial Neural Networks in Finance
Artificial Neural Networks have transformed financial forecasting by identifying complex patterns in market data
that traditional statistical methods might miss.
Examples of ANN in Stock Market Prediction:
1. Price Movement Forecasting
● Time-series analysis using Recurrent Neural Networks (RNNs) and Long Short-Term Memory
(LSTM) networks to predict short-term price movements
● Models incorporate technical indicators, historical prices, and trading volumes
● Performance typically exceeds traditional time-series forecasting methods like ARIMA
2. Sentiment Analysis for Market Prediction
● Natural Language Processing (NLP) combined with neural networks analyzes news articles,
social media, and financial reports
● Models quantify market sentiment as an additional predictive feature
● Helps capture market reactions to breaking news and events
3. Portfolio Optimization
● Deep Reinforcement Learning (DRL) models dynamically adjust portfolio allocations
● Neural networks optimize for risk-adjusted returns across various market conditions
● Can incorporate multiple objectives like volatility minimization and return maximization
JPMorgan developed the LOXM (Limit Order Execution) system using deep learning to execute equity trades
at optimal prices. The system analyzes market conditions and historical patterns to minimize market impact
while achieving best execution prices, outperforming human traders in many scenarios.
3. Agriculture
● Autonomous harvesting robots
● Precision weeding and crop management
● Livestock monitoring systems
● Soil and crop health analysis
NVIDIA has developed Isaac Sim, a robotics simulation platform that uses neural networks to generate
synthetic training data. This enables sim-to-real transfer learning, where robots train in virtual environments
before deploying skills in the physical world.
OpenAI's GPT models (and subsequently similar models like Claude) demonstrated that neural networks
trained on massive text corpora can generate coherent, contextually appropriate text across diverse topics.
These models showcase emergent abilities including complex reasoning, code generation, and creative
writing, highlighting how scale and architecture innovations can produce systems with capabilities beyond their
explicit training objectives.
Keras, on the other hand, is the user-friendly interface built on top of TensorFlow, designed to simplify the
process of creating neural networks. Originally an independent library, Keras is now TensorFlow’s official
high-level API, offering intuitive tools to construct models with minimal code. Imagine Keras as the "smart
home system" that lets you control the electric grid with a simple app. Instead of wiring circuits manually
(coding low-level math), you use preconfigured switches (layers like Dense or Conv2D) to build models
effortlessly.
Together, TensorFlow and Keras form a seamless partnership: TensorFlow handles the gritty details of
optimization and hardware acceleration, while Keras provides a clean, modular way to design experiments.
This combo is why they dominate industries—from healthcare (diagnosing diseases) to entertainment (Netflix
recommendations). For students, Keras lowers the barrier to entry, while TensorFlow ensures your skills scale
to real-world challenges.
"If TensorFlow is the engine and gears of a high-performance car, Keras is the steering wheel and
dashboard—giving you control without needing to be a mechanical engineer."
Click on the run button available on the left hand side of the code; you will get the tensor flow version which is
2.18.0
A note about the MNIST (Modified National Institute of Standards and Technology) Database
The MNIST dataset is the quintessential starting point for anyone learning machine learning and computer
vision. It consists of 70,000 handwritten digits (0–9), split into 60,000 training images and 10,000 test images,
each grayscale and sized at 28×28 pixels.
125
Here are the steps to use it in the form of paragraph
1. Preprocessing: Pixel values (0–255) are scaled to 0–1 (normalization).
2. Model Input: Images are flattened into 1D arrays (784 values) for classic neural networks.
3. Labels: Each digit comes with a true label (0–9), enabling supervised learning.
The MNIST dataset has become the quintessential starting point for machine learning and computer vision due
to its simplicity, accessibility, and well-structured format. Its small image size (28x28 pixels) and grayscale
format reduce computational complexity, making it ideal for beginners to experiment with algorithms without
needing high-end hardware. The dataset's clean, centered digits and balanced class distribution allow
newcomers to focus on core concepts like data preprocessing, model training, and evaluation metrics without
getting bogged down by noise or class imbalances. Additionally, MNIST's integration into popular libraries like
TensorFlow and PyTorch ensures easy access, enabling rapid prototyping and benchmarking.
Despite its widespread use, MNIST has notable limitations. Its simplicity, while great for beginners, fails to
capture real-world challenges like varying backgrounds, lighting conditions, or distorted handwriting,
leading to inflated accuracy scores (often >99%) that don't translate to practical applications. The dataset's
uniformity also means models trained on MNIST struggle with more complex tasks, exposing a gap between
academic exercises and real-world problems.
https://colab.research.google.com/drive/1WO6Cq2ihoipDjkaq2YhUf2HMGgFXBoRm?usp=sharing
126
The next topic is perceptron convergence theorem but before taking about this, let us have a discussion about
few important terms
❖ A vector is a one-dimensional array of numbers, either representing inputs, weights, or outputs. A
vector can indeed be thought of as a matrix of order n×1, where n is the number of elements in the
vector. A vector is often represented as a column vector, which is a matrix with n rows and 1 column
(n×1). For example, a vector v with 3 elements can be written as:
❖ A Hyperplane is a geometric entity that separates a space into two distinct parts. In n-dimensional
space, a hyperplane is an (n−1)-dimensional subspace. The general equation of a hyperplane in
n-dimensional space is:
Where:
● w1, w2, …, wn are the weights
● x1, x2, …, xn are the input features
● b is the bias term.
A.In one-dimensional space, a hyperplane is simply a point. For example, on a number line, a point x =
c can separate the space into two regions: x < c and x > c.
B.In two-dimensional space, a hyperplane is a line.
C.In three-dimensional space, a hyperplane is a plane.
The specific type of norm depends on the subscript or context. For example:
||W||1: L1 norm (sum of absolute values).
||W||2: L2 norm (Euclidean norm).
||W||p: Lp norm (generalized norm).
This represents the "straight-line distance" from the origin to the point defined by the vector W in
Euclidean space.
|⟨u, v⟩| ≤ ||u||2 · ||v||2 (The same can be written as |⟨u, v⟩| ≤ ||u|| · ||v|| if euclidean norm is mentioned
explicitly for norm)
Where:
⟨u, v⟩ represents the absolute value of inner product of vectors u and v
||u||2 and ||v||2 represent the norms of the vectors u and v respectively
Therefore: 32 ≤ 32.84
The inequality holds, as expected. Note that the values aren't exactly equal, which tells us that these
two vectors aren't scalar multiples of each other.
128
Note: The equality holds if and only if one vector is a scalar multiple of the other (meaning they're
linearly dependent; When vectors are linearly dependent, the angle between them is either 0° (same
direction) or 180° (opposite direction).).
❖ General Strategy for Tightening Inequalities: The idea of eliminating redundant terms to tighten an
inequality is based on the following points:
1. Redundancy: If a term in an inequality is already included in another term (e.g., as part of a
sum), explicitly writing it separately does not provide additional information.
2. Monotonicity of Inequalities: If A ≤ B + C and C is already included in B (i.e. B = C + D), then A ≤
B + C can be rewritten as A ≤ B, which is a tighter bound.
The principle given above is useful when applying Cauchy–Schwarz inequality in the Perceptron
Convergence Theorem, where we try to remove unnecessary terms to tighten the inequalities.
To derive the error-correction learning algorithm for the perceptron, we find it more convenient to work with the
modified signal-flow graph model in figure below
The only difference here is the bias b(n) is treated as a synaptic weight driven by a fixed input equal to +1. We
may thus define the (m + 1)-by-1 input vector
or
The T in the superscript stands for the transpose operation. The n denotes the time-step when the algorithm is
applied. A time-step (denoted by n) represents a specific iteration or update in the algorithm. It is the point at
which the algorithm processes a data point, updates its parameters (e.g., weights), and moves closer to finding
a solution.
or
In the first line, w0(n), corresponding to i = 0, represents the bias b. For fixed n, the equation WTX = 0, plotted
in an m-dimensional space (and for some prescribed bias) with coordinates x1, x2, ..., xm, defines a hyperplane
as the decision surface between two different classes of inputs.
Suppose then that the input variables of the perceptron originate from two linearly separable classes. Let H1 be
the subspace of training vectors X1(1), X1(2), ... that belong to class C1, and let H2 be the subspace of training
vectors X2(1), X2(2), ... that belong to class C2. The union of H1 and H2 is the complete space denoted by H.
Given the sets of vectors H1 and H2 to train the classifier, the training process involves the adjustment of the
weight vector W in such a way that the two classes C1 and C2 are linearly separable. That is, there exists a
weight vector W such that we may state
Note:
If we used strict inequalities for both classes:
● wTx > 0 for C1.
● wTx < 0 for C2.
This would leave input vectors with wTx=0 unclassified, which is undesirable. The Perceptron algorithm needs to classify
all input vectors, so it uses:
● wTx > 0 for C1.
● wTx ≤ 0 for C2.
This ensures that every input vector is assigned to one of the two classes.
Given the subsets of training vectors H1 and H2, the training problem for the perceptron is then to find a weight
vector w such that the two inequalities of above statements are satisfied.
The algorithm for adapting the weight vector of the elementary perceptron may now be formulated as follows:
1. If the nth member of the training set, x(n), is correctly classified by the weight vector w(n) computed at
the nth iteration of the algorithm, no correction is made to the weight vector of the perceptron in
accordance with the rule:
2. Otherwise, the weight vector of the perceptron is updated in accordance with the rule
130
Note: The learning rate is denoted by η (Greek letter ETA). It is used to control the amount of weight
adjustment at each step of training. The learning rate, ranging from 0 to 1, determines the rate of
learning at each time step. The learning rate plays a significant role in determining how fast or slow a
neural network learns. If the learning rate is low then the neuron will learn slowly similarly If the learning
rate is high then the neuron will learn fastly.
We are using the fixed-increment adaptation rule for the perceptron in which we are keeping the η fixed i.e. it is
a constant independent of the iteration number n. This means the learning rate does not change over time.
Proof of the perceptron convergence algorithm is presented for the initial condition W(0) = 0. Suppose that
WT(n)X(n) < 0 for n = 1, 2, ..., and the input vector X(n) belongs to the subset H1. That is, the perceptron
incorrectly classifies the vectors X(1), X(2) ..., since the first condition of equation (4) is violated. Then, with the
constant η(n) = 1, we may use the second line of equation (6) to write
W(n + 1) = W(n) + X(n) for X(n) belonging to class C1 ---------- (7)
Note: - The update rule W(n+1) = W(n) + X(n) means that the weight vector at the next iteration W(n+1) is updated based
on the current input X(n) and the current weight vector W(n). This ensures that the algorithm processes each input X(n)
sequentially and updates the weights accordingly. The variable n is being used in two different ways in the perceptron
algorithm.
- For weight; It is taking data either 0 or from the previous input data
- For input; it is taking values from 1 to total number of inputs
For iteration n = 2, The algorithm uses W(1) to classify the X(2); it updates the weight of W(2) based on the X(2)
W(2) = W(1) + X(2) [because W(1) = X(1)]
hence W(2) = X(1) + X(2)
For iteration n = 3, The algorithm uses W(2) to classify the X(3); it updates the weight of W(3) based on the X(3)
W(3) = W(2) + X(3) [because W(2) = X(1) + X(2)]
hence W(3) = X(1) + X(2) + X(3)
From the above calculation we can say that that for W(0) = 0, we may iteratively solve this equation for W(n + 1),
obtaining the result
W(n + 1) = X(1) + X(2) + . . . + X(n) - - - - - - - - - - (8)
Since the class C1 and C2 are assumed to be linearly separable, there exists a solution Wo for which WTX(n) >
0 for the vectors X(1), ..., X(n) belonging to the subset H1 For a fixed solution Wo, we may then define a
positive number α as
Hence, multiplying both sides of Eq (8) by the row vector WoT, we get
WoTW(n + 1) = WoTX(1) + WoTX(2) + WoTX(3) + … + WoTX(n)
Next we make use of an inequality known as the Cauchy–Schwarz inequality. Given two vectors W0 and W(n +
1), the Cauchy–Schwarz inequality states that
131
||Wo||2||W(n + 1)||2 ≥ n2α2 ---------(11)
Here ||wo|| is the squared euclidean norm it can be written as ||wo||22 for the sake of simplicity we are writing it ||wo||2
2
or, equivalently
or equivalently
||W(k + 1)||2 - ||W(k)||2 ≤ ||X(k)||2 ---------- (15)
The left-hand side is a telescoping sum, meaning most terms cancel out:
So we will get
Rearranging:
The above can be written as follow after applying the General Strategy for tightening Inequalities
||W(n + 1)||2 ≤ nβ
where β is a positive number defined by
The inequality mentioned in the second part of equation (16) is in conflict with the inequality mentioned in the
equation (12) for sufficiently large values of n because
1. The upper bound grows linearly with n
2. The lower bound grows quadratically with n
For sufficiently large n, the quadratic term will eventually exceed the linear term which would violate the upper
bound. This is why the two inequalities appear to be in conflict for large n.
The Perceptron algorithm is guaranteed to converge (i.e. find solution) after a finite number of updates
(nmax) if the data is linearly separable. This means that the inequalities are only relevant for n ≤ nmax, where nmax
is the maximum number of updates required for convergence.
We have thus proved that for η(n) = 1 for all n and W(0) = 0, and given that a solution vector Wo exists, the rule
for adapting the synaptic weights of the perceptron must terminate after at most nmax iterations. We may now
state the fixed-increment convergence theorem for the perceptron as follows
Let the subsets of training vectors H1 and H2 be linearly separable. Let the inputs presented to the perceptron
originate from these two subsets. The perceptron converges after some no iterations, in the sense that
w(no) = w(no + 1) = w(no + 2) = ….
is a solution vector for n0 < nmax.
The Perceptron Convergence Algorithm guarantees that if the data is linearly separable, the algorithm will find
a solution (i.e., a weight vector that correctly classifies all training examples) in a finite number of iterations.
The goal is indeed to determine the value of n0 (or nmax) such that the algorithm will surely converge within n0
iterations.
133
Back Propagation Network
A note about partial derivative and gradient vector:
A partial derivative measures how a function changes when you vary only one variable, while keeping all
other variables constant.
The derivative of x2 with respect to x is 2x. Since 3y is treated as a constant, its derivative is 0.
The term x2 is treated as a constant, so its derivative is 0. The derivative of 3y with respect to y is 3
The gradient vector is simply a vector of partial derivatives and points in the direction of the steepest ascent.
The gradient vector (denoted as ∇f pronounced as nabla f) is formed by collecting all partial derivatives:
The function changes most rapidly in the direction of (2x,3). If we move in the direction of this gradient, the
function f(x,y) increases fastest.
Convergence is made faster if a momentum factor is added to the weight updation process. This is generally
done in the back propagation network. If momentum has to be used, the weights from one or more previous
134
training patterns must be saved. Momentum helps the net in reasonably large weight adjustments until the
corrections are in the same general direction for several patterns.
The vigilance parameter is denoted by “ρ”. It is generally used in adaptive resonance theory (ART) networks.
The vigilance parameter is used to control the degree of similarity required for patterns to be assigned to the
same cluster unit. The choice of vigilance parameter ranges approximately from 0.7 to 1 to perform useful work
in controlling the number of clusters.
𝓵
η
δ
∈
⨁
α
∆
135