Neural Network Backpropagation Insights
Neural Network Backpropagation Insights
NumPy code:
import numpy as np # import numpy library
def sigmoid(x): # implement sigmoid function
return 1 / (1 + [Link](-x))
def derivative(x): # calculate derivative of sigmoid
# Using the derived expression of the derivatives
return sigmoid(x)*(1-sigmoid(x))
DERIVATIVE OF ReLU ACTIVATION FUNCTION:
Before the ReLU function was widely used, the activation function in neural networks
was mostly Sigmoid. However, the Sigmoid function was prone to gradient
dispersion. (When the number of layers of the network increased, because the gradient
values become very small, the parameters of the network cannot be effectively
updated. As a result, deeper neural networks cannot be trained, resulting in the
research of neural networks staying at the shallow level.)
With the introduction of the ReLU function, the phenomenon of gradient dispersion is
well alleviated, and the number
of layers of the neural network can reach deeper layers.
The expression of the ReLU function: ReLU (x) = max(0,x)
It can be seen that the derivative calculation of the ReLU function is simple. When x
is greater than or equal to zero, the derivative value is always 1.
In the process of back propagation, it will neither amplify the gradient, causing
gradient exploding, nor shrink the gradient, causing gradient vanishing phenomenon.
The derivative curve of the ReLU function is shown in Figure:
NumPy code:
def derivative(x): # Derivative of ReLU
d = [Link](x, copy=True)
d[x < 0] = 0
d[x >= 0] = 1
return d
It’s different from the ReLU function because when x is less than zero, the derivative
value of the LeakyReLU function is not 0, but a constant p, which is generally set to a
smaller value, such as 0.01 or 0.02.
The derivative curve of the LeakyReLU function is shown in Figure:
NumPy Code:
def derivative(x, p): # p is the slope of negative part of
LeakyReLU
dx = np.ones_like(x) # initialize a vector with 1
dx[x < 0] = p # set negative part to p
return dx
Loss Function: The choice of loss function depends on the task. For example,
mean squared error (MSE) is common for regression tasks, while categorical
cross-entropy is often used for classification tasks.
It can be seen from the preceding formula that the partial derivative of the
error to the weight wj1 is only related to the output value o1, the true
value t, and the input xj connected to the current weight.
NOTE:
The gradient of a single neuron in a neural network typically refers to the
derivative of the loss function with respect to the parameters (weights and
biases) of that neuron. This gradient is essential for updating the parameters
during the training process using optimization algorithms like gradient descent.
Forward Pass: During the forward pass, inputs are passed through the neuron,
and its output is calculated using the neuron's activation function.
Loss Calculation: The output of the neuron is compared to the target output
(ground truth) to compute the loss using a predefined loss function (e.g., mean
squared error, cross-entropy).
Backpropagation: The gradient of the loss with respect to the parameters of the
neuron is computed during the backpropagation phase. This involves applying
the chain rule to recursively compute gradients layer by layer, starting from the
output layer to the input layer.
Gradient Descent: Once the gradients are computed, they are used to update the
parameters of the neuron in the opposite direction of the gradient, aiming to
minimize the loss function. This is typically done iteratively over multiple
epochs until convergence.
GRADIENT OF FULLY CONNECTED LAYER
We generalize the single neuron model to a single-layer network of fully
connected layers, as shown in Figure .
The input layer obtains the output vector o(1) through a fully connected
layer and calculates the mean square error with the real label vector t.
The number of input nodes is J, and the number of output nodes is K.
The multi-output fully connected network layer model differs from the
single neuron model in that it has many more output nodes o1(1) ,o2(1)
,o3(1) ,ok(1) …… , and each output node corresponds to a real label t1,
t2, …, tK. wjk is the connection weight of the jth input node and the kth
output node. The mean square error can be expressed as:
Substitute ok = σ(zk):
It can be seen that the partial derivative of wjk is only related to the
output node ok (1) of the current connection, the label tk (1) of the
corresponding true, and the corresponding input node xj.
After using the representation δk, the partial derivative ∂L/ ∂wjk is only
related to the start node xj and the end node δk of the current connection.
Now that the gradient propagation method of the single-layer neural
network (i.e., the output layer) has been derived, next we try to derive the
gradient propagation method of the penultimate layer.
NOTE:
To compute the gradient of the loss function with respect to the parameters
(weights and biases) of the entire network, you typically use the
backpropagation algorithm. Backpropagation efficiently computes these
gradients by applying the chain rule of calculus layer by layer, starting from the
output layer and moving backward through the network.
Forward Pass: Inputs are passed through the network layer by layer, and the
output of the network is computed.
Loss Calculation: The output of the network is compared to the target output to
compute the loss using a predefined loss function.
Gradient Descent: Once the gradients are computed, they are used to update the
parameters of the network in the opposite direction of the gradient, aiming to
minimize the loss function. This is typically done using optimization algorithms
like stochastic gradient descent (SGD), Adam, or RMSprop.
CHAIN RULE:
The chain rule is a fundamental concept in calculus that allows you to find the derivative of a
composite function. In other words, it helps you calculate the rate of change of a function that
is composed of two or more functions nested inside each other. The chain rule is especially
important in calculus when dealing with functions where one quantity depends on another,
which in turn depends on another, and so on.
In words, the chain rule says that to find the derivative of the composite function y = f(g(x)),
you multiply the derivative of the outer function f(u) with respect to its variable u by the
derivative of the inner function g(x) with respect to x.
dy/dx = 6(3x - 1)
The chain rule is a powerful tool that enables you to find derivatives of complex functions by
breaking them down into simpler functions and considering how changes in the inner
function affect the outer function. It's a fundamental concept in calculus and is widely used in
various fields of mathematics and science.
In the context of machine learning and neural networks, the chain rule plays a crucial role in
gradient propagation during the training process. Gradient propagation is the process of
computing gradients (derivatives) of a composite function, such as a neural network, with
respect to its parameters. The chain rule is used to calculate these gradients efficiently,
allowing the model to update its parameters during training through optimization algorithms
like gradient descent.
Let's break down how the chain rule is used in gradient propagation:
During the forward pass of a neural network, input data is processed through multiple layers.
Each layer applies an activation function to its input and produces an output. This process can
be represented as a sequence of functions:
Input data: x
...
Here, each layer has its own weights (W) and biases (b), and f₁, f₂, ..., fₒ are activation
functions.
Computing Loss:
The neural network makes predictions (y) based on the input data, and the predictions are
compared to the actual target values to compute a loss function (e.g., Mean Squared Error or
Cross-Entropy Loss).
The goal of training is to minimize the loss. To do this, you need to calculate the gradients of
the loss with respect to the parameters (weights and biases) in each layer of the network. This
is where the chain rule comes into play.
Starting from the output layer and moving backward through the network (hence the term
"backpropagation"), you calculate the gradients layer by layer using the chain rule. The key
steps are as follows:
Compute the gradient of the loss with respect to the output layer's inputs.
Propagate this gradient backward through each layer, multiplying it by the gradient of the
layer's inputs with respect to its parameters, using the chain rule.
...
The chain rule is repeatedly applied for each layer to calculate the gradients.
Updating Parameters:
Once you have computed the gradients of the loss with respect to the parameters, you can use
them to update the parameters through an optimization algorithm like gradient descent. The
goal is to adjust the parameters to minimize the loss, thereby improving the model's
performance.
In summary, the chain rule is a fundamental concept in gradient propagation within neural
networks. It enables the efficient calculation of gradients for each layer, allowing the network
to learn and adapt its parameters during training. This process is critical for the successful
training of machine learning models, including deep neural networks.
BACK PROPOGATION ALGORITHM:
Backpropagation, short for "backward propagation of errors," is an algorithm used to train
artificial neural networks, including deep learning models. It is a key component of the
training process and is responsible for updating the model's weights to minimize the error
between predicted and actual values. Here is a step-by-step explanation of the
backpropagation algorithm:
Initialize the weights and biases of the neural network. These values are typically initialized
randomly.
Input data is fed forward through the network layer by layer, from the input layer to the
output layer.
Calculate the weighted sum of inputs for each neuron in the layer.
Apply the activation function to the weighted sum to get the output of each neuron.
Calculate the loss (error) between the predicted output and the actual target values using a
suitable loss function (e.g., Mean Squared Error for regression or Cross-Entropy Loss for
classification).
Compute the gradient of the loss with respect to the output layer's inputs. This gradient
measures how much a small change in the output of the network affects the loss.
∂Loss/∂output_layer_inputs
Propagate this gradient backward through the network to calculate the gradients of the loss
with respect to the weights and biases of each layer. This is done using the chain rule.
Update the weights and biases of each layer using an optimization algorithm like gradient
descent. The goal is to adjust these parameters in a way that minimizes the loss.
Step 5: Repeat
Repeat steps 2-4 for a fixed number of iterations (epochs) or until the loss converges to a
satisfactory level.
After training, evaluate the model's performance on a separate validation dataset or test
dataset to assess its generalization ability.
Once the model is trained and evaluated, it can be used for making predictions on new,
unseen data.
Backpropagation is the core of how neural networks learn. Up until this point, you learned
that training a neural network typically happens by the repetition of the following three steps:
• Feedforward: get the linear combination (weighted sum), and apply the activation
• Compare the prediction with the label to calculate the error or loss function:
E(W, b) = |yˆi – yi |
• Use a gradient descent optimization algorithm to compute the Δw that optimizes the
error function:
Δwi = –α dE/dw
Backpropagation, or backward pass, means propagating derivatives of the error with respect
from the last layer (output) back to the first layer (inputs) to adjust weights. By propagating
the change in weights Δw backward from the prediction node (yˆ) all the way through the
hidden layers and back to the input layer, the weights get updated:
This will take the error one step down the error mountain. Then the cycle starts again (steps 1
to 3) to update the weights and take the error another step down, until we get to the minimum
error.
Backpropagation might sound clearer when we have only one weight. We simply adjust the
wnew = w – α dE/dwi.
But it gets complicated when we have a multilayer perceptron (MLP) network with many
How do we compute the change of the total error with respect to dE/dw13 ?
How much will the total error change when we change the parameter w13?
That is straightforward because w21 is directly connected to the error function. But to
compute the derivatives of the total error with respect to the weights all the way back to the
of the error with respect to the third weight on the first input w1,3 (1) , where the (1) means
edge w1,3 (1) = effect of error on edge 4 · effect on edge 3 ·effect on edge 2 · effect on target
edge .
Thus the backpropagation technique is used by neural networks to update the weights to solve
The backpropagation technique is used by neural networks to update the weights to solve
This example will demonstrate a single-layer neural network for a simple regression problem
using backpropagation.
Let's start with a simple neural network with one neuron and one input feature. We'll
implement forward and backward passes for training.
import numpy as np
def sigmoid(x):
return 1 / (1 + [Link](-x))
def sigmoid_derivative(x):
return x * (1 - x)
class NeuralNetwork:
# Forward pass
return [Link]
# Backpropagation
self.output_error = y - output
self.output_delta = self.output_error
self.hidden_error = self.output_delta.dot(self.weights_hidden_output.T)
self.weights_hidden_output += self.hidden_output.[Link](self.output_delta)
self.weights_input_hidden += [Link](self.hidden_delta)
for _ in range(epochs):
output = [Link](x)
[Link](x, y, output)
if __name__ == "__main__":
input_size = 2
hidden_size = 4
output_size = 1
epochs = 10000
[Link](X, y, epochs)
for i in range(len(X)):
prediction = [Link](X[i])
This code defines a simple neural network with one hidden layer and uses the sigmoid
activation function. It trains the network using the XOR dataset.
EXAMPLE 2:
Here's an example of a four-layer fully connected neural network implemented in Python for
binary classification using backpropagation. This network has two input nodes, three hidden
layers with 20, 50, and 25 nodes respectively, and two output nodes for binary classification:
import numpy as np
def sigmoid(x):
return 1 / (1 + [Link](-x))
def sigmoid_derivative(x):
return x * (1 - x)
class NeuralNetwork:
def __init__(self, input_size, hidden_sizes, output_size):
self.input_size = input_size
self.hidden_sizes = hidden_sizes
self.output_size = output_size
[Link]([Link](hidden_sizes[i], hidden_sizes[i+1]))
[Link]([Link]((1, hidden_sizes[i+1])))
[Link]([Link](hidden_sizes[-1], output_size))
[Link]([Link]((1, output_size)))
self.layer_outputs = []
input_layer = x
layer_output = sigmoid(weighted_sum)
self.layer_outputs.append(layer_output)
input_layer = layer_output
return input_layer
# Backpropagation
error = y - output
deltas[-1] = delta
deltas[i] = delta
for i in range(len(X)):
x = X[i]
target = y[i]
output = [Link](x)
[Link](x, target, output)
return [Link](x)
if __name__ == "__main__":
y = [Link]([[0, 1], [1, 0], [1, 0], [0, 1]]) # One-hot encoded labels
input_size = 2
output_size = 2
learning_rate = 0.1
epochs = 10000
for i in range(len(X)):
prediction = [Link](X[i])
bi
(igmodataton
, + w,+ a t +by
h--(a-4 h-() -1+
73hs th,wt b
-1»)
Use MSE (cenidiny it o be au
ypeien analy)
Loss 9-y
steps Backdpoaln at he oufput
basy
chain oule)
(winy
) - G
-(3) ((z
Gradlent
JLoss dLos dda
dzshathtb> hy
Locs- 6.h|
dLosL 6,hy
Simla
OLocs
b2
1ct -S,
d
d dlo h, C |6
Loss
h ht
,-6, |O.
hL,0
dil i hat
d.b bu laye
5h,
lo hddn
ala aet,bae biale
b b,-76,
Numeicl Eromp ,b
Targetsto
-Sigmoid
Actvalion
Learnimg
a-o.
O Foard par
-0()+os (o) + "o3
hddan la4,-t +b
h,- e(7)
o0:S44
DMp 7
b)-,gt a,tb- 0) (0.3) +o5(os+o:Pi
o
Comut dou: 036g
(-03#)0
Losc-(-4)-
Backayd Pac!
y cernpuh emoat ole
output hy
-00S3S
huub428.
Wg:
[Link].os35)os44)
=oaoo3
oneo wo- Wet
hi
06004
baNe b-)
-0.30}
d
idden lor
1e -5,-'
0-h)
Z0O0G5
-0.00 1.
d dy ds
h,
h dz
0100|
0. 20003
b.30
S, , -0.40oa
une w4-n
bine b-0-0.1001
eldva Neus val
0-100
0- 20o03
W
0300 )
0. 0o
O6004
O00
by 0a00
O3001
b,
KERAS
INTRODUCTION:
Keras is an open-source neural network computing library mainly developed in the
Python language.
It was originally written by François Chollet.
It is designed as a highly modular and extensible high-level neural network interface,
so that users can quickly complete model building and training without excessive
professional knowledge.
The Keras library is divided into a frontend and a backend. The backend generally
calls the existing deep learning framework to implement the underlying operations,
such as Theano, CNTK, and TensorFlow.
The frontend interface is a set of unified interface functions abstracted by Keras.
Users can easily switch between different backend operations through Keras.
Since 2017, most components of Keras have been integrated into the TensorFlow
framework.
In 2019, Keras was officially identified as the only high-level interface API for
TensorFlow 2, replacing the high-level interfaces such as [Link] included in the
TensorFlow 1. In other words, now you can only use the Keras interface to complete
TensorFlow layer model building and training. In TensorFlow 2, Keras is
implemented in the [Link] submodule.
2. Network Container:
For common networks, we need to manually call the class instance of each layer to
complete the forward propagation operation. When the network layer becomes deeper,
this part of the code appears very bloated. Multiple network layers can be encapsulated
into a large network model through the network container Sequential provided by Keras.
Only the instance of the network model needs to be called once to complete the sequential
propagation operation of the data from the first layer to the last layer.
For example, the two-layer fully connected network with a separate activation function
layer can be encapsulated as a network through the Sequential container.
layers_num = 2
network = Sequential([]) # Create an empty container
for _ in range(layers_num):
[Link]([Link](3)) # add fully-connected layer
[Link]([Link]())# add activation layer
[Link](input_shape=(4, 4))
[Link]()
model = Sequential()
[Link](Dense(units=64, activation='relu', input_dim=feature_dim))
[Link](Dense(units=num_classes, activation='softmax'))
1. TENSOR METHOD:
In TensorFlow, you can save and load models using the tf.saved_model method. This
method provides a standard way to save and load models, making it compatible with
TensorFlow Serving and other TensorFlow-based deployment environments. Here's
how you can save and load a model using the tf.saved_model method:
Saving a Model:
import tensorflow as tf
from [Link] import Sequential
from [Link] import Dense
Loading a Model:
import tensorflow as tf
In this example:
To load the model, we use tf.saved_model.load('my_model'). This will load the model
and return a callable object that you can use for making predictions.
Note that the model will be saved as a TensorFlow SavedModel, which is a directory
containing the model's architecture, weights, and other metadata. This format is
designed for easy deployment and compatibility with TensorFlow Serving.
Make sure to replace 'my_model' with the path where you want to save or load your
model. You can also specify different versions of the model by using different
directory names when saving.
Define a custom layer or set of layers. You can do this by creating a class that inherits from
[Link]. Implement the __init__ method to define layer parameters and the call
method to specify the layer's forward pass.
Create a custom model by subclassing [Link] and define the architecture by composing
custom layers.
class CustomModel([Link]):
def __init__(self):
super(CustomModel, self).__init__()
self.layer1 = CustomLayer(num_units=64, activation='relu')
self.layer2 = CustomLayer(num_units=10, activation='softmax')
Compile the custom model with an optimizer, loss function, and metrics.
model = CustomModel()
[Link](optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
Train the model using your custom data and fit it to your training data:
[Link](X_train, y_train, epochs=epochs, batch_size=batch_size, validation_data=(X_val,
y_val))
Evaluate and use the custom model just like any other Keras model:
MODEL ZOO:
For commonly used network models, such as ResNet and VGG, you do not need to manually
create them. They can be implemented directly with the [Link] submodule with a
line of code. At the same time, you can also load pre-trained models by setting the weights
parameters.
For an example, use the Keras model zoo to load the pre-trained ResNet50 network by
ImageNet. The code is as follows:
# Load ImageNet pre-trained network. Exclude the last layer.
resnet = [Link].ResNet50(weights='imagenet',inclu
de_top=False)
[Link]()
# test the output
x = [Link]([4,224,224,3])
out = resnet(x) # get output
[Link]
For a specific task, we need to set a custom number output nodes. Taking 100 classification
tasks as an example, we rebuild a new network based on ResNet50. Create a new pooling
layer (the pooling
layer here can be understood as a function of downsampling in high and wide dimensions)
and reduce the features dimension from [b, 7, 7, 2048] to [b, 2048] as in the following.
Finally, create a new fully connected layer and set the number of output nodes to 100. The
code is as follows:
In [7]:
# New fully connected layer
fc = [Link](100)
# Use last layer's output as this layer's input
x = [Link]([4,2048])
out = fc(x)
print([Link])
Out[7]: (4, 100)
After creating a pre-trained ResNet50 feature sub-network, a new pooling layer, and a fully
connected layer, we re-use the Sequential container to encapsulate a new network:
# Build a new network using previous layers
mynet = Sequential([resnet, global_average_layer, fc])
[Link]()
You can see the structure information of the new network model is:
By setting [Link] = False, you can choose to freeze the network parameters of the
ResNet part and only train the newly created network layer, so that the network model
training can be completed quickly and efficiently. Of course, you can also update all the
parameters of the network.
Metrics
In the training process of the network, metrics such as accuracy and recall rate are often
required. Keras provides some commonly used metrics in the [Link] module.
There are four main steps in the use of Keras metrics: creating a new metrics container,
writing data, reading statistical data, and clearing the measuring container.
Create a Metrics Container:
In the [Link] module, it provides many commonly used metrics classes, such as mean,
accuracy, and cosine similarity. In the following, we take the mean error as an example.
loss_meter = [Link]()
Write Data
New data can be written through the update_state function, and the metric will record and
process the sampled data according to its own logic. For example, the loss value is collected
once at the end of each step:
# Record the sampled data, and convert the tensor to an ordinary value through the float()
#function
loss_meter.update_state(float(loss))
After the preceding sampling code is placed at the end of each batch
operation, the meter will automatically calculate the average value based
on the sampled data.
Read Statistical Data:
After sampling multiple times of data, you can choose to call the measurer’s result()
function to obtain statistical values. For example, the interval statistical loss average is as
follows:
# Print the average loss during the statistical period
print(step, 'loss:', loss_meter.result())
Clear the Container:
Since the metric container will record all historical data, it is necessary to clear the historical
status when starting a new round of statistics. It can be realized by reset_states() function. For
example, after reading the average error every time, clear the statistical information to start
the next round of statistics as follows:
if step % 100 == 0:
# Print the average loss
print(step, 'loss:', loss_meter.result())
loss_meter.reset_states() # reset the state
Hands-On Accuracy Metric:
According to the method of using the metric tool, we use the accuracy metric to count the
accuracy rate during the training process. First, create a new accuracy measuring container as
follows:
acc_meter = [Link]()
After each forward calculation is completed, record the training accuracy rate. It should be
noted that the parameters of the update_state function of the accuracy class are the predicted
value and the true value, not the accuracy rate of the current batch. We write the label and
prediction result of the current batch sample into the metric as follows:
# [b, 784] => [b, 10, network output
out = network(x)
# [b, 10] => [b], feed into argmax()
pred = [Link](out, axis=1)
pred = [Link](pred, dtype=tf.int32)
# record the accuracy
acc_meter.update_state(y, pred)
After counting the predicted values of all batches in the test set, print the average accuracy of
the statistics and clear the metric container. The code is as follows:
print(step, 'Evaluate Acc:', acc_meter.result().
numpy())
acc_meter.reset_states() # reset metric
VISUALISATION IN KERAS : MODEL SIDE AND BROWSER SIDE:
In the process of network training, it is very important to improve the development efficiency
and monitor the training progress of the network through the web terminal and visualize the
training results. TensorFlow provides a special visualization tool called TensorBoard, which
writes monitoring data to the file system through TensorFlow and uses the web backend to
monitor the corresponding file directory, thus allowing users to view network monitoring
data.
Visualizing the performance and architecture of a Keras model can be done on both the
model side (inside your Python script) and the browser side (using tools like TensorBoard).
Here's how you can perform visualization in both contexts:
summary_writer = [Link].create_file_writer(log_dir)
We take monitoring error and visual image data as examples to introduce how to write
monitoring data. After the forward calculation is completed, for the scalar data such as error,
we record the monitoring data through the [Link] function and specify the time
stamp step parameter. The step parameter here is similar to the time scale information
corresponding to each data and can also be understood as the coordinates of the data curve, so
it should not be repeated. Each type of data is distinguished by the name of the string, and
similar data needs to be written to the database with the same name. For example:
with summary_writer.as_default():
# write the current loss to train-loss database
[Link]('train-loss', float(loss), step=step)
TensorBoard distinguishes different types of monitoring data by string ID, so for error data,
we named it “train-loss”; other types of data cannot be written to prevent data pollution.
For picture-type data, you can write monitoring picture data through the [Link]
function. For example, during training, the sample image can be visualized by the
[Link] function. Since the tensor in TensorFlow generally contains multiple
samples, the [Link]. image function accepts tensor data of multiple pictures and sets the
max_ outputs parameter to select the maximum number of displayed pictures.
The code is as follows:
with summary_writer.as_default():
# log accuracy
[Link]('test-acc', float(total_correct/
total), step=step)
# log images
[Link]("val-onebyone-images:",
val_images, max_outputs=9, step=step)
Run the model program, and the corresponding data will be written to
the specified file directory in real time.
Plot Training History:
You can plot the training history of your Keras model directly within your Python script
using libraries like Matplotlib. This allows you to visualize metrics like loss and accuracy
during training.
import [Link] as plt
history = [Link](X_train, y_train, epochs=epochs, batch_size=batch_size,
validation_data=(X_val, y_val))
[Link](1, 2, 2)
[Link]([Link]['accuracy'], label='Training Accuracy')
[Link]([Link]['val_accuracy'], label='Validation Accuracy')
[Link]()
[Link]('Epochs')
[Link]('Accuracy')
[Link]()
Model Summary:
You can print a summary of your model's architecture to the console to understand its
structure and the number of trainable parameters.
[Link]()
Browser Side Visualization (TensorBoard):
TensorBoard is a powerful tool provided by TensorFlow that allows you to visualize various
aspects of your model, such as training metrics, model architecture, and even custom metrics.
When running the program, the monitoring data is written to the specified file directory. If
you want to remotely view and visualize these data in real time, you also need to use a
browser and a web backend. The first step is to open the web backend. Run command
“tensorboard --logdir path” in terminal and specify the file directory path monitored by the
web backend, then you can open the web backend monitoring process,
On the upper end of the monitoring page, you can choose different types of data monitoring
pages, such as scalar monitoring page SCALARS and picture visualization page
IMAGES. For this example, we need to monitor the training error and test accuracy rate for
scalar data, and its curve can be viewed on the SCALARS page, as shown in Figures
In addition to monitoring scalar data and image data, TensorBoard also supports functions
such as viewing histogram distribution of tensor data through [Link], and
printing text information through [Link]. For example: with
summary_writer.as_default(): [Link]('train-loss', float(loss), step=step)
[Link]('y-hist',y, step=step) [Link]('loss-text',str(float(loss))) You
can view the histogram of the tensor on the HISTOGRAMS page, as shown in Figure A, and
you can view the text information on the TEXT page, as shown in Figure B
FIGURE A
Figure B
Here's how to use TensorBoard for visualization:
Make sure you have TensorFlow installed. You can install it using pip:
Logging Metrics to TensorBoard: During model training, you can log metrics to
TensorBoard using a Keras callback.
Here's an example:
from [Link] import TensorBoard
tensorboard_callback = TensorBoard(log_dir='./logs', histogram_freq=1)
history = [Link](X_train, y_train, epochs=epochs, batch_size=batch_size,
validation_data=(X_val, y_val), callbacks=[tensorboard_callback])
Start TensorBoard:
In your terminal, navigate to the directory where you're running your Python script and use
the following command to start TensorBoard:
tensorboard --logdir=./logs
This command will start TensorBoard and provide a URL you can open in your browser to
access the visualization.
Open a web browser and go to the URL provided by TensorBoard (typically, it's
[Link] Here, you can view various visualizations, including training metrics,
model architecture, and more.
INTRODUCTION:
GENERALIZATION ABILITY:
• The ability of machine learning to learn the real model of the data from the training set, so
that it can perform well on the unseen test set.
• When the model’s expressive power is weak, such as a single linear layer, it can only learn a
linear model and not perform well on nonlinear model.
• When the model’s expressive power is too strong, it may be possible to reduce the noise
modalities of the training set, but leads to poor performance on the test set (generalization
ability is weak).
• Thus, the model’s ability to fit complex functions is called Model capacity.
• Consider the following examples , to understand the concept of model capacity in a better
way:
EXAMPLE:
• = {( , ) | = sin( ) , ∈ [ −5,5] }
• A small number of points are sampled from the real distribution to form the training set,
which contains the observation error ϵ, as shown by the small dots in Figure.
• Initially, If we only search the model space of all first-degree polynomials and set the bias to
0, that is, y = ax, as shown by the straight line of the first-degree polynomial.
• After increasing the hypothesis space again, as shown in the polynomial curves of 7, 9, 11,
13, 15, and 17 in Figure, the larger the hypothesis space of the function, the more likely it is
to find a function model that better approximates the real distribution.
CONS OF USING EXCESSIVELY LARGE HYPOTHESIS SPACE:
• Presence of Observation errors in training hurts the generalization ability of the model.
• Because the distribution of real data is often unknown and complicated, it is impossible to
deduce the type of distribution function and related parameters.
• Therefore, when choosing the capacity of the learning model, people often choose a slightly
larger model capacity based on empirical values.
• When the capacity of the model is too large, it may appear to perform better on the training
set, but perform worse on the test set.
• When the capacity of the model is too small, it may have poor performance in both the
training set and the testing set as shown in the area to the left of the red vertical line in
Figure.
• When the capacity of the model is too large, in addition to learning the modalities of the
training set data, the network model also learns additional observation errors, resulting in
the learned model performing better on the training set, but poor in unseen samples.
• When the capacity of the model is too small, the model cannot learn the modalities of the
training set data well, resulting in poor performance on both the training set and the unseen
samples.
• We call this phenomenon as under fitting.
EXAMPLE:
• If we use a simple linear function to learn, we will find it difficult to learn a better function,
resulting in the underfitting phenomenon that the training set and the test set do not
perform well, as shown in Figure.
• If you use a more complex function model to learn, it is possible that the learned function
will excessively “fit” the training set samples, but resulting in poor performance on the test
set, that is, overfitting, as shown in Figure
• Only when the capacity of the learned model and the real model roughly match, the model
can have a good generalization ability, as shown in Figure
SOLUTION TO UNDERFITTING:
• The problem of under fitting can be solved by increasing the number of layers of the neural
network.
• However, because modern deep neural network models can easily reach deeper layers, the
capacity of the model used for learning is generally sufficient.
SOLUTION TO OVERFITTING:
2. Model Design
3. Regularization
4. Drop Out
5. Data Augmentation
DATASET DIVISION:
Earlier we used to divide the data set into a training set and a test set.
In order to select model hyper parameters and detect over fitting, it is generally necessary to
split the original training set into three subsets:
Training set, validation set, and test set.
• We know that training set Dtrain is used to train model parameters,
• The test set Dtest is used to test the generalization ability of the model.
• Example, Training set = 80% of MNIST dataset and Test set = 20% of MNIST data set.
• the performance of the test set cannot be used as feedback for model training.
• we need to be able to pick out more suitable model hyperparameters during model training
to determine whether the model is overfitting.
• Therefore, we need to divide the training set into training set and validation set.
• The divided training set has the same function as the original training set and is used to train
the parameters of the model, while the validation set is used to select the
hyperparameters of the model.
FUNCTIONS OF VALIDATION DATASET:
• Adjust the learning rate, weight decay coefficient, training times, etc. according to the
performance of the validation set.
• Readjust the network topology according to the performance of the validation set.
• According to the performance of the validation set, determine whether it is overfitting or
underfitting.
• the training set, validation set, and test set can be divided according to a custom ratio, such
as the common 60%-20%-20% division.
DIFFERENCE BETWEEN TEST & VALIDATION SETS:
• The algorithm designer can adjust the settings of various hyperparameters of the model
according to the performance of the validation set to improve the generalization ability of
the model.
• But the performance of the test set cannot be used to adjust the model.
EARLY STOPPING:
EPOCH:
• one batch updating in the training set one Step, and iterating through all the samples in the
training set once is called an Epoch.
• It is generally recommended to perform a validation operation after several Epochs else it
introduces additional computation costs.
• If the training error of the model is low and the training accuracy is high, but the validation
error is high and the validation accuracy rate is low, overfitting may occur.
• If the errors on both the training set and the validation set are high and the accuracy is low,
then underfitting may occur.
EXAMPLE: A TYPICAL CLASSIFICATION
NOTE 1: In the laterstage of training, even with the same network structure, due to the
change in the actual capacity of the model, we observed the phenomenon of overfitting
NOTE2:
• This means that for neural networks, even if the network hyperparameters amount remains
unchanged (i.e., the maximum capacity of the network is fixed), the model may still appear
to be overfitting.
• It is because the effective capacity of the neural network is closely related to the state of the
network parameters
• As the number of training Epochs increased, the overfitting became more andmore serious.
• We can observe early stopping epoch as the vertical dotted line is in the best state of the
network, there is no obvious overfitting phenomenon, and the generalization ability of the
network is the best.
When it is found that the validation accuracy has not decreased for successive Epochs, we
can predict that the most suitable Epoch may have been reached, so we can stop training.
REGULARIZATION:
• By designing network models with different layers and sizes, the initial function hypothesis
space can be provided for the optimization algorithm, but the actual capacity of the model
can change as the network parameters are optimized and updated.
where Ω(θ) represents the sparsity constraint function on the network parameters θ.
The sparsity constraint of the parameter θ is achieved by constraining the L norm of the
parameter, that is:
•
• The goal of an optimization algorithm is to minimize the original loss function L(x,y) and also
to reduce network sparisty Ω(θ)
• Here λ is the weight parameter to balance the importance of L(x, y) and Ω(θ).
• Larger λ means that the sparsity of the network is more important; smaller λ means that the
training error of the network is more important.
• By selecting the appropriate λ, you can get better training performance, while ensuring the
sparsity of the network, which lead to a good generalization ability.
• Commonly used regularization methods are L0, L1, and L2 regularization.
L0 regularization:
• L0 regularization refers to the regularization calculation method using the L0 norm as the
sparsity penalty term Ω(θ).
• This constraint can force the connection weights in the network to be mostly 0, thereby
reducing the actual amount of network parameters and network capacity.
• DISADVANTAGE: However, because the L0 norm is not derivable, gradient descent algorithm
cannot be used for optimization. L0 norm is not often used in neural networks
L1 Regularization
• The L1 regularization refers to the regularization calculation method using the L1 norm as
the sparsity penalty term Ω(θ).
• The L1 norm ‖θi‖1 is defined as the sum of the absolute values of all elements in the tensor
θi.
L2 regularization:
• The L2 regularization refers to the regularization calculation method using the L2 norm as
the sparsity penalty term Ω(θ).
• The L2 norm ‖θi‖2 is defined as the sum of squares of the absolute values of all elements in
the tensor θi.
• IMPLEMENTATION:
• A standard least squares model tends to have some variance in it. Such model won’t
generalize well for a data set different than its training data.
• So the tuning parameter λ, used in the regularization, controls the impact on bias and
variance.
• As the value of λ rises, it reduces the value of coefficients and thus reducing the variance.
• Till a point, this increase in λ is beneficial as it is only reducing the variance(hence avoiding
overfitting), without loosing any important properties in the data.
• But after certain value, the model starts loosing important properties, giving rise to bias in
the model and thus underfitting.
DROPOUT:
Dropout works by essentially “dropping” a neuron from the input or hidden layers. Multiple
neurons are removed from the network, meaning they practically do not exist — their
incoming and outcoming connections are also destroyed.
This artificially creates a multitude of smaller, less complex networks. This forces the model to
not become solely dependent on one neuron, meaning it has to diversify its approach and
develop a multitude of methods to achieve the same result.
Dropout is applied to a neural network by randomly dropping neurons in every layer
(including the input layer). A pre-defined dropout rate determines the chance of each neuron
being dropped. For example, a dropout rate of 0.25 means that there is a 25% chance of a
neuron being dropped. Dropout is applied during every epoch during the training of the
model.
DATA AUGMENTATION:
One of the best techniques for reducing overfitting is to increase the size of the training dataset. As
discussed in the previous technique, when the size of the training data is small, then the network
tends to have greater control over the training data.
So, to increase the size of the training data i.e, increasing the number of images present in the
dataset, we can use data augmentation, which is the easiest way to diversify our data and make the
training data larger.
Some of the popular image augmentation techniques are flipping, translation, rotation,
scaling,cropping, changing brightness, adding noise etc
Here are some commonly used data augmentation techniques for various types of data:
You are given a synthetic binary classification task with 20 input features, where 15
features are informative and 5 are redundant. This high-dimensional dataset is prone to
overfitting when trained using a neural network.
🔍 Objectives
🧪 Dataset
Distribution: Stratified split into 70% training, 15% validation, 15% test
✅ Deliverables
Graphs of loss and accuracy showing signs of overfitting and the effect of
regularization.
Code should clearly demonstrate the difference between:
o Overfitting scenario (without regularization – optionally)
Fully connected layers are used efficiently in neural networks by connecting every neuron in one layer to every neuron in the preceding layer, maximizing the amount of information transfer and learning potential. This structure allows the model to capture complex representations as they combine learned features at different levels effectively. Backpropagation through these layers is optimized using chain rule-based gradient calculations that ensure efficient update of weights, ultimately enhancing the network's ability to minimize loss through iterative learning steps like stochastic gradient descent .
The gradient computation for a single neuron involves determining the partial derivative of the loss function with respect to its own weights and biases, which depends on its output, the true value, and the input connected to it. In contrast, a fully connected neural network layer involves calculating gradients for multiple neurons simultaneously, where each output node’s gradient is influenced by all the input connections and the associated weights. Despite the complexity, both computations use the principle of backpropagation but vary in scale and interaction across connections .
Using the δ variable simplifies gradient calculation by encapsulating the gradient component related to error propagation at the end of a connection line. This simplification allows the backpropagation algorithm to focus on the relationship between the start node and the δ variable, effectively reducing complexity when calculating the partial derivatives of all parameters across layers. It streamlines the recursive gradient computations, making it easier to implement and understand the gradient propagation through layers .
The chain rule plays a critical role in the backpropagation algorithm by allowing the efficient calculation of gradients for the loss function with respect to each layer's parameters. It facilitates the gradient computation by providing a method to decompose the gradient of a composite function (i.e., the neural network layers) into simpler components. This is done by breaking down the derivative of the overall function into the products of derivatives of its constituent functions, thereby enabling the recursion necessary to propagate gradients from the output layer back to the input layer .
Custom neural network layers in Keras can be created by subclassing the keras.layers.Layer and overriding essential methods like __init__, build, and call. The __init__ method initializes parameters, build defines layer-specific weights, and call implements forward pass logic. Once defined, these layers can be incorporated into custom models by subclassing keras.Model, allowing developers to create unique architectures that fit specific tasks. Such a design offers flexibility in model creation while maintaining the ability to use Keras's high-level interface for training and evaluation .
In a single neuron, the forward pass involves calculating a weighted sum of inputs and applying an activation function to produce an output. In contrast, a fully connected layer performs this operation for each neuron in the layer, utilizing all inputs from the preceding layer, thereby producing an entire output vector. This layer-by-layer propagation allows capturing complex patterns through the successive transformations applied to the data, aligning with the intended task complexity .
In Keras, models can be saved using methods like Model.save('model.h5'), which saves both the architecture and weights in an HDF5 format, and models can be loaded using keras.models.load_model. The SavedModel method in TensorFlow offers another way, which is platform-independent and does not require network source files for recovery. This method is particularly advantageous for deploying models across different platforms. Each method caters to different phases of the model lifecycle, where HDF5 is beneficial for iteration during development, while SavedModel is preferred for final deployment .
A model zoo, such as Keras Applications, provides a repository of commonly used pre-trained models like ResNet and VGG. These models can be implemented with minimal code, providing a significant head start in solving specific tasks due to their pre-trained weights on large datasets like ImageNet. Using a model zoo speeds up the development process as it requires fewer resources than training from scratch. Moreover, these models can be fine-tuned to adapt to specific tasks, thereby leveraging their learned features while customizing the network's output to fit the problem at hand .
Pre-trained models offer advantages such as reducing training time by using prior knowledge from large datasets and eliminating the need for substantial compute resources. They enable quick deployment of effective models and are particularly valuable in resource-constrained environments. However, limitations include potential overfitting to the original dataset, difficulty in adapting to significantly different tasks, and challenges in interpreting complex models. Their effectiveness hinges on the relevance of originally learned features to the new task .
TensorBoard enhances neural network training by offering visualization tools that allow real-time monitoring of various metrics such as loss, accuracy, histograms, and images. It provides insights into model behavior and aids in debugging by displaying scalar metrics, model architecture, and custom metrics. Users can navigate through different visual pages to track training progress and make data-driven decisions to adjust hyperparameters or architecture for improved training outcomes. This real-time monitoring capability is crucial for iterative refinement and optimization of models .