0% found this document useful (0 votes)
105 views68 pages

Neural Network Backpropagation Insights

The document discusses the Backward Propagation Algorithm, including derivatives and gradients, activation functions, and loss functions critical for training neural networks. It covers practical aspects of using Keras for model configuration, training, and addressing overfitting through techniques like regularization and dropout. Additionally, it explains the chain rule's importance in gradient propagation for optimizing neural network parameters during training.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
105 views68 pages

Neural Network Backpropagation Insights

The document discusses the Backward Propagation Algorithm, including derivatives and gradients, activation functions, and loss functions critical for training neural networks. It covers practical aspects of using Keras for model configuration, training, and addressing overfitting through techniques like regularization and dropout. Additionally, it explains the chain rule's importance in gradient propagation for optimizing neural network parameters during training.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

UNIT-III:

Backward Propagation Algorithm: Derivatives and Gradients, Common Properties of


Derivatives, Derivative of Activation Function, Gradient of Loss Function, Gradient of
Fully Connected Layer, Chain Rule, Back Propagation Algorithm. Hands-On
Handwritten Digital Image Recognition.
Keras Advanced API: Common Functional Modules, Model Configuration, Training,
and Testing, Model Configuration, Model Saving and Loading, Custom Network, Model
Zoo, Metrics, Hands-On Accuracy Metric, Visualization.
Overfitting: Model Capacity, Overfitting and Underfitting, Dataset Division, Model
Design, Regularization, Dropout, Data Augmentation, Hands-On Overfitting.

DERIVATIVES AND GRADIENTS:


 Neural network model expressions are usually very complex, and the model
parameters can reach tens or hundreds of millions of levels.
 Almost all neural network optimization problems rely on deep learning frameworks
to automatically calculate the gradient of network parameters and then use gradient
descent to iteratively optimize the network parameters until the performance meets
the requirement.
 The main algorithms implemented in deep learning frameworks are back propagation
and gradient descent algorithms. So understanding the principles of these two
algorithms is helpful to understand the role of deep learning frameworks.
 Before introducing the back propagation algorithm of the multilayer neural network,
we first introduce the common attributes of the derivative, the gradient derivation of
the common activation function, and the loss function and then derive the gradient
propagation law of the multilayer neural network.
 COMMON DERIVATIVES:

 COMMON PROPERTIES OF DERIVATIVES:


DERIVATIVE OF ACTIVATION FUNCTIONS:
DERIVATIVE OF SIGMOID ACTIVATION FUNCTION:
= σ(1-σ)
It can be seen that the derivative expression of the Sigmoid function can finally be expressed
as a simple operation of the output value of the activation function. Using this property, we
can calculate its derivate by caching the output value of the Sigmoid function of each layer in
the
gradient calculation of the neural network.

NumPy code:
import numpy as np # import numpy library
def sigmoid(x): # implement sigmoid function
return 1 / (1 + [Link](-x))
def derivative(x): # calculate derivative of sigmoid
# Using the derived expression of the derivatives
return sigmoid(x)*(1-sigmoid(x))
DERIVATIVE OF ReLU ACTIVATION FUNCTION:
 Before the ReLU function was widely used, the activation function in neural networks
was mostly Sigmoid. However, the Sigmoid function was prone to gradient
dispersion. (When the number of layers of the network increased, because the gradient
values become very small, the parameters of the network cannot be effectively
updated. As a result, deeper neural networks cannot be trained, resulting in the
research of neural networks staying at the shallow level.)
 With the introduction of the ReLU function, the phenomenon of gradient dispersion is
well alleviated, and the number
 of layers of the neural network can reach deeper layers.
 The expression of the ReLU function: ReLU (x) = max(0,x)

 The derivation of its derivative is very simple:


d/dx [ReLU] = 1 For x >= 0 and d/dx [ReLU] = 0 For x < 0

 It can be seen that the derivative calculation of the ReLU function is simple. When x
is greater than or equal to zero, the derivative value is always 1.
 In the process of back propagation, it will neither amplify the gradient, causing
gradient exploding, nor shrink the gradient, causing gradient vanishing phenomenon.
 The derivative curve of the ReLU function is shown in Figure:

NumPy code:
def derivative(x): # Derivative of ReLU
d = [Link](x, copy=True)
d[x < 0] = 0
d[x >= 0] = 1
return d

DERIVATIVE OF LEAKY ReLU ACTIVATION FUNCTION:

 The expression of LeakyReLU function:


LeakyReLU =x for x >= 0 and LeakyReLU =px for x< 0

 Its derivative can be derived as:


d/dx[LeakyReLU] = 1 for x >= 0 and d/dx[LeakyReLU] =px for x< 0

 It’s different from the ReLU function because when x is less than zero, the derivative
value of the LeakyReLU function is not 0, but a constant p, which is generally set to a
smaller value, such as 0.01 or 0.02.
 The derivative curve of the LeakyReLU function is shown in Figure:
NumPy Code:
def derivative(x, p): # p is the slope of negative part of
LeakyReLU
dx = np.ones_like(x) # initialize a vector with 1
dx[x < 0] = p # set negative part to p
return dx

DERIVATIVE OF TANH ACTIVATION FUNCTION:


NumPy code:
def sigmoid(x): # sigmoid function
return 1 / (1 + [Link](-x))
def tanh(x): # tanh function
return 2*sigmoid(2*x) - 1
def derivative(x): # derivative of tanh
return 1-tanh(x)**2
GRADIENT OF LOSS FUNCTION
In neural networks, the derivative of the loss function with respect to the model
parameters (weights and biases) is crucial for training through techniques like
gradient descent. The goal is to minimize this loss function to improve the
model's performance. Here's a brief overview:

Loss Function: The choice of loss function depends on the task. For example,
mean squared error (MSE) is common for regression tasks, while categorical
cross-entropy is often used for classification tasks.

Derivative: The derivative of the loss function with respect to a parameter


measures how the loss changes as the parameter changes. This gradient guides
the optimization algorithm (like gradient descent) to update parameters
iteratively.

GRADIENT OF LOSS FUNCTION - MSE:


The loss function is a method of evaluating how well your machine
learning algorithm models your featured data set.
In other words, loss functions are a measurement of how good your
model is in terms of predicting the expected outcome.
Loss function refer to the training process that uses back-propagation
to minimize the error between the actual and predicted outcome.
Here we mainly derive the gradient expressions of the mean square
error loss function and the cross-entropy loss [Link] mean

square error loss function expression is:


Then its partial derivative is:
This derivative represents the gradient of the MSE with respect to the predicted
outputs. It indicates how much the MSE changes when the predicted outputs
change. This gradient is then used in the backpropagation algorithm to update
the model parameters during training.

GRADIENT OF CROSS ENTROPY FUNCTION:


The Cross Entropy Loss function is commonly used in classification tasks,
especially when the model outputs probabilities over multiple classes. When
calculating the cross-entropy loss function, the Softmax function and the cross-
entropy function are generally implemented in a unified manner.
We first derive the gradient of the Softmax function, and then derive the
gradient of the cross-entropy function.
Gradient of Softmax: The expression of Softmax:
We know that if
GRADIENT OF SINGLE NEURON:

 For a neuron model using Sigmoid activation function, its mathematical


model can be written as:
O(1)=sigma(w(1)^T+b(1))
 The superscript of the variable represents the number of layers. For
example, o(1) represents the output of the first layer and x is the input of
the network
 We take the partial derivative derivation ∂L/ ∂wj1 of the weight
parameter wj1 as an example
 The number of input nodes is j. The weight connection from the input of
the jth node to the output o(1) is denoted as wj1 (1) , where the
superscript indicates the number of layers to which the weight parameter
belongs, and the subscript indicates the starting node number and the
ending node number of the current connection.
 For example, the subscript j1 indicates the jth node of the previous layer
to the first node of the current layer.
 The variable before the activation function σ is called z1(1) , and the
variable after the activation function σ is called o1(1) . Because there is
only one output node, so o1(1)=o(1)=o .
 The error value L is calculated by the error function between the output
and the real label

 The loss can be expressed as:


L =1/2( o1(1) - t)^2 = ½(o(1) - t)^2
 Among them, t is the real label value. Adding 1/2 does not affect the
direction of the gradient, and the calculation is simpler
 We take the weight variable wj1 of the jth (j ∈ [1,J]) node as an example
and consider the partial derivative of the loss function L:
∂L/∂wj1 = (o1 – t ) ∂o1/∂wj1
 Considering o1 = σ(z1) and the derivative of the Sigmoid function is
σ′ = σ(1 − σ), we have:

 It can be seen from the preceding formula that the partial derivative of the
error to the weight wj1 is only related to the output value o1, the true
value t, and the input xj connected to the current weight.

NOTE:
The gradient of a single neuron in a neural network typically refers to the
derivative of the loss function with respect to the parameters (weights and
biases) of that neuron. This gradient is essential for updating the parameters
during the training process using optimization algorithms like gradient descent.

Forward Pass: During the forward pass, inputs are passed through the neuron,
and its output is calculated using the neuron's activation function.

Loss Calculation: The output of the neuron is compared to the target output
(ground truth) to compute the loss using a predefined loss function (e.g., mean
squared error, cross-entropy).

Backpropagation: The gradient of the loss with respect to the parameters of the
neuron is computed during the backpropagation phase. This involves applying
the chain rule to recursively compute gradients layer by layer, starting from the
output layer to the input layer.

Gradient Descent: Once the gradients are computed, they are used to update the
parameters of the neuron in the opposite direction of the gradient, aiming to
minimize the loss function. This is typically done iteratively over multiple
epochs until convergence.
GRADIENT OF FULLY CONNECTED LAYER
 We generalize the single neuron model to a single-layer network of fully
connected layers, as shown in Figure .
 The input layer obtains the output vector o(1) through a fully connected
layer and calculates the mean square error with the real label vector t.
 The number of input nodes is J, and the number of output nodes is K.

 The multi-output fully connected network layer model differs from the
single neuron model in that it has many more output nodes o1(1) ,o2(1)
,o3(1) ,ok(1) …… , and each output node corresponds to a real label t1,
t2, …, tK. wjk is the connection weight of the jth input node and the kth
output node. The mean square error can be expressed as:

 Since ∂L / ∂wjk is only associated with node ok (1) , the summation


symbol in the preceding formula can be removed, that is, i = k:

 Substitute ok = σ(zk):

 It can be seen that the partial derivative of wjk is only related to the
output node ok (1) of the current connection, the label tk (1) of the
corresponding true, and the corresponding input node xj.

 The variable δk characterizes a certain characteristic of the error gradient


propagation of the end node of the connection line.

 After using the representation δk, the partial derivative ∂L/ ∂wjk is only
related to the start node xj and the end node δk of the current connection.
 Now that the gradient propagation method of the single-layer neural
network (i.e., the output layer) has been derived, next we try to derive the
gradient propagation method of the penultimate layer.

 After completing the propagation derivation of the penultimate layer,


similarly, the gradient propagation mode of all hidden layers can be
derived cyclically to obtain gradient calculation expressions of all layer
parameters.

NOTE:

In a fully connected neural network, also known as a dense neural network,


each neuron in a given layer is connected to every neuron in the preceding
layer, and possibly to every neuron in the following layer as well (except for the
input and output layers).

To compute the gradient of the loss function with respect to the parameters
(weights and biases) of the entire network, you typically use the
backpropagation algorithm. Backpropagation efficiently computes these
gradients by applying the chain rule of calculus layer by layer, starting from the
output layer and moving backward through the network.

Here's an overview of how the gradient of a fully connected network is


computed:

Forward Pass: Inputs are passed through the network layer by layer, and the
output of the network is computed.

Loss Calculation: The output of the network is compared to the target output to
compute the loss using a predefined loss function.

Backpropagation: Gradients of the loss with respect to the parameters of each


neuron in each layer are computed using the chain rule. This involves
calculating the gradient of the loss with respect to the output of each neuron and
then recursively computing gradients for the preceding layers. This process is
efficient due to the way gradients are propagated backward through the
network.

Gradient Descent: Once the gradients are computed, they are used to update the
parameters of the network in the opposite direction of the gradient, aiming to
minimize the loss function. This is typically done using optimization algorithms
like stochastic gradient descent (SGD), Adam, or RMSprop.
CHAIN RULE:
The chain rule is a fundamental concept in calculus that allows you to find the derivative of a
composite function. In other words, it helps you calculate the rate of change of a function that
is composed of two or more functions nested inside each other. The chain rule is especially
important in calculus when dealing with functions where one quantity depends on another,
which in turn depends on another, and so on.

Mathematically, the chain rule is stated as follows:

If you have a composite function y = f(g(x)), where:

y is the final output or dependent variable,

f(u) is the outer function,

g(x) is the inner function, and u = g(x),

Then, the derivative of y with respect to x, denoted as dy/dx, is calculated as:

dy/dx = (df/du) * (du/dx)

In words, the chain rule says that to find the derivative of the composite function y = f(g(x)),
you multiply the derivative of the outer function f(u) with respect to its variable u by the
derivative of the inner function g(x) with respect to x.

Here's a more concrete example:

Let y = f(u) = u^2, and u = g(x) = 3x - 1. We want to find dy/dx.

Find df/du: df/du = 2u.

Find du/dx: du/dx = 3.

Now, apply the chain rule:

dy/dx = (df/du) * (du/dx)

dy/dx = (2u) * (3)

Since u = g(x), we can substitute u back in:

dy/dx = (2(3x - 1)) * 3

dy/dx = 6(3x - 1)

So, dy/dx = 18x - 6.

The chain rule is a powerful tool that enables you to find derivatives of complex functions by
breaking them down into simpler functions and considering how changes in the inner
function affect the outer function. It's a fundamental concept in calculus and is widely used in
various fields of mathematics and science.

CHAIN RULE IN GRADIENT PROPOGATION:

In the context of machine learning and neural networks, the chain rule plays a crucial role in
gradient propagation during the training process. Gradient propagation is the process of
computing gradients (derivatives) of a composite function, such as a neural network, with
respect to its parameters. The chain rule is used to calculate these gradients efficiently,
allowing the model to update its parameters during training through optimization algorithms
like gradient descent.

Let's break down how the chain rule is used in gradient propagation:

Neural Network Forward Pass:

During the forward pass of a neural network, input data is processed through multiple layers.
Each layer applies an activation function to its input and produces an output. This process can
be represented as a sequence of functions:

Input data: x

Layer 1: h₁ = f₁(W₁ * x + b₁)

Layer 2: h₂ = f₂(W₂ * h₁ + b₂)

...

Output layer: y = fₒ(Wₒ * hₖ + bₒ)

Here, each layer has its own weights (W) and biases (b), and f₁, f₂, ..., fₒ are activation
functions.

Computing Loss:

The neural network makes predictions (y) based on the input data, and the predictions are
compared to the actual target values to compute a loss function (e.g., Mean Squared Error or
Cross-Entropy Loss).

Gradient Calculation - Backpropagation:

The goal of training is to minimize the loss. To do this, you need to calculate the gradients of
the loss with respect to the parameters (weights and biases) in each layer of the network. This
is where the chain rule comes into play.

Starting from the output layer and moving backward through the network (hence the term
"backpropagation"), you calculate the gradients layer by layer using the chain rule. The key
steps are as follows:

Compute the gradient of the loss with respect to the output layer's inputs.
Propagate this gradient backward through each layer, multiplying it by the gradient of the
layer's inputs with respect to its parameters, using the chain rule.

Mathematically, the chain rule is applied as follows:

∂Loss/∂Wₒ = ∂Loss/∂y * ∂y/∂(Wₒ * hₖ + bₒ)

∂Loss/∂bₒ = ∂Loss/∂y * ∂y/∂(Wₒ * hₖ + bₒ)

∂Loss/∂W₂ = ∂Loss/∂y * ∂y/∂(W₂ * h₁ + b₂) * ∂h₁/∂(W₂ * h₁ + b₂) * ∂(W₂ * h₁ + b₂)/∂W₂

∂Loss/∂b₂ = ∂Loss/∂y * ∂y/∂(W₂ * h₁ + b₂) * ∂h₁/∂(W₂ * h₁ + b₂) * ∂(W₂ * h₁ + b₂)/∂b₂

...

The chain rule is repeatedly applied for each layer to calculate the gradients.

Updating Parameters:

Once you have computed the gradients of the loss with respect to the parameters, you can use
them to update the parameters through an optimization algorithm like gradient descent. The
goal is to adjust the parameters to minimize the loss, thereby improving the model's
performance.

In summary, the chain rule is a fundamental concept in gradient propagation within neural
networks. It enables the efficient calculation of gradients for each layer, allowing the network
to learn and adapt its parameters during training. This process is critical for the successful
training of machine learning models, including deep neural networks.
BACK PROPOGATION ALGORITHM:
Backpropagation, short for "backward propagation of errors," is an algorithm used to train
artificial neural networks, including deep learning models. It is a key component of the
training process and is responsible for updating the model's weights to minimize the error
between predicted and actual values. Here is a step-by-step explanation of the
backpropagation algorithm:

Step 1: Initialize Weights and Biases

Initialize the weights and biases of the neural network. These values are typically initialized
randomly.

Step 2: Forward Pass

Input data is fed forward through the network layer by layer, from the input layer to the
output layer.

For each layer:

Calculate the weighted sum of inputs for each neuron in the layer.

Apply the activation function to the weighted sum to get the output of each neuron.

Step 3: Compute Loss

Calculate the loss (error) between the predicted output and the actual target values using a
suitable loss function (e.g., Mean Squared Error for regression or Cross-Entropy Loss for
classification).

Step 4: Backward Pass (Backpropagation)

Compute the gradient of the loss with respect to the output layer's inputs. This gradient
measures how much a small change in the output of the network affects the loss.

∂Loss/∂output_layer_inputs

Propagate this gradient backward through the network to calculate the gradients of the loss
with respect to the weights and biases of each layer. This is done using the chain rule.

∂Loss/∂weights and ∂Loss/∂biases for each layer

Update the weights and biases of each layer using an optimization algorithm like gradient
descent. The goal is to adjust these parameters in a way that minimizes the loss.

Step 5: Repeat
Repeat steps 2-4 for a fixed number of iterations (epochs) or until the loss converges to a
satisfactory level.

Step 6: Evaluate Model

After training, evaluate the model's performance on a separate validation dataset or test
dataset to assess its generalization ability.

Step 7: Use the Model

Once the model is trained and evaluated, it can be used for making predictions on new,
unseen data.

This was simple process… lets understand it in detail

Backpropagation is the core of how neural networks learn. Up until this point, you learned

that training a neural network typically happens by the repetition of the following three steps:

• Feedforward: get the linear combination (weighted sum), and apply the activation

function to get the output prediction (yˆ):

yˆ = σ · W(3) · σ · W(2) · σ · W(1) · (x)

• Compare the prediction with the label to calculate the error or loss function:

E(W, b) = |yˆi – yi |

• Use a gradient descent optimization algorithm to compute the Δw that optimizes the

error function:

Δwi = –α dE/dw

• Backpropagate the Δw through the network to update the weights:

Backpropagation, or backward pass, means propagating derivatives of the error with respect

to each specific weight dE/dWi .

from the last layer (output) back to the first layer (inputs) to adjust weights. By propagating
the change in weights Δw backward from the prediction node (yˆ) all the way through the

hidden layers and back to the input layer, the weights get updated:

(wnext – step = wcurrent + Δw)

This will take the error one step down the error mountain. Then the cycle starts again (steps 1

to 3) to update the weights and take the error another step down, until we get to the minimum

error.

Backpropagation might sound clearer when we have only one weight. We simply adjust the

weight by adding the Δw to the old weight

wnew = w – α dE/dwi.

But it gets complicated when we have a multilayer perceptron (MLP) network with many

weight variables. To make this clearer, consider the scenario in figure:

How do we compute the change of the total error with respect to dE/dw13 ?

How much will the total error change when we change the parameter w13?

how to compute by applying the derivative rules on the error function.

That is straightforward because w21 is directly connected to the error function. But to

compute the derivatives of the total error with respect to the weights all the way back to the

input, we need a calculus rule called the chain rule.


Let’s apply the chain rule to calculate the derivative

of the error with respect to the third weight on the first input w1,3 (1) , where the (1) means

layer 1, and w1,3 means node number 1 and weight number 3:

The error back propagated to the

edge w1,3 (1) = effect of error on edge 4 · effect on edge 3 ·effect on edge 2 · effect on target
edge .

Thus the backpropagation technique is used by neural networks to update the weights to solve

the best fit problem.


HANDS ON BACK PROPAGATION ALGORITHM:

The backpropagation technique is used by neural networks to update the weights to solve

the best fit problem.

This example will demonstrate a single-layer neural network for a simple regression problem
using backpropagation.

Let's start with a simple neural network with one neuron and one input feature. We'll
implement forward and backward passes for training.

import numpy as np

# Define the sigmoid activation function and its derivative

def sigmoid(x):

return 1 / (1 + [Link](-x))

def sigmoid_derivative(x):

return x * (1 - x)

# Define the neural network class

class NeuralNetwork:

def __init__(self, input_size, hidden_size, output_size):

# Initialize weights and biases with random values

self.weights_input_hidden = [Link](input_size, hidden_size)

self.bias_hidden = [Link]((1, hidden_size))

self.weights_hidden_output = [Link](hidden_size, output_size)

self.bias_output = [Link]((1, output_size))

def forward(self, x):

# Forward pass

self.hidden_input = [Link](x, self.weights_input_hidden) + self.bias_hidden


self.hidden_output = sigmoid(self.hidden_input)

[Link] = [Link](self.hidden_output, self.weights_hidden_output) + self.bias_output

return [Link]

def backward(self, x, y, output):

# Backpropagation

self.output_error = y - output

self.output_delta = self.output_error

self.hidden_error = self.output_delta.dot(self.weights_hidden_output.T)

self.hidden_delta = self.hidden_error * sigmoid_derivative(self.hidden_output)

self.weights_hidden_output += self.hidden_output.[Link](self.output_delta)

self.weights_input_hidden += [Link](self.hidden_delta)

self.bias_output += [Link](self.output_delta, axis=0)

self.bias_hidden += [Link](self.hidden_delta, axis=0)

def train(self, x, y, epochs):

for _ in range(epochs):

output = [Link](x)

[Link](x, y, output)

if __name__ == "__main__":

# Define the dataset

X = [Link]([[0, 0], [0, 1], [1, 0], [1, 1]])

y = [Link]([[0], [1], [1], [0]])

# Create and train the neural network

input_size = 2
hidden_size = 4

output_size = 1

nn = NeuralNetwork(input_size, hidden_size, output_size)

epochs = 10000

[Link](X, y, epochs)

# Test the trained network

for i in range(len(X)):

prediction = [Link](X[i])

print("Input:", X[i], "Actual:", y[i], "Predicted:", prediction)

This code defines a simple neural network with one hidden layer and uses the sigmoid
activation function. It trains the network using the XOR dataset.

EXAMPLE 2:

Here's an example of a four-layer fully connected neural network implemented in Python for
binary classification using backpropagation. This network has two input nodes, three hidden
layers with 20, 50, and 25 nodes respectively, and two output nodes for binary classification:

import numpy as np

# Define the sigmoid activation function and its derivative

def sigmoid(x):

return 1 / (1 + [Link](-x))

def sigmoid_derivative(x):

return x * (1 - x)

# Define the neural network class

class NeuralNetwork:
def __init__(self, input_size, hidden_sizes, output_size):

self.input_size = input_size

self.hidden_sizes = hidden_sizes

self.output_size = output_size

self.num_layers = len(hidden_sizes) + 2 # Including input and output layers

# Initialize weights and biases with random values

[Link] = [[Link](input_size, hidden_sizes[0])]

[Link] = [[Link]((1, hidden_sizes[0]))]

for i in range(len(hidden_sizes) - 1):

[Link]([Link](hidden_sizes[i], hidden_sizes[i+1]))

[Link]([Link]((1, hidden_sizes[i+1])))

[Link]([Link](hidden_sizes[-1], output_size))

[Link]([Link]((1, output_size)))

def forward(self, x):

self.layer_outputs = []

input_layer = x

# Forward pass through hidden layers

for i in range(self.num_layers - 1):

weighted_sum = [Link](input_layer, [Link][i]) + [Link][i]

layer_output = sigmoid(weighted_sum)

self.layer_outputs.append(layer_output)

input_layer = layer_output
return input_layer

def backward(self, x, y, output):

# Backpropagation

deltas = [None] * self.num_layers

error = y - output

delta = error * sigmoid_derivative(output)

deltas[-1] = delta

# Calculate deltas for hidden layers

for i in range(self.num_layers - 2, 0, -1):

error = deltas[i + 1].dot([Link][i].T)

delta = error * sigmoid_derivative(self.layer_outputs[i - 1])

deltas[i] = delta

# Update weights and biases

for i in range(self.num_layers - 1):

[Link][i] += self.layer_outputs[i - 1].[Link](deltas[i])

[Link][i] += [Link](deltas[i], axis=0)

def train(self, X, y, epochs, learning_rate):

for epoch in range(epochs):

for i in range(len(X)):

x = X[i]

target = y[i]

output = [Link](x)
[Link](x, target, output)

def predict(self, x):

return [Link](x)

if __name__ == "__main__":

# Define the dataset

X = [Link]([[0, 0], [0, 1], [1, 0], [1, 1]])

y = [Link]([[0, 1], [1, 0], [1, 0], [0, 1]]) # One-hot encoded labels

# Create and train the neural network

input_size = 2

hidden_sizes = [20, 50, 25]

output_size = 2

learning_rate = 0.1

epochs = 10000

nn = NeuralNetwork(input_size, hidden_sizes, output_size)

[Link](X, y, epochs, learning_rate)

# Test the trained network

for i in range(len(X)):

prediction = [Link](X[i])

print("Input:", X[i], "Actual:", y[i], "Predicted:", prediction)


Conuder a smpl model

bi
(igmodataton

step: oroosd popagalien

, + w,+ a t +by
h--(a-4 h-() -1+
73hs th,wt b

-1»)
Use MSE (cenidiny it o be au
ypeien analy)
Loss 9-y
steps Backdpoaln at he oufput

Ooufput leyes qradent:

basy

chain oule)
(winy

) - G

-(3) ((z
Gradlent
JLoss dLos dda
dzshathtb> hy
Locs- 6.h|
dLosL 6,hy
Simla
OLocs
b2

()Hidden Jayer gsadent

1ct -S,

d
d dlo h, C |6

Loss

h ht

Similay 6,-6; ,a)


Gradient
dL__ a3oh dz,

,-6, |O.
hL,0
dil i hat
d.b bu laye

5h,
lo hddn
ala aet,bae biale

b b,-76,

Numeicl Eromp ,b

Targetsto
-Sigmoid
Actvalion

Learnimg
a-o.

O Foard par
-0()+os (o) + "o3
hddan la4,-t +b
h,- e(7)
o0:S44

DMp 7
b)-,gt a,tb- 0) (0.3) +o5(os+o:Pi

h, th,+b os74 (os)+o6es »(0) +


= 0.4 88|
0+

o
Comut dou: 036g
(-03#)0
Losc-(-4)-
Backayd Pac!
y cernpuh emoat ole

output hy

-00S3S

huub428.

Wg:

[Link].os35)os44)
=oaoo3
oneo wo- Wet

hi

06004
baNe b-)
-0.30}
d

idden lor
1e -5,-'

0-h)

Z0O0G5
-0.00 1.

d dy ds
h,
h dz

0100|

0. 20003
b.30

S, , -0.40oa
une w4-n

bine b-0-0.1001
eldva Neus val
0-100

0- 20o03
W

0300 )

0. 0o

O6004
O00
by 0a00

O3001
b,
KERAS
INTRODUCTION:
 Keras is an open-source neural network computing library mainly developed in the
Python language.
 It was originally written by François Chollet.
 It is designed as a highly modular and extensible high-level neural network interface,
so that users can quickly complete model building and training without excessive
professional knowledge.
 The Keras library is divided into a frontend and a backend. The backend generally
calls the existing deep learning framework to implement the underlying operations,
such as Theano, CNTK, and TensorFlow.
 The frontend interface is a set of unified interface functions abstracted by Keras.
Users can easily switch between different backend operations through Keras.
 Since 2017, most components of Keras have been integrated into the TensorFlow
framework.
 In 2019, Keras was officially identified as the only high-level interface API for
TensorFlow 2, replacing the high-level interfaces such as [Link] included in the
TensorFlow 1. In other words, now you can only use the Keras interface to complete
TensorFlow layer model building and training. In TensorFlow 2, Keras is
implemented in the [Link] submodule.

COMMON FUNCTIONAL MODULES:


 Keras is a popular deep learning framework that provides a high-level API for
building and training neural networks. It offers a variety of common functional
modules that you can use to construct neural network architectures.
 Keras provides a series of high-level neural network-related classes and functions,
such as classic dataset loading function, network layer class, model container, loss
function class, optimizer class, and classic model class.
1. Common Network Layer Classes:
 For the common neural network layer, we can use the tensor mode of the underlying
interface functions to achieve, which are generally included in the [Link] module.
 For common network layers, we generally use the layer method to complete the
model construction. A large number of common network layers are provided in the
[Link] namespace, such as fully connected layers, activation function layers,
pooling layers, convolutional layers, and recurrent neural network layers.
 For these network layer classes, you only need to specify the relevant parameters of
the network layer at the time of creation and use the __call__ method to complete the
forward calculation. When using the __call__ method, Keras will automatically call
the forward propagation logic of each layer, which is generally implemented in the
call function of the class.
Examples:
Dense Layer: The Dense layer is a fully connected layer where every neuron in the layer
is connected to every neuron in the previous layer. It's commonly used in feed forward
neural networks and can be used for tasks like image classification and regression.

from [Link] import Dense


dense_layer = Dense(units=64, activation='relu')
Convolutional Layer: The Conv2D and Conv3D layers are used for 2D and 3D
convolution operations, respectively. They are essential for tasks like image and video
analysis.

from [Link] import Conv2D


conv_layer = Conv2D(filters=32, kernel_size=(3, 3), activation='relu')

2. Network Container:
For common networks, we need to manually call the class instance of each layer to
complete the forward propagation operation. When the network layer becomes deeper,
this part of the code appears very bloated. Multiple network layers can be encapsulated
into a large network model through the network container Sequential provided by Keras.
Only the instance of the network model needs to be called once to complete the sequential
propagation operation of the data from the first layer to the last layer.
For example, the two-layer fully connected network with a separate activation function
layer can be encapsulated as a network through the Sequential container.

from [Link] import layers, Sequential


network = Sequential([
[Link](3, activation=None), # Fully-connected layer
without activation function
[Link](),# activation function layer
[Link](2, activation=None), # Fully-connected layer
without activation function
[Link]() # activation function layer
])
x = [Link]([4,3])
out = network(x)
The Sequential container can also continue to add a new network layer through the add()
method to dynamically create a network:

layers_num = 2
network = Sequential([]) # Create an empty container
for _ in range(layers_num):
[Link]([Link](3)) # add fully-connected layer
[Link]([Link]())# add activation layer
[Link](input_shape=(4, 4))
[Link]()

When we encapsulate multiple network layers through Sequential container, the


parameter list of each layer will be automatically incorporated into the Sequential
container.
MODEL CONFIGURATION, TRAINING AND TESTING IN KERAS
In Keras, you can create, configure, train, and test a neural network model by following a
series of steps. Below is an overview of these steps:

1. Model Creation and Configuration:


Import necessary libraries and modules from Keras.
Define the model architecture, including layers, activation functions, and input/output
shapes.
from [Link] import Sequential
from [Link] import Dense

model = Sequential()
[Link](Dense(units=64, activation='relu', input_dim=feature_dim))
[Link](Dense(units=num_classes, activation='softmax'))

2. Compile the Model:


After defining the model, you need to configure its learning process by specifying the
optimizer, loss function, and evaluation metrics.
[Link](optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
3. Data Preparation:
Prepare your training and testing data by loading, preprocessing, and splitting it into input
and target (label) datasets.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.2,
random_state=42)
4. Model Training:
Train the model using the fit method. Pass in your training data (features and labels),
batch size, number of epochs, and validation data if necessary.
history = [Link](X_train, y_train, epochs=epochs, batch_size=batch_size,
validation_data=(X_test, y_test)).
5. Model Evaluation:
After training, you can evaluate the model's performance on the test dataset using the
evaluate method.
test_loss, test_accuracy = [Link](X_test, y_test)
print(f"Test Loss: {test_loss:.4f}, Test Accuracy: {test_accuracy:.4f}")
6. Making Predictions:
Use the trained model to make predictions on new or unseen data.
predictions = [Link](X_new_data)
7. Save and Load Models (Optional):
You can save your trained model to disk for later use and load it when needed.
[Link]("my_model.h5") loaded_model =
[Link].load_model("my_model.h5")
8. Visualization (Optional):
You can visualize training history and performance using libraries like Matplotlib.
import [Link] as plt [Link]([Link]['loss'], label='Training Loss')
[Link]([Link]['val_loss'], label='Validation Loss') [Link]() [Link]()

MODEL SAVING AND LOADING

1. TENSOR METHOD:
In TensorFlow, you can save and load models using the tf.saved_model method. This
method provides a standard way to save and load models, making it compatible with
TensorFlow Serving and other TensorFlow-based deployment environments. Here's
how you can save and load a model using the tf.saved_model method:

Saving a Model:
import tensorflow as tf
from [Link] import Sequential
from [Link] import Dense

# Create a simple Sequential model


model = Sequential([
Dense(units=64, activation='relu', input_dim=feature_dim),
Dense(units=num_classes, activation='softmax')
])

# Compile the model


[Link](optimizer='adam', loss='categorical_crossentropy',
metrics=['accuracy'])

# Train the model


[Link](X_train, y_train, epochs=epochs, batch_size=batch_size,
validation_data=(X_test, y_test))

# Save the model using the tf.saved_model method


[Link]('my_model')
# 'my_model' is the directory where the model will be saved

Loading a Model:

import tensorflow as tf

# Load the model using the tf.saved_model.load method


loaded_model = tf.saved_model.load('my_model')

# Make predictions using the loaded model


predictions = loaded_model(X_new_data)
# X_new_data is your new data for inference.

In this example:

We first create a simple Sequential model and train it.

We save the model using [Link]('my_model'), which will create a directory


named 'my_model' containing the saved model artifacts.

To load the model, we use tf.saved_model.load('my_model'). This will load the model
and return a callable object that you can use for making predictions.
Note that the model will be saved as a TensorFlow SavedModel, which is a directory
containing the model's architecture, weights, and other metadata. This format is
designed for easy deployment and compatibility with TensorFlow Serving.

Make sure to replace 'my_model' with the path where you want to save or load your
model. You can also specify different versions of the model by using different
directory names when saving.

2. USING NETWORK METHOD:


Let’s introduce a method that does not require network source files and only needs model
parameter files to recover the network [Link] model structure and model parameters
can be saved to the path file through the [Link](path) function, and the network
structure and network parameters can be restored through [Link].load_model(path)
without the need for network source files .
First, save the MNIST handwritten digital picture recognition model to a file, and delete
the network object:
# Save model and parameters to a file
[Link]('model.h5')
print('saved total model.')
del network # Delete the network
The structure and state of the network can be recovered through the model.h5 file, and
there is no need to create network objects in advance.
The code is as follows:
# Recover the model and parameters from a file
network = [Link].load_model('model.h5')
In addition to storing model parameters, the model. h5 file should also save network
structure information. You can directly recover the network object from the file without
creating a model in advance.

3. SAVED MODEL METHOD:


TensorFlow is favored by the industry, not only because of the excellent neural network
layer API support, but also because it has powerful ecosystem, including mobile and web
support. When the model needs to be deployed to other platforms, the SavedModel
method proposed by TensorFlow is platform-independent.
By tf.saved_model.save(network, path), the model can be saved to the path directory as
follows:
# Save model and parameters to a file
tf.saved_model.save(network, 'model-savedmodel')
print('saving savedmodel.')
del network # Delete network object
After recovering the model instance, we complete the calculation of the test accuracy rate
and achieve the following:
print('load savedmodel from file.')
# Recover network and parameter from files
network = tf.saved_model.load('model-savedmodel')
# Accuracy metrics
acc_meter = [Link]()
for x,y in ds_val: # Loop through test dataset
pred = network(x) # Forward calculation
acc_meter.update_state(y_true=y, y_pred=pred)
# Update stats
# Print accuracy
print("Test Accuracy:%f" % acc_meter.result())
CUSTOM NETWORK LAYERS:
In Keras, you can create custom neural network architectures by subclassing the [Link]
class. This approach allows you to define your own forward pass logic and create highly
customized network architectures.

Here's how to create a custom neural network in Keras:

#Import the necessary libraries:


import tensorflow as tf
from [Link] import Layer

Define a custom layer or set of layers. You can do this by creating a class that inherits from
[Link]. Implement the __init__ method to define layer parameters and the call
method to specify the layer's forward pass.

For example, here's how you can define a custom layer:


class CustomLayer(Layer):
def __init__(self, num_units, activation='relu'):
super(CustomLayer, self).__init__()
self.num_units = num_units
[Link] = [Link](activation)

def build(self, input_shape):


# Define layer variables and weights here
[Link] = self.add_weight("kernel", shape=[input_shape[-1], self.num_units])
[Link] = self.add_weight("bias", shape=[self.num_units])

def call(self, inputs):


# Define the layer's forward pass logic
return [Link]([Link](inputs, [Link]) + [Link])

Create a custom model by subclassing [Link] and define the architecture by composing
custom layers.
class CustomModel([Link]):
def __init__(self):
super(CustomModel, self).__init__()
self.layer1 = CustomLayer(num_units=64, activation='relu')
self.layer2 = CustomLayer(num_units=10, activation='softmax')

def call(self, inputs):


x = self.layer1(inputs)
return self.layer2(x)

Compile the custom model with an optimizer, loss function, and metrics.

model = CustomModel()
[Link](optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

Train the model using your custom data and fit it to your training data:
[Link](X_train, y_train, epochs=epochs, batch_size=batch_size, validation_data=(X_val,
y_val))

Evaluate and use the custom model just like any other Keras model:

loss, accuracy = [Link](X_test, y_test)


predictions = [Link](X_new_data)

MODEL ZOO:
For commonly used network models, such as ResNet and VGG, you do not need to manually
create them. They can be implemented directly with the [Link] submodule with a
line of code. At the same time, you can also load pre-trained models by setting the weights
parameters.

For an example, use the Keras model zoo to load the pre-trained ResNet50 network by
ImageNet. The code is as follows:
# Load ImageNet pre-trained network. Exclude the last layer.
resnet = [Link].ResNet50(weights='imagenet',inclu
de_top=False)
[Link]()
# test the output
x = [Link]([4,224,224,3])
out = resnet(x) # get output
[Link]

For a specific task, we need to set a custom number output nodes. Taking 100 classification
tasks as an example, we rebuild a new network based on ResNet50. Create a new pooling
layer (the pooling
layer here can be understood as a function of downsampling in high and wide dimensions)
and reduce the features dimension from [b, 7, 7, 2048] to [b, 2048] as in the following.

# New pooling layer


global_average_layer = layers.GlobalAveragePooling2D()
# Use last layer's output as this layer's input
x = [Link]([4,7,7,2048])
# Use pooling layer to reduce dimension from [4,7,7,2048] to
[4,1,1,2048],and squeeze to [4,2048]
out = global_average_layer(x)
print([Link])
Out[6]: (4, 2048)

Finally, create a new fully connected layer and set the number of output nodes to 100. The
code is as follows:
In [7]:
# New fully connected layer
fc = [Link](100)
# Use last layer's output as this layer's input
x = [Link]([4,2048])
out = fc(x)
print([Link])
Out[7]: (4, 100)

After creating a pre-trained ResNet50 feature sub-network, a new pooling layer, and a fully
connected layer, we re-use the Sequential container to encapsulate a new network:
# Build a new network using previous layers
mynet = Sequential([resnet, global_average_layer, fc])
[Link]()
You can see the structure information of the new network model is:

Layer (type) Output


Shape Param Number
===============================================================
resnet50 (Model) (None, None, None, 2048) 23587712
_______________________________________________________________
global_average_pooling2d (Gl (None, 2048) 0
_______________________________________________________________
dense_4 (Dense) (None, 100) 204900
===============================================================
Total params: 23,792,612
Trainable params: 23,739,492
Non-trainable params: 53,120

By setting [Link] = False, you can choose to freeze the network parameters of the
ResNet part and only train the newly created network layer, so that the network model
training can be completed quickly and efficiently. Of course, you can also update all the
parameters of the network.

Metrics
In the training process of the network, metrics such as accuracy and recall rate are often
required. Keras provides some commonly used metrics in the [Link] module.
There are four main steps in the use of Keras metrics: creating a new metrics container,
writing data, reading statistical data, and clearing the measuring container.
Create a Metrics Container:
In the [Link] module, it provides many commonly used metrics classes, such as mean,
accuracy, and cosine similarity. In the following, we take the mean error as an example.
loss_meter = [Link]()
Write Data
New data can be written through the update_state function, and the metric will record and
process the sampled data according to its own logic. For example, the loss value is collected
once at the end of each step:
# Record the sampled data, and convert the tensor to an ordinary value through the float()
#function
loss_meter.update_state(float(loss))
After the preceding sampling code is placed at the end of each batch
operation, the meter will automatically calculate the average value based
on the sampled data.
Read Statistical Data:
After sampling multiple times of data, you can choose to call the measurer’s result()
function to obtain statistical values. For example, the interval statistical loss average is as
follows:
# Print the average loss during the statistical period
print(step, 'loss:', loss_meter.result())
Clear the Container:
Since the metric container will record all historical data, it is necessary to clear the historical
status when starting a new round of statistics. It can be realized by reset_states() function. For
example, after reading the average error every time, clear the statistical information to start
the next round of statistics as follows:
if step % 100 == 0:
# Print the average loss
print(step, 'loss:', loss_meter.result())
loss_meter.reset_states() # reset the state
Hands-On Accuracy Metric:
According to the method of using the metric tool, we use the accuracy metric to count the
accuracy rate during the training process. First, create a new accuracy measuring container as
follows:
acc_meter = [Link]()
After each forward calculation is completed, record the training accuracy rate. It should be
noted that the parameters of the update_state function of the accuracy class are the predicted
value and the true value, not the accuracy rate of the current batch. We write the label and
prediction result of the current batch sample into the metric as follows:
# [b, 784] => [b, 10, network output
out = network(x)
# [b, 10] => [b], feed into argmax()
pred = [Link](out, axis=1)
pred = [Link](pred, dtype=tf.int32)
# record the accuracy
acc_meter.update_state(y, pred)
After counting the predicted values of all batches in the test set, print the average accuracy of
the statistics and clear the metric container. The code is as follows:
print(step, 'Evaluate Acc:', acc_meter.result().
numpy())
acc_meter.reset_states() # reset metric
VISUALISATION IN KERAS : MODEL SIDE AND BROWSER SIDE:

In the process of network training, it is very important to improve the development efficiency
and monitor the training progress of the network through the web terminal and visualize the
training results. TensorFlow provides a special visualization tool called TensorBoard, which
writes monitoring data to the file system through TensorFlow and uses the web backend to
monitor the corresponding file directory, thus allowing users to view network monitoring
data.
Visualizing the performance and architecture of a Keras model can be done on both the
model side (inside your Python script) and the browser side (using tools like TensorBoard).
Here's how you can perform visualization in both contexts:

Model Side Visualization:


On the model side, you need to create a summary class that writes monitoring data when
needed. First, create an instance of the monitoring object class through
[Link].create_file_writer,
and specify the directory where the monitoring data is written. The code is as follows:
# Create a monitoring class, the monitoring data will be written to the log_dir directory

summary_writer = [Link].create_file_writer(log_dir)

We take monitoring error and visual image data as examples to introduce how to write
monitoring data. After the forward calculation is completed, for the scalar data such as error,
we record the monitoring data through the [Link] function and specify the time
stamp step parameter. The step parameter here is similar to the time scale information
corresponding to each data and can also be understood as the coordinates of the data curve, so
it should not be repeated. Each type of data is distinguished by the name of the string, and
similar data needs to be written to the database with the same name. For example:
with summary_writer.as_default():
# write the current loss to train-loss database
[Link]('train-loss', float(loss), step=step)

TensorBoard distinguishes different types of monitoring data by string ID, so for error data,
we named it “train-loss”; other types of data cannot be written to prevent data pollution.
For picture-type data, you can write monitoring picture data through the [Link]
function. For example, during training, the sample image can be visualized by the
[Link] function. Since the tensor in TensorFlow generally contains multiple
samples, the [Link]. image function accepts tensor data of multiple pictures and sets the
max_ outputs parameter to select the maximum number of displayed pictures.
The code is as follows:
with summary_writer.as_default():
# log accuracy
[Link]('test-acc', float(total_correct/
total), step=step)
# log images
[Link]("val-onebyone-images:",
val_images, max_outputs=9, step=step)
Run the model program, and the corresponding data will be written to
the specified file directory in real time.
Plot Training History:

You can plot the training history of your Keras model directly within your Python script
using libraries like Matplotlib. This allows you to visualize metrics like loss and accuracy
during training.
import [Link] as plt
history = [Link](X_train, y_train, epochs=epochs, batch_size=batch_size,
validation_data=(X_val, y_val))

# Plot training history


[Link](figsize=(12, 6))
[Link](1, 2, 1)
[Link]([Link]['loss'], label='Training Loss')
[Link]([Link]['val_loss'], label='Validation Loss')
[Link]()
[Link]('Epochs')
[Link]('Loss')

[Link](1, 2, 2)
[Link]([Link]['accuracy'], label='Training Accuracy')
[Link]([Link]['val_accuracy'], label='Validation Accuracy')
[Link]()
[Link]('Epochs')
[Link]('Accuracy')

[Link]()

Model Summary:

You can print a summary of your model's architecture to the console to understand its
structure and the number of trainable parameters.
[Link]()
Browser Side Visualization (TensorBoard):
TensorBoard is a powerful tool provided by TensorFlow that allows you to visualize various
aspects of your model, such as training metrics, model architecture, and even custom metrics.

When running the program, the monitoring data is written to the specified file directory. If
you want to remotely view and visualize these data in real time, you also need to use a
browser and a web backend. The first step is to open the web backend. Run command
“tensorboard --logdir path” in terminal and specify the file directory path monitored by the
web backend, then you can open the web backend monitoring process,

Open web server


Open a browser and enter the URL [Link] 6006 (you can also remotely access
through the IP address, the specific port number may change depending on the command
prompt) to monitor the progress of the network training. TensorBoard can display multiple
monitoring records at the same time. On the left side of the monitoring page, you can select
monitoring records, as shown in Figure

On the upper end of the monitoring page, you can choose different types of data monitoring
pages, such as scalar monitoring page SCALARS and picture visualization page
IMAGES. For this example, we need to monitor the training error and test accuracy rate for
scalar data, and its curve can be viewed on the SCALARS page, as shown in Figures
In addition to monitoring scalar data and image data, TensorBoard also supports functions
such as viewing histogram distribution of tensor data through [Link], and
printing text information through [Link]. For example: with
summary_writer.as_default(): [Link]('train-loss', float(loss), step=step)
[Link]('y-hist',y, step=step) [Link]('loss-text',str(float(loss))) You
can view the histogram of the tensor on the HISTOGRAMS page, as shown in Figure A, and
you can view the text information on the TEXT page, as shown in Figure B

FIGURE A

Figure B
Here's how to use TensorBoard for visualization:

Install TensorFlow (if not already installed):

Make sure you have TensorFlow installed. You can install it using pip:

pip install tensorflow

Logging Metrics to TensorBoard: During model training, you can log metrics to
TensorBoard using a Keras callback.
Here's an example:
from [Link] import TensorBoard
tensorboard_callback = TensorBoard(log_dir='./logs', histogram_freq=1)
history = [Link](X_train, y_train, epochs=epochs, batch_size=batch_size,
validation_data=(X_val, y_val), callbacks=[tensorboard_callback])

Start TensorBoard:

In your terminal, navigate to the directory where you're running your Python script and use
the following command to start TensorBoard:
tensorboard --logdir=./logs
This command will start TensorBoard and provide a URL you can open in your browser to
access the visualization.

Access TensorBoard in Your Browser:

Open a web browser and go to the URL provided by TensorBoard (typically, it's
[Link] Here, you can view various visualizations, including training metrics,
model architecture, and more.

By combining model-side visualization in your Python script with browser-side visualization


using TensorBoard, you can gain a comprehensive understanding of your Keras model's
performance and architecture during training and evaluation.
OVERFITTING

INTRODUCTION:

GENERALIZATION ABILITY:

• The ability of machine learning to learn the real model of the data from the training set, so
that it can perform well on the unseen test set.

CAPACITY OF THE MODEL:

• The expressive power of the model.

• When the model’s expressive power is weak, such as a single linear layer, it can only learn a
linear model and not perform well on nonlinear model.

• When the model’s expressive power is too strong, it may be possible to reduce the noise
modalities of the training set, but leads to poor performance on the test set (generalization
ability is weak).

• Thus, the model’s ability to fit complex functions is called Model capacity.

• Its major indicator is the size of its hypothesis space.

• Consider the following examples , to understand the concept of model capacity in a better
way:

EXAMPLE:

• = {( , ) | = sin( ) , ∈ [ −5,5] }

• A small number of points are sampled from the real distribution to form the training set,
which contains the observation error ϵ, as shown by the small dots in Figure.

• Initially, If we only search the model space of all first-degree polynomials and set the bias to
0, that is, y = ax, as shown by the straight line of the first-degree polynomial.

• After increasing the hypothesis space again, as shown in the polynomial curves of 7, 9, 11,
13, 15, and 17 in Figure, the larger the hypothesis space of the function, the more likely it is
to find a function model that better approximates the real distribution.
CONS OF USING EXCESSIVELY LARGE HYPOTHESIS SPACE:

• will undoubtedly increase the search difficulty

• Increase in computational cost.

• Doesn’t guarantee a better model

• Presence of Observation errors in training hurts the generalization ability of the model.

OVERFITTING AND UNDERFITTING:

• Because the distribution of real data is often unknown and complicated, it is impossible to
deduce the type of distribution function and related parameters.

• Therefore, when choosing the capacity of the learning model, people often choose a slightly
larger model capacity based on empirical values.

• When the capacity of the model is too large, it may appear to perform better on the training
set, but perform worse on the test set.

• When the capacity of the model is too small, it may have poor performance in both the
training set and the testing set as shown in the area to the left of the red vertical line in
Figure.

REASON FOR OVER FITTING:

• When the capacity of the model is too large, in addition to learning the modalities of the
training set data, the network model also learns additional observation errors, resulting in
the learned model performing better on the training set, but poor in unseen samples.

• Thus, the generalization ability of the model is weak.

• We call this phenomenon as overfitting

REASON FOR UNDERFITTING:

• When the capacity of the model is too small, the model cannot learn the modalities of the
training set data well, resulting in poor performance on both the training set and the unseen
samples.
• We call this phenomenon as under fitting.

EXAMPLE:

• Consider a degree 2 polynomial data distribution.

• If we use a simple linear function to learn, we will find it difficult to learn a better function,
resulting in the underfitting phenomenon that the training set and the test set do not
perform well, as shown in Figure.

• If you use a more complex function model to learn, it is possible that the learned function
will excessively “fit” the training set samples, but resulting in poor performance on the test
set, that is, overfitting, as shown in Figure

• Only when the capacity of the learned model and the real model roughly match, the model
can have a good generalization ability, as shown in Figure
SOLUTION TO UNDERFITTING:

• The problem of under fitting can be solved by increasing the number of layers of the neural
network.

• It can also be solved by increasing the size of the intermediate dimension.

• However, because modern deep neural network models can easily reach deeper layers, the
capacity of the model used for learning is generally sufficient.

• In real applications, more overfitting phenomena occur.

SOLUTION TO OVERFITTING:

1. Data set Division

2. Model Design

3. Regularization

4. Drop Out

5. Data Augmentation

DATASET DIVISION:

 Earlier we used to divide the data set into a training set and a test set.
 In order to select model hyper parameters and detect over fitting, it is generally necessary to
split the original training set into three subsets:
Training set, validation set, and test set.
• We know that training set Dtrain is used to train model parameters,
• The test set Dtest is used to test the generalization ability of the model.
• Example, Training set = 80% of MNIST dataset and Test set = 20% of MNIST data set.

• the performance of the test set cannot be used as feedback for model training.
• we need to be able to pick out more suitable model hyperparameters during model training
to determine whether the model is overfitting.
• Therefore, we need to divide the training set into training set and validation set.
• The divided training set has the same function as the original training set and is used to train
the parameters of the model, while the validation set is used to select the
hyperparameters of the model.
FUNCTIONS OF VALIDATION DATASET:
• Adjust the learning rate, weight decay coefficient, training times, etc. according to the
performance of the validation set.
• Readjust the network topology according to the performance of the validation set.
• According to the performance of the validation set, determine whether it is overfitting or
underfitting.
• the training set, validation set, and test set can be divided according to a custom ratio, such
as the common 60%-20%-20% division.
DIFFERENCE BETWEEN TEST & VALIDATION SETS:
• The algorithm designer can adjust the settings of various hyperparameters of the model
according to the performance of the validation set to improve the generalization ability of
the model.
• But the performance of the test set cannot be used to adjust the model.
EARLY STOPPING:
EPOCH:
• one batch updating in the training set one Step, and iterating through all the samples in the
training set once is called an Epoch.
• It is generally recommended to perform a validation operation after several Epochs else it
introduces additional computation costs.
• If the training error of the model is low and the training accuracy is high, but the validation
error is high and the validation accuracy rate is low, overfitting may occur.
• If the errors on both the training set and the validation set are high and the accuracy is low,
then underfitting may occur.
EXAMPLE: A TYPICAL CLASSIFICATION
NOTE 1: In the laterstage of training, even with the same network structure, due to the
change in the actual capacity of the model, we observed the phenomenon of overfitting
NOTE2:
• This means that for neural networks, even if the network hyperparameters amount remains
unchanged (i.e., the maximum capacity of the network is fixed), the model may still appear
to be overfitting.
• It is because the effective capacity of the neural network is closely related to the state of the
network parameters
• As the number of training Epochs increased, the overfitting became more andmore serious.
• We can observe early stopping epoch as the vertical dotted line is in the best state of the
network, there is no obvious overfitting phenomenon, and the generalization ability of the
network is the best.
When it is found that the validation accuracy has not decreased for successive Epochs, we
can predict that the most suitable Epoch may have been reached, so we can stop training.

REGULARIZATION:

• By designing network models with different layers and sizes, the initial function hypothesis
space can be provided for the optimization algorithm, but the actual capacity of the model
can change as the network parameters are optimized and updated.

 The capacity of the preceding model can be simply measuredthrough n.


 By limiting the sparsity of network parameters, the actual capacity of the network can be
constrained.
 This constraint is generally achieved by adding additional parameter sparsity penalties to the
loss function.
 Optimization goal before adding constraint:

 Optimisation goal before adding constraint:

 where Ω(θ) represents the sparsity constraint function on the network parameters θ.
 The sparsity constraint of the parameter θ is achieved by constraining the L norm of the
parameter, that is:

• The goal of an optimization algorithm is to minimize the original loss function L(x,y) and also
to reduce network sparisty Ω(θ)
• Here λ is the weight parameter to balance the importance of L(x, y) and Ω(θ).
• Larger λ means that the sparsity of the network is more important; smaller λ means that the
training error of the network is more important.
• By selecting the appropriate λ, you can get better training performance, while ensuring the
sparsity of the network, which lead to a good generalization ability.
• Commonly used regularization methods are L0, L1, and L2 regularization.

L0 regularization:

• L0 regularization refers to the regularization calculation method using the L0 norm as the
sparsity penalty term Ω(θ).

• The L0 norm ‖θi‖o is defined as the number of non-zero elements in θi.

• This constraint can force the connection weights in the network to be mostly 0, thereby
reducing the actual amount of network parameters and network capacity.

• DISADVANTAGE: However, because the L0 norm is not derivable, gradient descent algorithm
cannot be used for optimization. L0 norm is not often used in neural networks

L1 Regularization

• The L1 regularization refers to the regularization calculation method using the L1 norm as
the sparsity penalty term Ω(θ).

• The L1 norm ‖θi‖1 is defined as the sum of the absolute values of all elements in the tensor
θi.

• L1 regularization is also called Lasso regularization, which is continuously derivable and


widely used in neural networks.
• IMPLEMENTATION:

L2 regularization:

• The L2 regularization refers to the regularization calculation method using the L2 norm as
the sparsity penalty term Ω(θ).

• The L2 norm ‖θi‖2 is defined as the sum of squares of the absolute values of all elements in
the tensor θi.

• L1 regularization is also called Ridge regularization, which is continuously derivable and


widely used in neural networks.

• IMPLEMENTATION:

What does Regularization achieve?

• A standard least squares model tends to have some variance in it. Such model won’t
generalize well for a data set different than its training data.

• Regularization, significantly reduces the variance of the model, without substantial


increase in its bias.

• So the tuning parameter λ, used in the regularization, controls the impact on bias and
variance.

• As the value of λ rises, it reduces the value of coefficients and thus reducing the variance.
• Till a point, this increase in λ is beneficial as it is only reducing the variance(hence avoiding
overfitting), without loosing any important properties in the data.

• But after certain value, the model starts loosing important properties, giving rise to bias in
the model and thus underfitting.

• Therefore, the value of λ should be carefully selected.

DROPOUT:

 Dropout works by essentially “dropping” a neuron from the input or hidden layers. Multiple
neurons are removed from the network, meaning they practically do not exist — their
incoming and outcoming connections are also destroyed.
 This artificially creates a multitude of smaller, less complex networks. This forces the model to
not become solely dependent on one neuron, meaning it has to diversify its approach and
develop a multitude of methods to achieve the same result.
 Dropout is applied to a neural network by randomly dropping neurons in every layer
(including the input layer). A pre-defined dropout rate determines the chance of each neuron
being dropped. For example, a dropout rate of 0.25 means that there is a 25% chance of a
neuron being dropped. Dropout is applied during every epoch during the training of the
model.

DATA AUGMENTATION:
One of the best techniques for reducing overfitting is to increase the size of the training dataset. As
discussed in the previous technique, when the size of the training data is small, then the network
tends to have greater control over the training data.
So, to increase the size of the training data i.e, increasing the number of images present in the
dataset, we can use data augmentation, which is the easiest way to diversify our data and make the
training data larger.
Some of the popular image augmentation techniques are flipping, translation, rotation,
scaling,cropping, changing brightness, adding noise etc
Here are some commonly used data augmentation techniques for various types of data:

Image Data Augmentation:


a. Rotation: Rotate images by various degrees to simulate different angles.
b. Flip: Horizontally and/or vertically flip images.
c. Zoom: Randomly zoom in or out of images.
d. Crop: Randomly crop and resize images to different dimensions.
e. Translation: Shift images horizontally and/or vertically.
f. Brightness and Contrast Adjustments: Randomly adjust brightness and contrast.
g. Noise: Add random noise to the images.
h. Color Jitter: Randomly change hue, saturation, and brightness.

Text Data Augmentation:


a. Synonym Replacement: Replace words with their synonyms.
b. Random Deletion: Randomly delete words from the text.
c. Random Swap: Randomly swap the positions of two words in a sentence.
d. Back-Translation: Translate text to another language and then back to the original language.
e. Insertion: Insert random words into the text.

Time Series Data Augmentation:


a. Time Warping: Slightly warp the time series by stretching or compressing it.
b. Jittering: Add small random noise to the data points.
c. Rolling Window: Apply rolling window transformations to create new data points.

Audio Data Augmentation:


a. Pitch Shifting: Change the pitch of the audio.
b. Time Stretching: Stretch or compress the audio in time.
c. Noise Injection: Add background noise to the audio.
d. Speed Perturbation: Vary the playback speed of the audio.

Tabular Data Augmentation:


a. Feature Scaling: Scale features within a certain range.
b. Feature Selection: Randomly select a subset of features.
c. Data Perturbation: Add random noise to the data.
Synthetic Data Generation:
a. Generative Adversarial Networks (GANs): Generate synthetic data samples that mimic the real
data distribution.
b. Variational Autoencoders (VAEs): Create new data points from the latent space of the VAE.
When applying data augmentation, it's essential to strike a balance. Too much augmentation can
lead to model training on noisy or unrealistic data, while too little may not effectively combat
overfitting. Experiment with different augmentation techniques and parameters to find the right
balance for your specific problem and dataset. Cross-validation can help in evaluating the impact of
data augmentation on your model's performance.
HANDSON-ON OVERFITTING

You are given a synthetic binary classification task with 20 input features, where 15
features are informative and 5 are redundant. This high-dimensional dataset is prone to
overfitting when trained using a neural network.

Your goal is to:

1. Build a fully connected feedforward neural network to perform classification.

2. Observe and understand the effects of overfitting.

3. Apply multiple regularization techniques to mitigate overfitting and improve


generalization.

🔍 Objectives

 Create a classification model using a synthetic tabular dataset.

 Split data into training, validation, and test sets.


 Introduce noise-based data augmentation to expand the training set.

 Apply the following regularization techniques:

o Validation set: to monitor generalization.

o Early stopping: to halt training when validation loss increases.

o Dropout: to reduce co-adaptation of neurons.


o L1/L2 regularization: to penalize model complexity.

 Visualize training vs. validation loss and accuracy to observe overfitting


behavior.
 Evaluate the effectiveness of each technique by comparing training and
validation curves.

🧪 Dataset

 Type: Synthetic (generated using [Link].make_classification)

 Size: 5000 samples


 Features: 20 (15 informative, 5 redundant)

 Classes: 2 (binary classification)

 Distribution: Stratified split into 70% training, 15% validation, 15% test

✅ Deliverables

 A Python implementation of the full pipeline using TensorFlow/Keras.

 Printed metrics after each epoch for transparency in training.

 Graphs of loss and accuracy showing signs of overfitting and the effect of
regularization.
 Code should clearly demonstrate the difference between:
o Overfitting scenario (without regularization – optionally)

o Properly regularized model (with early stopping, dropout, L1/L2, data


augmentation)

Common questions

Powered by AI

Fully connected layers are used efficiently in neural networks by connecting every neuron in one layer to every neuron in the preceding layer, maximizing the amount of information transfer and learning potential. This structure allows the model to capture complex representations as they combine learned features at different levels effectively. Backpropagation through these layers is optimized using chain rule-based gradient calculations that ensure efficient update of weights, ultimately enhancing the network's ability to minimize loss through iterative learning steps like stochastic gradient descent .

The gradient computation for a single neuron involves determining the partial derivative of the loss function with respect to its own weights and biases, which depends on its output, the true value, and the input connected to it. In contrast, a fully connected neural network layer involves calculating gradients for multiple neurons simultaneously, where each output node’s gradient is influenced by all the input connections and the associated weights. Despite the complexity, both computations use the principle of backpropagation but vary in scale and interaction across connections .

Using the δ variable simplifies gradient calculation by encapsulating the gradient component related to error propagation at the end of a connection line. This simplification allows the backpropagation algorithm to focus on the relationship between the start node and the δ variable, effectively reducing complexity when calculating the partial derivatives of all parameters across layers. It streamlines the recursive gradient computations, making it easier to implement and understand the gradient propagation through layers .

The chain rule plays a critical role in the backpropagation algorithm by allowing the efficient calculation of gradients for the loss function with respect to each layer's parameters. It facilitates the gradient computation by providing a method to decompose the gradient of a composite function (i.e., the neural network layers) into simpler components. This is done by breaking down the derivative of the overall function into the products of derivatives of its constituent functions, thereby enabling the recursion necessary to propagate gradients from the output layer back to the input layer .

Custom neural network layers in Keras can be created by subclassing the keras.layers.Layer and overriding essential methods like __init__, build, and call. The __init__ method initializes parameters, build defines layer-specific weights, and call implements forward pass logic. Once defined, these layers can be incorporated into custom models by subclassing keras.Model, allowing developers to create unique architectures that fit specific tasks. Such a design offers flexibility in model creation while maintaining the ability to use Keras's high-level interface for training and evaluation .

In a single neuron, the forward pass involves calculating a weighted sum of inputs and applying an activation function to produce an output. In contrast, a fully connected layer performs this operation for each neuron in the layer, utilizing all inputs from the preceding layer, thereby producing an entire output vector. This layer-by-layer propagation allows capturing complex patterns through the successive transformations applied to the data, aligning with the intended task complexity .

In Keras, models can be saved using methods like Model.save('model.h5'), which saves both the architecture and weights in an HDF5 format, and models can be loaded using keras.models.load_model. The SavedModel method in TensorFlow offers another way, which is platform-independent and does not require network source files for recovery. This method is particularly advantageous for deploying models across different platforms. Each method caters to different phases of the model lifecycle, where HDF5 is beneficial for iteration during development, while SavedModel is preferred for final deployment .

A model zoo, such as Keras Applications, provides a repository of commonly used pre-trained models like ResNet and VGG. These models can be implemented with minimal code, providing a significant head start in solving specific tasks due to their pre-trained weights on large datasets like ImageNet. Using a model zoo speeds up the development process as it requires fewer resources than training from scratch. Moreover, these models can be fine-tuned to adapt to specific tasks, thereby leveraging their learned features while customizing the network's output to fit the problem at hand .

Pre-trained models offer advantages such as reducing training time by using prior knowledge from large datasets and eliminating the need for substantial compute resources. They enable quick deployment of effective models and are particularly valuable in resource-constrained environments. However, limitations include potential overfitting to the original dataset, difficulty in adapting to significantly different tasks, and challenges in interpreting complex models. Their effectiveness hinges on the relevance of originally learned features to the new task .

TensorBoard enhances neural network training by offering visualization tools that allow real-time monitoring of various metrics such as loss, accuracy, histograms, and images. It provides insights into model behavior and aids in debugging by displaying scalar metrics, model architecture, and custom metrics. Users can navigate through different visual pages to track training progress and make data-driven decisions to adjust hyperparameters or architecture for improved training outcomes. This real-time monitoring capability is crucial for iterative refinement and optimization of models .

You might also like