Machine Learning Unit 3 Part 1
Machine Learning Unit 3 Part 1
01
Convolution Neural Network: Introduction
02
Convolution Neural Network: Introduction
02
Convolution Neural Network: Architecture
•CNNs are also made up of neurons, that have learnable weights and biases.
•The architecture of CNN has a list of layers that transforms the 3-dimensional, i.e.
height, width and depth of image volume into a 3-dimensional output volume .
•It uses M filters, which are basically feature extractors that extract features like
edges, corner and so on.
03
Convolution Neural Network: Introduction
• CNN can consider the 2D structure of the images, process
them and allow it to extract the properties that are specific to
images.
• CNNs have the advantage of having one or more Convolutional
layers and pooling layer, which are the main building blocks of
CNNs.
• These layers are followed by one or more fully connected layers as
in standard multilayer NNs.
02
Convolution Neural Network: Architecture
03
Convolution Neural Network: Architecture
•Each input image will pass it through a series of convolution layers with filters
(Kernals), Pooling, fully connected layers (FC) and apply Softmax function to
classify an object with probabilistic values between 0 and 1. The below figure is a
complete flow of CNN to process an input image and classifies the objects based
on values.
04
Convolution Neural Network: Architecture
Following are the layers that are used to construct CNNs−
INPUT− As the name implies, this
layer holds the raw pixel values.
Raw pixel values mean the data of
the image as it is. Example, INPUT
[64×64×3] is a 3-channeled RGB
image of width-64, height-64 and
depth-3.
05
Convolution Neural Network: Architecture
Following are the layers that are used to construct CNNs−
05
Convolution Neural Network: Architecture
06
Convolution Neural Network: Architecture
06
Convolution Neural Network: Architecture
Breaking it Down:
•Pooling layer is one other Down-Sampling:
This means reducing the size of the input feature maps. It helps in
building block of CNNs. The main reducing computation and extracting dominant features while ignoring
unnecessary details.
task of this layer is down- Operates Independently on Every Slice:
Each slice refers to an individual feature map in the input. The
sampling, which means it pooling operation is applied separately to each feature map.
06
Convolution Neural Network: Architecture
06
Convolution Neural Network: Architecture
Breaking it Down:
Fully Connected (FC) Layer:
•FC− It is called Fully Every neuron in this layer is connected to every neuron in the previous
layer.
Connected layer / output layer. It converts the extracted features into class scores for classification.
Output Class Score:
The FC layer processes the learned features and assigns a score to each
It is used to compute output possible class.
A Softmax activation function is often used to convert these scores
class score and the resulting into probabilities.
Output Size (1×1×L):
The output of the FC layer is a vector of size L, where L is the number
output is volume of the size of classes.
The shape 1×1×L means that there is only one value per class (a
1*1*L where L is the number Example:
single probability or score for each class).
06
CNN Architecture: Convolution Layer
•Convolution is the first layer to extract features from an input image. Convolution preserves the relationship between
pixels by learning image features using small squares of input data.
• It is a mathematical operation that takes two inputs such as image matrix and a filter or kernel.
07
CNN Architecture: Convolution Layer
•Convolution Operation:
•It is a mathematical operation that applies a small filter (also called a kernel) over
an image. The filter slides over the image, multiplying its values with the pixel
values and summing them up to create a new representation.
•.Preserving Pixel Relationships:
•Unlike fully connected layers, which treat all input pixels as independent,
convolutions maintain spatial relationships.
•This means nearby pixels are processed together, preserving structures like edges,
textures, and shapes.
•Learning Image Features: The small squares (filters) learn different features such
as edges, corners, textures, and patterns. Deeper layers in the CNN combine these
basic features to recognize complex objects.
•Example:
•Imagine a 3×3 filter applied to an image: The filter scans through the image one
small section at a time, identifying patterns. Early layers may detect edges, while
deeper layers detect faces, objects, or more abstract patterns.
07
CNN Architecture: Convolution Layer
•Consider a 5 x 5 whose image pixel values are 0, 1 and filter matrix 3 x 3 as shown in below. Then the convolution of 5 x 5
image matrix multiplies with 3 x 3 filter matrix which is called “Feature Map” as output shown in below
08
CNN
12
CNN
12
CNN
12
CNN
12
CNN: Pooling
Pooling is the process of merging. So it’s basically for the purpose of reducing the size of the data.
Shown below is the pooling with 2*2 filters.
But isn’t this losing valuable data? Why are we reducing the size? It could be seen like losing information at
the first glimpse, but it’s rather getting more ‘meaningful’ data than losing.
By removing some noise in the data and extracting only the significant one, we can reduce overfitting
and speed up the computation.
09
CNN: Padding
Sometimes the pixels of the image aren’t processed with the same number. The pixels at the corner are less counted than
those in the middle. This means that the pixels don’t get the same amount of weights. Also, If we just keep applying the
convolution, we might lose the data too fast. Padding is the trick we can use here to fix this problem.
As its name, padding means giving additional pixels at the boundary of the data. We have two options:
•Pad the picture with zeros (zero-padding) so that it fits
•Drop the part of the image where the filter did not fit. This is called valid padding which keeps only valid part of the
image.
10
CNN: Padding
Example: The input image has 4x4 pixels and the filter has 3x3. There is no padding, which is called ‘valid.’ The
result becomes 2x2 pixels data (4–3+1 = 2). We can see that the output data is downsized.
By the way, does a filter always have to move one pixel at a time? Of course not. We can also make it move
two steps or three steps at a time both in the horizontal and vertical ways. This is called ‘stride.’
11
CNN: Stride
•Stride is the number of pixels shifts over the input matrix. When the stride is 1 then we move the filters to 1
pixel at a time. When the stride is 2 then we move the filters to 2 pixels at a time and so on. The below figure
shows convolution would work with a stride of 2.
12
CNN: Non Linearity (ReLU)
•ReLU stands for Rectified Linear Unit for a non-linear operation. The output is ƒ(x) = max(0,x).
•Why ReLU is important : ReLU’s purpose is to introduce non-linearity in our ConvNet. Since, the real world
data would want our ConvNet to learn would be non-negative linear values.
13
CNN: Flattening
Flattening is converting the data into a 1-dimensional array for inputting it to the next layer. We flatten the
output of the convolutional layers to create a single long feature vector. And it is connected to the final
classification model, which is called a fully-connected layer. In other words, we put all the pixel data in one
line and make connections with the final layer. .
14
CNN: Flattening
Adding multiple convolutional layers and pooling layers, the image will be processed for feature
extraction.
As the layers go deeper and deeper, the features that the model deals with become more complex.
15
CNN: Fully Connected Layer
•The layer we call as FC layer, we flattened our matrix into vector and feed it into a fully connected layer like a
neural network.
16
CNN: Loss Layer
The "loss layer", or "loss function", specifies how training penalizes the deviation between the predicted output
of the network, and the true data labels (during supervised learning). Various loss functions can be used,
depending on the specific task.
•Softmax loss function is used for predicting a single class of K mutually exclusive classes.
• Sigmoid cross-entropy loss is used for predicting K independent probability values.
• Euclidean loss is used for regressing to real-valued labels .
The function used to evaluate a candidate solution (i.e. a set of weights) is referred to as the objective function.
We may seek to maximize or minimize the objective function. When we are minimizing it, we may also call it the cost function,
loss function, or error function.
Typically, with neural networks, we seek to minimize the error. As such, the objective function is often referred to as a cost
function or a loss function and the value calculated by the loss function is referred to as simply “loss.”
17
CNN: Loss Layer
•Following the fully-connected layer is the loss layer, which manages the adjustments of weights across the
network.
•Before the training of the network begins, the weights in the convolution and fully-connected layers are given
random values.
•Then during training, the loss layer continually checks the fully-connected layer's guesses against the actual
values with the goal of minimizing the difference between the guess and the real value as much as possible.
•The loss layer does this by adjusting the weights in both the convolution and fully-connected layers.
18
CNN: Summary
•Provide input image into convolution layer
•Choose parameters, apply filters with strides, padding if requires. Perform convolution on the image and apply
ReLU activation to the matrix.
•Perform pooling to reduce dimensionality size
•Add as many convolutional layers until satisfied
•Flatten the output and feed into a fully connected layer (FC Layer)
•Output the class using an activation function (Logistic Regression with cost functions) and classifies images.
19
CNN: 1 x 1 convolution
•Convolutions is an element wise multiplication and summation of the input and kernel/filter elements.
• The data points to remember are:
1. Input matrix can and, in most cases, will have more than one channel. This is sometimes referred to as depth
•. Example: 64X64 pixel RGB input from an image will have 3 channels so the input is 64X64X3
2. The filter has the same depth as input except in some special cases.
•Example: filter of 3X3 will have 3 channels as well, hence the filter should be represented as 3X3X3
3. Third and critical point, the output of Convolution step will have the depth equal to number of filters we
choose.
•. Example: Output of Convolution step of the 3D input (64X64X3) and the filter we chose (3X3X3) will
have the depth of 1 (Because we have only one filter)
•The Convolution step on the 3D input 64X64X3 with filter size of 3X3X3 will have the filter ‘sliding’ along the
width and height of the input.
20
CNN: 1 x 1 convolution What is it?
1X1 Conv (Cross Channel Pooling) was used to reduce the
number of channels while introducing non-linearity.
In 1X1 Convolution simply means the filter is of size 1X1 (Yes —
that means a single number as opposed to matrix like, say 3X3
filter). This 1X1 filter will convolve over the ENTIRE input image
pixel by pixel.
For input of 64X64X3, if we choose a 1X1 filter (which would be
1X1X3), then the output will have the same Height and Weight as
input but only one channel — 64X64X1
For example consider inputs with large number of channels — 192 .
(remember Number of filters = Output Channels) to achieve this
effect. This effect of cross channel down-sampling is called
‘Dimensionality reduction’.
21
CNN: 1 x 1 convolution What is it?
Let us look at an example to understand how reducing dimension will reduce computational load. Suppose we need to
convolve 28 X 28 X 192 input feature maps with 5 X 5 X 32 filters. This will result in 120.422 Million operations
Let us do some math with the same input feature maps but with 1X1 Conv layer before the 5 X 5 conv layer
22
CNN: Transfer learning
The task of convolutional neural network (CNN) is to identify objects in images.
Transfer learning: take a model trained on a large dataset and transfer its knowledge to a smaller dataset.
The idea is the convolutional layers extract general, low-level features that are applicable across images — such as
edges, patterns, gradients — and the later layers identify specific features within an image such as eyes or wheels.
Thus, we can use a network trained on unrelated categories in a massive dataset (usually Imagenet) and apply it to our
own problem because there are universal, low-level features shared between images.
Following is the general outline for transfer learning for object recognition:
•Load in a pre-trained CNN model trained on a large dataset
•Freeze parameters (weights) in model’s lower convolutional layers
•Add custom classifier with several layers of trainable parameters to model
•Train classifier layers on training data available for task
•Fine-tune hyperparameters and unfreeze more layers as needed
23
CNN: Inception Network
Let’s suppose, we want to build a more complex deep neural network, what are the challenges we face when adding
complexity?
When building a deep neural network, we are faced two main challenges while adding a new layer:
•What should be the filter size should we choose — 3x3, 5x5 or 1x3 or something else?
•Should we choose a convolutional layer or a pooling layer?
•How I should proceed?
• Do I need to build multiple networks testing every time with a different layer or maybe a different kernel size?
• Is there a way to do them all and let the network decide what is more relevant for the problem I am solving?
24
CNN: Inception Network
•Inception comes with a more complicated network architecture but it works remarkably well.
•An inception network says that instead of choosing what filter size we want in the Conv layer or what kind of layer we
need, let’s do them all.
•To have a better understanding of how an inception network works, we need first to understand an inception module or
sometimes called inception block.
•An inception network is composed of multiple inception modules.
25
CNN: Inception Network
•Let’s imagine that we have a 28x28x192 input. Applying
a 1x1x64 convolution will result in an output volume of
28x28x64 calculated following the rule (n+2p-f/s) +1
where n=28, f=1, p=0 and s=1 .
•So now we have an output volume, but also we maybe
want to try a 3x3x128 convolution we will have a
28x28x128 output volume and then all we need to do is
to stack both volumes together
We can keep trying different filters size or even layers like seen in the picture above where we also tried to convolve with a
5x5 filter and also we applied a max-pool layer to the input volume wherein each step we needed to keep the height and
width of the output volumes same as the input volume. So, after stacking different outputs, this Inception module will have
an output of 28x28x256(256=64+128+32+32), and this is the heart of an Inception Network
26
CNN: Inception Network
•In an Inception Network where instead of picking one of these filter sizes or pooling we want, we can do them
all and by the end, we only need to concatenate all the outputs and let the Network learn whatever
combinations it wants to use.
27
Conclusion
03
04
24
Thank you so much!
For more info please contact us
0755 - 3501700
[email protected]