0% found this document useful (0 votes)
8 views44 pages

Machine Learning Unit 3 Part 1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views44 pages

Machine Learning Unit 3 Part 1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Machine Learning

Prof. Shraddha Kumar


Department of Computer Science

S . D. Bansal College, Indore


What we will learn

1. Convolution Neural Network (CNN)


2. Pooling
3. Padding and Stride
4. Loss Layer
5. Transfer learning

01
Convolution Neural Network: Introduction

In ordinary NNs, every neuron receives one


or more inputs, takes a weighted sum and it
passed through an activation function to
produce the final output.

02
Convolution Neural Network: Introduction

Ordinary NNs The


structure of input data is
ignored in ordinary NN
and all the data is
converted into 1-D
array before feeding it
into the network.

02
Convolution Neural Network: Architecture

•CNNs are also made up of neurons, that have learnable weights and biases.
•The architecture of CNN has a list of layers that transforms the 3-dimensional, i.e.
height, width and depth of image volume into a 3-dimensional output volume .
•It uses M filters, which are basically feature extractors that extract features like
edges, corner and so on.
03
Convolution Neural Network: Introduction
• CNN can consider the 2D structure of the images, process
them and allow it to extract the properties that are specific to
images.
• CNNs have the advantage of having one or more Convolutional
layers and pooling layer, which are the main building blocks of
CNNs.
• These layers are followed by one or more fully connected layers as
in standard multilayer NNs.

02
Convolution Neural Network: Architecture

•CNN image classifications takes an input image, process it and


classify it under certain categories (Eg., Dog, Cat, Tiger, Lion).
• Computers sees an input image as array of pixels. Based on the
image resolution, it will see h x w x d( h = Height, w = Width, d =
Dimension ).
•Eg., An image of 6 x 6 x 3 array of matrix of RGB (3 refers to
RGB values) and an image of 4 x 4 x 1 array of matrix of
grayscale image.

03
Convolution Neural Network: Architecture
•Each input image will pass it through a series of convolution layers with filters
(Kernals), Pooling, fully connected layers (FC) and apply Softmax function to
classify an object with probabilistic values between 0 and 1. The below figure is a
complete flow of CNN to process an input image and classifies the objects based
on values.

04
Convolution Neural Network: Architecture
Following are the layers that are used to construct CNNs−
INPUT− As the name implies, this
layer holds the raw pixel values.
Raw pixel values mean the data of
the image as it is. Example, INPUT
[64×64×3] is a 3-channeled RGB
image of width-64, height-64 and
depth-3.

05
Convolution Neural Network: Architecture
Following are the layers that are used to construct CNNs−

CONV− This layer is one of the


building blocks of CNNs as most
of the computation is done in this
layer. Example - if we use 6
filters on the above mentioned
INPUT [64×64×3], this may
result in the volume [64×64×6].

05
Convolution Neural Network: Architecture

•RELU−Also called rectified


linear unit layer, that applies an
activation function to the output
of previous layer. In other
manner, a non-linearity would be
added to the network by RELU.

06
Convolution Neural Network: Architecture

•POOL− This layer, i.e. Pooling layer


is one other building block of CNNs.
The main task of this layer is down-
sampling, which means it operates
independently on every slice of the
input and resizes it spatially.

06
Convolution Neural Network: Architecture
Breaking it Down:
•Pooling layer is one other Down-Sampling:
This means reducing the size of the input feature maps. It helps in
building block of CNNs. The main reducing computation and extracting dominant features while ignoring
unnecessary details.
task of this layer is down- Operates Independently on Every Slice:
Each slice refers to an individual feature map in the input. The
sampling, which means it pooling operation is applied separately to each feature map.

operates independently on Resizes Spatially:


The operation reduces the height and width of the feature maps, but
every slice of the input and the depth remains the same.
Common pooling techniques include:
resizes it spatially. Max Pooling: Takes the maximum value from each region.
Average Pooling: Takes the average value from each region.

Example: If an input image has dimensions 32x32x3 (Width x Height x


Channels) and we apply 2x2 Max Pooling with a stride of 2, the output size
will be 16x16x3.

06
Convolution Neural Network: Architecture

•FC− It is called Fully Connected


layer or more specifically the
output layer. It is used to compute
output class score and the
resulting output is volume of the
size 1*1*L where L is the number
corresponding to class score.

06
Convolution Neural Network: Architecture
Breaking it Down:
Fully Connected (FC) Layer:
•FC− It is called Fully Every neuron in this layer is connected to every neuron in the previous
layer.
Connected layer / output layer. It converts the extracted features into class scores for classification.
Output Class Score:
The FC layer processes the learned features and assigns a score to each
It is used to compute output possible class.
A Softmax activation function is often used to convert these scores
class score and the resulting into probabilities.
Output Size (1×1×L):
The output of the FC layer is a vector of size L, where L is the number
output is volume of the size of classes.
The shape 1×1×L means that there is only one value per class (a
1*1*L where L is the number Example:
single probability or score for each class).

For an image classification task with 10 classes (e.g., digits 0-9):


corresponding to class score. The FC layer output will be 1×1×10.
Each of the 10 values represents a class score.
The class with the highest score is the predicted class.

06
CNN Architecture: Convolution Layer

•Convolution is the first layer to extract features from an input image. Convolution preserves the relationship between
pixels by learning image features using small squares of input data.
• It is a mathematical operation that takes two inputs such as image matrix and a filter or kernel.

07
CNN Architecture: Convolution Layer
•Convolution Operation:
•It is a mathematical operation that applies a small filter (also called a kernel) over
an image. The filter slides over the image, multiplying its values with the pixel
values and summing them up to create a new representation.
•.Preserving Pixel Relationships:
•Unlike fully connected layers, which treat all input pixels as independent,
convolutions maintain spatial relationships.
•This means nearby pixels are processed together, preserving structures like edges,
textures, and shapes.
•Learning Image Features: The small squares (filters) learn different features such
as edges, corners, textures, and patterns. Deeper layers in the CNN combine these
basic features to recognize complex objects.
•Example:
•Imagine a 3×3 filter applied to an image: The filter scans through the image one
small section at a time, identifying patterns. Early layers may detect edges, while
deeper layers detect faces, objects, or more abstract patterns.
07
CNN Architecture: Convolution Layer

•Consider a 5 x 5 whose image pixel values are 0, 1 and filter matrix 3 x 3 as shown in below. Then the convolution of 5 x 5
image matrix multiplies with 3 x 3 filter matrix which is called “Feature Map” as output shown in below

08
CNN

12
CNN

12
CNN

12
CNN

12
CNN: Pooling
Pooling is the process of merging. So it’s basically for the purpose of reducing the size of the data.
Shown below is the pooling with 2*2 filters.

But isn’t this losing valuable data? Why are we reducing the size? It could be seen like losing information at
the first glimpse, but it’s rather getting more ‘meaningful’ data than losing.
By removing some noise in the data and extracting only the significant one, we can reduce overfitting
and speed up the computation.

09
CNN: Padding
Sometimes the pixels of the image aren’t processed with the same number. The pixels at the corner are less counted than
those in the middle. This means that the pixels don’t get the same amount of weights. Also, If we just keep applying the
convolution, we might lose the data too fast. Padding is the trick we can use here to fix this problem.
As its name, padding means giving additional pixels at the boundary of the data. We have two options:
•Pad the picture with zeros (zero-padding) so that it fits
•Drop the part of the image where the filter did not fit. This is called valid padding which keeps only valid part of the
image.

10
CNN: Padding
Example: The input image has 4x4 pixels and the filter has 3x3. There is no padding, which is called ‘valid.’ The
result becomes 2x2 pixels data (4–3+1 = 2). We can see that the output data is downsized.

By the way, does a filter always have to move one pixel at a time? Of course not. We can also make it move
two steps or three steps at a time both in the horizontal and vertical ways. This is called ‘stride.’

11
CNN: Stride
•Stride is the number of pixels shifts over the input matrix. When the stride is 1 then we move the filters to 1
pixel at a time. When the stride is 2 then we move the filters to 2 pixels at a time and so on. The below figure
shows convolution would work with a stride of 2.

12
CNN: Non Linearity (ReLU)
•ReLU stands for Rectified Linear Unit for a non-linear operation. The output is ƒ(x) = max(0,x).
•Why ReLU is important : ReLU’s purpose is to introduce non-linearity in our ConvNet. Since, the real world
data would want our ConvNet to learn would be non-negative linear values.

There are other non linear functions such as tanh or


sigmoid that can also be used instead of ReLU.
Most of the data scientists use ReLU since
performance wise ReLU is better than the other two.

13
CNN: Flattening

Flattening is converting the data into a 1-dimensional array for inputting it to the next layer. We flatten the
output of the convolutional layers to create a single long feature vector. And it is connected to the final
classification model, which is called a fully-connected layer. In other words, we put all the pixel data in one
line and make connections with the final layer. .

14
CNN: Flattening
Adding multiple convolutional layers and pooling layers, the image will be processed for feature
extraction.
As the layers go deeper and deeper, the features that the model deals with become more complex.

A flatten layer collapses the spatial dimensions


of the input into the channel dimension. For
example, if the input to the layer is an H-by-W-
by-C-by-N-by-S array (sequences of images),
then the flattened output is an (H*W*C)-by-N-
by-S array. This layer supports sequence input
:
only.

15
CNN: Fully Connected Layer
•The layer we call as FC layer, we flattened our matrix into vector and feed it into a fully connected layer like a
neural network.

In the above diagram, the feature map matrix


will be converted as vector (x1, x2, x3, …).
With the fully connected layers, we
combined these features together to
create a model. Finally, we have an
activation function such as softmax or
sigmoid to classify the outputs as cat, dog,
car, truck etc.,

16
CNN: Loss Layer
The "loss layer", or "loss function", specifies how training penalizes the deviation between the predicted output
of the network, and the true data labels (during supervised learning). Various loss functions can be used,
depending on the specific task.
•Softmax loss function is used for predicting a single class of K mutually exclusive classes.
• Sigmoid cross-entropy loss is used for predicting K independent probability values.
• Euclidean loss is used for regressing to real-valued labels .

The function used to evaluate a candidate solution (i.e. a set of weights) is referred to as the objective function.
We may seek to maximize or minimize the objective function. When we are minimizing it, we may also call it the cost function,
loss function, or error function.
Typically, with neural networks, we seek to minimize the error. As such, the objective function is often referred to as a cost
function or a loss function and the value calculated by the loss function is referred to as simply “loss.”

17
CNN: Loss Layer
•Following the fully-connected layer is the loss layer, which manages the adjustments of weights across the
network.
•Before the training of the network begins, the weights in the convolution and fully-connected layers are given
random values.
•Then during training, the loss layer continually checks the fully-connected layer's guesses against the actual
values with the goal of minimizing the difference between the guess and the real value as much as possible.
•The loss layer does this by adjusting the weights in both the convolution and fully-connected layers.

18
CNN: Summary
•Provide input image into convolution layer
•Choose parameters, apply filters with strides, padding if requires. Perform convolution on the image and apply
ReLU activation to the matrix.
•Perform pooling to reduce dimensionality size
•Add as many convolutional layers until satisfied
•Flatten the output and feed into a fully connected layer (FC Layer)
•Output the class using an activation function (Logistic Regression with cost functions) and classifies images.

19
CNN: 1 x 1 convolution
•Convolutions is an element wise multiplication and summation of the input and kernel/filter elements.
• The data points to remember are:
1. Input matrix can and, in most cases, will have more than one channel. This is sometimes referred to as depth
•. Example: 64X64 pixel RGB input from an image will have 3 channels so the input is 64X64X3
2. The filter has the same depth as input except in some special cases.
•Example: filter of 3X3 will have 3 channels as well, hence the filter should be represented as 3X3X3
3. Third and critical point, the output of Convolution step will have the depth equal to number of filters we
choose.
•. Example: Output of Convolution step of the 3D input (64X64X3) and the filter we chose (3X3X3) will
have the depth of 1 (Because we have only one filter)
•The Convolution step on the 3D input 64X64X3 with filter size of 3X3X3 will have the filter ‘sliding’ along the
width and height of the input.

20
CNN: 1 x 1 convolution What is it?
1X1 Conv (Cross Channel Pooling) was used to reduce the
number of channels while introducing non-linearity.
In 1X1 Convolution simply means the filter is of size 1X1 (Yes —
that means a single number as opposed to matrix like, say 3X3
filter). This 1X1 filter will convolve over the ENTIRE input image
pixel by pixel.
For input of 64X64X3, if we choose a 1X1 filter (which would be
1X1X3), then the output will have the same Height and Weight as
input but only one channel — 64X64X1
For example consider inputs with large number of channels — 192 .
(remember Number of filters = Output Channels) to achieve this
effect. This effect of cross channel down-sampling is called
‘Dimensionality reduction’.

21
CNN: 1 x 1 convolution What is it?
Let us look at an example to understand how reducing dimension will reduce computational load. Suppose we need to
convolve 28 X 28 X 192 input feature maps with 5 X 5 X 32 filters. This will result in 120.422 Million operations

Let us do some math with the same input feature maps but with 1X1 Conv layer before the 5 X 5 conv layer

22
CNN: Transfer learning
The task of convolutional neural network (CNN) is to identify objects in images.
Transfer learning: take a model trained on a large dataset and transfer its knowledge to a smaller dataset.
The idea is the convolutional layers extract general, low-level features that are applicable across images — such as
edges, patterns, gradients — and the later layers identify specific features within an image such as eyes or wheels.
Thus, we can use a network trained on unrelated categories in a massive dataset (usually Imagenet) and apply it to our
own problem because there are universal, low-level features shared between images.

Following is the general outline for transfer learning for object recognition:
•Load in a pre-trained CNN model trained on a large dataset
•Freeze parameters (weights) in model’s lower convolutional layers
•Add custom classifier with several layers of trainable parameters to model
•Train classifier layers on training data available for task
•Fine-tune hyperparameters and unfreeze more layers as needed

23
CNN: Inception Network
Let’s suppose, we want to build a more complex deep neural network, what are the challenges we face when adding
complexity?
When building a deep neural network, we are faced two main challenges while adding a new layer:
•What should be the filter size should we choose — 3x3, 5x5 or 1x3 or something else?
•Should we choose a convolutional layer or a pooling layer?
•How I should proceed?
• Do I need to build multiple networks testing every time with a different layer or maybe a different kernel size?
• Is there a way to do them all and let the network decide what is more relevant for the problem I am solving?

•Inception networks are the ingenious answer to all these questions.

24
CNN: Inception Network
•Inception comes with a more complicated network architecture but it works remarkably well.
•An inception network says that instead of choosing what filter size we want in the Conv layer or what kind of layer we
need, let’s do them all.
•To have a better understanding of how an inception network works, we need first to understand an inception module or
sometimes called inception block.
•An inception network is composed of multiple inception modules.

25
CNN: Inception Network
•Let’s imagine that we have a 28x28x192 input. Applying
a 1x1x64 convolution will result in an output volume of
28x28x64 calculated following the rule (n+2p-f/s) +1
where n=28, f=1, p=0 and s=1 .
•So now we have an output volume, but also we maybe
want to try a 3x3x128 convolution we will have a
28x28x128 output volume and then all we need to do is
to stack both volumes together

We can keep trying different filters size or even layers like seen in the picture above where we also tried to convolve with a
5x5 filter and also we applied a max-pool layer to the input volume wherein each step we needed to keep the height and
width of the output volumes same as the input volume. So, after stacking different outputs, this Inception module will have
an output of 28x28x256(256=64+128+32+32), and this is the heart of an Inception Network

26
CNN: Inception Network
•In an Inception Network where instead of picking one of these filter sizes or pooling we want, we can do them
all and by the end, we only need to concatenate all the outputs and let the Network learn whatever
combinations it wants to use.

Inception Module Computational cost


•One big problem with the above inception modules is that even a modest number of 5x5 convolutions can be
expensive on top of a convolutional layer with a large number of filters. If we calculate the computational cost
we will have 28x28x32x5x5x192 which is equal to 120 Million parameters, which is a pretty expensive
operation to calculate.
• To solve this problem, we can use a 1x1 convolution to reduce the volume to 16 channels, and then
on this smaller volume, we can apply the 5x5 convolution.
•An inception network is a pretty deep network that is subject to the vanishing gradient problem.
•To prevent the middle part of the network from “dying out”, we can use intermediate classifiers to solve
this vanishing gradient problem

27
Conclusion

01 Introduction of Convolutional Neural Network

02 Basic Architecture of CNN

03

04

24
Thank you so much!
For more info please contact us

0755 - 3501700
[email protected]

Video Lecture by : Prof. Shraddha Kumar

You might also like