A Probabilistic Theory of Deep Learning: Unit 2
A Probabilistic Theory of Deep Learning: Unit 2
Real world data is chaotic.Since deep learning systems utilize real world data, they require
a tool to handle the chaoticness.
It is always practical to use a simple and uncertain system rather than a complex but
certain and brittle one.
Back prorpogation
Static Back-propagation
Recurrent Backpropagation
Static back-propagation:
It is one kind of backpropagation network which produces a mapping of a
static input for static output. It is useful to solve static classification issues like
optical character recognition.
Recurrent Backpropagation:
The main difference between both of these methods is: that the mapping is
rapid in static back-propagation while it is nonstatic in recurrent
backpropagation.
1. Input xx: Set the corresponding activation a1a1 for the input
layer.
2. Feedforward: For
each l=2,3,…,Ll=2,3,…,L compute zl=wlal−1+blzl=wlal−1+bl and al=
σ(zl)al=σ(zl).
3. Output error δLδL: Compute the
vector δL=∇aC⊙σ′(zL)δL=∇aC⊙σ′(zL).
4. Backpropagate the error: For
each l=L−1,L−2,…,2l=L−1,L−2,…,2 compute δl=((wl+1)Tδl+1)⊙σ′(zl)δ
l=((wl+1)Tδl+1)⊙σ′(zl).
5. Output: The gradient of the cost function is given
by ∂C∂wljk=al−1kδlj∂C∂wjkl=akl−1δjl and ∂C∂blj=δlj∂C∂bjl=δjl.
Examining the algorithm you can see why it's called backpropagation.
We compute the error vectors δlδl backward, starting from the final
layer. It may seem peculiar that we're going through the network
backward. But if you think about the proof of backpropagation, the
backward movement is a consequence of the fact that the cost is a
function of outputs from the network. To understand how the cost
varies with earlier weights and biases we need to repeatedly apply the
chain rule, working backward through the layers to obtain usable
expressions.
3. Batch normalisation
Batch normalisation is a technique for improving the performance and stability of neural
networks, and also makes more sophisticated deep learning architectures work in practice
(like DCGANs).
The idea is to normalise the inputs of each layer in such a way that they have a mean output
activation of zero and standard deviation of one. This is analogous to how the inputs to
networks are standardised.
How does this help? We know that normalising the inputs to a network helps it learn. But a
network is just a series of layers, where the output of one layer becomes the input to the next.
That means we can think of any layer in a neural network as the first layer of a smaller
subsequent network.
Thought of as a series of neural networks feeding into each other, we normalising the output
of one layer before applying the activation function, and then feed it into the following layer
(sub-network).
In Keras, it is implemented using the following code. Note how the BatchNormalization call
occurs after each fully-connected layer, but before the activation function and dropout.
Since by now we have a clear idea of why we need Batch normalization, let’s understand
how it works. It is a two-step process. First, the input is normalized, and later rescaling and
offsetting is performed.
Normalization is the process of transforming the data to have a mean zero and standard
deviation one. In this step we have our batch input from layer h, first, we need to calculate
the mean of this hidden activation.
Once we have meant at our end, the next step is to calculate the standard deviation of the
hidden activations.
Further, as we have the mean and the standard deviation ready. We will normalize the hidden
activations using these values. For this, we will subtract the mean from each input and divide
the whole value with the sum of standard deviation and the smoothing term (ε).
The smoothing term(ε) assures numerical stability within the operation by stopping a division
by a zero value.
Rescaling of Offsetting
In the final operation, the re-scaling and offsetting of the input take place. Here two
components of the BN algorithm come into the picture, γ(gamma) and β (beta). These
parameters are used for re-scaling (γ) and shifting(β) of the vector containing values from the
previous operations.
These two are learnable parameters, during the training neural network ensures the optimal
values of γ and β are used. That will enable the accurate normalization of each batch.
In short, "shallow" neural networks is a term used to describe NN that usually have only
one hidden layer as opposed to deep NN which have several hidden layers, often of
various types. ... In lay man terms : Shallow means : NOT DEEP that is no of hidden
layer = 1
A shallow network has less number of hidden layers. While there are studies that a
shallow network can fit any function, it will need to be really fat. That causes the number
of parameters to increase a lot.
There are quite conclusive results that a deep network can fit functions better with less
parameters than a shallow network.
A shallow network has less number of hidden layers. While there are studies that a
shallow network can fit any function, it will need to be really fat. That causes the number
of parameters to increase a lot.
There are quite conclusive results that a deep network can fit functions better with less
parameters than a shallow network.
A shallow network has less number of hidden layers. While there are studies that a
shallow network can fit any function, it will need to be really fat. That causes the number
of parameters to increase a lot.There are quite conclusive results that a deep network can
fit functions better with less parameters than a shallow network.
Besides an input layer and an output layer, a neural network has intermediate layers,
which might also be called hidden layers. They might also be called encoders.
A shallow network has less number of hidden layers. While there are studies that a
shallow network can fit any function, it will need to be really fat. That causes the number
of parameters to increase a lot.
There are quite conclusive results that a deep network can fit functions better with less
parameters than a shallow network.
5.Convolutional Networks
In the figure, we have an RGB image which has been separated by its three color planes — Red,
Green, and Blue. There are a number of such color spaces in which images exist — Grayscale,
RGB, HSV, CMYK, etc.
You can imagine how computationally intensive things would get once the images reach
dimensions, say 8K (7680×4320). The role of the ConvNet is to reduce the images into a
form which is easier to process, without losing features which are critical for getting a
good prediction. This is important when we are to design an architecture which is not only
good at learning features but also is scalable to massive datasets.
Convolution Layer — The Kernel
Convoluting a 5x5x1 image with a 3x3x1 kernel to get a 3x3x1 convolved feature
Image Dimensions = 5 (Height) x 5 (Breadth) x 1 (Number of channels, eg. RGB)
In the above demonstration, the green section resembles our 5x5x1 input image, I. The
element involved in carrying out the convolution operation in the first part of a
Convolutional Layer is called the Kernel/Filter, K, represented in the color yellow. We
have selected K as a 3x3x1 matrix.
Kernel/Filter, K = 1 0 1
0 1 0
1 0 1
The Kernel shifts 9 times because of Stride Length = 1 (Non-Strided), every time
performing a matrix multiplication operation between K and the portion P of the
image over which the kernel is hovering.
In the case of images with multiple channels (e.g. RGB), the Kernel has the same depth as
that of the input image. Matrix Multiplication is performed between Kn and In stack ([K1,
I1]; [K2, I2]; [K3, I3]) and all the results are summed with the bias to give us a squashed
one-depth channel Convoluted Feature Output.
The objective of the Convolution Operation is to extract the high-level features such as
edges, from the input image. ConvNets need not be limited to only one Convolutional
Layer. Conventionally, the first ConvLayer is responsible for capturing the Low-Level
features such as edges, color, gradient orientation, etc. With added layers, the architecture
adapts to the High-Level features as well, giving us a network which has the wholesome
understanding of images in the dataset, similar to how we would.
There are two types of results to the operation — one in which the convolved feature is
reduced in dimensionality as compared to the input, and the other in which the
dimensionality is either increased or remains the same. This is done by applying Valid
Padding in case of the former, or Same Padding in the case of the latter.
Pooling Layer
Similar to the Convolutional Layer, the Pooling layer is responsible for reducing the
spatial size of the Convolved Feature. This is to decrease the computational power
required to process the data through dimensionality reduction. Furthermore, it is useful
for extracting dominant features which are rotational and positional invariant, thus
maintaining the process of effectively training of the model.
There are two types of Pooling: Max Pooling and Average Pooling. Max Pooling returns
the maximum value from the portion of the image covered by the Kernel. On the other
hand, Average Pooling returns the average of all the values from the portion of the
image covered by the Kernel.
Max Pooling also performs as a Noise Suppressant. It discards the noisy activations
altogether and also performs de-noising along with dimensionality reduction. On the other
hand, Average Pooling simply performs dimensionality reduction as a noise suppressing
mechanism. Hence, we can say that Max Pooling performs a lot better than Average
Pooling.
Types of Pooling
The Convolutional Layer and the Pooling Layer, together form the i-th layer of a
Convolutional Neural Network. Depending on the complexities in the images, the number
of such layers may be increased for capturing low-levels details even further, but at the
cost of more computational power.
After going through the above process, we have successfully enabled the model to
understand the features. Moving on, we are going to flatten the final output and feed it to a
regular Neural Network for classification purposes.
Classification — Fully Connected Layer (FC Layer)
Generative adversarial networks (GANs) are algorithmic architectures that use two neural
networks, pitting one against the other (thus the “adversarial”) in order to generate new,
synthetic instances of data that can pass for real data. They are used widely in image
generation, video generation and voice generation.
GANs were introduced in a paper by Ian Goodfellow and other researchers at the
University of Montreal, including Yoshua Bengio, in 2014. Referring to GANs,
Facebook’s AI research director Yann LeCun called adversarial training “the most
interesting idea in the last 10 years in ML.”
GANs’ potential for both good and evil is huge, because they can learn to mimic any
distribution of data. That is, GANs can be taught to create worlds eerily similar to our
own in any domain: images, music, speech, prose. They are robot artists in a sense, and
their output is impressive – poignant even. But they can also be used to generate fake
media content, and are the technology underpinning Deepfakes.
The discriminator is in a feedback loop with the ground truth of the images, which we know.
The generator is in a feedback loop with the discriminator.
You can think of a GAN as the opposition of a counterfeiter and a cop in a
game of cat and mouse, where the counterfeiter is learning to pass false
notes, and the cop is learning to detect them. Both are dynamic; i.e. the
cop is in training, too (to extend the analogy, maybe the central bank is
flagging bills that slipped through), and each side comes to learn the
other’s methods in a constant escalation.
For MNIST, the discriminator network is a standard convolutional network
that can categorize the images fed to it, a binomial classifier labeling
images as real or fake. The generator is an inverse convolutional network,
in a sense: While a standard convolutional classifier takes an image and
downsamples it to produce a probability, the generator takes a vector of
random noise and upsamples it to an image. The first throws away data
through downsampling techniques like maxpooling, and the second
generates new data.
Both nets are trying to optimize a different and opposing objective function,
or loss function, in a zero-zum game. This is essentially an actor-critic
model. As the discriminator changes its behavior, so does the generator,
and vice versa. Their losses push against each other.
7. Semisupervised Learning
uses a small amount of labeled data and a large amount of unlabeled data, which provides the
benefits of both unsupervised and supervised learning while avoiding the challenges of finding a
large amount of labeled data. That means you can train a model to label data without having to
The way that semi-supervised learning manages to train the model with less labeled training data
than supervised learning is by using pseudo labeling. This can combine many neural network
Then use it with the unlabeled training dataset to predict the outputs, which are pseudo
Link the labels from the labeled training data with the pseudo labels created in the
previous step.
Link the data inputs in the labeled training data with the inputs in the unlabeled data.
Then, train the model the same way as you did with the labeled set in the beginning in
This is the type of situation where semi-supervised learning is ideal because it would be nearly
impossible to find a large amount of labeled text documents. This is simply because it is not time
efficient to have a person read through entire text documents just to assign it a
simple classification.
So, semi-supervised learning allows for the algorithm to learn from a small amount of labeled
text documents while still classifying a large amount of unlabeled text documents in the training
data.