A Probabilistic Theory of Deep Learning: Unit 2

UNIT 2
1.A Probabilistic Theory of Deep Learning
 By introducing probability to a deep learning system, we introduce common sense to the

system.Otherwise the system would be very brittle and will not be useful.In deep learning,
several models like bayesian models, probabilistic graphical models, hidden markov
models are used.They depend entirely on probability concepts.
 Real world data is chaotic.Since deep learning systems utilize real world data, they require
a tool to handle the chaoticness.
 It is always practical to use a simple and uncertain system rather than a complex but
certain and brittle one.
 A grand challenge in machine learning is the development of computational algorithms

that match or outperform humans in perceptual inference tasks that are complicated by
nuisance variation.
 For instance, visual object recognition involves the unknown object position, orientation,
and scale in object recognition while speech recognition involves the unknown voice
pronunciation, pitch, and speed.
 Recently, a new breed of deep learning algorithms have emerged for high-nuisance
inference tasks that routinely yield pattern recognition systems with near- or super-human
capabilities.
 But a fundamental question remains: Why do they work? Intuitions abound, but a
coherent framework for understanding, analyzing, and synthesizing deep learning
architectures has remained elusive.
 We answer this question by developing a new probabilistic framework for deep learning
based on the Deep Rendering Model: a generative probabilistic model that explicitly
captures latent nuisance variation.
 By relaxing the generative model to a discriminative one, we can recover two of the
current leading deep learning systems, deep convolutional neural networks and random
decision forests, providing insights into their successes and shortcomings, a principled
route to their improvement, and new avenues for exploration.
2. Backpropagation and regularization

REGULARIZATION
 Reduction of testing error while maintaining a low training error. Excessive training does
reduce training error, but often at the expense of higher testing error.
 The network essentially memorizes the training cases, which hinders its ability to
generalize and handle new cases (e.g., the test set).
 Exemplifies tradeoff between bias and variance: an NN with low bias (i.e. training error)
has trouble reducing variance (i.e. test error). Bias(θ˜m) = E(θ˜m)−θ Where θ is the
parameterization of the true generating function, and θ˜m is the parameterization of an
estimator based on the sample, m.
 E.g. θ˜m are weights of an NN after training on sample set m. Var(θ˜m) = degree to
which the estimator’s results change with other data samples (e.g., the test set) from
same data generator.
 → Regularization = attempt to combat the bias-variance tradeoff: to produce a θ˜m
with low bias and low variance.
Types of Regularization
Parameter Norm Penalization
Data Augmentation
Multitask Learning
Early Stopping
Sparse Representations
Ensemble Learning
Dropout Adversarial Training
Back prorpogation
 Backpropagation is a common method for training a neural network. There

is no shortage of papers online that attempt to explain how
backpropagation works, but few that include an example with actual
numbers. This post is my attempt to explain how it works with a concrete
example that folks can compare their own calculations to in order to
ensure they understand backpropagation correctly.
Why We Need Backpropagation?

Most prominent advantages of Backpropagation are:
 Backpropagation is fast, simple and easy to program
 It has no parameters to tune apart from the numbers of input
 It is a flexible method as it does not require prior knowledge about the
network
 It is a standard method that generally works well
 It does not need any special mention of the features of the function to
be learned.
Types of Backpropagation Networks

Two Types of Backpropagation Networks are:
 Static Back-propagation
 Recurrent Backpropagation
Static back-propagation:
It is one kind of backpropagation network which produces a mapping of a
static input for static output. It is useful to solve static classification issues like
optical character recognition.
Recurrent Backpropagation:
Recurrent backpropagation is fed forward until a fixed value is achieved. After

that, the error is computed and propagated backward.
The main difference between both of these methods is: that the mapping is
rapid in static back-propagation while it is nonstatic in recurrent
backpropagation.
HOW DOES IT WORKS
The backpropagation equations provide us with a way of computing

the gradient of the cost function. Let's explicitly write this out in the
form of an algorithm:
1. Input xx: Set the corresponding activation a1a1 for the input
layer.
2. Feedforward: For
each l=2,3,…,Ll=2,3,…,L compute zl=wlal−1+blzl=wlal−1+bl and al=
σ(zl)al=σ(zl).
3. Output error δLδL: Compute the
vector δL=∇aC⊙σ′(zL)δL=∇aC⊙σ′(zL).
4. Backpropagate the error: For
each l=L−1,L−2,…,2l=L−1,L−2,…,2 compute δl=((wl+1)Tδl+1)⊙σ′(zl)δ
l=((wl+1)Tδl+1)⊙σ′(zl).
5. Output: The gradient of the cost function is given
by ∂C∂wljk=al−1kδlj∂C∂wjkl=akl−1δjl and ∂C∂blj=δlj∂C∂bjl=δjl.
Examining the algorithm you can see why it's called backpropagation.
We compute the error vectors δlδl backward, starting from the final
layer. It may seem peculiar that we're going through the network
backward. But if you think about the proof of backpropagation, the
backward movement is a consequence of the fact that the cost is a
function of outputs from the network. To understand how the cost
varies with earlier weights and biases we need to repeatedly apply the
chain rule, working backward through the layers to obtain usable
expressions.

3. Batch normalisation
 Batch normalisation is a technique for improving the performance and stability of neural
networks, and also makes more sophisticated deep learning architectures work in practice
(like DCGANs).
 The idea is to normalise the inputs of each layer in such a way that they have a mean output
activation of zero and standard deviation of one. This is analogous to how the inputs to
networks are standardised.
 How does this help? We know that normalising the inputs to a network helps it learn. But a
network is just a series of layers, where the output of one layer becomes the input to the next.
That means we can think of any layer in a neural network as the first layer of a smaller
subsequent network.
 Thought of as a series of neural networks feeding into each other, we normalising the output
of one layer before applying the activation function, and then feed it into the following layer
(sub-network).
 In Keras, it is implemented using the following code. Note how the BatchNormalization call
occurs after each fully-connected layer, but before the activation function and dropout.
How does Batch Normalization work?
 Since by now we have a clear idea of why we need Batch normalization, let’s understand
how it works. It is a two-step process. First, the input is normalized, and later rescaling and
offsetting is performed.
 Normalization of the Input
 Normalization is the process of transforming the data to have a mean zero and standard
deviation one. In this step we have our batch input from layer h, first, we need to calculate
the mean of this hidden activation.
 Here, m is the number of neurons at layer h.
 Once we have meant at our end, the next step is to calculate the standard deviation of the
hidden activations.

 Further, as we have the mean and the standard deviation ready. We will normalize the hidden
activations using these values. For this, we will subtract the mean from each input and divide
the whole value with the sum of standard deviation and the smoothing term (ε).
 The smoothing term(ε) assures numerical stability within the operation by stopping a division
by a zero value.
 Rescaling of Offsetting
 In the final operation, the re-scaling and offsetting of the input take place. Here two
components of the BN algorithm come into the picture, γ(gamma) and β (beta). These
parameters are used for re-scaling (γ) and shifting(β) of the vector containing values from the
previous operations.
 These two are learnable parameters, during the training neural network ensures the optimal
values of γ and β are used. That will enable the accurate normalization of each batch.
4. Neural Nets-Deep Vs Shallow Networks
 In short, "shallow" neural networks is a term used to describe NN that usually have only
one hidden layer as opposed to deep NN which have several hidden layers, often of
various types. ... In lay man terms : Shallow means : NOT DEEP that is no of hidden
layer = 1
 A shallow network has less number of hidden layers. While there are studies that a
shallow network can fit any function, it will need to be really fat. That causes the number
of parameters to increase a lot.
 There are quite conclusive results that a deep network can fit functions better with less
parameters than a shallow network.
of parameters to increase a lot.There are quite conclusive results that a deep network can
fit functions better with less parameters than a shallow network.
 Besides an input layer and an output layer, a neural network has intermediate layers,
which might also be called hidden layers. They might also be called encoders.

5.Convolutional Networks
 A Convolutional Neural Network (ConvNet/CNN) is a Deep Learning algorithm which

can take in an input image, assign importance (learnable weights and biases) to various
aspects/objects in the image and be able to differentiate one from the other. The pre-
processing required in a ConvNet is much lower as compared to other classification
algorithms. While in primitive methods filters are hand-engineered, with enough training,
ConvNets have the ability to learn these filters/characteristics.
 The architecture of a ConvNet is analogous to that of the connectivity pattern of Neurons
in the Human Brain and was inspired by the organization of the Visual Cortex. Individual
neurons respond to stimuli only in a restricted region of the visual field known as the
Receptive Field.
 An image is nothing but a matrix of pixel values, right? So why not just flatten the image
(e.g. 3x3 image matrix into a 9x1 vector) and feed it to a Multi-Level Perceptron for
classification purposes? Uh.. not really.
 In cases of extremely basic binary images, the method might show an average precision
score while performing prediction of classes but would have little to no accuracy when it
comes to complex images having pixel dependencies throughout.
 A ConvNet is able to successfully capture the Spatial and Temporal dependencies in
an image through the application of relevant filters. The architecture performs a better
fitting to the image dataset due to the reduction in the number of parameters involved and
reusability of weights. In other words, the network can be trained to understand the
sophistication of the image better.
 Input Image
In the figure, we have an RGB image which has been separated by its three color planes — Red,
Green, and Blue. There are a number of such color spaces in which images exist — Grayscale,
RGB, HSV, CMYK, etc.
 You can imagine how computationally intensive things would get once the images reach
dimensions, say 8K (7680×4320). The role of the ConvNet is to reduce the images into a
form which is easier to process, without losing features which are critical for getting a
good prediction. This is important when we are to design an architecture which is not only
good at learning features but also is scalable to massive datasets.
 Convolution Layer — The Kernel

 Convoluting a 5x5x1 image with a 3x3x1 kernel to get a 3x3x1 convolved feature
 Image Dimensions = 5 (Height) x 5 (Breadth) x 1 (Number of channels, eg. RGB)
 In the above demonstration, the green section resembles our 5x5x1 input image, I. The
element involved in carrying out the convolution operation in the first part of a
Convolutional Layer is called the Kernel/Filter, K, represented in the color yellow. We
have selected K as a 3x3x1 matrix.
 Kernel/Filter, K = 1 0 1
0 1 0
1 0 1
 The Kernel shifts 9 times because of Stride Length = 1 (Non-Strided), every time
performing a matrix multiplication operation between K and the portion P of the
image over which the kernel is hovering.
 Movement of the Kernel

 The filter moves to the right with a certain Stride Value till it parses the complete width.
Moving on, it hops down to the beginning (left) of the image with the same Stride Value
and repeats the process until the entire image is traversed.
Convolution operation on a MxNx3 image matrix with a 3x3x3 Kernel
 In the case of images with multiple channels (e.g. RGB), the Kernel has the same depth as
that of the input image. Matrix Multiplication is performed between Kn and In stack ([K1,
I1]; [K2, I2]; [K3, I3]) and all the results are summed with the bias to give us a squashed
one-depth channel Convoluted Feature Output.
 The objective of the Convolution Operation is to extract the high-level features such as
edges, from the input image. ConvNets need not be limited to only one Convolutional
Layer. Conventionally, the first ConvLayer is responsible for capturing the Low-Level
features such as edges, color, gradient orientation, etc. With added layers, the architecture
adapts to the High-Level features as well, giving us a network which has the wholesome
understanding of images in the dataset, similar to how we would.
 There are two types of results to the operation — one in which the convolved feature is
reduced in dimensionality as compared to the input, and the other in which the
dimensionality is either increased or remains the same. This is done by applying Valid
Padding in case of the former, or Same Padding in the case of the latter.
 Pooling Layer
 Similar to the Convolutional Layer, the Pooling layer is responsible for reducing the
spatial size of the Convolved Feature. This is to decrease the computational power
required to process the data through dimensionality reduction. Furthermore, it is useful
for extracting dominant features which are rotational and positional invariant, thus
maintaining the process of effectively training of the model.
 There are two types of Pooling: Max Pooling and Average Pooling. Max Pooling returns
the maximum value from the portion of the image covered by the Kernel. On the other
hand, Average Pooling returns the average of all the values from the portion of the
image covered by the Kernel.
 Max Pooling also performs as a Noise Suppressant. It discards the noisy activations
altogether and also performs de-noising along with dimensionality reduction. On the other
hand, Average Pooling simply performs dimensionality reduction as a noise suppressing
mechanism. Hence, we can say that Max Pooling performs a lot better than Average
Pooling.
 Types of Pooling
 The Convolutional Layer and the Pooling Layer, together form the i-th layer of a
Convolutional Neural Network. Depending on the complexities in the images, the number
of such layers may be increased for capturing low-levels details even further, but at the
cost of more computational power.
 After going through the above process, we have successfully enabled the model to
understand the features. Moving on, we are going to flatten the final output and feed it to a
regular Neural Network for classification purposes.
 Classification — Fully Connected Layer (FC Layer)

6. Generative Adversarial Networks
 Generative adversarial networks (GANs) are algorithmic architectures that use two neural
networks, pitting one against the other (thus the “adversarial”) in order to generate new,
synthetic instances of data that can pass for real data. They are used widely in image
generation, video generation and voice generation.
 GANs were introduced in a paper by Ian Goodfellow and other researchers at the
University of Montreal, including Yoshua Bengio, in 2014. Referring to GANs,
Facebook’s AI research director Yann LeCun called adversarial training “the most
interesting idea in the last 10 years in ML.”
 GANs’ potential for both good and evil is huge, because they can learn to mimic any
distribution of data. That is, GANs can be taught to create worlds eerily similar to our
own in any domain: images, music, speech, prose. They are robot artists in a sense, and
their output is impressive – poignant even. But they can also be used to generate fake
media content, and are the technology underpinning Deepfakes.
How GANs Work

 One neural network, called the generator, generates new data instances, while the other,
the discriminator, evaluates them for authenticity; i.e. the discriminator decides whether
each instance of data that it reviews belongs to the actual training dataset or not.
 Let’s say we’re trying to do something more banal than mimic the Mona Lisa. We’re
going to generate hand-written numerals like those found in the MNIST dataset, which is
taken from the real world. The goal of the discriminator, when shown an instance from
the true MNIST dataset, is to recognize those that are authentic.
 Meanwhile, the generator is creating new, synthetic images that it passes to the
discriminator. It does so in the hopes that they, too, will be deemed authentic, even
though they are fake. The goal of the generator is to generate passable hand-written
digits: to lie without being caught. The goal of the discriminator is to identify images
coming from the generator as fake.
Here are the steps a GAN takes:
 The generator takes in random numbers and returns an image.

 This generated image is fed into the discriminator alongside a stream of images taken from
the actual, ground-truth dataset.
 The discriminator takes in both real and fake images and returns probabilities, a number
between 0 and 1, with 1 representing a prediction of authenticity and 0 representing fake.
So you have a double feedback loop:
 The discriminator is in a feedback loop with the ground truth of the images, which we know.
 The generator is in a feedback loop with the discriminator.
 You can think of a GAN as the opposition of a counterfeiter and a cop in a
game of cat and mouse, where the counterfeiter is learning to pass false
notes, and the cop is learning to detect them. Both are dynamic; i.e. the
cop is in training, too (to extend the analogy, maybe the central bank is
flagging bills that slipped through), and each side comes to learn the
other’s methods in a constant escalation.
 For MNIST, the discriminator network is a standard convolutional network
that can categorize the images fed to it, a binomial classifier labeling
images as real or fake. The generator is an inverse convolutional network,
in a sense: While a standard convolutional classifier takes an image and
downsamples it to produce a probability, the generator takes a vector of
random noise and upsamples it to an image. The first throws away data
through downsampling techniques like maxpooling, and the second
generates new data.
 Both nets are trying to optimize a different and opposing objective function,
or loss function, in a zero-zum game. This is essentially an actor-critic
model. As the discriminator changes its behavior, so does the generator,
and vice versa. Their losses push against each other.

Image credit: Thalles Silva

 If you want to learn more about generating images, Brandon Amos wrote a
great post about interpreting images as samples from a probability
distribution.
7. Semisupervised Learning
Semi-supervised learning is the type of machine learning that uses a combination of a

small amount of labeled data and a large amount of unlabeled data to train models. This
approach to machine learning is a combination of supervised machine learning, which
uses labeled training data, and unsupervised learning, which uses unlabeled training data.
What is semi-supervised machine learning?
Semi-supervised machine learning is a combination of supervised and unsupervised learning. It
uses a small amount of labeled data and a large amount of unlabeled data, which provides the
benefits of both unsupervised and supervised learning while avoiding the challenges of finding a
large amount of labeled data. That means you can train a model to label data without having to
use as much labeled training data.
How semi-supervised learning works
The way that semi-supervised learning manages to train the model with less labeled training data
than supervised learning is by using pseudo labeling. This can combine many neural network
models and training methods. Here’s how it works:

 Train the model with the small amount of labeled training data just like you would in
supervised learning, until it gives you good results.
 Then use it with the unlabeled training dataset to predict the outputs, which are pseudo
labels since they may not be quite accurate.
 Link the labels from the labeled training data with the pseudo labels created in the
previous step.
 Link the data inputs in the labeled training data with the inputs in the unlabeled data.
 Then, train the model the same way as you did with the labeled set in the beginning in
order to decrease the error and improve the model’s accuracy.
Example application of semi-supervised learning
A common example of an application of semi-supervised learning is a text document classifier.
This is the type of situation where semi-supervised learning is ideal because it would be nearly
impossible to find a large amount of labeled text documents. This is simply because it is not time
efficient to have a person read through entire text documents just to assign it a
simple classification.
So, semi-supervised learning allows for the algorithm to learn from a small amount of labeled
text documents while still classifying a large amount of unlabeled text documents in the training
data.

A Probabilistic Theory of Deep Learning: Unit 2

Uploaded by

A Probabilistic Theory of Deep Learning: Unit 2

Uploaded by

UNIT 2

1.A Probabilistic Theory of Deep Learning

 By introducing probability to a deep learning system, we introduce common sense to the

 A grand challenge in machine learning is the development of computational algorithms

2. Backpropagation and regularization

 Backpropagation is a common method for training a neural network. There

Why We Need Backpropagation?

Types of Backpropagation Networks

Recurrent backpropagation is fed forward until a fixed value is achieved. After

HOW DOES IT WORKS

The backpropagation equations provide us with a way of computing

How does Batch Normalization work?

 Normalization of the Input

 Here, m is the number of neurons at layer h.

4. Neural Nets-Deep Vs Shallow Networks

 A Convolutional Neural Network (ConvNet/CNN) is a Deep Learning algorithm which

 Movement of the Kernel

Convolution operation on a MxNx3 image matrix with a 3x3x3 Kernel

6. Generative Adversarial Networks

How GANs Work

Here are the steps a GAN takes:

 The generator takes in random numbers and returns an image.

So you have a double feedback loop:

Image credit: Thalles Silva

Semi-supervised learning is the type of machine learning that uses a combination of a

What is semi-supervised machine learning?

Semi-supervised machine learning is a combination of supervised and unsupervised learning. It

use as much labeled training data.

How semi-supervised learning works

models and training methods. Here’s how it works:

supervised learning, until it gives you good results.

labels since they may not be quite accurate.

order to decrease the error and improve the model’s accuracy.

Example application of semi-supervised learning

A common example of an application of semi-supervised learning is a text document classifier.

You might also like