Convolutional Neural Networks
Na Lu
Xi’an Jiaotong University
Intuition of CNN
• In the former section, we deal with images that were relatively low in resolution, such
as small image patches and small images of hand-written digits.
• In the sparse autoencoder, "fully connection" of all the hidden units to all the input
units is employed.
– On the relatively small images that we were working with (e.g., 8x8
patches, 28x28 images for the MNIST dataset), it was computationally
feasible to learn features on the entire image.
– However, with larger images (e.g., 96x96 images) learning features that
span the entire image (fully connected networks) is very computationally
expensive--about 104 input units, and assuming you want to learn 100
features, you would have on the order of 106 parameters to learn. The
feedforward and backpropagation computations would also be about
102 times slower, compared to 28x28 images.
Intuition of CNN
• In this section, we will develop methods which
will allow us to scale up these methods to more
realistic datasets that have larger images.
Convolutional Neural Network
• Two key ingredients
– Locally connected networks
– Convolutions
– Pooling
– Local receptive field
– Weight sharing
Locally Connected Networks
• One simple solution to this problem is to restrict the connections between
the hidden units and the input units, allowing each hidden unit to connect to
only a small subset of the input units.
• Specifically, each hidden unit will connect to only a small contiguous region
of pixels in the input.
• For input modalities different than images, there is often also a natural way
to select "contiguous groups" of input units to connect to a single hidden
unit as well; for example, for audio, a hidden unit might be connected to only
the input units corresponding to a certain time span of the input audio clip.
Locally Connected Networks
Locally Connected Networks
• The idea of having locally connected networks also
draws inspiration from how the early visual system is
wired up in biology. Specifically, neurons in the visual
cortex have localized receptive fields (i.e., they respond
only to stimuli in a certain location).
Convolutions
• Natural images have the property of being stationary,
meaning that the statistics of one part of the image are
the same as any other part.
• This suggests that the features that we learn at one part
of the image can also be applied to other parts of the
image, and we can use the same features at all locations.
Convolutions
Convolutions
• Example:
– Suppose we have learned features over small (say 8x8) patches
sampled randomly from the larger image, we can then apply this
learned 8x8 feature detector anywhere in the image.
– Specifically, we can take the learned 8x8 features and
convolve them with the larger image, thus obtaining a different
feature activation value at each location in the image.
Convolutions
• Suppose you have learned features on 8x8 patches sampled from a
96x96 image.
• Suppose further this was done with an autoencoder that has 100
hidden units.
• To get the convolved features, for every 8x8 region of the 96x96
image, run it through the trained sparse autoencoder to get the
feature activations. This would result in 100 sets 89x89 convolved
features.
Convolutions
• Formal illustration
– Given some large r×c images xlarge
– We first train a sparse autoencoder on
small m×n patches xsmall sampled from
these images
– Learning k features f = σ(W(1)xsmall + b(1))
(where σ is the sigmoid function), given
by the weights W(1) and biases b(1) from
the visible units to the hidden units.
– For every m×n patch xs in the large
image, we compute fs = σ(W(1)xs + b(1)),
giving us fconvolved, a k×(r-m+1)×(c-n+1)
array of convolved features.
Pooling
• Problem
– In theory, one could use all the extracted features with a
classifier such as a softmax classifier, but this can be
computationally challenging.
– Consider images of size 96x96 pixels, and suppose we have
learned 400 features over 8x8 inputs.
– Each convolution results in an output of size (96−8+1)*(96−8+1)
=7921, and since we have 400 features, this results in a vector
of 892*400 =3,168,400 features per example.
– Learning a classifier with inputs having 3+ million features can
be unwieldy, and can also be prone to over-fitting.
Pooling
• Recall that we decided to obtain convolved features because
images have the "stationarity" property, which implies that features
that are useful in one region are also likely to be useful for other
regions.
• To describe a large image, one natural approach is to aggregate
statistics of these features at various locations. For example, one
could compute the mean (or max) value of a particular feature over
a region of the image.
• The aggregation operation is called pooling, or sometimes mean
pooling or max pooling (depending on the pooling operation
applied).
Pooling
• Formal description
– After obtaining the convolved features, we decide the
size of the region to pool the convolved features over.
– Then, divide the convolved features into disjoint
regions, and take the mean (or maximum) feature
activation over these regions to obtain the pooled
convolved features.
– These pooled features can then be used for
classification.
Pooling
• Pooling for Invariance
– If one chooses the pooling regions to be contiguous areas in the
image and only pools features generated from the same
(replicated) hidden units. These pooling units will be translation
invariant.
– This means that the same (pooled) feature will be active even
when the image undergoes (small) translations.
Pooling
Pooling
Weight Sharing
• A special case of pooling.
• For different local regions, use the same weight
for the same feature.
The replicated feature approach
(currently the dominant approach for neural networks)
• Use many different copies of the same The red connections all
feature detector with different positions. have the same weight.
– Could also replicate across scale and
orientation (tricky and expensive)
– Replication greatly reduces the number of free
parameters to be learned.
• Use several different feature types, each
with its own map of replicated detectors.
– Allows each patch of image to be represented
in several ways.
Backpropagation with weight
constraints
• It’s easy to modify the
backpropagation algorithm to
incorporate linear constraints
between the weights.
• We compute the gradients as usual,
and then modify the gradients so
that they satisfy the constraints.
– So if the weights started off
satisfying the constraints, they
will continue to satisfy them.
What does replicating the feature detectors
achieve?
• Equivariant activities: Replicated features do not make the neural activities
invariant to translation. The activities are equivariant.
translated
representation representation
by active
neurons
translated
image image
• Invariant knowledge: If a feature is useful in some locations during training,
detectors for that feature will be available in all locations during testing.
Pooling the outputs of replicated
feature detectors
• Get a small amount of translational invariance at each level by averaging
four neighboring replicated detectors to give a single output to the next
level.
– This reduces the number of inputs to the next layer of feature
extraction, thus allowing us to have many more different feature maps.
– Taking the maximum of the four works slightly better.
• Problem: After several levels of pooling, we have lost information about
the precise positions of things.
– This makes it impossible to use the precise spatial relationships
between high-level parts for recognition.
Question
Convolutional Neural Networks
• Compared to the standard feedforward neural
networks with similarly-sized layers:
– CNNs have much fewer connections and parameters
– So they are easier to train
– While their theoretically best performance is likely to
be only slightly worse.
Convolutional Neural Networks
• One successful application of CNNs:
Le Net
Le Net
• Yann LeCun and his collaborators developed a really good
recognizer for handwritten digits by using backpropagation in a
feedforward net with:
– Many hidden layers
– Many maps of replicated units in each layer.
– Pooling of the outputs of nearby replicated units.
– A wide net that can cope with several characters at once even if
they overlap.
– A clever way of training a complete system, not just a recognizer.
• This net was used for reading ~10% of the checks in North America.
• Look the impressive demos of LENET at [Link]
The architecture of LeNet5
C1 S2 C3 S4 F5 F6 F7
Le Net
Le Net
• LeNet 5 has seven layers.
• Input: 32×32 pixel image. The largest character is 20×20. (Note that all
important information should be in the center of the receptive field of the
highest level feature detectors)
• C1 and C2 are convolutional layers
• S1 and S2 are subsample layers
• F1, F2 and F3 are fully connected layers
Le Net
• C1: Convolutional layer with 6 feature maps of size 28×28. C1k(k=1…6)
• Each unit of C1 has a 5×5 receptive field in the input layer
• Properties: local connection, convolution, and shared weights
• C1 layer has (5×5+1) ×6=156 parameters to learn
• C1 layer has connections: 28×28×(5×5+1) ×6=122,304
• If it was fully connected, (32×32+1) ×(28 ×28) ×6=4,821,600 parameters
Le Net
• S2: Subsampling layer with 6 feature maps of size 14×14
• For each unit in S2 layer, 2×2 nonoverlapping receptive fields in C1as input
• Operation: Addition of the 2×2 units, then weighted and biased, followed
by a Sigmoid function
• Layer S2 has 6×2=12 trainable parameters
• Connections is 14×14×(2×2+1) ×6=5880
Le Net
• C3: Convolutional layer with 16 feature maps of size 10×10
• Each unit in each feature map is connected to several 5×5 neighborhoods
at identical locations in a subset of S2’s feature maps
• Layer C3 has 1516 trainable parameters
• 151,600 connections
Le Net
• S4: Subsampling layer with 16 feature maps of size 5×5
• Each unit in S4 is connected to the corresponding 2×2 receptive field in C3
• Layer S4 has 16×2=32 trainable parameters
• 5×5×(2×2+1) ×16=2000 connections
Le Net
• F5: Convolutional layer with 120 feature maps of size 1×1
• Each unit in F5 is connected to all 16 5×5 receptive fields in S4
• Layer F5 has 120×(16×25+1)=48120 trainable parameters and
connections
• Note that Layer F5 is fully connected
Le Net
• F6: Convolutional layer with 84 feature maps of size 1×1
• Each unit in F6 is fully connected to all units in F5
• Layer F5 has 84×(120+1)=10164 trainable parameters and connections
• Output layer: 10 RBF with one for each digit
• Weight update: Backpropagation
Performance
60,000 original datasets
Test error: 0.95%
540,000 artificial distortions
+60,000 original datasets
Test error: 0.8%
The 82 errors
made by LeNet5
Notice that most of the
errors are cases that
people find quite easy.
The human error rate is
probably 20 to 30 errors
but nobody has had the
patience to measure it.
Performance
Performance
Performance
Performance
Performance
Performance
Performance
Priors and Prejudice
• We can put our prior • Alternatively, we can use our
knowledge about the task into
the network by designing prior knowledge to create a
appropriate: whole lot more training data.
– Connectivity. – This may require a lot of work
– Weight constraints. (Hofman&Tresp, 1993)
– Neuron activation functions – It may make learning take much longer.
• This is less intrusive than • It allows optimization to
hand-designing the features.
– But it still prejudices the network discover clever ways of using
towards the particular way of the multi-layer network that
solving the problem that we had
in mind. we did not think of.
– And we may never fully understand
how it does it.
The brute force approach
• LeNet uses knowledge • Ciresan et. al. (2010) inject
about the invariances to knowledge of invariances by
design: creating a huge amount of
– the local connectivity carefully designed extra
– the weight-sharing training data:
– the pooling. – For each training image, they
produce many new training
• This achieves about 80 examples by applying many
errors. different transformations.
– This can be reduced to about – They can then train a large,
40 errors by using many deep, dumb net on a GPU
different transformations of without much overfitting.
the input and other tricks
(Ranzato 2008) • They achieve about 35 errors.
The errors made by the Ciresan et. al.
net
The top printed digit is the
right answer. The bottom two
printed digits are the
network’s best two guesses.
The right answer is almost
always in the top 2 guesses.
With model averaging they
can now get about 25 errors.
How to detect a significant drop in the
error rate
• Is 30 errors in 10,000 test cases significantly better than 40 errors?
– It all depends on the particular errors!
– The McNemar test uses the particular errors and can be much more
powerful than a test that just uses the number of errors.
model 1 model 1 model 1 model 1
wrong right wrong right
model 2 29 1 model 2 15 15
wrong wrong
model 2 11 9959 model 2 25 9945
right right
From hand-written digits to 3-D objects
• Recognizing real objects in color photographs
downloaded from the web is much more complicated than
recognizing hand-written digits:
– Hundred times as many classes (1000 vs 10)
– Hundred times as many pixels (256 x 256 color vs 28
x 28 gray)
– Two dimensional image of three-dimensional scene.
– Cluttered scenes requiring segmentation
– Multiple objects in each image.
• Will the same type of convolutional neural network work?
ImageNet
• 15M images
• 22K categories
• Images collected from Web
• Human labelers (Amazon’s Mechanical Turk crowd
sourcing)
• RGB images
• Variable resolution
ImageNet
• Classification goals:
– Make 1 guess about the label (Top-1 error)
– Make 5 guess about the label (Top-5 error)
The ILSVRC-2012 competition on
ImageNet
• The dataset has 1.2 million high- • Some of the best existing
resolution training images. computer vision methods were
• The classification task: tried on this dataset by leading
– Get the “correct” class in your computer vision groups from
top 5 bets. There are 1000 Oxford, INRIA, XRCE, …
classes. – Computer vision systems
• The localization task: use complicated multi-stage
systems.
– For each bet, put a box
around the object. Your box – The early stages are
must have at least 50% typically hand-tuned by
overlap with the correct box. optimizing a few parameters.
ImageNet Large Scale Visual Recognition Challenge
Examples from the test set (with the
network’s guesses)
• University of Toronto (Alex Krizhevsky) • 16.4%
•
34.1%
Error rates on the ILSVRC-2012
competition
classification
classification &localization
• University of Tokyo • 26.1% 53.6%
• Oxford University Computer Vision Group • 26.9% 50.0%
• INRIA (French national research institute in • 27.0%
CS) + XRCE (Xerox Research Center
Europe)
• University of Amsterdam • 29.5%
A neural network for ImageNet
• Alex Krizhevsky (NIPS 2012) • The activation functions were:
developed a very deep –Rectified linear units in every
convolutional neural net of hidden layer. These train much
the type pioneered by Yann faster and are more expressive
than logistic units.
LeCun. Its architecture was:
–Competitive normalization to
– 7 hidden layers not counting suppress hidden activities when
some max pooling layers. nearby units have stronger
– The early layers were activities. This helps with
convolutional. variations in intensity.
– The last two layers were
globally connected.
The CIFAR-10 dataset consists
The Architecture of 60,000 32x32 colour
images in 10 classes, with
6000 images per class. There
• Typical nonlinearities: f(x) = tanh(x) are 50000 training images
and 10000 test images.
f(x) = (1+e-x)-1
• However, Rectified Linear Units (ReLU) are used: f(x)=max(0, x)
• Empirical observation: Deep convolutional neural networks with
ReLUs train several times faster than their equivalents with tanh
units
A four-layer convolutional neural
network with ReLUs(solid line)
reaches a 25% training error rate
on CIFAR-10 six times faster than
an equivalent network with tanh
neurons (dashed line)
The Architecture
• The first convolutional layer filters the 224×224×3 input image with 96
kernels of size 11×11×3 with a stride of 4 pixels which is the distance
between the receptive field centers of neighboring neurons in the kernel
map.
• The pooling layer: form of non-linear down-sampling. Max-pooling partitions
the input image into a set of rectangles and, for each such sub-region,
outputs the maximum value.
The Architecture
• Trained with stochastic gradient descent
• On two NVIDIA GTX 580 3GB GPUs
• For about a week
• 650,000 neurons
• 60,000,000 parameters
• 630,000,000 connections
• 5 convolutional layers, 3 fully connected layers
• Final feature layer is 4096 dimensional
Data Augmentation
• The easiest and most common method to reduce
overfitting on image data is to artificially enlarge the
dataset using label-preserving transformations.
• Two different forms of image transformations have been
employed:
– Image translation
– Horizontal reflections
– Changing RGB intensities
Dropout
• Combining different models can be very useful(Mixture of experts,
majority voting, boosting, etc)
• However, training many different models could be very time
consuming.
• The solution:
– Dropout: set the output of each hidden neuron to zero w.p. 0.5
Dropout
• Dropout: set the output of each hidden neuron to zero w.p. 0.5
• The neurons which are dropped out do not contribute to the forward pass and do
not participate in backpropagation.
• So every time an input is presented, the neural network samples a different
architecture, but all these architectures share weights.
• This technique reduces complex co-adaptations of neurons, since a neuron
cannot rely on the presence of particular other neurons.
• It is forced to learn more robust features that are useful in conjunction with many
different random subsets of the other neurons.
• Without dropout, this network exhibits substantial overfitting.
• Dropout roughly doubles the number of iterations required to converge.
Tricks that significantly improve
generalization
• Train on random 224x224 • Use “dropout” to regularize
patches from the 256x256 the weights in the globally
images to get more data. Also connected layers (which
use left-right reflections of the contain most of the
images. parameters).
• At test time, combine the – Dropout means that half of
opinions from ten different the hidden units in a layer
patches: The four 224x224 are randomly removed for
corner patches plus the central each training example.
224x224 patch plus the – This stops hidden units from
reflections of those five patches. relying too much on other
hidden units.
Some more examples
of how well the deep
net works for object
recognition.
Results on the test
data:
Top-1 error rate: 37.5%
Top-5 error rate:17.0%
ILSVRC-2012
competition:
15.3%
2nd best team:
26.2% (Top-5 error rate)
The first convolutional layer
• 96 convolutional kernels of size 11×11×3 learned by the first
convolutional layer on the 224×224×3 input images.
• The top 48 kernels were learned on GPU1 while the bottom 48
kernels were learned form GPU2
• Looks like Gabor wavelets, ICA filters……
The hardware required for Alex’s net
• He uses a very efficient implementation of convolutional nets
on two Nvidia GTX 580 Graphics Processor Units (over 1000
fast little cores)
– GPUs are very good for matrix-matrix multiplies.
– GPUs have very high bandwidth to memory.
– This allows him to train the network in a week.
– It also makes it quick to combine results from 10 patches at test
time.
• We can spread a network over many cores if we can
communicate the states fast enough.
• As cores get cheaper and datasets get bigger, big neural nets
will improve faster than old-fashioned (i.e. pre Oct 2012)
computer vision systems.
Finding roads in high-resolution images
• Vlad Mnih (ICML 2012) • The task is hard for many reasons:
used a non-convolutional – Occlusion by buildings trees and cars.
net with local fields and – Shadows, Lighting changes
multiple layers of rectified – Minor viewpoint changes
linear units to find roads in • The worst problems are incorrect labels:
cluttered aerial images. – Badly registered maps
– It takes a large image patch
and predicts a binary road – Arbitrary decisions about what counts
label for the central 16x16 as a road.
pixels. • Big neural nets trained on big image
– There is lots of labeled patches with millions of examples are the
training data available for only hope.
this task.
The best road-finder
on the planet?
Two ways to average models
• MIXTURE: We can • PRODUCT: We can
combine models by combine models by
averaging their output taking the geometric
probabilities: means of their output
Model A: .3 .2 .5 probabilities:
Model B: .1 .8 .1
Model A: .3 .2 .5
Combined .2 .5 .3
Model B: .1 .8 .1
Combined .03 .16 .05 /sum
Dropout: An efficient way to average
many large neural nets
([Link]
• Consider a neural net with one
hidden layer.
• Each time we present a training
example, we randomly omit
each hidden unit with
probability 0.5.
• So we are randomly sampling
from 2^H different architectures.
– All architectures share
weights.
Dropout as a form of model averaging
• We sample from 2^H models. So only a few of
the models ever get trained, and they only get
one training example.
– This is as extreme as bagging can get.
• The sharing of the weights means that every
model is very strongly regularized.
– It’s a much better regularizer than L2 or L1
penalties that pull the weights towards zero.
But what do we do at test time?
• We could sample many different architectures
and take the geometric mean of their output
distributions.
• It better to use all of the hidden units, but to
halve their outgoing weights.
– This exactly computes the geometric mean of
the predictions of all 2^H models.
What if we have more hidden layers?
• Use dropout of 0.5 in every layer.
• At test time, use the “mean net” that has all the outgoing
weights halved.
– This is not exactly the same as averaging all the
separate dropped out models, but it’s a pretty good
approximation, and its fast.
• Alternatively, run the stochastic model several times on
the same input.
– This gives us an idea of the uncertainty in the answer.
What about the input layer?
• It helps to use dropout there too, but with a
higher probability of keeping an input unit.
– This trick is already used by the “denoising
autoencoders” developed by Pascal Vincent,
Hugo Larochelle and Yoshua Bengio.
How well does dropout work?
• The record breaking object recognition net developed by
Alex Krizhevsky uses dropout and it helps a lot.
• If your deep neural net is significantly overfitting, dropout
will usually reduce the number of errors by a lot.
– Any net that uses “early stopping” can do better by
using dropout (at the cost of taking quite a lot longer
to train).
• If your deep neural net is not overfitting you should be
using a bigger one!
Another way to think about dropout
• If a hidden unit knows • If a hidden unit has to
which other hidden units work well with
are present, it can co-adapt combinatorially many sets
to them on the training data. of co-workers, it is more
– But complex co-adaptations likely to do something
are likely to go wrong on new that is individually useful.
test data. – But it will also tend to do
– Big, complex conspiracies something that is
are not robust. marginally useful given
what its co-workers
achieve.
Convolutional Neural Networks
END