Deep Learning: Yann Lecun
Deep Learning: Yann Lecun
MA Ranzato
Deep Learning
Yann LeCun
Center for Data Science & Courant Institute, NYU
& Facebook AI Research
yann@cs.nyu.edu
http://yann.lecun.com
Deep Learning = Learning Representations/Features Y LeCun
MA Ranzato
Trainable Trainable
Feature Extractor Classifier
This Basic Model has not evolved much since the 50's Y LeCun
MA Ranzato
Feature Extractor
Built at Cornell in 1960
The Perceptron was a linear classifier on
top of a simple feature extractor
The vast majority of practical applications
of ML today use glorified linear classifiers
A Wi
Linear Machines
And their limitations
Yann LeCun
Yann LeCun
Yann LeCun
Yann LeCun
Yann LeCun
Yann LeCun
Yann LeCun
Yann LeCun
Yann LeCun
Yann LeCun
Yann LeCun
Yann LeCun
Yann LeCun
Yann LeCun
Yann LeCun
Yann LeCun
Yann LeCun
Yann LeCun
Architecture of MainstreamPattern Recognition Systems Y LeCun
MA Ranzato
SIFT K-means
Pooling Classifier
HoG Sparse Coding
unsupervised supervised
fixed
Low-level Mid-level
Features Features
Deep Learning = Learning Hierarchical Representations Y LeCun
MA Ranzato
It's deep if it has more than one stage of non-linear feature transformation
Feature visualization of convolutional net trained on ImageNet from [Zeiler & Fergus 2013]
Trainable Feature Hierarchy Y LeCun
MA Ranzato
The ventral (recognition) pathway in the visual cortex has multiple stages
Retina - LGN - V1 - V2 - V4 - PIT - AIT ....
Lots of intermediate representations
How can we make all the modules trainable and get them to learn
appropriate representations?
Three Types of Deep Architectures Y LeCun
MA Ranzato
Purely Supervised
Initialize parameters randomly
Train in supervised mode
typically with SGD, using backprop to compute gradients
Used in most practical systems for speech and image
recognition
Unsupervised, layerwise + supervised classifier on top
Train each layer unsupervised, one after the other
Train a supervised classifier on top, keeping the other layers
fixed
Good when very few labeled samples are available
Unsupervised, layerwise + global supervised fine-tuning
Train each layer unsupervised, one after the other
Add a classifier layer, and retrain the whole thing supervised
Good when label set is poor (e.g. pedestrian detection)
Unsupervised pre-training often uses regularized auto-encoders
Do we really need deep architectures? Y LeCun
MA Ranzato
A deep architecture trades space for time (or breadth for depth)
more layers (more sequential computation),
but less hardware (less parallel computation).
Example1: N-bit parity
requires N-1 XOR gates in a tree of depth log(N).
Even easier if we use threshold gates
requires an exponential number of gates of we restrict ourselves
to 2 layers (DNF formula with exponential number of minterms).
Example2: circuit for addition of 2 N-bit binary numbers
Requires O(N) gates, and O(N) layers using N one-bit adders with
ripple carry propagation.
Requires lots of gates (some polynomial in N) if we restrict
ourselves to two layers (e.g. Disjunctive Normal Form).
Bad news: almost all boolean functions have a DNF formula with
an exponential number of minterms O(2^N).....
Which Models are Deep? Y LeCun
MA Ranzato
log P ( X ,Y , Z /W ) E ( X , Y , Z , W )=i E i ( X ,Y , Z ,W i )
E1(X1,Y1) E3(Z2,Y1) E4(Y3,Y4)
E2(X2,Z1,Z2)
X1 Z2 Y1 Z3 Y2
Z1 X2
No generalization bounds?
Actually, the usual VC bounds apply: most deep learning
systems have a finite VC dimension
We don't have tighter bounds than that.
But then again, how many bounds are tight enough to be
useful for model selection?
Deep Learning has been the hottest topic in speech recognition in the last 2 years
A few long-standing performance records were broken with deep
learning methods
Microsoft and Google have both deployed DL-based speech
recognition system in their products
Microsoft, Google, IBM, Nuance, AT&T, and all the major academic
and industrial players in speech recognition have projects on deep
learning
Deep Learning is the hottest topic in Computer Vision
Feature engineering is the bread-and-butter of a large portion of
the CV community, which creates some resistance to feature
learning
But the record holders on ImageNet and Semantic Segmentation
are convolutional nets
Deep Learning is becoming hot in Natural Language Processing
Deep Learning/Feature Learning in Applied Mathematics
The connection with Applied Math is through sparse coding,
non-convex optimization, stochastic gradient algorithms, etc...
In Many Fields, Feature Learning Has Caused a Revolution
Y LeCun
(methods used in commercially deployed systems) MA Ranzato
Sparse
GMM Coding BayesNP
DecisionTree
SHALLOW DEEP
Y LeCun
MA Ranzato
Sparse
GMM Coding BayesNP
Probabilistic Models
DecisionTree
SHALLOW DEEP
Y LeCun
MA Ranzato
Sparse
GMM Coding BayesNP
Probabilistic Models
DecisionTree
Unsupervised
Supervised Supervised
SHALLOW DEEP
Y LeCun
MA Ranzato
Conv. Net
SVM RBM DBN DBM
Sparse
GMM Coding BayesNP
What Are
Good Feature?
Discovering the Hidden Structure in High-Dimensional Data
Y LeCun
The manifold hypothesis
MA Ranzato
Face/not face
[ ]
Ideal 1.2
3 Pose
Feature Lighting
0.2
Extractor 2 .. . Expression
-----
Disentangling factors of variation Y LeCun
MA Ranzato
View
Pixel n
Ideal
Feature
Extractor
Pixel 2
Expression
Pixel 1
Data Manifold & Invariance:
Some variations must be eliminated Y LeCun
MA Ranzato
Pooling
Non-Linear
Or
Function
Aggregation
Input
Stable/invariant
high-dim
features
Unstable/non-smooth
features
Non-Linear Expansion Pooling Y LeCun
MA Ranzato
Non-Linear Dim
Pooling.
Expansion,
Aggregation
Disentangling
Sparse Non-Linear Expansion Pooling Y LeCun
MA Ranzato
Clustering,
Pooling.
Quantization,
Aggregation
Sparse Coding
Overall Architecture:
Y LeCun
Normalization Filter Bank Non-Linearity Pooling MA Ranzato
Convolutional Nets
LeCun, Bottou, Bengio and Haffner: Gradient-Based Learning Applied to Document
Recognition, Proceedings of the IEEE, 86(11):2278-2324, November 1998
Jarrett, Kavukcuoglu, Ranzato, LeCun: What is the Best Multi-Stage Architecture for
Object Recognition?, Proc. International Conference on Computer Vision (ICCV'09),
IEEE, 2009
Pierre Sermanet, Koray Kavukcuoglu, Soumith Chintala and Yann LeCun: Pedestrian
Detection with Unsupervised Multi-Stage Feature Learning, CVPR 2013
- Raia Hadsell, Pierre Sermanet, Marco Scoffier, Ayse Erkan, Koray Kavackuoglu, Urs
Muller and Yann LeCun: Learning Long-Range Vision for Autonomous Off-Road Driving,
Journal of Field Robotics, 26(2):120-144, February 2009
Burger, Schuler, Harmeling: Image Denoisng: Can Plain Neural Networks Compete
with BM3D?, Computer Vision and Pattern Recognition, CVPR 2012,
REFERENCES Y LeCun
MA Ranzato
Applications of RNNs
Mikolov Statistical language models based on neural networks PhD thesis 2012
Boden A guide to RNNs and backpropagation Tech Report 2002
Hochreiter, Schmidhuber Long short term memory Neural Computation 1997
Graves Offline arabic handwrting recognition with multidimensional neural networks
Springer 2012
Graves Speech recognition with deep recurrent neural networks ICASSP 2013
REFERENCES Y LeCun
MA Ranzato
Y. Bengio, Learning Deep Architectures for AI, Foundations and Trends in Machine
Learning, 2(1), pp.1-127, 2009.
Practical guide
Y. LeCun et al. Efficient BackProp, Neural Networks: Tricks of the Trade, 1998
L. Bottou, Stochastic gradient descent tricks, Neural Networks, Tricks of the Trade
Reloaded, LNCS 2012.