Unit 5e - Autoencoders

Autoencoders
Chapter 14 from
The Deep Learning book
(Goodfellow et al)
1
Autoencoders
 Autoencoders are artificial neural networks
capable of learning efficient representations of
the input data, called codings (or latent
representation), without any supervision.
 These codings typically have a much lower
dimensionality than the input data, making
autoencoders useful for dimensionality reduction.
 Some autoencoders are generative models: they
are capable of randomly generating new data
that looks very similar to the training data.
 However, the generated images are usually fuzzy and
not entirely realistic.
2
Autoencoders
 Which of the following number sequences
do you find the easiest to memorize?
 40, 27, 25, 36, 81, 57, 10, 73, 19, 68
 50, 25, 76, 38, 19, 58, 29, 88, 44, 22, 11, 34,
17, 52, 26, 13, 40, 20
3
Autoencoders
 At first glance, it would seem that the first
sequence should be easier, since it is much
shorter.
 However, if you look carefully at the second
sequence, you may notice that it follows two
simple rules:
 Even numbers are followed by their half,
 And odd numbers are followed by their triple
plus one
 This is a famous sequence known as the
hailstone sequence.
4
Autoencoders
 Once you notice this pattern, the second
sequence becomes much easier to memorize
than the first because you only need to memorize
 the first number,
 the length of the sequence and
 the two rules.
 This leads to efficient data representation.

 Autoencoders find efficient data representations
recognizing the underlying patterns.
5
Autoencoders
6
Autoencoders formalized
 Autoencoder consists of two parts: an encoder and a
decoder
 The encoder transforms the data into a set of “factors” ,
i.e.
 A decoder decode from the encoded information, and try

to reconstruct the original data, i.e.
 The goal for the autoencoder is to minimize the difference

between the original data and the reconstructed data:
7
Autoencoders formalized
Hidden layer (code)
f g
Input Reconstruction
8
Autoencoders
 A question that comes to the mind of every beginner
of autoencoder is “isn’t it just copying data?”
 In practice, there are often constraints on the
encoder part to make sure that it will NOT lead to a
solution that just copies the data.
 For example, in practice, it may require that the
encoded information is of lower rank, so that the

encoder can be viewed as a dim. reduction step.
 While autoencoder consists of two steps, sometimes
only the output from the encoder is of interest for
downstream analysis.
9
Autoencoders
Undercomplete Overcomplete
Autoencoder Autoencoder
10
Autoencoders
 Autoencoders may be thought of as being a special
case of feedforward networks and may be trained
with all the same techniques, typically minibatch SGD.
 Unlike general feedforward networks, autoencoders
may also be trained using recirculation (Hinton and
McClelland, 1988), a learning algorithm based on
comparing the activations of the network on the
original input to the activations on the reconstructed
input.
 Recirculation is regarded as more biologically
plausible than back-propagation but is rarely used for
machine learning applications.
11
Undercomplete Autoencoders
 Learning an undercomplete representation forces
the autoencoder to capture the most salient
features of the training data.
 When the decoder is linear and L is the mean
squared error, an undercomplete autoencoder
learns to span the same subspace as PCA.
 Autoencoders with nonlinear encoder function f and
nonlinear decoder function g can thus learn a more
powerful nonlinear generalization of PCA.
 But, …
12
Undercomplete Autoencoders
 Unfortunately, if the encoder and decoder are
allowed too much capacity, the autoencoder can
learn to perform the copying task without
extracting useful information about the
distribution of the data.
 E.g. An autoencoder with a one-dimensional code
and a very powerful nonlinear encoder can learn
to map x(i) to code.
 The decoder can learn to map these integer indices
back to the values of specific training examples
13
Regularized Autoencoders
 Ideally, choose code size (dimension of h) small
and capacity of encoder f and decoder g based
on complexity of distribution modeled.
 Regularized autoencoders: Rather than limiting
model capacity by keeping encoder/decoder
shallow and code size small, we can use a loss
function that encourages the model to have
properties other than copy its input to output.
 Sparsity of the representation
 Smoothness of the derivatives
 Robustness to noise and errors in the data
14
Sparse Autoencoders
 Sparse autoencoders is a training criterion that
add a sparsity penalty to the loss function:
 An autoencoder that has been regularized to be

sparse must respond to unique statistical
features of the dataset it has been trained on,
rather than simply acting as an identity function.
 In this way, training to perform the copying task with a
sparsity penalty can yield a model that has learned
useful features as a byproduct.
15
Denoising Autoencoders (DAEs)
 In addition to add penalty terms, there are other
tricks for autoencoders to avoid copying data.
 One trick is to add some noise to the input data,
is used to denote a noisy version of the input .
 The denoising autoencoder (DAE) seeks to
minimize
 Denoising training forces f and g to implicitly learn

the structure of pdata(x)
 Another example of how useful properties can emerge
as a by -product of minimizing reconstruction error.
16
Contractive Autoencoder (CAE)
 Another strategy for regularizing an
autoencoder is to use penalty as in sparse
autoencoders L(x, g ( f (x))) + Ω(h,x)
but with a different form of
 Forces the model to learn a function that
does not change much when x changes
slightly.
 An autoencoder regularized in this way is
called a contractive autoencoder, CAE.
17
Representational Power
 Autoencoders are often trained with a single layer.
 However using a deep encoder offers many
advantages:
 They can approximate any mapping from input to code
arbitrarily well, given enough hidden units.
 They yield much better compression than
corresponding shallow autoencoders.
 Depth can exponentially reduce the computational
cost of representing some functions.
 Depth can also exponentially decrease the amount of
training data needed to learn some functions.
18
Stochastic Encoders and Decoders
 General strategy for designing the output units
and loss function of a feedforward network is to
 Define the output distribution p(y|x)
 Minimize the negative log-likelihood –log p(y|x)
 In this case y is a vector of targets such as
class labels.
 In an autoencoder x is the target as well as the
input.
 Yet we can apply the same machinery as
before.
19
Stochastic Encoders and Decoders
Hidden layer (code)
P encoder ( h | x ) P decoder ( x | h )
Input Reconstruction
20
 Defined as an autoencoder that receives a corrupted
data point as input and is trained to predict the original,
uncorrupted data point as its output.
 Traditional autoencoders minimize L(x, g ( f (x)))
 DAE seeks to minimize .
• The autoencoder must undo this corruption rather
than simply copying their input.
Encoder Decoder
Noisy Denoised
Latent space
Input Input
representation
21
 DAE trained to reconstruct clean data point x from the
corrupted by minimizing loss
L=-log pencoder(x|h=f(x))
 The autoencoder learns a reconstruction distribution
preconstruct(x| )) ) estimated from training pairs (x, )) as
follows:
1. Sample a training sample x from the training data
2. Sample a corrupted version from C( ~| =x)
3. Use (x, )) as a training example for estimating the
autoencoder distribution precoconstruct(x| ) =pdecoder(x|h)
with h the output of encoder f( ) and pdecoder typically
defined by a decoder g(h).
22
 Score matching is often employed to train
DAEs.
 Score Matching encourages the model to
have the same score as the data
distribution at every training point x.
 The score is a particular gradient field: x log p(x)
 DAE estimates this score as (g(f(x)-x).
 See picture on the next slide.
23
 Training examples x are red crosses.
 Gray circle is equiprobable corruptions.
 The vector field (g(f(x)-x), indicated by green
arrows, estimates the score x log p( x) which is the
slope of the density of data.
24
Contractive Autoencoder (CAE)
 Contractive autoencoder has an explicit
regularizer on h=f(x), encouraging the derivatives
of f to be as small as possible:
 Penalty Ω(h) is the squared Frobenius norm (sum
of squared elements) of the Jacobian matrix of
partial derivatives associated with encoder
function.
25
DAEs vs. CAEs
 DAE make the reconstruction function
resist small, finite sized perturbations in
input.
 CAE make the feature encoding function
resist small, infinitesimal perturbations in
input.
 Both denoising AE and contractive AE
perform well!
 Both are over overcomplete.
26
DAEs vs. CAEs
 Advantage of DAE: simpler to implement
 Requires adding one or two lines of code to
regular AE.
 No need to compute Jacobian of hidden layer.
 Advantage of CAE: gradient is
deterministic.
 Might be more stable than DAE, which uses a
sampled gradient.
 One less hyper-parameter to tune (noise-
factor).
27
Recurrent Autoencoders
 In a recurrent autoencoder, the encoder is
typically a sequence-to-vector RNN which
compresses the input sequence down to a
single vector.
 The decoder is a vector-to-sequence RNN
that does the reverse.
28
Convolutional autoencoders
 Convolutional neural networks are far better
suited than dense networks to work with images.
 Convolutional autoencoder: The encoder is a
regular CNN composed of convolutional layers
and pooling layers.
 It typically reduces the spatial dimensionality of the
inputs (i.e., height and width) while increasing the
depth (i.e., the number of feature maps).
 The decoder does the reverse using transpose
convolutional layers.
29
Applications of Autoencoders
 Data compression
 Dimensionality reduction
 Information retrieval
 Image denoising
 Feature extraction
 Removing watermarks from Images
30
Applications of Autoencoders
 Autoencoders have been successfully applied to
dimensionality reduction and information retrieval
tasks.
 Dimensionality reduction is one of the early
motivations for studying autoencoders.
 yielded less reconstruction error than PCA.
 If we can produce a code that is low-dimensional

and binary, then we can store all database
entries in a hash table that maps binary code
vectors to entries -- semantic hashing.
31
Chapter Summary
 Autoencoders motivated.
 Sparse autoencoders
 Denoising autoencoders
 Contractive autoencoder
 Recurrent/Convolutional autoencoders
 Applications of Autoencoders
32

Unit 5e - Autoencoders

Uploaded by

Unit 5e - Autoencoders

Uploaded by

Autoencoders

 And odd numbers are followed by their triple

 the length of the sequence and

 the two rules.

 This leads to efficient data representation.

 A decoder decode from the encoded information, and try

 The goal for the autoencoder is to minimize the difference

Hidden layer (code)

encoded information is of lower rank, so that the

 An autoencoder that has been regularized to be

 Denoising training forces f and g to implicitly learn

 Minimize the negative log-likelihood –log p(y|x)

 In this case y is a vector of targets such as

Hidden layer (code)

 If we can produce a code that is low-dimensional

You might also like