Autoencoder

AUTOENCODER
Autoencoders
• Autoencoders (AE) are a specific type of feedforward neural
network where the input is the same as the output.
• AEs compress the input into a lower-dimensional code or

representation and then reconstruct the output from this
representation.
• The code is a compact “summary” or “compression” of the

input, also called the latent-space representation.
• AEs were first introduced in 1986 by Hinton to address the

problem of backpropagation algorithm with unlabelled data.
Autoencoder
• An autoencoder consists of 3 components: encoder, code and
decoder.
• The objective of an AE training process is to minimize the

reconstruction error which is either Mean squared error or
cross entropy error between original input and the
reconstructed input.
• Autoencoders are mainly a dimensionality reduction (or

compression)
Autoencoders
Latent representation, h
An autoencoder performs: encoding, decoding, and a loss function

used to compare the output with the target.
Property
• AEs are different than a standard data compression algorithm.
• The output of the autoencoder is not exactly the same as the

input, it will be a close but degraded representation.
• It is a Lossy Compression technique.
• Autoencoders are considered an unsupervised learning

technique since they don’t need explicit labels to train on.
• Both the encoder and decoder are fully-connected feedforward

neural networks.
where, Xi’ is the reconstruction of input X i,
• m represents the number of training samples.
Error Function
• The choice of error function depends on the model.
• If we consider a probabilistic model where the output layer is

implemented by sigmoid or softmax activation function, then
CEE is a better choice compared to MSE.
• Similarly, if we assume the target data to be continuous and

normally distributed, MSE is preferred.
• Optimization techniques like gradient descent may be used to

minimize the reconstruction error.
Hyperparameter
• The number of nodes in the code layer (code size) is
a hyperparameter that we set before training the autoencoder.
• Number of layers: the autoencoder can be as deep as we like.
• The number of nodes per layer decreases with each subsequent
layer of the encoder, and increases back in the decoder.
• The decoder is symmetric to the encoder in terms of layer
structure.
• Autoencoders are trained the same way as ANNs via

backpropagation.
Autoencoder
•Input Images of size n×n and the latent
space where m < n × n.
•Latent space is not sufficient to

reproduce all images.
•Needs to learn an encoding that

captures the important features in
training data, sufficient for approximate
reconstruction.
• The autoencoder tries to learn a function hW,b(x) ≈ x

•An approximation to the identity function, so as to output is
similar to input.
Bottlenecks
• If the inputs are completely random—say, each xi comes from
an IID Gaussian independent of the other features—then this
compression task would be very difficult.
• But if there is structure in the data, for example, if some of the

input features are correlated, then this algorithm will be able to
discover some of those correlations.
• In fact, this simple autoencoder often ends up learning a

low-dimensional representation very similar to PCA’s.
Autoencoders: Applications
• Image colorization: input black and white and train to produce
color images
https://www.edureka.co/blog/autoencoders-tutorial/
Denoising Autoencoders
• Keeping the code layer small forced the autoencoder to learn
an intelligent representation of the data.
• Another way to force the autoencoder to learn useful features

by adding random noise to its inputs and making it recover the
original noise-free data.
• This autoencoder can’t simply copy the input to its output,

called a denoising autoencoder.
• We add random Gaussian noise to them and the noisy data

becomes the input to the autoencoder.
Denoising Autoencoder
• The autoencoder doesn’t see the original image at all.

•We expect the autoencoder to regenerate the noise-free original
image.
Denoising autoencoders
• Denoising autoencoders can’t simply memorize the input
output relationship.
• Intuitively, a denoising autoencoder learns a projection from a
neighborhood of our training data back onto the training data.
https://ift6266h17.files.wordpress.com/2017/03/14_autoencoders.pdf
Sparse Autoencoder
• Sparse autoencoder learning algorithm automatically learns
features from unlabeled data.
• The simple autoencoder tries to learn a function hW,b(x) ≈ x.
• In other words, it is trying to learn an approximation to the

identity function, so as to output xˆ that is similar to x.
• By placing constraints on the network, such as by limiting the

number of hidden units, we can discover interesting structure
about the data.
Sparsity Constraint
• Even when the number of hidden units is large (perhaps even
greater than the number of input pixels), we can still discover
interesting structure, by imposing other constraints on the
network.
• Impose a sparsity constraint on the hidden units.
• Think of a neuron as being “active” (or as “firing”) if its
output value is close to 1, or as being “inactive” if its output
value is close to 0.
• We would like to constrain the neurons to be inactive most of
the time.
Sparse Autoencoders
• We would construct loss function by penalizing activations of
hidden layers so that only a few nodes are encouraged to
activate when a single sample is fed into the network.
• This forces the autoencoder to represent each input as a
combination of small number of nodes, and demands it to
discover interesting structure in the data.
• This method works even if the code size is large, since only a
small subset of the nodes will be active at any time.
• a j (2) denotes the activation of hidden unit j at hidden layer (i.e.
2n d) in the autoencoder.
Sparsity parameter
• The average activation of hidden unit j (averaged over the training
set).
• denotes the activation of hidden unit j when the network is

given a specific input x.
• We would like to (approximately) enforce the constraint ρˆj =
ρ
• where ρ is a sparsity parameter, typically a small value close
to zero (say ρ = 0.05), supplied by the user.
• In other words, we would like the average activation of each
hidden neuron j to be close to 0.05 (say).
Sparsity Cost
• a2 is a matrix containing the hidden neuron activations with
one row per hidden neuron and one column per training
example, then you can just sum along the rows of a2 and
divide by m.
• The result is a column vector with one row per hidden neuron.
Penalty Term
• To satisfy this constraint, the hidden unit’s activations must
mostly be near 0.
• To achieve this, we will add an extra penalty term to our
optimization objective that penalizes ˆρj deviating significantly
from ρ.
• s2 is the number of neurons in the hidden layer. assuming a

sigmoid activation function.
• If you are using a tanh activation function, then we think of a

neuron as being inactive when it outputs values close to -1.
Penalty term
• Penalty term is based on K-L divergence:
• KL divergence between a Bernoulli random variable with

mean ρ and a Bernoulli random variable with mean ˆρj .
KL-divergence reaches its minimum to

0 at ˆρj = ρ , and increases when ˆρj
approaches 0 or 1.
Thus, minimizing
this penalty term has the effect of
causing ˆρj close to ρ.
Say, ρ = 0.2
L1 Regularization Sparse
• There are two different ways to construct sparsity penalty: L1
regularization and KL-divergence.
• L1 regularization adds “absolute value of magnitude” of
coefficients as penalty term.
• L1 regularization tends to shrink the penalty coefficient to zero.
• For L1 regularization, the gradient is either 1 or -1 except when

w=0, which means that L1 regularization will always move w
towards zero with same step size (1 or -1) regardless of the value of
w.
• when w=0, the gradient becomes zero and no update will be made
anymore.
Jsparse = J + L1 + λ ∑i |ai |h
The third term which penalizes the absolute value of the vector
of activations a in layer h for sample i.
Cost function: KL Divergence
• β controls the weight of the sparsity penalty term.

• ρˆj depends on W, b; the average activation of hidden unit j
• In the second layer (l = 2), during backpropagation we
compute,
• Now compute:
Gradient Calculation
• To Compute ρˆi allow a forward pass on all the training
examples first i.e. the average activations on the training set,
before computing backpropagation on any example.
• Then you can use your precomputed activations to perform

backpropagation on all your examples.
The gradient for a single weight value relative to a single training

example. This equation needs to be evaluated for every combination of
j and i, leading to a matrix with same dimensions as the weight matrix.
Then it needs to be evaluated for every training example, and the
resulting matrices are summed.
Visualization
• Having trained a (sparse) autoencoder, we would now like to
visualize the function learned by the algorithm.
• Consider the case of training an autoencoder on 10×10 images.
• Each hidden unit i computes a function of the input:
If we have an autoencoder with 100 hidden units (say), then we

our visualization will have 100 such images—one per hidden unit.
By examining these 100 images.
Each square shows the (norm bounded) input image x
that maximally actives one of 100 hidden units. Different hidden units
have learned to detect edges at different positions and orientations
in the image.
Variational Autoencoder
• The basic idea behind a variational autoencoder is that instead of
mapping an input to fixed vector, input is mapped to a distribution.
• Thus, rather than building an encoder which outputs a single value

to describe each latent state attribute, we'll formulate our encoder to
describe a probability distribution for each latent attribute.
• Represent each latent attribute as a range of possible values.
• When decoding from the latent state, we'll randomly sample from
each latent state distribution to generate a vector as input for our
decoder model.
varitational inference
• Suppose that there exists some hidden variable z which
generates an observation x.
• We can only see x, but we would like to infer the
characteristics of z i.e. compute p(z|x).
• Computing p(x) is quite difficult. P(x) = ∫ p(x|z) p(z) dz
• we can apply varitational inference to estimate this value.

• Let's approximate p(z|x) by another distribution q(z|x).
• Define the parameters of q(z|x) such that it is very similar to p(z|x).
• KL divergence is a measure of difference between two probability

distributions.
• We wanted to minimize the KL divergence (min[KL(q(z| x) || p(z |

x)]) between the two distributions.
• we can minimize the above expression by maximizing

• Eq(z|x) log p(x|z) - KL(q(z|x) || p(z|x)
• The first term represents the reconstruction likelihood and the

second term ensures that our learned distribution q is similar to the
true prior distribution p.
• we can use q to infer the possible hidden variables (ie. latent
state) which was used to generate an observation.
• We can further construct this model into a neural network

architecture where the encoder model learns a mapping
from x to z and the decoder model learns a mapping
from z back to x.
• The main benefit of a variational autoencoder is that we're
capable of learning smooth latent state representations of the
input data.
• When constructing a variational autoencoder, inspect the latent

dimensions for a few samples from the data to see the
characteristics of the distribution.

Autoencoder

Uploaded by

Autoencoder

Uploaded by

AUTOENCODER

• AEs compress the input into a lower-dimensional code or

• The code is a compact “summary” or “compression” of the

• AEs were first introduced in 1986 by Hinton to address the

• The objective of an AE training process is to minimize the

• Autoencoders are mainly a dimensionality reduction (or

An autoencoder performs: encoding, decoding, and a loss function

• The output of the autoencoder is not exactly the same as the

• It is a Lossy Compression technique.

• Autoencoders are considered an unsupervised learning

• Both the encoder and decoder are fully-connected feedforward

• If we consider a probabilistic model where the output layer is

• Similarly, if we assume the target data to be continuous and

• Optimization techniques like gradient descent may be used to

• Autoencoders are trained the same way as ANNs via

•Latent space is not sufficient to

•Needs to learn an encoding that

• The autoencoder tries to learn a function hW,b(x) ≈ x

• But if there is structure in the data, for example, if some of the

• In fact, this simple autoencoder often ends up learning a

• Another way to force the autoencoder to learn useful features

• This autoencoder can’t simply copy the input to its output,

• We add random Gaussian noise to them and the noisy data

• The autoencoder doesn’t see the original image at all.

• The simple autoencoder tries to learn a function hW,b(x) ≈ x.

• In other words, it is trying to learn an approximation to the

• By placing constraints on the network, such as by limiting the

• denotes the activation of hidden unit j when the network is

• s2 is the number of neurons in the hidden layer. assuming a

• If you are using a tanh activation function, then we think of a

• KL divergence between a Bernoulli random variable with

KL-divergence reaches its minimum to

• For L1 regularization, the gradient is either 1 or -1 except when

• β controls the weight of the sparsity penalty term.

• Then you can use your precomputed activations to perform

The gradient for a single weight value relative to a single training

• Consider the case of training an autoencoder on 10×10 images.

• Each hidden unit i computes a function of the input:

If we have an autoencoder with 100 hidden units (say), then we

• Thus, rather than building an encoder which outputs a single value

• Represent each latent attribute as a range of possible values.

• Computing p(x) is quite difficult. P(x) = ∫ p(x|z) p(z) dz

• we can apply varitational inference to estimate this value.

• Define the parameters of q(z|x) such that it is very similar to p(z|x).

• KL divergence is a measure of difference between two probability

• We wanted to minimize the KL divergence (min[KL(q(z| x) || p(z |

• we can minimize the above expression by maximizing

• The first term represents the reconstruction likelihood and the

• We can further construct this model into a neural network

• When constructing a variational autoencoder, inspect the latent

You might also like