0% found this document useful (0 votes)
13 views20 pages

deep_density_estimation

Advanced Machine Learning Course from Carnegie Mellon University

Uploaded by

bruceayim30
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
13 views20 pages

deep_density_estimation

Advanced Machine Learning Course from Carnegie Mellon University

Uploaded by

bruceayim30
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 20

Deep Density Estimation

10716: Advanced Machine Learning


Pradeep Ravikumar

1 Introduction

Consider the density estimation problem where we wish to estimate the density p of some
distribution P , and where we are given samples {Xi }ni=1 drawn iid from that distribution.
Suppose we wish to do parametric density estimation: we then start with a parametric class
of densities {pθ }θ∈Θ , and then estimate the density pθb, for some θb ∈ Θ, with the best fit to
the data {Xi }ni=1 . There are two technical facets to this: (a) how to specify a parametric
family of densities, and (b) how to determine goodness of fit of any member of the family to
data.

2 Multivariate Exponential Families (MEFs)

A very classical and popular class of distributions is the exponential family class:

p(X) = exp(⟨θ, ϕ(X)⟩ + B(X) − A(θ)),

which are specified by sufficient statistics ϕ : X → Rd , and a log base measure B(X). A(θ)
is the log-normalization constant, also known as the log-partition function:
Z
A(θ) = exp(⟨θ, ϕ(X)⟩ + B(X))dX.
X

Most of the “named” distributions you have heard of are members of the exponential family
class: normal, exponential, gamma, chi-squared, beta, Dirichlet, Bernoulli, Poisson, cate-
gorical, Wishart, inverse Wishart, and geometric, to name a few. Each of these make some
specific choice of ϕ(X) and B(X), usually depending also on the domain X (e.g. Poisson
for count data). However all of these are examples of univariate exponential families where
X ⊆ R. While this is well-defined for multivariate data, a key question is how do we specify
the functions ϕ(X) and B(X)?

There is however one class of exponential family distributions where the multivariate coun-
terpart is also equally popular: the multivariate categorical distribution. This is a popular
class of distributions for discrete or categorical data, and is simply the probability table
where the rows are the different configurations of all the discrete variables and with one
probability column that has the corresponding probabilities. If the probabilities are all non-
negative, this corresponds to an exponential family with indicator sufficient statistics. If

1
there are d variables, and each takes k values, then there are dk possible configurations, so
that there are too many sufficient statistics. A natural approach might be truncate the set
of sufficient statistics to only have upto r-order interactions. Such approaches form the rich
subject of categorical data analysis, and discrete graphical models.

The other popular multivariate exponential family is for continuous data and is the multivari-
ate Gaussian distribution. One of the defining characteristics of the multivariate Gaussian
is that it is the unique distribution where the conditional distributions of a variable con-
ditioned on the other variables P (Xi |X−i ) are univariate Gaussian for any fixed values of
X−i . This observation thus naturally allows us to answer the question: how do we take
any of the beloved univariate exponential family distributions (e.g. univariate Poisson) to a
corresponding canonical multivariate distribution?

[Yang et al., 2015] showed the following. Suppose that for all i ∈ [n]:

P (Xi |X−i ) ∝ exp(⟨θi (X−i ), Φi (Xi )⟩ + Bi (Xi )).

Then the only joint distribution P (X) that is consistent with these conditional distributions
has the form: X Y X
P (X) ∝ exp( θS Φi (Xi ) + Bi (Xi )).
S⊆[d] i∈S i∈[d]

In other words, the set of sufficient statistics for the multivariate exponential family are
specified by tensor products of the univariate exponential family sufficient statistics. (Exer-
cise: verify that this holds for the multivariate Gaussian). [Yang et al., 2015, Inouye et al.,
2017] develop parametric exponential family distributions for multivariate data using the
above recipe, and further reduce the parameterization above via some deeper connections
to probabilistic graphical models. These parametric classes could be enriched further via
mixtures of such exponential family distributions. Nonetheless, these might not always be a
good fit for data such as images, with low-level features.

3 Energy Based Models

Now instead of such “classical” parametric families, suppose we have a very expressive class
of parametric functions {fθ } (e.g. deep neural networks) that can approximate very complex
functions very well. How do we use these for density estimation? One caveat to directly
using these as a class of parametric densities is the contraint that the densities be non-
negative, and integrate to one. One approach to enforce that is by parameterizing the logistic
transform η(x) instead, so that pθ (x) = R exp(η θ (x))
exp(η θ (x))dx
is non-negative and is normalizable by
x∈X
construction,
R with no further constraints on η(x) (other than for identifiability such as that
x∈X
η(x) = 0, or η(x0 ) = 0, for some x0 ∈ X ). In some of the literature, these are referred to
as “energy based models” where ηθ (x) = −Eθ (x) is referred to as the negative energy, so that

2
higher energy is associated with lower probability (and vice versa), as a nod to statistical
physics. Before we discuss approaches to train these models, it is worthwhile to briefly tour
some of the classical “energy based models”, which were not fully non-parameteric.

3.1 Early Energy Based Models

Hopfield Network/Ising Model


X
P (X) ∝ exp( wij Xi Xj ),
i,j

where wij = wji , and Xi ∈ {−1, +1}, ∀i ∈ [n]. Note that the most probable assignment to
“neuron” Xi given X−i is given the neuronal computation:
X
X
bi = sign( wij Xj ),
j̸=i

and hence were one of P


the first neural networks, and that were also energy based models
with energy Eθ (x) = − i̸=j wij xi xj .

These could also be viewed as an instance of a categorical or discrete MEF/graphical model


with binary variables and pairwise factors (i.e. allowing for terms with interactions of atmost
two variables).

Boltzmann Machines The limited expressibility led to Boltzmann machines by Hinton


et al. [1986], which were again over binary vectors, but also had hidden units Z (also binary,
taking values in {−1, +1}) with joint distribution:
X X X
XX ZZ XZ
P (X, Z) ∝ exp( wij Xi Xj + wij Zi Zj + wij Xi Zj ).
i,j i,j i,j

Even though this is also a graphical model with atmost pairwise factors, the hidden vari-
ables provide considerably more flexibility. The caveat however was that these were difficult
to train, since the likelihood (or its gradient with respect to parameters) of the observed
variables was not tractable, and required sampling based approximations.

Restricted Boltzmann Machines One simplification of the above was to only allow a bi-
partitite graph between the observed and hidden units, so that only wXZ ̸= 0: these were
called restricted Boltzmann machines [Salakhutdinov et al., 2007]. These are a bit easier to
train, and also have better semantics for the hidden units, since:
X
P (Zj = 1|X) = σ( WijXZ Xi ),
i

3
where σ(·) is the sigmoid function, so that the hidden units could be thought of as stochastic
neural layer on top of the observed variables. (And accordingly were stacked to build deeper
neural networks).

3.2 Approximations for Tractable Learning

The main caveat with energy based models is the normalization constant, which involves a
multi-dimensional integral. For instance, the MLE estimate of the parameters would yield:
( n Z )
1 X
inf − ηθ (xi ) + log exp(ηθ (x))dx ,
θ n i=1 x∈X

which is in general intractable due to the multi-dimensional integral. There have been a wide
series of approaches, from multiple communities (AI, non-parametric statistics, statistical
physics, theoretical computer science), on approaches to address this intractability of the
normalization factor. We only briefly tour these; discussing these as length would comprise
its own course.

3.2.1 Sampling Based Approaches

Most of these are used to approximate the gradient of the learning objective. Computing
the gradient of the MLE objective above:

gθ (x) = −E∇η
b θ (x) + EP ∇ηθ (x).
θ

The second term is where the intractability comes from since it requires computing an
expectation with respect to the intractable energy based model given the current parameters.

The ideal sampling based approach would be to use MCMC to sample from Pθ and use
those to approximate the expectation with respect to Pθ . These however could very long to
generate samples with some guarantees, and hence in practice, one might truncated set of
MCMC steps, and use the resulting samples. Carreira-Perpinan and Hinton [2005] suggest
using just a few MCMC steps, which they termed contrastive divergence.

3.2.2 Variational Likelihood Approximations

Variational surrogate likelihoods are another approach. Jeon and Lin [2006] for instance
(in the context of more general non-parametric density estimation) suggest the following
estimator:

4
( n Z )
1X
inf exp(−ηθ (xi )) + ηθ (x)ρ(x)dx ,
θ n i=1 x∈X

where ρ(x) is any simpler known density with the same support as the true density p(x).
As they show, the M -estimator above is consistent, and moreover is much more tractable
than the MLE. But overall, such surrogate likelihood approaches coupled with the logistic
transform seem less popular, since their theoretical properties are less well-understood, and
perhaps empirically they have not performed as well.

We will study more variational approximations in the sequel.

3.2.3 Contrastive Approaches

Gutmann and Hyvärinen [2010] propose to learn density ratios with respect to a known noise
distribution instead of the density itself, which allows us to use contrastive or discriminative
approaches to learn the models. Let Q be a given noise distribution (e.g. standard Gaussian).
Suppose we learn a discriminant function h : X → R to contrast samples from the data
distribution P vs samples from Q via the cross-entropy loss:

min EX∼P ln h(X) + EX∼Q ln[1 − h(X)].


h

The optimal discriminant function can be seen to be the density ratio:


P (x)
h(x) = ,
P (x) + Q(x)
which we can thus use to construct our density estimate:

h(x)
b
Pb(x) = Q(x) ,
1−bh(x)

given the learnt discriminant function b


h.

We can also use this noise contrastive approach to fit an explicit parameterized energy based
model Pθ (x) ∝ exp(ηθ (x)). Assuming the true data distribution also follows this energy
based model for some θ∗ , the optimal discriminant function would have the form:
exp(ηθ∗ (x) + c∗ )
h∗ (x) = .
exp(ηθ∗ (x) + c∗ ) + Q(x)
We can thus fit a discriminant function from the following class of functions:
exp(ηθ (x) + c)
h(x) = .
exp(ηθ (x) + c) + Q(x)

5
Gutmann and Hyvärinen [2010] show that doing so leads to a well-defined discriminant learn-
ing problems, that is, optimizing over the constant c that represents the log-normalization
constant does not lead to an unbounded objective. As they show, there is a slight statistical
inefficiency in using such a noise contrastive approach to learning energy based models. Liu
et al. [2021] further show that when the noise distribution Q is far from the true data dis-
tribution Pθ∗ , the landscape of the classification objective becomes very flat far away from
the optimum (so that the classification objective itself could be very small, but the distance
in parameter space is very large). They suggest that normalized gradients over an appro-
priately chosen surrogate classification objective could ameliorate some of these landscape
challenges.

3.2.4 Score Matching

Consider the score function:



s(x) = log p(x).
∂x
It can be seen that for an energy based model pθ (x), the score simplifies to:

s(x) = ηθ (x),
∂x
and in particular does not need to grapple with the log-partition function. Consider the
“score matching” objective:
1
J(θ) = E∥sθ (X) − s(X)∥2 ,
2
that aims to match the score of the energy based model with respect to true score function.

Hyvärinen and Dayan [2005] show that the above can be re-written as:
d  
X ∂ 1 2
J(θ) = E sθ;i (X) + sθ;i (X) + const.,
j=1
∂Xi 2

so that one could estimate the above via:


d  
X ∂ 1 2
J(θ)
b =E b sθ;i (X) + sθ;i (X) + const.,
j=1
∂X i 2

given samples {xi }i∈[n] drawn from P .

As score matching (and contrastive learning) finds wider usage, their statistical and opti-
mization landscape caveats are increasingly being analyzed. Koehler et al. [2022] for instance
show that for energy based models with a large isoperimetric constant (loosely: worst case
over sets of ratio of probability mass of set boundary over probability mass of the set itself)
score matching can be very inefficient.

6
4 Neural Generative Models

Over the last decade there have been a slew of alternative approaches that sidestep the
energy model/logistic transform route, with its normalization difficulties altogether, and
specify the random vector X as a transformation of some other latent variables Z with some
known distribution. These transformations in general can be relatively unconstrained, so that
we sidestep issues of normalizability. These transformations typically involve deep neural
network based parametric functions, and hence are loosely called deep density estimators.
Let us consider various classes of these generative models in the sequel.

5 Normalizing Flows

Suppose that we have a latent representation Z ∼ N (0, I), and that we have a deterministic
transformation from Z to the data X as:

X = gθ (Z),

for some flexible parametric function gθ . Suppose gθ is invertible (which is a big if). Then
by the change of variables formula:

pX;θ (x) = pZ (gθ−1 (x))|detJg −1 (x)|,

where [Jh(x)]ij = ∂hi (x)/∂xj , so that the density has a nice closed form expression. Thus,
given samples {xi }ni=1 , we could thus directly solve for the MLE:
n
X
inf − log pX;θ (xi ).
θ
i=1

Note that these can be stacked, so that we could obtain a stacked transformation ZK =
gK ◦ . . . ◦ g1 (Z0 ), which in turn will have the log-density:
K
X
log pK (zK ) = log p0 (z0 ) − log |detJgk (zk )|.
k=1

The random variables Zk are called “flows,” and the distributions Pk are called “normalizing
flows” [Rezende and Mohamed, 2015].

Note that by the so-called reparameterization trick introduced earlier EpK [h(X)] = Ep0 [h(gK ◦
. . . ◦ g1 (z0 ))] which does not involve Jacobian calculations.

7
5.1 Invertible Maps

The key caveats with normalizing flows are two-fold: (a) the transformation gθ has to be
invertible, and (b) the density involves the Jacobian of the transformation, which could be
expensive for general invertible maps.

Some simple classes of invertible transformations (which as noted above can be stacked to
get “deep” flow tranforms) include:

g(z) = z + uh(wT z + b),

which are invertible for specific settings of (h, u, w) e.g. h = tanh(·) and wT u >= −1 [Rezende
and Mohamed, 2015].

An alternative approach, called NICE [Dinh et al., 2014], is to split X = (X1 , X2 ) as well as
Z = (Z1 , Z2 ) into two blocks of variables with the blocked transform:

X1 = Z1
X2 = Z2 + m(Z1 ),

for an arbitrary, potentially non-invertible function m(·). It can be seen that the transfor-
mation from Z to X is trivially invertible:

Z 1 = X1
Z2 = X2 − m(X1 ).

Moreover, the Jacobian of the transformation is triangular, so that its determinant is simply
the product of diagonal entries, and hence easy to compute. A related triangular Jacobian
transformation [Dinh et al., 2016] is given by:

X1 = Z1
X2 = Z2 ⊙ exp(m1 (Z1 )) + m2 (Z1 ),

which can again be trivially inverted for arbitrary m1 (·), m2 (·), via:

Z 1 = X1
Z2 = (X2 − m2 (X1 )) ⊙ exp(−m1 (X1 )).

6 Autoregressive Flows

The simple triangular Jacobian examples had an implicit auto-regressive character: we could
specify the joint distribution via the marginal distribution of a subset of variables X1 , and

8
the conditional distribution of the remaining subset X2 conditioned on X1 . Autoregressive
flows generalize this to allow for more general auto-regressive transformations.

In a so-called Masked Autoregressive Flow (MAF) [Papamakarios et al., 2017], this is given
as:

Xi = µi + Zi exp(αi ),

where Zi ∼ N (0, 1), and µi = gµi (X<i ), and αi = gσi (X<i ), so that X is a transformation
of the standard Gaussian vector Z, and where the transformation is specified in an auto-
regressive manner. It can be seen that the inverse is easily computed:

Zi = (Xi − µi ) exp(−αi ),

so that Z can be recovered given X, and that moreover the determinant of Pthe Jacobian of
−1
the transformation X = g(Z) is easily computed as |detJg (x)| = exp(− i αi ). MAF can
transform X to Z in one (parallelized) iteration: since the information to specify each Zj
is fully available in X, and we do not need to wait to compute Z<j . But then we require p
iterations to transform Z to X: since to specify Xj it does not just suffice to provide Z, but
also X<j .

A variant of MAF is Inverse Autoregressive Flow (IAF) [Kingma et al., 2016], where we
have:
Xi = µi + Zi exp(αi ),
where µi = gµi (Z<i ), and αi = gαi (Z<i ). Its inverse is again given as:

Zi = (Xi − µi ) exp(−αi ),

but note that in this case, IAF can transform Z to X in one vectorized iteration, but requires
p iterations to transform X to Z. Thus the slight difference in choices between IAF and MAF
can result in vastly different computational times for specific tasks. Note that transform X
to Z is required for calculating the density of a point X, while transforming Z to X is
required to generate new samples.

Stacking such auto-regressive transformations X = gK ◦. . .◦g1 (Z) is called an “autoregressive


flow”, as a special instance of normalizing flows.

6.1 General Auto-regressive Distributions

Classically, auto-regressive models were used to directly parameterize joint distributions


(rather than simply transformations) via parameterizing conditional distributions specified
by the standard chain rule
p
Y
pθ (x) = pθ (xi |x<i ).
i=1

9
A classical approach to model pθ (xi |x<i ) is to make a Markov assumption that pθ (xi |x<i ) =
pθ (xi |xi−1 , . . . , xi−k ) so that the conditional distribution of Xi conditioned on all previous
variables only depends on the k most recent variables before Xi . Another approach is to use
sequence model based recurrences, such as:

hi = fθh (hi−1 )
xi = fθx (xi−1 , hi ),

for some parametric functions fθ;h (·), and fθ;x (·, ·). Such recurrence based sequence models,
such as recurrent neural networks (RNNs), were popular parametric models for sequence
based data where the sequence order is very naturally specified, for instance, via time. But
they are far less popular when there is no such natural sequence order, in large part because
the performance of such models is very sensitive to such ordering. Papamakarios et al. [2017],
in their Figure 1, provide an example with two variables, where an auto-regressive model
with the ordering (x1 , x2 ) is able to model the data, but an auto-regressive model with the
ordering (x2 , x1 ) is not able to.

One approach to address this to use different orderings, and use an ensemble or mixture of
the resulting distributions. Another approach is to use different orderings in each layer of
an “autoregressive flow” X = gK ◦ . . . ◦ g1 (Z), where we use a different ordering for each
auto-regressive transformation gi , for i = 1, . . . , K. This provides another rationale for the
use of auto-regressive flows, rather than sequence based auto-regressive recurrence models,
in addition to other benefits of normalizing flows, such as the ease of computing the density,
and sampling.

7 Generative Adversarial Networks (GANs)

Suppose we have a parametric transformation X = gθ (Z) of some base distribution z. Pro-


vided the transformation gθ is invertible as with normalizing flows, we can compute the
density pθ (x) of the random vector x, and consequently use MLE:
n
1X
θb ∈ arg inf log pθ (xi ),
θ n
i=1

to estimate the parameters θb given samples {xi }ni=1 . There are two caveats here. The first
is that this is not feasible when gθ is not invertible, which would be the case for instance,
for most modern architectures of deep neural networks. The second caveat is more subtle,
and is due to the very nature of the MLE as minimizing the empirical variant of the KL
divergence between the true data distribution P and the distribution Pθ over X with density
pθ :
inf KL(P, Pθ ).
θ

10
R
Note that KL(P, Pθ ) = p(x) log p(x)/pθ (x)dx, so that this would be large if there are
P -likely regions where pθ (x) is small and p(x) is large: which encourages pθ (x) to have
support in the P -likely regions of the input space. But this does not ensure that pθ (x)(x)
be small where p(x) is small: such a property would ne required to ensure that samples
from Pθ (x) be P -realistic (i.e. do not have small density with respect to true data distri-
bution
R P ). How do we encourage the latter property? By simply minimizing KL(Pθ , P ) =
pθ (x)(x) log pθ (x)(x)/p(x)dx, which would be large if there are Pθ (x)-likely regions where
p(x) is small and pθ (x) is large. A caveat with KL(Pθ , P ) on the other hand is practical: it is
not decomposable, so that it is not clear how to optimize this given just samples {xi }ni=1 from
P . Combining both these asymmetric KL divergences yields the Jensen-Shannon divergence:
   
1 P + Pθ (x) 1 P + Pθ (x)
JSD(P, Pθ (x)) = KL P , + KL Pθ (x) , ,
2 2 2 2

which has the additional advantage of being symmetric in its arguments. This loss again
is not decomposable, so that it is not clear how to optimize this given just samples {xi }ni=1
from P . In a seminal paper, Goodfellow et al. [2014] showed that one can indeed mini-
mize the Jensen-Shannon divergence given samples by considering a variational form using
“generators” and “discriminators”.

Suppose D : X 7→ [0, 1] be a classifier (ideally probabilistic, but more generally with an


output of classifier scores between 0 and 1). Given the parameterized density qθ , and the
density of the true data distribution p, consider the following variational form:

V (p, qθ , D) = Ex∼p [log D(x)] + Ex∼qθ [log(1 − D(x))].

Goodfellow et al. [2014] then showed the following useful result:

max V (p, qθ , D) = − log(4) + 2JSD(p, qθ ),


D

so that
arg min max V (p, qθ , D) = arg min JSD(p, qθ ).
θ D θ

The interesting facet of V (p, qθ , D) is that it is decomposable, so that it can be approximated


well via samples (from both p as well as qθ ), thus facilitating learning the parameters of the
density qθ by minimizing the Jensen-Shannon divergence itself with respect to the true data
distribution.

The variational objective V (p, qθ , D) can also be motivated as specifying a min-max adversar-
ial game between the “generative” density qθ , and a discriminator D that aims to discriminate
between samples from Qθ and P , while the generator Qθ aims to fool the discriminator D.
Specifically, consider the following classification task, where Y = 1 indicates the true data
distribution and Y = 0 indicates Qθ , so that X|(Y = 1) ∼ P , and X|(Y = 0) ∼ Qθ . The

11
expected cross-entropy loss of a probabilistic discriminator D : X 7→ [0, 1] is then given by

E [Y log D(X) + (1 − Y ) log(1 − D(X))]


= E[log D(X)|Y = 1]P (Y = 1) + E[log(1 − D(X))|Y = 0]P (Y = 0)
= 0.5 ∗ EX∼P [log D(X)] + 0.5 ∗ EX∼Qθ [log(1 − D(X))]
= 0.5 ∗ V (p, qθ , D)

using P (Y = 1) = P (Y = 0) = 1/2.

Thus, the variational objective V (p, qθ , D) is simply twice the expected cross entropy loss of
the discriminator D(·) in the classification task of discriminating between the true distribu-
tion P and the generative model Qθ .

8 Variational Auto-Encoders

A key caveat to the above approaches is that they might not be able to capture the multi-
modality of the inputs given the latent representation if X is a deterministic function of Z.
This prevents the latent variables from only representing “coarser” information about the
input. One way to accomodate this is to allow X to be a stochastic function of Z.

A very natural approach [Kingma and Welling, 2013] along these lines is to have:

Z ∼ N (0, Id )
X|Z = z ∼ N (µθ (z), σθ2 (z)I),

or alternatively:

X = µθ (Z) + σθ (Z) W

where Z, W ∼ N (0, I). Thus, X has a well-defined density even when the mean and vari-
ance functions {µθ (z), σθ (z)}θ∈Θ are relatively unconstrained, with no normalization terms.
For instance, a highly expressive parametric family of choice for these mean and variance
functions are deep neural networks. The density of X is then given as:
Z
p(X; θ) = pN (x; µθ (z), σθ2 (z)I) pN (z; 0, I)dz,
z∈Rd

where pN (µ, Σ) is the multivariate Gaussian density. It can be seen that the density does
not have an explicit tractable form. To fit the parameters θ to the data, we could maximize
the likelihood of the observed data:
n
1X
max log p(xi ; θ),
θ n i=1

12
but this has all of the difficulties of fitting a latent generative model.

As before, we could optimize a surrogate likelihood instead. In so-called variational inference,


we compute parameterized lower bounds of the likelihood and optimize this lower bound
instead. Thus, if pθ (X) ≥ gθ;γ (X), for γ ∈ Γ, then we solve for:
n
1X
max log gθ;γ (xi ).
θ∈Θ,γ∈Γ n
i=1

With the above latent variable model, we have the following classical variational bound, also
called the Evidence Lower Bound or ELBO:
Z
log pθ (x) = log pθ (x, z)dz
Zz
= log qϕ (z|x)pθ (x|z)p(z)/qϕ (z|x)
Z z Z
≥ qϕ (z|x) log p(z)/qϕ (z|x) + qϕ (z|x) log pθ (x|z)
z z
= L(θ, ϕ, x) := −KL(qϕ (z|x)∥p(z)) + Eqϕ (z|x) [log pθ (x|z)],

so that instead of the maximizing the empirical expectation of log pθ (x), we maximize the
empirical expectation of the lower bound L(θ, ϕ, x) instead. We can moreover show that:

log pθ (x) − L(θ, ϕ, x) = KL(qϕ (z|x)||pθ (z|x)),

so that the ELBO bound gets tighter as the variational approximation qϕ (z|x) gets closer to
the intractable true posterior pθ (z|x).

A natural flexible parameterization is simply:

qϕ (z|x) = N (µϕ (x), σϕ2 (x)I),

where again the mean and variance functions µϕ (x), σϕ (x), can again be parameterized
by flexible families such as deep neural networks. Note that when taking an expecta-
tion Eqϕ (z|x) [f (z)], we can “reparameterize” z = h(x, w) := µϕ (x) + σϕ (x)w in terms of
w ∼ N (0, I), so that
Eqϕ (z|x) [f (z)] = Ew∼N (0,I) [f (h(w, x))],
which can be approximated via Monte Carlo samples of w, and no explicit density calcula-
tions of qϕ (z|x). This is called the “reparameterization trick”.

The above approach is also called the Variational Auto-encoder, since it is reminiscent of
auto-encoders that were used to learn compact representations of the input x. As an instance,
suppose we wish to get a representation z ∈ Rd of the input x ∈ Rp for some d < p, via the
following “encoder” model:
z = g(b + W x),

13
which is then coupled with a “decoder” model:

x
b = g(c + V z),

for some point-wise non-linearity g(·), and some vectors b, c, and matrices W, V . We could
learn these parameters by minimizing the reconstruction error:
X
inf ∥b
xi − xi ∥.
θ
i

Here the “encoder” transformation from x to z, as well as a “decoder” transformation from


z to x are both deterministic, and hence this does not specify a density model for x. With
the variational autoencoder, both these transformations are stochastic, and moreover, there
was an explicit distribution imposed on the latent representations z, which thus specified a
distribution over the inputs x. Additionally, the traditional auto-encoders, in order to learn
a non-identity transformation (implemented via arbitrary encoders and decoders that are
inverses of each other) either use a bottleneck (where dimensionality of encoding z is much
smaller than that of x), or add noise to the input x and aim to predict the denoised input
(with the idea that then the encoding z is forced to capture the salient information about
x). The variational auto-encoder could thus be viewed as a principled Bayesian approach to
denoising auto-encoders.

While in the original variational auto-encoder, qϕ (z|x) was set to be a Gaussian with param-
eterized mean and variance, one could also use other flexible parameterizations, including
the invertible neural networks or normalizing flows to be discussed in the next section.

One could also use a stacked set of latent Gaussians as:

zL ∼ N (0, I)
zl |zl+1 ∼ N (µl (zl+1 ), σl2 (zl+1 )I)
x|z1 ∼ N (µ0 (z1 ), σ02 (z1 )I)

in what are called Deep Latent Gaussian Models [Rezende et al., 2014], though these seem
less popular, perhaps due to the added complexity.

9 Destructive Distribution Learning

So far we have considered a largely “constructive” learning approach where we learn a trans-
formation gθ (Z) of a random vector Z with known simple distribution (such as independent
Gaussian) and fit the parameters so that the transformed distribution Pgθ (Z) is as close to the
true data distribution PX as possible, for instance by solving for the following (population)
objective:
inf KL(PX , Pgθ (Z) ).
θ

14
An alternative approach is to consider a “destructive” learning approach where we learn
a transformation hθ (X) of the data random vector X, and fit the parameters so that the
transformed distribution Phθ (X) is as close to a random vector Z with known simple distri-
bution (such as independent Gaussian). Such a transformation is called destructive learning
since we aim to “destroy” the structure in X, reducing it to say an independent Gaussian
distribution.

But while (imperfectly) transforming Z to X seems useful from a density estimation per-
spective, why would we want to (imperfectly) transform X to Z? There are two reasons to
do so. The first is that of representation learning, as we discuss further below. The second
is that we could also use it as density estimation procedure.

9.1 Representation Learning

Tranforming X to Z with known or simple distribution could be cast as “encoding” the


data X into a representation that is “simple”. A variant of this is Independent Component
Analysis (ICA), where we (typically) only assume that Z is independent. This is not however
sufficient to make the transformation identifiable. For instance, if Z1 and Z2 are independent,
then so are component-wise transformations f1 (Z1 ) and f2 (Z2 ). Even if we restrict the
distribution of the independent vector Z, the indeterminacy remains. Suppose we have a
transformation h : X 7→ [0, 1]d that maps X to a uniform random vector Z ∈ [0, 1]d . Then,
given any measure-preserving automorphism g : [0, 1]d 7→ [0, 1]d , it is clear that g ◦ h will be
another solution to the ICA problem of transforming X to a uniform random vector.

So, such a destructive mapping, if it exists, is not unique. But what about existence of such
a mapping? One can show this via the following constructive mapping.

Suppose we set Z1 = F1 (X1 ) ∼ Unif[0, 1]. And for j = 2, . . . , d, denote Fj (x; z1 , ..., zj ) =
P (Xj ≤ x|Z1 = z1 , . . . , Zj = zj ) as the conditional CDF of Xj conditioned on j uni-
form random variables {Zℓ }jℓ=1 . Then set Zj+1 = Fj (Xj ; z1 , . . . , zj ). It can be seen that
(Z1 , ..., Zd ) ∼ Unif[0, 1]d . This is simply the multivariate extension of the classical univariate
CDF transformation result that for any real-valued random variableV ∈ R, if FV is its CDF,
then FV (V ) ∼ Unif[0, 1]. Thus, stitching these conditional CDF transformations together,
we get the mapping: Z = h(X). The main caveat with this constructive mapping is that
such conditional CDFs are difficult to estimate for multivariate data.

9.2 Density Estimation

The other reason we might want to learn an imperfect mapping hθ (·) from X to Z is that in
the limit where we are able to truly convert X to Z with known distribution, then we can
recover the density of X by the change of variable formula applied to the transformation

15
h−1
θ (Z), so that
ph−1 (Z) (x) = pZ (hθ (x))|detJhθ (x)|,
θ

as with normalizing flows. Since hθ is an imperfect transformation of X to Z, similarly,


h−1
θ will be an imperfect transformation from Z to X, which is the case with constructive
approaches such as normalizing flows as well. A more critical concern with the destructive
learning approach is computational/practical.

Consider the objective:

inf KL(Phθ (X) , PZ )


θ
Z
= inf phθ (X) (z) log phθ (X) (z)/pZ (z)dz.
θ z

It can be seen that it is not clear how to optimize this objective given just samples {xi }ni=1
drawn from PX , since without access to the true density pX , we might not be able to evaluate
the transformed density phθ (X) (z) (note that in the case of normalizing flows, we had access to
the base density pZ , and so could evaluate the density of transformations of this base density).
But the following simple identity essentially notes reduces it to constructive learning:

Theorem 1 (Destructive-Constructive Identity)

KL(Phθ (X) , PZ ) = KL(PX , Ph−1 (Z) ).


θ

This theorem has (re-)appeared in many recent generative model papers; see for instance
[Ballé et al., 2015, Papamakarios et al., 2017]. The proof just follows from some applications
of the change of variables formula:
Z

KL(Phθ (X) , PZ ) = phθ (X) (z) log phθ (X) (z) − log pZ (z) dz
Zz  
∂x ∂x ∂z
= pX (x) det log pX (x) det − log pZ (hθ (x)) det dx
x ∂z ∂z ∂x
Z  
∂z
= pX (x) log pX (x) − log pZ (hθ (x)) det dx
x ∂x
Z  
= pX (x) log pX (x) − log ph−1 (Z) (x) dx
θ
x
= KL(PX , Ph−1 (Z) ),
θ

where ∂x
∂z
= Jh−1 ∂z
θ (hθ (x)), and ∂x = Jhθ (x), and where we used the property of Jacobians
that ∂z = ( ∂x ) , and the property of determinants that det(A−1 ) = 1/det(A).
∂x ∂z −1

Thus, destructive learning is equivalent to constructive learning with invertible transforma-


tions, so that it is not clear why we should not simply use constructive learning approaches

16
if we care about density estimation. One methodological advantage could be that we could
use insights from other fields to obtain invertible “destructive” transformations. For instance
[Ballé et al., 2015] suggest the following “divisive normal” transformation from X to Z:

U =HX
Ui
Zi = Pd ,
αij )ϵi
(βi + j=1 γij |Uj |

for some parameter matrices H, α, γ ∈ Rd×d , and parameter vectors β, ϵ ∈ Rd which they
motivate from neuroscience considerations.

There are also particular classes of transformations where the solution of the destructive
learning problem is easier. Consider the class of transformations of the following form:

H = {h : X ⊆ Rd 7→ Rd | h = Ψ(Ax),

where A ∈ Rd×d , and Ψ(u) = (Ψ1 (u1 ), . . . , Ψd (ud )), where {Ψj }dj=1 are pointwise invert-
ible univariate transformations. Thus, the transformations h = Ψ(Ax) consist of a linear
transformation, followed by coordinatewise transformations.

We then wish to solve for:


inf KL(PhΨ,A (X) , PZ ).
Ψ,A

The solution (Ψ∗ , A∗ ) to this can be characterized simply:

A∗ = arg inf I(AX),


A
Ψ∗j =Φ −1
⊙ F⟨A∗j ,X⟩ (⟨A∗j , X⟩),

where I(·) is the mutual information of a random vector, and Φ(·) is the standard Gaussian
CDF. Minimizing I(AX) over matrices A is essentially the linear ICA problem, which aims to
find a linear transformation of X that reduces dependence among the transformed variables
as much as possible. While Ψ∗j is simply a univariate Gaussianization transform, which is a
composition of the univariate CDF of the linear transformed variable, and an inverse of the
standard Gaussian CDF. Both of these have practical if approximate implementations.

To see why the solution has such a Pnice closed form, let us first define the marginal KL
divergence: marginal-KL(PU , PV ) := dj=1 KL(PUj , PVj ), as the sum of the KL divergences
between the corresponding d marginal distributions. From some algebraic calculations, we
can then write

KL(PΨ◦AX , PZ ) = marginal-KL(PΨ◦AX , PZ ) + I(Ψ ◦ AX).

But for invertible Ψ, I(Ψ ◦ AX) = I(AX), so that Ψ can be obtained by minimizing just
the first term, which yields that it is the pointwise Gaussianization of the optimal linear

17
transformation of X. Given this optimal Ψ∗ , the first term becomes zero, so that the objective
then reduces to the second term which is precises the linear ICA objective I(AX). Thus,
one could simply solve for linear ICA to obtain the matrix A.

One can also use the decomposition above to suggest a linear ICA algorithm. Note that if
we restrict A to be orthogonal, we then have that:
KL(PX , PZ ) = KL(PAX , PZ ) = marginal-KL(PAX , PZ ) + I(AX).
Since the LHS does not depend on A, we then get that:
inf I(AX) = sup marginal-KL(PAX , PZ ),
A A

so that we aim to find a linear transformation A that makes the coordinates of AX be as


non-Gaussian as possible. The overall optimal solution then seems very intuitive: A aims to
find the directions under which the projection of X is most non-Gaussian. Ψ then marginally
Gaussianizes these transformed variables.

Both the destructive transforms above might not seem that flexible. But one advantage of
destructive transformations such as the above is that we can iterate over these, destroying
a bit of X in every iteration. So, starting with X1 = X, in iteration t = 1, . . ., we find ht =
arg inf h∈H KL(Ph(Xt ) , PZ ), and then destroy Xt as Xt+1 = ht (Xt ). A consistency property
that would be good to have is that KL(PXt , PZ ) → 0. This was shown to indeed be the case
with the Gaussianization transformation above [Chen and Gopinath, 2001].

One could also view the iterates above as greedily learnt stacked destructors: Xt+1 = ht ◦
. . . ◦ h1 (X). Given the equivalence in the beginning of the section, Inouye and Ravikumar
[2018] thus suggested the following general algorithm for destructive learning:
gt = arg inf KL(PXt , Pg(Z) ),
g∈G

for some simple class of invertible functions G, and then use Xt+1 = gt−1 (Xt ). This thus gen-
eralizes the Gaussianization transforms above to a much larger class of generative models,
that simply fit a generative model over the current data iterate Xt , and then use this gen-
erative model to extract a destructive transform to further transform the data and iterate.
Note that such a destructive iterative algorithm is much more computationally feasible than
a constructive iterative algorithm that would aim to solve:
gt = arg inf KL(PX , Pg(Zt ) ),
g∈G

where Zt = gt−1 ◦ . . . ◦ g1 (Z), where the main computational concern is the computation of
the densities of Zt .

Inouye and Ravikumar [2018] also show that even if we simply solve for simple or shallow
density estimation via:
Qt = arg inf KL(PXt , PQ ),
Q∈Q

18
one can in most cases extract a destructive transform h(·) from PQ such that PQ = Ph(Z) .
This further increases the ease of each destructive iteration: one performs shallow density
estimation using their method of choice, extract the corresponding destructive transform,
and use this to further transform the data, and iterate. So by a series of shallow density
estimation procedures, we are able to fit a “deep” density destructively.

One simple approach to extract the invertible destructor h(·), for a given Q (so that Ph(Z) ≡
PQ ) is to use the conditional univariate CDF transformations discussed earlier. But in
most cases such destructive transforms can be obtained even more simply. See [Inouye and
Ravikumar, 2018] for examples with many common used shallow densities.

References
Eunho Yang, Pradeep Ravikumar, Genevera I Allen, and Zhandong Liu. Graphical models
via univariate exponential family distributions. The Journal of Machine Learning Re-
search, 16(1):3813–3847, 2015.

David I Inouye, Eunho Yang, Genevera I Allen, and Pradeep Ravikumar. A review of
multivariate distributions for count data derived from the poisson distribution. Wiley
Interdisciplinary Reviews: Computational Statistics, 9(3):e1398, 2017.

Geoffrey E Hinton, Terrence J Sejnowski, et al. Learning and relearning in boltzmann


machines. Parallel distributed processing: Explorations in the microstructure of cognition,
1(282-317):2, 1986.

Ruslan Salakhutdinov, Andriy Mnih, and Geoffrey Hinton. Restricted boltzmann machines
for collaborative filtering. In Proceedings of the 24th international conference on Machine
learning, pages 791–798, 2007.

Miguel A Carreira-Perpinan and Geoffrey Hinton. On contrastive divergence learning. In


International workshop on artificial intelligence and statistics, pages 33–40. PMLR, 2005.

Yongho Jeon and Yi Lin. An effective method for high-dimensional log-density anova es-
timation, with application to nonparametric graphical model building. Statistica Sinica,
pages 353–374, 2006.

Michael Gutmann and Aapo Hyvärinen. Noise-contrastive estimation: A new estimation


principle for unnormalized statistical models. In Proceedings of the thirteenth international
conference on artificial intelligence and statistics, pages 297–304. JMLR Workshop and
Conference Proceedings, 2010.

Bingbin Liu, Elan Rosenfeld, Pradeep Ravikumar, and Andrej Risteski. Analyzing and
improving the optimization landscape of noise-contrastive estimation. arXiv preprint
arXiv:2110.11271, 2021.

19
Aapo Hyvärinen and Peter Dayan. Estimation of non-normalized statistical models by score
matching. Journal of Machine Learning Research, 6(4), 2005.

Frederic Koehler, Alexander Heckett, and Andrej Risteski. Statistical efficiency of score
matching: The view from isoperimetry. arXiv preprint arXiv:2210.00726, 2022.

Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing flows.
arXiv preprint arXiv:1505.05770, 2015.

Laurent Dinh, David Krueger, and Yoshua Bengio. Nice: Non-linear independent compo-
nents estimation. arXiv preprint arXiv:1410.8516, 2014.

Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real nvp.
arXiv preprint arXiv:1605.08803, 2016.

George Papamakarios, Theo Pavlakou, and Iain Murray. Masked autoregressive flow for
density estimation. In Advances in Neural Information Processing Systems, pages 2338–
2347, 2017.

Durk P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling.
Improved variational inference with inverse autoregressive flow. In Advances in neural
information processing systems, pages 4743–4751, 2016.

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil
Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in
neural information processing systems, pages 2672–2680, 2014.

Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint
arXiv:1312.6114, 2013.

Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation
and variational inference in deep latent gaussian models. In International Conference on
Machine Learning, volume 2, 2014.

Johannes Ballé, Valero Laparra, and Eero P Simoncelli. Density modeling of images using a
generalized normalization transformation. arXiv preprint arXiv:1511.06281, 2015.

Scott Saobing Chen and Ramesh A Gopinath. Gaussianization. In Advances in neural


information processing systems, pages 423–429, 2001.

David Inouye and Pradeep Ravikumar. Deep density destructors. In International Confer-
ence on Machine Learning, pages 2167–2175, 2018.

20

You might also like