Exercises for the course Abteilung Maschinelles Lernen
Institut für Softwaretechnik und theoretische Informatik
Machine Learning 2 Fakultät IV, Technische Universität Berlin
Summer semester 2021 Prof. Dr. Klaus-Robert Müller
Email:
[email protected] Exercise Sheet 1
Exercise 1: Symmetries in LLE (25 P)
The Locally Linear Embedding (LLE) method takes as input a collection of data points ~x1 , . . . , ~xN ∈ Rd
and embeds them in some low-dimensional space. LLE operates in two steps, with the first step consisting
of minimizing the objective
XN
X
2
E(w) =
~xi − wij ~xj
i=1 j
P P
where w is a collection of reconstruction weights subject to the constraint ∀i : j wij = 1, and where j
sums over the K nearest neighbors of the data point ~xi . The solution that minimizes the LLE objective can
be shown to be invariant to various transformations of the data.
Show that invariance holds in particular for the following transformations:
(a) Replacement of all ~xi with α~xi , for an α ∈ R+ \ {0},
(b) Replacement of all ~xi with ~xi + ~v , for a vector ~v ∈ Rd ,
(c) Replacement of all ~xi with U~xi , where U is an orthogonal d × d matrix.
Exercise 2: Closed form for LLE (25 P)
In the following, we would like to show that the optimal weights w have an explicit analytic solution. For
this, we first observe that the objective function can be decomposed as a sum of as many subobjectives as
there are data points:
N
X
X
2
E(w) = Ei (w) with Ei (w) =
~xi − wij ~xj
i=1 j
Furthermore, because each subobjective depends on different parameters, they can be optimized indepen-
dently. We consider one such subobjective and for simplicity of notation, we rewrite it as:
K
X
2
Ei (w) =
~x − wj ~ηj
j=1
where ~x is the current data point (we have dropped the index i), where η = (~η1 , . . . , ~ηK ) is a matrix of size
K × d containing the K nearest neighborsP of ~x, and w is the vector of size K containing the weights to
optimize and subject to the constraint K j=1 wj = 1.
(a) Prove that the optimal weights for ~x are found by solving the following optimization problem:
min w> Cw subject to w> 1 = 1.
w
where C = (1~x> − η)(1~x> − η)> is the covariance matrix associated to the data point ~x and 1 is a vector
of ones of size K.
(b) Show using the method of Lagrange multipliers that the minimum of the optimization problem found in (a)
is given analytically as:
C −1 1
w = > −1 .
1 C 1
(c) Show that the optimal w can be equivalently found by solving the equation Cw = 1 and then rescaling w
such that w> 1 = 1.
Exercise 3: SNE and Kullback-Leibler Divergence (25 P)
SNE is an embedding algorithm that operates by minimizing the Kullback-Leibler divergence between two
discrete probability distributions p and q representing the input space and the embedding space respectively.
In ‘symmetric SNE’, these discrete distributions assign to each pair of data points (i, j) in the dataset the
probability scores pij and qij respectively, corresponding to how close the two data points are in the input
and embedding spaces. Once the exact probability functions are defined, the embedding algorithm proceeds
by optimizing the function:
C = DKL (p k q)
N X
N p
ij
X
= pij log
qij
i=1 j=1
where p and q are subject to the constraints N
P PN PN PN
i=1 j=1 pij = 1 and i=1 j=1 qij = 1. Specifically, the
algorithm minimizes q which itself is a function of the coordinates in the embedded space. Optimization is
typically performed using gradient descent.
In this exercise, we derive the gradient of the Kullback-Leibler divergence, first with respect to the probability
scores qij , and then with respect to the embedding coordinates of which qij is a function.
(a) Show that
∂C pij
=− . (1)
∂qij qij
(b) The probability matrix q is now reparameterized using a ‘softargmax’ function:
exp(zij )
qij = PN PN
k=1 l=1 exp(zkl )
The new variables zij can be interpreted as unnormalized log-probabilities. Show that
∂C
= −pij + qij . (2)
∂zij
(c) Explain which of the two gradients, (1) or (2), is the most appropriate for practical use in a gradient descent
algorithm. Motivate your choice, first in terms of the stability or boundedness of the gradient, and second
in terms of the ability to maintain a valid probability distribution during training.
(d) The scores zij are now reparameterized as
zij = −k~yi − ~yj k2
where the coordinates ~yi , ~yj ∈ Rh of data points in embedded space now appear explicitly. Show using the
chain rule for derivatives that
N
∂C X
= 4 (pij − qij ) · (~yi − ~yj ).
∂~yi
j=1
Exercise 4: Programming (25 P)
Download the programming files on ISIS and follow the instructions.