Lecture Maths
Lecture Maths
Marc Toussaint
April, 2022
This script is primarily based on a lecture I gave 2015-2019 in Stuttgart. The current
form also integrates notes and exercises from the Optimization lecture, and a little
material from my Robotics and ML lectures. The first full version was from 2019, since
then I occasionally update it.
Contents
1 Speaking Maths 4
1.1 Describing systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Should maths be taught with many application examples? Or abstractly? . . . 5
1.3 Notation: Some seeming trivialities . . . . . . . . . . . . . . . . . . . . . . . . 5
3 Linear Algebra 21
3.1 Vector Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1.1 Why should we care for vector spaces in intelligent systems research?;
3.1.2 What is a vector?; 3.1.3 What is a vector space?
3.7 Eigendecomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.7.1 Power Method; 3.7.2 Power Method including the smallest eigenvalue
; 3.7.3 Why should I care about Eigenvalues and Eigenvectors?
3.8 Beyond this script: Numerics to compute these things . . . . . . . . . . . . . . 38
3.9 Derivatives as 1-forms, steepest descent, and the covariant gradient . . . . . . 38
3.9.1 The coordinate-free view: A derivative takes a change-of-input vector
as input, and returns a change of output; 3.9.2 Contra- and co-variance;
3.9.3 Steepest descent and the covariant gradient vector
4 Optimization 47
4.1 Downhill algorithms for unconstrained optimization . . . . . . . . . . . . . . . 47
4.1.1 Why you shouldn’t trust the magnitude of the gradient; 4.1.2 Ensuring
monotone and sufficient decrease: Backtracking line search, Wolfe conditions,
Maths for Intelligent Systems, Marc Toussaint 3
& convergence; 4.1.3 The Newton direction; 4.1.4 Least Squares & Gauss-
Newton: a very important special case; 4.1.5 Quasi-Newton & BFGS: ap-
proximating the hessian from gradient observations; 4.1.6 Conjugate Gradient
; 4.1.7 Rprop*
4.2 The general optimization problem – a mathematical program . . . . . . . . . . 57
4.3 The KKT conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.4 Unconstrained problems to tackle a constrained problem . . . . . . . . . . . . 59
4.4.1 Augmented Lagrangian*
A Gaussian identities 96
B Further 100
Index 102
1 Speaking Maths
Systems can be described in many ways. Biologists describe their systems often using
text, and lots and lots of data. Architects describe buildings using drawings. Physicists
describe nature using differential equations, or optimality principles, or differential geom-
etry and group theory. The whole point of science is to find descriptions of systems—in
the natural science descriptions that allow prediction, in the engineering sciences de-
scriptions that enable the design of good systems, problem-solving systems.
And how should we describe intelligent systems? Robots, perception systems, machine
learning systems? I think there are two main categories: the imperative way in terms of
literal algorithms (code), or the declarative way in terms of formulating the problem. I
prefer the latter.
The point of this lecture is to teach you to speak maths, to use maths to describe
systems or problems. I feel that most maths courses rather teach to consume maths, or
solve mathematical problems, or prove things. Clearly, this is also important. But for the
purpose of intelligent systems research, it is essential to be skilled in expressing problems
mathematically, before even thinking about solving them and deriving algorithms.
If you happen to attend a Machine Learning or Robotics course you’ll see that every
problem is addressed the same way: You have an “intuitively formulated” problem; the
first step is to find a mathematical formulation; the second step to solve it. The second
step is often technical. The first step is really the interesting and creative part. This is
where you have to nail down the problem, i.e., nail down what it means to be successful
or performant – and thereby describe “intelligence”, or at least a tiny aspect of it.
The “Maths for Intelligent Systems” course will recap essentials of multi-variate func-
tions, linear algebra, optimization, and probabilities. These fields are essential to formu-
late problems in intelligent systems research and hopefully will equip you with the basics
Maths for Intelligent Systems, Marc Toussaint 5
of speaking maths.
Maybe this is the wrong question and implies a view on maths I don’t agree with. I think
(but this is arguable) maths is nothing but abstractions of real-world things. At least I
aim to teach maths as abstractions of real-world things. It is misleading to think that
there is “pure maths” and then “applications”. Instead mathematical concepts, such as
a vector, are abstractions of real-world things, such as faces, scenes, images, documents;
and theorems, methods and algorithms that apply on vectors of course also apply to all
the real-world things—subject to the limitations of this abstraction. So, the goal is not
to teach you a lookup table of which method can be used in which application, but
rather to teach which concepts maths offers to abstract real-world things—so that you
find such abstractions yourself once you’ll have to solve a real-world problem.
But yes, I believe that maths – in our context – should ideally be taught with many
exercises relating to AI problems. Perhaps the ideal would be:
Teach Maths using AI exercises (where AI problems are formulated and treated
analytically).
Teach AI using coding exercises.
Teach coding using maths-implementation exercises.
Equations and mathematical expressions have a syntax. This is hardly ever made explicit1
and might seem trivial. But it is surprising how buggy mathematical statements can be
in scientific papers (and oral exams). I don’t want to write much text about this, just
some bullet points:
Decorations are ok, but really not necessary. It is much more important to declare
all things. E.g., there are all kinds of decorations used for vectors, v, v, →
−
v , |vi
and matrices. But these are not necessary. Properly declaring all symbols is much
more important.
When declaring sets of indexed elements, I use the notation {xi }ni=1 . Similarly for
tuples: (xi )ni=1 , (x1 , .., xn ), x1:n .
I usually use brackets [a = b] ∈ {0, 1} for the boolean indicator function of some
expression. An alternative notation is I(a = b), or the Kronecker symbol δab .
One should distinguish between the infimum inf x f (x) and supremum supx f (x)
from the min/max: the inf/sup refer to limits, while the min/max to values actually
acquired by the function. I must admit I am sloppy in this regard and usually only
write min/max.
Never use multiple letters for one thing. E.g. length = 3 means l times e times n
times g times t times h equals 3.
f : R → R, x 7→ cos(x) (1)
The dot is used to help defining functions with only some arguments fixed:
a1 0
A typical convention is
Also, ei = (0, .., 0, 1, 0, .., 0) ∈ Rn often denotes the ith column of the identity
matrix, which of course are the coordinates of a basis vector ei ∈ V in a basis
(ei )ni=1 .
We super quickly recap the basics of functions R → R, which the reader might already
know.
Pp
A polynomial of degree p is of the form f (x) = i=0 ai xi , which is a weighed
sum of monimials 1, x, x2 , ... Note that for multi-variate functions, the number
of monimials grows combinatorially with the degree and dimension. E.g., the
2 2 2
monimials
of degree 2 in 3D space are x1 , x1 x2 , x1 x3 , x2 , x2 x3 , x3 . In general,
we have dp monimials of degree p in d-dimensional space.
I assume it is clear what piece-wise means (we have different polynomials in disjoint
intervals covering the input domain R).
The above functions are all examples of parameteric functions, which means
that they can be specified by a finite number of parameters ai . E.g., when we
have a finite set of basis functions, the functions can all be described by the finite
set of weights in the linear combination.
However, in general a function f : R → R is an “infinite-dimensional object”,
i.e., it has infinitely many degrees-of-freedom f (x), i.e., values at infinitely many
points x. In fact, sometimes it is useful to think of f as a “vector” of ele-
ments fx with continuous index x. Therefore, the space of all possible functions,
and also the space of all continuous function C0 and smooth functions C∞ , is
infinite-dimensional. General functions cannot be specified by a finite number of
parameters, and they are called non-parameteric.
The core example are functions used for regression or classification in Machine
Learning, which are a linear combination of an infinite set of basis functions. E.g.,
an infinite set of radial basis functions ϕ(|x − c|) for all centers c ∈ R. This
infinite set of basis functions spans a function space called Hilbert space (in the
ML context, “Reproducing Kernel Hilbert Space (RKHS)”), which is an infinite-
dimensional vector space. Elements in that space are called non-parameteric.
As a final note, splines are parameteric functions that are often used in robotics
and engineering in general. Splines usually are piece-wise polynomials that are
continuously joined. Namely, a spline of degree p is in Cp−1 , i.e., p − 1-fold con-
tinuously differentiable. A spline is not fully smooth, as the p-th derivative is
Maths for Intelligent Systems, Marc Toussaint 9
Definition 2.1. The partial derivative of a function of multiple arguments f (x1 , .., xn )
is the standard derivative w.r.t. only one of its arguments,
2.2.2 Total derivative, computation graphs, forward and backward chain rules
Let me start with an example: We have three real-valued quantities x, g and f which
depend on each other. Specifically,
f (x, g) = 3x + 2g and g(x) = 2x . (5)
Question: What is the “derivative of f w.r.t. x”?
The correct answer is: Which one do you mean? The partial or total?
The partial derivative defined above really thinks of f (x, g) as a function of two argu-
ments, and does not at all care about whether there might be dependencies of these
arguments. It only looks at f (x, g) alone and takes the partial derivative (=derivative
w.r.t. one function argument):
∂
f (x, g) = 3 (6)
∂x
However, if you suddenly talk about h(x) = f (x, g(x)) as a function of the argument x
only, that’s a totally different story, and
∂ ∂
h(x) = 3x + 2(2x) = 7 (7)
∂x ∂x
Bottom line, the definition of the partial derivative really depends on what you explicitly
defined as the arguments of the function.
10 Maths for Intelligent Systems, Marc Toussaint
xi = fi (xπ(i) ) (8)
where xπ(i) = (xj )j∈π(i) is the tuple of parent values. This could also be called a
deterministic Bayes net.
In a function network all values can be computed deterministically if the input values
(which do have no parents) are given. Concerning differentiation, we may now ask:
Assume we have a variation dx of some input value, how do all other values vary? The
chain rules give the answer. It turns our there are two chain rules in function networks:
df X ∂f dg dx
= (with ≡ 1, in case x ∈ π(f )) (9)
dx ∂g dx dx
g∈π(f )
Read this as follows: “The change of f with x is the sum of changes that come from
its direct dependence on g ∈ π(f ), each multiplied the change of g with x.”
df
This rule defines the total derivative of dx w.r.t. x. Note how different these two
notions of derivatives are by definition: a partial derivative only looks at a function itself
and takes a limit of differences w.r.t. one argument—no notion of further dependencies.
The total derivative asks how, in a function network, one value changes with a change
of another.
The second version of the chain rule is:
df X df ∂g df
= (with ≡ 1, in case x ∈ π(f )) (10)
dx dg ∂x df
g:x∈π(g)
Read this as follows: “The change of f with x is the sum of changes that arise from all
changes of g which directly depend on x.”
Figure 1 illustrates the fwd and bwd versions of the chain rule. The bwd version allows
df
you to propagate back, given gradients dg from top to g, one step further down, from
dg
top to x. The fwd version allows you to propagate forward, given gradients dx from g to
bottom, one step further up, from f to bottom. Both versions are recursive equations.
Maths for Intelligent Systems, Marc Toussaint 11
f f
x x
Figure 1: General Chain Rule. Left: Forward-Version, Right: Backward-Version.
They gray arc denotes the direct dependence ∂f
∂x , which appears in the summations
via dx/dx ≡ 1, df /df ≡ 1.
If you would recursively plug in the definition for a given functional network, both of
df
them would yield the same expression of dx in terms of partial derivatives only.
Let’s compare to the chain rule as it is commonly found in other texts (written more
precisely):
Note that we here very explicitly notated that ∂f∂g(g) considers f to be a function of
the argument g, which is evaluted at g = g(x). Written like this, the rule is fine. But
the above discussion and explicitly distinguishing between partial and total derivative is,
when things get complicated, less prone to confusion.
Let’s take the next step and consider functions f : Rn → Rd that map from n numbers
to a d-dimensional output. In this case, we can take the partial derivative of each output
w.r.t. each input argument, leading to a matrix of partial derivatives:
matrix) as
∂ ∂ ∂
∂x1 f1 (x) ∂x2 f1 (x) ... ∂xn f1 (x)
∂ ∂ ∂
∂ ∂x1 f2 (x) ∂x2 f2 (x) ... ∂xn f2 (x)
f (x) =
.. ..
(12)
∂x
. .
∂ ∂ ∂
∂x1 d (x)
f ∂x2 fd (x) ... f (x)
∂xn d
When the function only as one output dimension, f : Rn → R1 , the partial derivative
can be written as a vector. Unlike many other texts, I advocate for consistency with the
Jacobian matrix (and contra-variance, see below) and define this to be a row vector:
f (x + δ) − f (x) =
˙ ∂f (x) δ , (15)
where =˙ denotes “in first order approximation”. This equation holds, no matter if the
output space is Rd or R, or the input space and variation is δ ∈ Rn or δ ∈ R. In the
gradient notation we have
˙ ∇f (x)>δ .
f (x + δ) − f (x) = (16)
Jumping ahead to our later discussion of linear algebra: The above two equations are
written in coordinates. But note that the equations are truly independent of the choice
of vector space basis and independent of an optional metric or scalar product in V . The
transpose should not be understood as a scalar product between two vectors, but rather
as undoing the transpose in the definition of ∇f . All this is consistent to understanding
the derivatives as coordinates of a 1-form, as we will introduce it later.
Given a certain direction d (with |d| = 1) we define the directional derivative as
∇f (x)>d, and it holds
f (x + d) − f (x)
∇f (x)>d = lim . (17)
→0
Maths for Intelligent Systems, Marc Toussaint 13
2.3.2 Hessian
Definition 2.5. We define the Hessian of a scalar function f : Rn → R as the
symmetric matrix
∂2 ∂2 ∂2
∂x1 ∂x1 f ∂x1 ∂x2 f ... ∂x1 ∂xn f
∂2 ∂2 ∂2
∂ ∂x2 ∂x1 f ∂x2 ∂x2 f ... ∂x2 ∂xn f
∇2 f (x) =
∇f (x) = (18)
.. ..
∂x
. .
∂2 ∂2 ∂2
The Hessian can be thought of as the Jacobian of ∇f . Using the Hessian, we can express
the 2nd order approximation of f as:
1
¨ f (x) + ∂f (x) δ + δ> ∇2 f (x) δ .
f (x + δ) = (19)
2
For a uni-variate function f : R → R, the Hessian is just a single number, namely the
second derivative f 00 (x). In this section, let’s call this the “curvature” of the function
(not to be confused with the Rimannian curvature of a manifold). In the uni-variate
case, we have the obvious cases:
If f 00 (x) > 0, the function is locally “curved upwared” and convex (see also the
formal Definition ??).
n
P f : R > → R, the Hessian matrix H is symmetric and we can
In the multi-variate case
decompose it as H = i λi hi hi with eigenvalues λi and eigenvectors hi (which we will
learn about in detail later). Importantly, all hi will be orthogonal to each other, forming
a nice orthonormal basis.
This insight gives us a very strong intuition on how the Hessian H describes the local
curvature of the function f : λi gives the directional curvature, i.e., the curvature in
the direction of eigenvector hi . If λi > 0, f is curved upward along hi ; if λi < 0, f is
curved downward along hi . Therefore, the eigenvalues λi tell us whether the function is
locally curved upward, downward, or flat in each of the orthogonal directions hi .
∂
This becomes particular intuitive if ∂x f = 0 is zero, i.e., the derivative (slope) of the
function is zero in all directions. When the curvatures λi are positive in all directions,
the function is locally convex (upward parabolic) and x is a local minimum; if the
curvatures λi are all negative, the function is concave (downward parabolic) and x is a
local maximum; if some curvatures are positive and some are negative along different
directions hi , then the function curves down in some directions, and up in others, and
x is a saddle point.
Again jumping ahead, in the coordinate-free notation, the second derivative would be
14 Maths for Intelligent Systems, Marc Toussaint
In 1D, we have
1 1
f (x + v) ≈ f (x) + f 0 (x)v + f 00 (x)v 2 + · · · + f (k) (x)v k (23)
2 k!
For f : Rn → R, we have
1
f (x + v) ≈ f (x) + ∇f (x)>v + v>∇2 f (x)v + · · · (24)
2
which is equivalent to
X ∂ 1 X ∂2
f (x + v) ≈ f (x) + f (x)vj + f (x)vj vk + · · · (25)
j
∂xj 2 ∂xj ∂xk
jk
The next section will introduce linear algebra from scratch – here we first want to learn
how to practically deal with derivatives in matrix expressions. We think of matrices and
vectors simply as arrays of numbers ∈ Rn×m and Rn . As a warmup, try to solve the
following exercises:
For problem (v), we want to find a minimum for a matrix expression. We find this by
setting the derivative equal to zero. Here is the solution, and below details will become
clear:
∂
0= ||y − Xβ||2 + λ||β||2 (32)
∂β
= 2(y − Xβ)>(−X) + 2λβ> (33)
>
0 = −X (y − Xβ) + λβ (34)
> >
0 = −X y + (X X + λI)β (35)
> -1 >
β = −(X X + λI) X y (36)
Line 2 uses a standard rule for the derivative (see below) and gives a row vector equation.
Line 3 transposes this to become a column vector equation.
As 2nd order terms are very common in AI methods, this is a very useful identity to
learn:
Identities 2.3.
∂ ∂ ∂
f (x)>Ag(x) = f (x)>A g(x) + g(x)>A> f (x) (37)
∂x ∂x ∂x
∂ ∂
∇x f (x)>Ag(x) = [ g(x)]>A>f (x) + [ f (x)]>Ag(x) (38)
∂x ∂x
16 Maths for Intelligent Systems, Marc Toussaint
which I find impossible to remember, and mixes gradients-in-columns (∇) with gradients-
in-rows (the Jacobian) notation.
Special cases and variants of this identity are:
∂
[whatever]x = [whatever] , if whatever is indep. of x (39)
∂x
∂ >
a x = a> (40)
∂x
∂
Ax = A (41)
∂x
∂
(Ax − b)>(Cx − d) = (Ax − b)>C + (Cx − d)>A (42)
∂x
∂ >
x Ax = x>A + x>A> (43)
∂x
∂ ∂ > 1 1 1 1 >
||x|| = (x x) 2 = (x>x)− 2 2x> = x (44)
∂x ∂x 2 ||x||
∂2
(Ax + a)>C(Bx + b) = A>CB + B>C>A (45)
∂x2
∂ ∂
|A| = |A| tr(A-1 A) (46)
∂θ ∂θ
∂ -1 ∂
A = −A-1 ( A) A-1 (47)
∂θ ∂θ
∂ X ∂
tr(A) = Aii (48)
∂θ i
∂θ
(54)=Woodbury; (56,57) holds for pos def A and B. See also the matrix cookbook.
An example from logistic regression: We have the loss gradient and want the Hessian:
This is your typical work procedure when implementing a Machine Learning or AI’ish or
Optimization kind of methods:
You first mathematically (on paper/LaTeX) formalize the problem domain, includ-
ing the objective function.
You derive analytically (on paper) the gradients/Hessian of your objective function.
You implement the objective function and these analytic gradient equations in
Matlab/Python/C++, using linear algebra packages.
Only if that works, you put everything together, interfacing the objective & gra-
dient equations with some optimization algorithm
a) In 3D, note that a × b = skew(a)b = −skew(b)a, where skew(v) is the skew matrix
of v. What is the gradient of (a × b)2 w.r.t. a and b?
∂
Ji,j,k,...,l,m,n... = yi,j,k,...
∂xl,m,n,...
Maths for Intelligent Systems, Marc Toussaint 19
The following exercises will require you to code basic functions and derivatives. You can
code in your prefereed language (Matlab, NumPy, Julia, whatever).
(i) Implement the following pseudo code for empirical gradient checking in the pro-
gramming language of your choice:
df
Derive pseudo code to efficiently compute dx0 . (Ideally also for deeper networks.)
where Wl ∈ Rhl+1 ×hl and σ(z) = 1/(e−z + 1) is the sigmoid function which is applied
element-wise. We established last time that
df ∂f ∂x2 ∂z2 ∂x1 ∂z1
=
dx0 ∂x2 ∂z2 ∂x1 ∂z1 ∂x0
with:
∂xl ∂zl+1 ∂f
= diag(xl ◦ (1 − xl )) , = Wl , = W2
∂zl ∂xl ∂x2
Note: In the following we still let f be a h3 -dimensional vector. For those that are
confused with the resulting tensors, simplify to f being a single scalar output.
(i) Derive also the necessary equations to get the derivative w.r.t. the weight matrices
Wl , that is the Jacobian tensor
df
dWl
df df
(ii) Write code to implement f (x) and dx0 and dWl .
To test this, choose layer sizes (h0 , h1 , h2 , h3 ) = (2, 10, 10, 2), i.e., 2 input and 2
output dimensions, and hidden layers of dimension 10.
For testing, choose random inputs sampled from x ∼randn(2,1)
And choose random weight matrices Wl ∼ √ 1 rand(h[l+1],h[l]).
hl+1
(iii) Bonus: Try to train the network to become the identity mapping. In the sim-
plest case, use “stochastic gradient descent”, meaning that you sample an input,
2
compute the gradients wl = d(f (x)−x)
dWl , and make tiny updates Wl ← Wl − αwl .
Maths for Intelligent Systems, Marc Toussaint 21
where xi ∈ R is the ith row of a matrix X ∈ Rn×d , and y ∈ {0, 1}n is a vector of 0s and
d
1s only. Here, σ(z) = 1/(e−z + 1) is the sigmoid function, with σ 0 (z) = σ(z)(1 − σ(z)).
∂
Derive the gradient ∂β L(β), as well as the Hessian
∂2
∇2 L(β) = L(β) .
∂β 2
3 Linear Algebra
3.1.1 Why should we care for vector spaces in intelligent systems research?
Actually, some of these spaces are not vector spaces at all. E.g. the configuration space
of a robot might have ‘holes’, be a manifold with complex topology, or not even that
(switch dimensionality at some places). But to do computations in these spaces one
always either introduces (local) parameterizations that make them a vector space,2 or
one focusses on local tangent spaces (local linearizations) of these spaces, which are
vector spaces.
Perhaps the most important computation we want to do in these spaces is taking
derivatives—to set them equal to zero, or do gradient descent, or Newton steps for
2 E.g. by definition an n-dimensional manifold X is locally isomorphic to Rn .
22 Maths for Intelligent Systems, Marc Toussaint
optimization. But taking derivatives essentially requires the input space to (locally) be
a vector space.3 So, we also need vector spaces because we need derivatives, and Linear
Algebra to deal with the resulting equations.
Definition 3.1 (vector space). A vector spacea V is a space (=set) on which two
operations, addition and multiplication, are defined as follows
Roughly, this definition says that a vector space is “closed under linear operations”,
meaning that we can add and scale vectors and they remain vectors.
In this section we explain what might be obvious: that once we have a basis, we can
write vectors as (column) coordinate vectors, 1-forms as (row) coordinate vectors, and
linear transformations as matrices. Only the last subsection becomes more practical,
refers to concrete exercises, and explains how in practise not to get confused about basis
3 Also when the space is actually a manifold; the differential is defined as a 1-form on the local
tangent.
Maths for Intelligent Systems, Marc Toussaint 23
For simplicity we consider only functions involving a single vector space V . But all that
is said transfers to the case when multiple vector spaces V, W, ... were involved.
Many names are used for special linear functions—let’s make some explicit:
– f : V → R, called linear functional4 , or 1-form, or dual vector.
– f : V → V , called linear function, or linear transform, or vector-valued 1-form
– f : V × V → R, called bilinear functional, or 2-form
– f : V × V × V → R, called 3-form (or unspecifically ’multi-linear functional’)
– f : V × V → V , called vector-valued 2-form (or unspecifically ’multi-linear map’)
– f : V × V × V → V × V , called bivector-valued 3-form
– f : V k → V m , called m-vector-valued k-form
This gives us a simple taxonomy of linear functions based on how many vectors a function
eats, and how many it outputs. To give examples, consider some space X of systems
(examples above), which might itself not be a vector space. But locally, around a specific
x ∈ X, its tangent V is a vector space. Then
– f : X → R could be a cost function over the system space.
– The differential df |x : V → R is a 1-form, telling us how f changes when ‘making a tangent
step’ v ∈ V .
– The 2nd derivative d2 f |x : V × V → R is a 2-form, telling us how df |x (v) changes when
‘making a tangent step’ w ∈ V .
– The inner product h·, ·i : V × V → R is a 2-form.
Another example:
– f : Ri → Ro is a neural network that maps i input signals to o output signals.
– Its derivative df |x : Ri → Ro is a vector-valued 1-form, telling us how each output changes
with a step v ∈ Ri in the input.
– Its 2nd derivative d2 f |x : Ri × Ri → Ro is a vector-valued 2-form.
4 The word ’functional’ instead of ’function’ is especially used when V is a space of functions.
24 Maths for Intelligent Systems, Marc Toussaint
This is simply to show that vector-valued functions, 1-forms, and 2-forms are common.
Instead of being a neural network, f could also be a mapping from one parameterization
of a system to another, or the mapping from the joint angles of a robot to its hand
position.
We need to define some notions. I’m not commenting on these definitions—train yourself
in reading maths...
hP i
Definition 3.5. {vi }ni=1 linearly independent ⇔ i αi vi = 0 ⇒ ∀i αi = 0
n
Definition 3.8. The tuple
n
P (v1 , v2 , .., vn ) ∈ R is called coordinates of v ∈ V in
the basis (ei )i=1 iff v = i vi ei
Note that Rn is also a vector space, and therefore coordinates v1:n ∈ Rn are also vectors,
but in Rn , not V . So coordinates are vectors, but vectors in general not coordinates.
Given a basis (ei )ni=1 , we can describe every vector v as a linear combination v = i vi ei
P
of basic elements—the basis vectors ei . This general idea, that “linear things” can be
described as linear combinations of “basic elements” carries over also to functions. In
fact, to all the types of functions we described above: 1-forms, 2-forms, bi-vector-valued
k-forms, whatever. And if we describe all these als linear combinations of basic elements
we automatically also introduce coordinates for these things. To get there, we first have
to introduce a second type of “basic elements”: 1-forms.
First, it is easy to see that V ∗ is also a vector space: We can add two linear functionals,
f = f1 + f2 , and scale them, and it remains a linear functional.
Maths for Intelligent Systems, Marc Toussaint 25
ei )ni=1 of V ∗
Second, given a basis (ei )ni=1 of V , we define a corresponding dual basis (b
simply by
That is, ebi is the 1-form that simply maps a vector to its ith coordinate. It can be shown
ei )ni=1 is in fact a basis of V ∗ . (Omitted.) That tells us a lot!
that (b
dim(V ∗ ) = dim(V ). That is, the space of 1-forms has the same dimension as V . At
this place, geometric intuition should kick in: indeed, every linear function over V could
be envisioned as a “plane” over V . Such a plane can be illustrated by its iso-lines and
these can be uniquely determined by their orientation and distance (same dimensionality
as V itself). Also, (assuming we’d know already what a transpose or scalar product is)
every 1-form must be of the form f (v) = c>v for some c ∈ V —so every f is uniquely
described by a c ∈ V . Showing that the vector space V and its dual V ∗ are really twins.
The dual basis (bei )ni=1 introduces coordinates in the dual space: Every 1-form f can be
described as a linear combination of basis 1-forms,
X
f= fi ebi (70)
i
ei }ni=1 ) = V ∗ .
span({b (71)
We now have the basic elements: the basis vectors (ei )ni=1 of V , and basis 1-forms
ei )ni=1 of V ∗ . From these, we can describe, for instance, any bivector-valued 3-form as
(b
a linear combination as follows:
f : V ×V ×V →V ×V (72)
X
f= fijklm ei ⊗ ej ⊗ ebk ⊗ ebl ⊗ ebm (73)
ijklm
columns rows
1-form
vector co-vector
derivative
output space input space
co-variant contra-variant
contra-variant coordinates co-variant coordinates
The above was rather abstract. The exercises demonstrate representing vectors and
transformations with coordinates and matrices in different input and output bases. We
just summarize here:
We have two bases A = (a1 , .., an ) and B = (b1 , .., bn ), and the transformation
T that maps each ai to bi , i.e., B = T A.
Given a vector x we denote its coordinates in A by [x]A or briefly as xA . And
we denotes its coordinates in B as [x]B or xB . E.g., xA
i is the ith coordinate in
basis A.
[bi ]A are the coordinates of the new basis vectors in the old basis. The coordinate
transformation matrix B is given with elements Bij = [bj ]A i . Note that
i.e., while the basis transform T carries old basis ai vectors to new basis vectors bi ,
the matrix B carries coordinates [x]B in the new basis to coordinates [x]A in the
old basis! This is the origin of understanding that coordinates are contra-variant.
Given a linear transform f in the vector space, we can represent it as a matrix
in four ways, using basis A or B in the input and output spaces, respectively. If
28 Maths for Intelligent Systems, Marc Toussaint
[f ]AA = F is the matrix in old coordinates (using A for input and output), then
[f ]BB = B -1 F B is its matrix in new coordinates, [f ]AB = F B is its matrix using
B for the input and A for the output space, and [f ]BA = B -1 F is the matrix using
A for input and B for output space.
T itself is also a linear transform. [T ]AA = B is its matrix in old coordinates.
And the same [T ]BB = B is also its matrix in new coordinates! [T ]BA = I is its
matrix when using A for input and B for output space. And [T ]AB = B 2 is its
matrix using B for input and A for output space.
Please note that so far we have not in any way referred to a scalar product or a transpose.
All the concepts above, dual vector space, bases, coordinates, matrix-vector multiplica-
tion, are fully independent of the notion of a scalar product or transpose. Columns
and rows naturally appear as coordinates of vectors and 1-forms. But now we need to
introduce scalar products.
h·, ·i : V × V → R . (84)
Definition 3.11. Given a scalar product, we define for every v ∈ V its dual v ∗ ∈ V ∗
as
X X
v ∗ = hv, ·i = vi hei , ·i = vi e∗i . (85)
i i
Note that ebi and e∗i are in general different 1-forms! The canonical dual basis (bei )ni=1
is independent of an introduction of a scalar product, they were the basis to introduce
coordinates for linear functions, including matrices. And while such coordinates do
depend on a choice of basis (ei )ni=1 , they do not depend on a choice of scalar product.
The 1-forms (e∗i )ni=1 also form a basis for V ∗ , but a different one to the canonical
basis, and one that depends on the notion of a scalar product. You can see this: the
coordinates vi of v ∗ in the basis (e∗i )ni=1 are identical to the coordinates vi of v in the
basis (ei )ni=1 , but different to the coordinates (v ∗ )i of v ∗ in the basis (b
ei )ni=1 .
Definition 3.12. Given a scalar product, a set of vectors {vi }ni=1 is called orthonor-
mal iff hvi , vj i = δij .
Maths for Intelligent Systems, Marc Toussaint 29
Definition 3.13. Given a scalar product and basis (ei )ni=1 , we define the metric
tensor gij = hei , ej i, which are the coordinates of the 2-form h·, ·i, that is
X
h·, ·i = gij ebi ⊗ ebj . (86)
ij
Although related, do not confuse gij with the usual definition of a metric d(·, ·) in a
metric space.
If we have an orthonormal basis (ei )ni=1 , many thing simplify a lot. Throughout this
subsection, we assume {ei } orthonormal.
The metric tensor gij = hei , ej i = δij is the identity matrix.5 Such a metric is
also called Euclidean. The norm ||i || = 1. The canonical dual basis (b ei )ni=1 and
the one defined via the scalar product (ei )i=1 become identical, ebi = e∗i = hei , ·i.
∗ n
Consequently, v and v ∗ have the same coordinates vi = (v ∗ )i w.r.t. (ei )ni=1 and
ei )ni=1 , respectively.
(b
The coordinates of vectors can now easily been computed:
X X X
v= vi ei ⇒ hei , vi = hei , vj e j i = hei , ej ivj = vi (88)
i j j
The coordinates of a linear transform can equally easily been computed: Given
a linear transform f : V → U an arbitrary (e.g. non-orthonormal) input basis
(vi )ni=1 of V , but an orthonormal basis (ui )ni=1 , then
X
f= fij ui ⊗ vbj ⇒ (89)
ij
X X
hui , f vj i = huj , fkl uk ⊗ vbl (vj )i = fkl huj , uk i vbl (vj )
kl kl
X
= fkl δjk δlj = fij (90)
kl
Pk
The projection onto the span of several basis vectors (e1 , .., ek ) is given by i=1 ei hei , ·i.
Pdim(V )
The identity mapping I : V → V is given by I = i=1 ei hei , ·i.
The scalar product with an orthonormal basis is
X X
hv, wi = vi wj δij = vi wi (91)
ij i
w1
w2
X
>
hv, wi = (v1 v2 .. vn ) .
= v w = vi w i , (92)
..
i
wn
where for the first time we introduced the transpose which, in the matrix con-
vention, swaps columns to rows and rows to columns.
As a general note, a row vector “eats a vector and outputs a scalar”. That is v> : V → R
should be thought of as a 1-form! Due to the matrix conventions, it generally is the
case that “rows eat columns”, that is, every row index should always be thought of as
relating to a 1-form (dual vector), and every column index as relating to a vector. That
is totally consistent to our definition of coordinates.
For an orthonormal basis we also have
That is, v> is the coordinate representation of the 1-form v ∗ . (Which also says, that
the coordinates of the 1-form v ∗ in the special basis (e∗i )ni=1 ⊂ V ∗ coincide with the
coordinates of the vector v.)
We focus here on linear transforms (or “linear maps”) f : V → U from one vector space
to another (or the same). It turns out that such transforms have a very specific and
intuitive structure, which is captured by the singular value decompositon.
V U
f
x
span(U )
f (x)
span(V ) σ
Pk
Figure 2: A linear transformation f = i=1 σi ui v> i can be described as: take the input
x, project it onto the first input fundamental vector v1 to yield a scalar, stretch/squeeze
it by σ1 , and “unproject” this into the first output fundamental vector u1 ; repeat this
for all i = 1, .., k, and add up the results.
Theorem 3.1 (Singular Value Decomposition). Given two vector spaces V and U
with scalar products, dim(V) = n and dim(U) = m, for every linear transform f :
V → U there exist a k ≤ n, m and orthonormal vectors {vi }ki=1 ⊂ V, orthonormal
vectors {ui }ki=1 ⊂ U, and positive scalars σi > 0, i = 1, .., k, such that
k
X
f= σi ui vi∗ (94)
i=1
As above, vi∗ = hvi , ·i is the basis 1-form that picks the ith coordinate of a vector
in the basis (vi )ki=1 ⊂ V.a
a Note that {vi }ki=1 may not be a full basis of V if k < n. But because {vi } is orthonormal,
hvi , ·i uniquely picks the ith coordinate no matter how {vi }ki=1 is completed with further n − k
vectors to become a full basis.
Theorem 3.2 (Singular Value Decomposition). For every matrix A ∈ Rm×n there
exists a k ≤ n, m and orthonormal vectors {vi }ki=1 ⊂ Rn , orthonormal vectors
{ui } ⊂ Rm , and positive scalars σi > 0, i = 1, .., k, such that
k
X
A= σi ui v>
i = U SV
>
(95)
i=1
where V = (v1 , .., vk ) ∈ Rn×k , U = (u1 , .., uk ) ∈ Rm×k contain the orthonor-
mal bases vectors as columns and S = diag(σ1 , .., σk ).
Let me rephrase this in a sentence: Every matrix A can be expressed as a linear com-
bination of only k rank-1 matrices. Rank-1 matrices are the most minimalistic kinds of
32 Maths for Intelligent Systems, Marc Toussaint
matrices and they are always of the form uv> for some u and v. The rank-1 matrix uv>
takes an input x, projects it on v (measures its alignment with v), and “unprojects” into
u (multiplies v>x to the output vector u).
Just to explicitly show the transition from coordinate-free to the coordinate-based theo-
rem, consider arbitrary orthonormal bases {ei }ni=1 ⊂ V and {êi }m
i=1 ⊂ U. For x ∈ V we
have
k
X k
X X X X
f (x) = σi ui hvi , xi = σi ( uli êl )h vji ej , xk ek i (96)
i=1 i=1 l j k
k
X X X k
XhX X i
= σi ( uli êl ) vji xk δjk = uli σi vji xj êl (97)
i=1 l jk l i=1 j
Xh i
= U SV >x êl (98)
l
l
where vji are the coordinates of vi , uli the coordinates of ui , U = (u1 , .., uk ) is the ma-
trix containing {ui } as columns, V = (v1 , .., vk ) the matrix containing {vi } as columns,
and S = diag(σ1 , .., σk ) the diagonal matrix with elements σi .
We add some definitions based on this:
Definition 3.14. The rank rank(f ) = rank(A) of a transform f or its matrix A is
the unique minimal k.
The last definition is a bit flaky, as the ± is not properly defined. If, alternatively, in the
above theorems we would require V and U to be rotations, that is, elements of SO(n)
(of the special
Qn orthogonal group); then negative σ’s would indicate such a reflection and
det(A) = i=1 σi . But above we required σ’s to be strictly positive and V and U only
orthogonal. Fundamental space vectors vi and ui could flip sign. The ± above indicates
how many flip sign.
Definition 3.16. a) The row space (also called right or input fundamental space)
rank(f )
of a transform f is span{vi }i=1 . The input null space (or right null space) V⊥
is the subspace orthogonal to the row space, such that v ∈ V⊥ ⇒ f (v) = 0.
b) The column space or (also called left or output fundamental space) of a trans-
rank(f )
form f is span{ui }i=1 . The output null space (or left null space) U⊥ the
Maths for Intelligent Systems, Marc Toussaint 33
In the following we list some statements—all of them relate to the SVD theorem and
together
Pk they’re meant to give a more intuitive understanding of the equation A =
> >
i=1 σi ui vi = U SV .
1 vv>
xk = vhv, xi or x. (100)
v2 v2
1
Here, the v2 is normalizing in case v does not have length |v| = 1.
The projection of a vector x ∈ V onto a subvector space span{vi }ki=1 for orthonor-
34 Maths for Intelligent Systems, Marc Toussaint
where V = (v1 , .., vk ) ∈ Rn×k . The projection matrix V V > for orthonormal V is
symmetric, semi-pos-def, and has rank(V V >) = k.
for orthonormal V = (v1 , .., vk ). Here λi = ±σi and Λ = diag(λ) is the diagonal
matrix of λ’s. This describes nothing but a stretching/squeezing along orthogonal
projections.
The λi and vi are also the eigenvalues and eigenvectors of A, that is, for all
i = 1, .., k:
Avi = λi vi . (103)
If A has full rank, then the SVD A = V SV > = V SV -1 is therefore also the
eigendecomposition of A.
which simply does the reverse stretching/squeezing along the same orthogonal
projections. Note that
is the projection on {vi }ki=1 . For full rank(A) = n we have V V > = I and
A† = A-1 . For rank(A) < n, we have that A† y minimizes minx ||Ax − y||2 , but
there are infinitly many x’s that minimize this, spanned by the null space of A.
A† y is the minimizer closest to zero (with smallest norm).
Maths for Intelligent Systems, Marc Toussaint 35
Consider
Pm a data set D = {xi }m n
i=1 , xi ∈ R . For simplicity assume it has zero
mean, i=1 xi = 0. The covariance matrix is defined as
1 X 1
C= xi x>i = X>X (106)
n i n
It follows that
that is, the volume is being multiplied with det(A), which is consistent with our
intuition of transforms as stretchings/squeezings along orthonormal projections.
The pseudo-inverse of a general matrix is
X
A† = σi-1 vi u>i = V S -1 U> . (110)
i
3.7 Eigendecomposition
Definition 3.18. The eigendecomposition or diagonalization of a square matrix
A ∈ Rn×n is (if it exists!)
A = QΛQ-1 (115)
First note that, unlike SVD, this is not a Theorem but just a definition: If such a
decomposition exists, it is called eigendecomposition. But it exists for almost any
square matrix.
multiplicity k), but n − rank(A − λI) (the dimensionality of the span of the
eigenvectors of λ!) is less than k. Therefore, the eigenvectors of λ are not linearly
independent; they do not span the necessary k dimensions.
So, only very “special” matrices are not diagonalizable. Random matrices are
(with prob 1).
Symmetric matrices? → SVD
Rotations? Not real. But complex! Think of oscillating projection onto eigenvec-
tor. If φ is the rotation angle, e±iφ are eigenvalues.
A trick, hard to find in the literature, to also compute the smallest eigenvalue and -vector
is the following. We assume all eigenvalues to be positive. Initialize x and y randomly,
iterate
x ← Ax , λ ← ||x|| , x ← x/λ , y ← (λI − A)y , y ← y/||y|| (117)
Then y will converge to the smallest eigenvector, and λ − ||y|| will be its eigenvalue.
Note that (in the limit) A − λI only has negative eigenvalues, therefore ||y|| should be
positive. Finding smallest eigenvalues is a common problem in model fitting.
We will not go into details of numerics. Nathan’s script gives a really nice explaination
of the QR-method. I just mention two things:
(i) The most important forms of matrices for numerics are diagonal matrices, orthogonal
matrices, and upper triangular matrices. One reason is that all three types can very easily
be inverted. A lot of numerics is about finding decompositions of general matrices into
products of these special-form matrices, e.g.:
– QR-decomposition: A = QR with Q orthogonal and R upper triangular.
– LU-decomposition: A = LU with U and L> upper triangular.
– Cholesky decomposition: (symmetric) A = C>C with C upper triangular
– Eigen- & singular value decompositions
f (x + hv) − f (x)
df x
: V → G, v 7→ lim (118)
h→0 h
This definition holds whenever G is a continuous space that allows the definition of this
limit and the limit exists (f is differentiable). The notation df |x reads “the differential
at location x”, i.e., evaluating this derivative at location x.
Note that df |x is a mapping from a “tangent vector” v (a change-of-input vector) to
an output-change. Further, by this definition df |x is linear. df |x (v) is the directional
derivative we mentioned before. Therefore df |x is a G-valued 1-form. As discussed
earlier, we can introduce coordinates for 1-forms; these coordinates are what typically
Maths for Intelligent Systems, Marc Toussaint 39
is called the “gradient” or “Jacobian”. But here we explicitly see that we speak of
coordinates of a 1-form.
that is, the matrix B carries new coordinates to old ones. These coordinates are
said to be contra-variant: they transform ‘against’ the transformation of the basis
vectors.
We require that
must be invariant, i.e., the change of function value for δ should not depend on
whether we compute it using A or B coordinates. It follows
What we just wrote for the derivative df |x (δ) we could equally write and argue for
any 1-form v ∗ ∈ V ∗ ; we always require that the value v ∗ (δ) is invariant.
where [G]A and [G]B are the 2-form-coordinates (metric tensor) in the old and
new basis. It follows
[δ]A>[G]A [δ]A = [δ]B>B>[G]A B[δ]B (124)
[G]B = B>[G]A B (125)
that is, the matrix carries the old 2-form-coordinates to new ones. These coordi-
nates are called twice co-variant.
Consider the following example: We have the function f : R2 → R, f (x) = x1 +x2 . The
function’s partial derivative is of course ∂f
∂x = (1 1). Now let’s transform the coordinates
of the space: we introduce new coordinates (z1 , z2 ) = (2x1 , x2 ) or z = B -1 x with B =
1
0. The same function, written in the new coordinates, is f (z) =
1
2 z1 + z2 . The
2
0 1
Generally, consider we have two kinds of mathematical objects and when we multiply
them together this gives us a scalar. The scalar shouldn’t depend on any choice of
coordinate system and is therefore invariant against coordinate transforms. Then, if
one of the objects transforms in a covariant (“transforming with the transformation”)
manner, the other object must transform in a contra-variant (“transforming contrary
to the transformation”) manner to ensure that the resulting scalar is invariant. This is a
general principle: whenever two things multiply together to give an invariant thing, one
should transform co- the other contra-variant.
Let’s also check Wikipedia:
– “For a vector to be basis-independent, the components [=coordinates] of the vector must
contra-vary with a change of basis to compensate. That is, the matrix that transforms
the vector of components must be the inverse of the matrix that transforms the basis
vectors. The components of vectors (as opposed to those of dual vectors) are said to be
contravariant.
– For a dual vector (also called a covector) to be basis-independent, the components of the
dual vector must co-vary with a change of basis to remain representing the same covector.
That is, the components must be transformed by the same matrix as the change of basis
matrix. The components of dual vectors (as opposed to those of vectors) are said to be
covariant.”
Ordinary gradient descent of the form x ← x + α∇f adds objects of different types:
contra-variant coordinates x with co-variant partial derivatives ∇f . Clearly, adding two
such different types leads to an object who’s transformation under coordinate transforms
is strange—and indeed the ordinary gradient descent is not invariant under transforma-
tions.
Let’s define the steepest descent direction to be the one where, when you make a step of
length 1, you get the largest decrease of f in its linear (=1st order Taylor) approximation.
Maths for Intelligent Systems, Marc Toussaint 41
Definition 3.20. Given f : V → R and a norm ||x||2 = hx, xi (or scalar product)
defined on V , we define the steepest descent vector δ ∗ ∈ V as the vector:
Note that for this definition we need to assume we have a scalar product, otherwise the
length=1 constraint is not defined. Also recall that df |x (δ) = ∂x f (x)δ = ∇f (x)>δ are
equivalent notations.
Clearly, if we have coordinates in which the norm is Euclidean then
||δ||2 = δ>δ ⇒ δ ∗ ∝ −∇f (x) (127)
For a coordinate transformation B, recall that new metric becomes G̃ = B>GB, and
ff = B>∇f . Therefore, the new steepest descent is
the new gradient ∇
ff = −B -1 G-1 B -> B>∇f = −B -1 G-1 ∇f
δe∗ = −[G̃]-1 ∇ (132)
and therefore transformes like normal contra-variant coordinates of a vector.
There is an important special case of this, when f is a function over the space of
probability distributions and G is the Fisher metric, which we’ll discuss later.
3.10.1 Basis
f (x) = Ax = 7 x .
5 −8
42 Maths for Intelligent Systems, Marc Toussaint
1 2
Consider the basis B = , , which we also simply refer to by the matrix B =
1 1
1 2 . Given a vector x in the vector space R , we denote its coordinates in basis B
2
1 1
with xB .
−b
You have a book lying on the table. The edges of the book define the basis B, the edges
of the table define basis A. Initially A and B are identical (also their origins align). Now
we rotate the book by 45◦ counter-clock-wise about its origin.
(i) Given a dot p marked on the book at position pB = (1, 1) in the book coordinate
frame, what are the coordinates pA of that dot with respect to the table frame?
(ii) Given a point x with coordinates xA = (0, 1) in table frame, what are its coordi-
nates xB in the book frame?
(iii) What is the coordinate transformation matrix from book frame to table frame,
and from table frame to book frame?
A = {1, x, x2 , . . . , xn }
and
B = {1, 1 + x, 1 + x + x2 , . . . , 1 + x + . . . + xn }.
Maths for Intelligent Systems, Marc Toussaint 43
1→1
x→1+x
x2 → 1 + x + x2
..
.
xn → 1 + x + x2 + · · · + xn
What is the matrix T A for the linear transform t in the basis A, i.e., such
that [t(f )]A = T A [f ]A ? (Basis A is used for both, input and output spaces.)
What is the matrix T B for the linear transform t in the basis B, i.e., such
that [t(f )]B = T B [f ]B ? (Basis B is used for both, input and output spaces.)
What is the matrix T BA if we use A as input space basis, and B as output
space basis, i.e., such that [t(f )]B = T BA [f ]A ?
What is the matrix T AB if we use B as input space basis, and A as output
space basis, i.e., such that [t(f )]A = T AB [f ]B ?
Show that T B = I BA T A I AB (cp. Exercise 1(b)). Also note that T AB =
T A I AB and T BA = I BA T A .
3.10.4 Projections
(i) In Rn , a plane (through the origin) is typically described by the linear equation
c>x = 0 , (133)
where c ∈ Rn parameterizes the plane. Provide the matrix that describes the
orthogonal projection onto this plane. (Tip: Think of the projection as I minus a
rank-1 matrix.)
44 Maths for Intelligent Systems, Marc Toussaint
(ii) In Rn , let’s have k linearly independent {vi }ki=1 , which form the matrix V =
(v1 , .., vk ) ∈ Rn×k . Let’s formulate a projection using an optimality principle,
namely,
Derive the equation for the optimal α∗ (x) from the optimality principle.
Pk
(For information only: Note that V α = i=1 αi vi is just the linear combination
of vi ’s with coefficients α. The projection of a vector x is then x|| = V α∗ (x).)
3.10.5 SVD
(iii) Given an arbitrary input vector x ∈ R3 , provide the linear transformation matrices
PA and PB that project x into the input null space of matrix A and B, respectively.
(iii) What property does a matrix M have to satisfy in order to be a valid metric tensor,
i.e. such that x>M y is a valid scalar product?
Maths for Intelligent Systems, Marc Toussaint 45
3.10.7 Eigenvectors
Prove that (under certain conditions) these iterations converge to the eigenvector
x with a largest (in absolute terms |λi |) eigenvalue of A. How fast does this
converge? In what sense does it converge if the largest eigenvalue is negative?
What if eigenvalues are not different? Other convergence conditions?
(vi) Let A be a positive definite matrix with λmax its largest eigenvalue (in absolute
terms |λi |). What do we get when we apply power iteration method to the matrix
B = A − λmax I? How can we get the smallest eigenvalue of A?
1 1
z ← Ax , λ ← x>z , y ← (λI − A)y , x← z, y← y.
||z|| ||y||
If A is a positive definite matrix, show that the algorithm can give an estimate of
the smallest eigenvalue of A.
N
1X 1
C= xi x>i = X>X
n i=1 n
46 Maths for Intelligent Systems, Marc Toussaint
where (consistent to ML lecture convention) the data matrix X contains each x>i as a
row, i.e., X> = (x1 , .., xN ).
If we project D onto some unit vector v ∈ Rn , then the variance of the projected data
points is v>Cv. Show that the direction that maximizes this variance is the largest
eigenvector of C. (Hint: Expand v in terms of the eigenvector basis of C and exploit
the constraint v>v = 1.)
hf, kx i = f (x)
(ii) Assume we only have a finite set of points {D = {xi }ni=1 }, which defines a finite
basis {kxi }ni=1 ⊂ B. This finite function basis spans a subspace FD = span{kxi :
xi ∈ D} of the space of all functions.
For a general function f , we decompose it f = fs + f⊥ with fs ∈ FD and
∀g ∈ FD : hf⊥ , gi = 0, i.e., f⊥ is orthogonal to FD . Show that for every xi ∈ D:
f (xi ) = fs (xi )
(Note: This shows that the function values of any function f at the data points
D only depend on the part fs which is inside the spann of {kxi : xi ∈ D}.
This implies the so-called representer theorem, which is fundamental in kernel
machines: A loss can only depend on function values f (xi ) at data points, and
therefore on fs . The part f⊥ can only increase the complexity (norm) of a function.
Therefore, the simplest function to optimize any loss will have f⊥ = 0 and be
within span{kxi : xi ∈ D}.)
(iii) Within span{kxi : xi ∈ D}, what is the coordinate representation of the scalar
product?
Maths for Intelligent Systems, Marc Toussaint 47
4 Optimization
We discuss here algorithms that have one goal: walk downhill as quickly as possible.
These target at efficiently finding local optima—in constrast to global optimization
methods, which try to find the global optimum.
For such downhill walkers, there are two essential things to discuss: the stepsize and
the step direction. When discussing the stepsize we’ll hit on topics like backtracking
line search, the Wolfe conditions and its implications in a basic convergence proof.
The discussion of the step direction will very much circle around Newton’s method and
thereby also cover topics like quasi-Newton methods (BFGS), Gauss-Newton, covariant
and conjugate gradients.
Consider the following 1D function and naive gradient descent x ← x − α∇f for some
fixed and small α.
small gradient
small step?
large gradient
large step?
In plateaus we’d make small steps, at steep slopes (here close to the optimum) we
make huge steps, very likely overstepping the optimum. In fact, for some α the
algorithm might indefinitely loop a non-sensical sequence of very slowly walking
left on the plateau, then accellerating, eventually overstepping the optimum, then
being thrown back far to the right again because of the huge negative gradient on
the left.
As a conclusion, the gradient ∇f gives a reasonable descent direction, but its magnitude
is really arbitrary and no good indication of a good stepsize. Therefore, it often makes
sense to just compute the step direction
1
δ=− ∇f (x) (136)
|∇f (x)|
and iterate x ← x + αδ for some appropriate stepsize.
The first idea is simple: If a step would increase the objective value, reduce the stepsize.
We typically use multiplicative stepsize adaptations: Reduce α ← %− −
α α with %α ≈ 0.5;
+ +
and increase α ← %α α with %α ≈ 1.2. A simple monotone gradient descent algorithm
reads as follows (the blue part is explained later). Here, the step vector δ is always
normalized and α is adapted on the fly; decreasing when f (x + αδ) is not sufficiently
smaller than f (x).
This sufficiently smaller is described by the blue part and is called the (1st) Wolfe
condition
f (x + αδ) > f (x) + %ls ∇f (x)>(αδ) . (137)
Figure 3 illustrates this. Note that ∇f (x)>(αδ) is a negative value and describes how
much the objective would decrease if f was linear. But f is of course not linear; we
cannot expect that a step would really decrease f that much. Instead we require that
it decreases by a fraction of this expectation. %ls describes this fraction and is typically
chosen very moderate, e.g. %ls ∈ [0.01, 0.1]. So, the Wolfe conditions requires that f
descreases by the %ls -fraction of what it would decrease if f was linear. Note that for
α → 0 the Wolfe condition will always be fulfilled for smooth functions, because f
“becomes locally linear”.
Maths for Intelligent Systems, Marc Toussaint 49
Figure 3: The 1st Wolfe condition: f (x) + ∇f (x)>(αδ) is the tangent, which describes
the expected decrease of f (x + αδ) if f was linear. We cannot expect this to be the
case; so f (x + αδ) > f (x) + %ls ∇f (x)>(αδ) weakens this condition.
You’ll proove the following theorem in the exercises. It is fundamental to convex opti-
mization and proves that Alg 2 is efficient for convex objective functions:
But even if your objective function is not globally convex, Alg 2 is an efficient downhill
walker, and once it reaches a convex region it will efficiently walk into its local minimum.
For completeness, there is a second Wolfe condition,
which states that the gradient magnitude should have decreased sufficiently. We do not
use it much.
We already discussed the steepest descent direction −G-1 ∇f (x) if G is a metric tensor.
Let’s keep this in mind!
The original Newton method is a method to find the root (that is, zero point) of a
function f (x). In 1D it iterates x ← x − ff0(x) 0
(x) , that is, it uses the gradient f to
estimate where the function might cross the x-axis. To find an optimum (minimum
50 Maths for Intelligent Systems, Marc Toussaint
or maximum) of f we want to find the root of its gradient. For x ∈ Rn the Newton
method iterates
Note that the Newton step δ = −∇2 f (x)-1 ∇f (x) is the solution to
h 1 i
min f (x) + ∇f (x)>δ + δ>∇2 f (x)δ . (140)
δ 2
So the Newton method can also be viewed as 1) compute the 2nd-order Taylor approx-
imation to f at x, and 2) jump to the optimum of this approximation.
Note:
If f is just a 2nd-order polynomial, the Newton method will jump to the optimum
in just one step.
Unlike the gradient magnitude |∇f (x)|, the magnitude of the Newton step δ is
very meaningful. It is scale invariant! If you’d rescale f (trade cents by Euros), δ
is unchanged. |δ| is the distance to the optimum of the 2nd-order Taylor.
Unlike the gradient ∇f (x), the Newton step δ is truely a vector! The vector itself
is invariant under coordinate transformations; the coordinates of δ transforms
contra-variant, as it is supposed to for vector coordinates.
The hessian as metric, and the Newton step as steepest descent: Assume
that the hessian H = ∇2 f (x) is pos-def.
P Then it fulfils all necessary conditions
to define a scalar product hv, wi = ij vi wj Hij , where H plays the role of the
metric tensor. If H was the space’s metric, then the steepest descent direction is
−H -1 ∇f (x), which is the Newton direction.
Another way to understand the same: In the 2nd-order Taylor approximation f (x+
δ) ≈ f (x)+∇f (x)>δ+ 21 δ>Hδ the Hessian plays the role of a metric tensor. Or: we
may think of the function f as being an isometric parabola f (x + δ) ∝ hδ, δi, but
we’ve chosen coordinates where hv, vi = v>Hv and the parabola seems squeezed.
Note that this discussion only holds for pos-dev hessian.
A robust Newton method is the core of many solvers, see Algorithm 3. We do back-
tracking line search along the Newton direction, but with maximal step size α = 1 (the
full Newton step).
We can additionally add and adapt damping to gain more robustness. Some notes on
the λ:
In Alg 3, the first line chooses λ to ensure that (∇2 f (x) + λI) is indeed pos-dev—
and a Newton step actually decreases f instead of seeking for a maximum. There
would be other options: instead of adding to all eigenvalues we could only set the
negative ones to some λ > 0.
Maths for Intelligent Systems, Marc Toussaint 51
Trust region method: Let’s consider a different mathematical program over the
step:
1
min ∇f (x)>δ + δ>∇2 f (x)δ s.t. δ 2 ≤ β (142)
δ 2
This problem wants to find the minimum of the 2nd-order Taylor (like the Newton
step), but constrained to a stepsize no larger than β. This β defines the trust
region: The region in which we trust the 2nd-order Taylor to be a reasonable
enough approximation.
Let’s solve this using Lagrange parameters (as we will learn it later): Let’s assume
the inequality constraint is active. Then we have
1
L(δ, λ) = ∇f (x)>δ + δ>∇2 f (x)δ + λ(δ 2 − β) (143)
2
∇δ L(δ, λ) = ∇f (x)> + δ>(∇2 f (x) + 2λI) (144)
Setting this to zero gives the step δ = −(∇2 f (x) + 2λI)-1 ∇f (x).
Therefore, the λ can be viewed as dual variable of a trust region method. There
is no analytic relation between β and λ; we cannot determine λ directly from
β. We could use a constrained optimization method, like primal-dual Newton,
52 Maths for Intelligent Systems, Marc Toussaint
A special case that appears a lot in intelligent systems is the Least Squares case:
Consider an objective function of the form
X
f (x) = φ(x)>φ(x) = φi (x)2 (145)
i
where we call φ(x) the cost features. This is also called a sum-of-squares problem.
We have
∂
∇f (x) = 2 φ(x)>φ(x) (146)
∂x
∂ ∂ ∂2
∇2 f (x) = 2 φ(x)> φ(x) + 2φ(x)> 2 φ(x) (147)
∂x ∂x ∂x
The Gauss-Newton method is the Newton method for f (x) = φ(x)>φ(x) while approxi-
mating ∇2 φ(x) ≈ 0. That is, it computes approximate Newton steps
∂ ∂ ∂
δ = −( φ(x)> φ(x) + λI)-1 φ(x)>φ(x) . (148)
∂x ∂x ∂x
Note:
1: initialize stepsize α = 1, λ = λ0
2: repeat
3: compute smallest eigenvalue σmin of ∇2 f (x) + λI
4: if σmin > 0 is sufficiently posititive then
5: compute δ to solve (∇2 f (x) + λI) δ = −∇f (x)
6: else
7: Option 1: λ ← 2λ − σmin and goto line 3 // increase regularization
8: Option 2: δ ← −δmax ∇f (x)/|∇f (x)| // gradient step of length δmax
9: end if
10: if ||δ||∞ > δmax : δ ← (δmax /||δ||∞ ) δ // cap δ length
11: y ← BoundClip(x + αδ, x, x)
12: while f (y) > f (x) + %ls ∇f (x)>(y − x) do // bound-projected line search
13: α ← %− αα // decrease stepsize
14: (unusual option: λ ← %+ λ λ and goto line 3) // increase damping
15: y ← BoundClip(x + αδ, x, x)
16: end while
17: xold ← x
18: x←y // step is accepted
19: α ← min{%+ α α, 1} // increase stepsize
20: (unusual option: λ ← %− λ λ) // decrease damping
21: until ||xold − x||∞ < θx repeatedly, or f (xold ) − f (x) < θf repeatedly
The objective f (x) can be interpreted as the Euclidean norm f (φ) = φ>φ but
pulled back into the x-space. More precisely: Consider a mapping φ : Rn → Rm
and a general scalar product h·, ·iφ in the output space. In differential geometry
there is the notion of a pull-back of a metric, that is, we define a scalar product
h·, ·ix in the input space as
∂ ∂
hx, xix = hdφ(x), dφ(x)iφ = φ(x)> φ(x) (150)
∂x ∂x
and therefore, the approximate Hessian is the pullback of a Euclidean cost
feature metric, and hx, xix approximates the 2-order polynomial term of f (x),
with the non-constant (i.e., Riemannian) pull-back metric h·, ·ix .
4.1.5 Quasi-Newton & BFGS: approximating the hessian from gradient obser-
vations
To apply full Newton methods we need to be able to compute f (x), ∇f (x), and ∇2 f (x)
for any x. However, sometimes, computing ∇2 f (x) is not possible, e.g., because we
cannot derive an analytic expression for ∇2 f (x), or it would be too expensive to compute
the hessian exactly, or even to store it in memory—especially in very high-dimensional
spaces. In such cases it makes sense to approximate ∇2 f (x) or ∇2 f (x)-1 with a low-rank
approximation.
Assume we have computed ∇f (x1 ) and ∇f (x2 ) at two different points x1 , x2 ∈ Rn . We
define
From this we may wish to find some approximate Hessian matrix H or H -1 that fulfils
! !
H δ=y or δ = H -1 y (152)
The first equation is called secant equation. Here are guesses of H and H -1 :
yy> δδ>
H= or H -1 = (153)
y>δ δ>y
Convince yourself that these choices fulfil the respective desired relation above. However,
these choices are under-determined. There exist many alternative H or H -1 that would
be consistent with the observed change in gradient. However, given our understanding
of the structure of matrices it is clear that these choices are the lowest rank solutions,
namely rank 1.
Maths for Intelligent Systems, Marc Toussaint 55
Hδδ>H> yy>
H←H− + > . (155)
δ T Hδ y δ
Note:
δδ>
If H -1 is initially zero, this update will assign H -1 ← δ>y
, which is the minimal
rank 1 update we discussed above.
If H -1 is previously non-zero,
the red
-1
part “deletes certain dimensions” from H .
yδ>
More precisely, note that I − δ>y y = 0, that is, this rank n − 1 construction
deletes span{y} from its input space. Therefore, the red part gives zero when
multiplied with y; and it is guaranteed that the resulting H -1 fulfils H -1 y = δ.
The BFGS algorithms uses this H -1 instead of a precise ∇2 f (x)-1 to compute the steps
in a Newton method. All we said about line search and Levenberg-Marquardt damping
is unchanged. 6
In very high-dimensional spaces we do not want to store H -1 densely. Instead we use
aPcompressed storage for low-rank matrices, e.g., storing vectors {vi } such that H -1 =
>
i vi vi . Limited memory BFGS (L-BFGS) makes this more memory efficient: it
limits the rank of the H -1 and thereby the used memory. I do not know the details
myself, but I assume that with every update it might aim to delete the lowest eigenvalue
to keep the rank constant.
The Conjugate Gradient Method is a method for solving large linear eqn. systems Ax +
b = 0. We only mention its extension for optimizing nonlinear functions f (x).
As above, assume that we evaluted ∇f (x1 ) and ∇f (x2 ) at two different points x1 , x2 ∈
Rn . But now we make one more assumption: The point x2 is the minimum of a line
search from x1 along the direction δ1 . This latter assumption is quite optimistic: it
6 Taken from Peter Blomgren’s lecture slides: terminus.sdsu.edu/SDSU/Math693a_f2013/
Lectures/18/lecture.pdf This is the original Davidon-Fletcher-Powell (DFP) method suggested by
W.C. Davidon in 1959. The original paper describing this revolutionary idea – the first quasi-Newton
method – was not accepted for publication. It later appeared in 1991 in the first issue the the SIAM
Journal on Optimization.
56 Maths for Intelligent Systems, Marc Toussaint
assumes we did perfect line search. But it gives great information: The iso-lines of f (x)
at x2 are tangential to δ1 .
In this setting, convince yourself of the following: Ideally each search direction should be
orthogonal to the previous one—but not orthogonal in the conventional Euclidean sense,
but orthogonal w.r.t. the Hessian H. Two vectors d and d0 are called conjugate w.r.t.
a metric H iff d0>Hd = 0. Therefore, subsequent search directions should be conjugate
to each other.
Conjugate gradient descent does the following:
Intuitively, β > 0 implies that the new descent direction always adds a bit of the
old direction. This essentially provides 2nd order information.
For arbitrary quadratic functions CG converges in n iterations. But this only works
with perfect line search.
4.1.7 Rprop*
The algorithm not only ignores |∇f | but also its exact direction! Only the gradient
signs in each coordinate are relevant. Therefore, the step directions may differ up
to < 90◦ from −∇f .
Algorithm 6 Rprop
Input: initial x ∈ Rn , function f (x), ∇f (x), initial stepsize α, tolerance θ
Output: x
1: initialize x = x0 , all αi = α, all gi = 0
2: repeat
3: g ← ∇f (x)
4: x0 ← x
5: for i = 1 : n do
6: if gi gi0 > 0 then // same direction as last time
7: αi ← 1.2αi
8: xi ← xi − αi sign(gi )
9: gi0 ← gi
10: else if gi gi0 < 0 then // change of direction
11: αi ← 0.5αi
12: xi ← xi − αi sign(gi )
13: gi0 ← 0 // force last case next time
14: else
15: xi ← xi − αi sign(gi )
16: gi0 ← gi
17: end if
18: optionally: cap αi ∈ [αmin xi , αmax xi ]
19: end for
0
20: until |x − x| < θ for 10 iterations in sequence
min f (x)
x
s.t. g(x) ≤ 0 , h(x) = 0
Figure 4: 2D example: f (x, y) = −x, pulling constantly to the right; three inequality
constraints, two active, one inactive. The “pull/push” vectors fulfil the stationarity
condition ∇f + λ1 ∇g1 + λ2 ∇g2 = 0.
For the following examples, draw the situation and guess, without much maths, where
the optimum is:
x optimal ⇒ ∃λ ∈ Rm , κ ∈ Rl s.t.
m
X l
X
∇f (x) + λi ∇gi (x) + κj ∇hj (x) = 0 (stationarity)
i=1 j=1
∀i : λi ≥ 0 (dual feasibility)
∀i : λi gi (x) = 0 (complementarity)
Note that these are, in general, only necessary conditions. Only in special cases, e.g.
convex, these are also sufficient.
These conditions should be intuitive in the previous examples:
The first condition describes the “force balance” of the objective pulling and the
active constraints pushing back. The existance of dual parameters λ, κ could
implicitly be expressed by stating
The specific values of λ and κ tell us, how strongly the constraints push against
the objective, e.g., λi |∇gi | is the force excerted by the ith inequality.
The fourth condition very elegantly describes the logic of inequality constraints
being either active (λi > 0, gi = 0) or inactive (λi = 0, gi ≤ 0). Intuitively it says:
An inequality can only push at the boundary, where gi = 0, but not inside the
feasible region, where gi < 0. The trick of using the equation λi gi = 0 to express
this logic is beautiful, especially when later we discuss a case which relaxes this
strict logic to λi gi = −µ for some small µ—which roughly means that inequalities
may push a little also inside the feasible region.
Special case m = l = 0 (no constraints). The first condition is just the usual
∇f (x) = 0.
Discuss the previous examples as special cases; and how the force balance is met.
Assume you’d know about basic unconstrained optimization methods (like standard gra-
dient descent or the Newton method) but nothing about constrained optimization meth-
ods. How would you solve a constrained problem? Well, I think you’d very quickly have
the idea to introduce extra cost terms for the violation of constraints—a million people
have had this idea and successfully applied it in practice.
In the following we define a new cost function F (x), which includes the objective f (x)
and some extra terms.
Definition 4.2 (Log barrier, squared penalty, Lagrangian, Augmented Lagrangian).
X X
Fsp (x; ν, µ) = f (x) + ν hj (x)2 + µ [gi (x) > 0] gi (x)2 (sqr. penalty)
j i
60 Maths for Intelligent Systems, Marc Toussaint
Figure 5: The function −µ log(−g) (with g on the “x-axis”) for various µ. This is
always undefined (“∞”) for g > 0. For µ → 0 this becomes the hard step function.
X
Flb (x; µ) = f (x) − µ log(−gi (x)) (log barrier)
i
X X
L(x, λ, κ) = f (x) + κj hj (x) + λi gi (x) (Lagrangian)
j i
X X
L̂(x) = f (x) + κj hj (x) + λi gi (x) +
j i
X X
+ν hj (x)2 + µ [gi (x) > 0] gi (x)2 (Aug. Lag.)
j i
The log barrier method (see Fig. 5) does exactly the same, except that we decrease
µ towards zero (multiply with a numer < 1 in each iteration). Note that we
need a feasible initialization x0 , because otherwise the barriers are ill-defined! The
whole algorithm will keep the temporary solutions always inside the feasible regions
(because the barriers push away from the constraints). That’s why it is also called
interior point method.
That is, ∇L(x, λ, κ) = 0 is our first KKT condition! In that sense, the additional
terms in the Lagrangian generate the push forces of the constraints. If we knew
the correct λ’s and κ’s beforehand, then we could find the optimal x by the
unconstrained problem minx L(x, λ, κ) (if this has a unique solution).
Maths for Intelligent Systems, Marc Toussaint 61
This is not a main-stream algorithm, but I like it. See Toussaint (2014).
In the Augmented Lagrangian L̂, the solver has two types of knobs to tune: the strenghts
of the penalties ν, µ and the strengths of the Lagrangian forces λ, κ. The trick is
conceptually easy:
Note that 2µhj (x0 ) is the force (gradient) of the equality penalty at x0 ; and
max(λi + 2µgi (x0 ), 0) is the force of the inequality constraint at x0 . What this
update does is: it analyzes the forces excerted by the penalties, and translates
them to forces excerted by the Lagrange terms in the next iteration. It tries to
trade the penalizations for the Lagrange terms.
More rigorously, observe that, if f, g, h are linear and the same constraints are
active in two consecutive iterations, then this update will guarantee that all penalty
terms are zero in the second iteration, and therefore the solution fulfils the first
KKT condition (Toussaint, 2014). See also the respective exercise.
The Lagrangian L(x, κ, λ) = f + κ>h + λ>g has a number of properties that relates it
to the KKT conditions:
(iii) Requiring that L is maximized w.r.t. λ ≥ 0 is related to the remaining 2nd and
4th KKT conditions:
(
f (x) if g(x) ≤ 0
max L(x, λ) = (159)
λ≥0 ∞ otherwise
(
λi = 0 if gi (x) < 0
λ = argmax L(x, λ) ⇒ (160)
λ≥0 0 = ∇λi L(x, λ) = gi (x) otherwise
This implies either (λi = 0 ∧ gi (x) < 0) or gi (x) = 0, which is equivalent to the
complementarity and primal feasibility for inequalities.
These three facts show how tightly the Lagrangian is related to the KKT conditions.
To simplify the discussion let us assume only inequality constraints from now on. Fact
(i) tells us that if we minx L(x, λ), we reproduce the 1st KKT condition. Fact (iii) tells
us that if we maxλ≥0 L(x, λ), we reproduce the remaining KKT conditions. Therefore,
the optimal primal-dual solution (x∗ , λ∗ ) can be characterized as a saddle point of the
Lagrangian. Finding the saddle point can be written in two ways:
Convince
h yourself, using 159, that
i the first expression is indeed the original primal prob-
lem minx f (x) s.t. g(x) ≤ 0 .
What can we learn from this? The KKT conditions state that, at an optimum, there
exist some λ, κ. This existance statement is not very helpful to actually find them. In
contrast, the Lagrangian tells us directly how the dual parameters can be found: by
maximizing w.r.t. them. This can be exploited in several ways:
Here we first formulated the Lagrangian. In this context, κ is often called Lagrange
multiplier, but I prefer the term dual variable. Then we find a saddle point of L by
requiring 0 = ∇x L(x, κ), 0 = ∇κ L(x, κ). If we want to solve a problem with an inequality
constrained, we do the same calculus for both cases: 1) the constraint is active (handled
like an equality constrained), and 2) the constrained is inactive. Then we check if the
inactive case solution is feasible, or the active case is dual-feasible (λ ≥ 0). Note that
if we have m inequality constraints we have to analytically evaluate every combination
of constraints being active/inactive—which are 2m cases. This already hints at the fact
that a real difficulty in solving mathematical programs is to find out which inequality
constraints are active or inactive. In fact, if we knew this a priori, everything would
reduce to an equality constrained problem, which is much easier to solve.
In some cases the dual function l(λ) = minx L(x, λ) can analytically be derived. In this
case it makes very much sense to try solving the dual problem instead of the primal.
First, the dual problem maxλ≥0 l(λ) is guaranteed to be convex even if the primal is
non-convex. (The dual function l(λ) is concave, and the constraint λ ≥ 0 convex.)
But note that l(λ) is itself defined as the result of a generally non-convex optimization
problem minx L(x, λ). Second, the inequality constraints of the dual problem are very
simple: just λ ≥ 0. Such inequality constraints are called bound constraints and can
be handled with specialized methods.
However, in general minx maxy f (x, y) 6= maxy minx f (x, y). For example, in dis-
crete domain x, y ∈ {1, 2}, let f (1, 1) = 1, f (1, 2) = 3, f (2, 1) = 4, f (2, 2) = 2, and
minx f (x, y) = (1, 2) and maxy f (x, y) = (3, 4). Therefore, the dual problem is in
general not equivalent to the primal.
The dual function is, for λ ≥ 0, a lower bound
h i
l(λ) = min L(x, λ) ≤ min f (x) s.t. g(x) ≤ 0 . (166)
x x
And consequently
4.5.4 Finding the “saddle point” directly with a primal-dual Newton method.
Note that the first equation is the 1st KKT, the 2nd is the 2nd KKT w.r.t. equalities,
and the third is the approximate 4th KKT with log barrier parameter µ (see below).
These three equations reflect the saddle point properties (facts (i), (ii), and (iii) above).
We define
∇f (x) + λ>∂g(x) + κ>∂h(x)
r(x, λ, κ) = (173)
h(x)
diag(λ) g(x) + µ1m
(175)
where ∂r(x, λ, κ) ∈ R(n+m+l)×(n+m+l) . Note that this method uses the hessians
∇2 f, ∇2 g and ∇2 h.
Maths for Intelligent Systems, Marc Toussaint 65
The above formulation allows for a duality gap µ. One could choose µ = 0, but often
that is not robust. The beauty is that we can adapt µ on the fly, before each Newton
step, so that we do not need a separate outer loop to adapt µ.
1
Pm
Before computing a Newton step, we compute the current duality measure µ̃ = − m i=1 λi gi (x
Then we set µ = 21 µ̃ to half of this. In this way, the Newton step will compute a direction
that aims to half the current duality gap. In practise, this leads to good convercence in
a single-loop Newton method. (See also Boyd sec 11.7.3.)
The dual feasibility λi ≥ 0 needs to be handled explicitly by the root finder – the line
search can simply clip steps to stay within the bound constraints.
Typically, the method is called “interior primal-dual Newton”, in which case also the
primal feasibility gi ≤ 0 has to be ensured. But I found there are tweaks to make the
method also handle infeasible x, including infeasible initializations.
Finally, let’s revisit the log barrier method. In principle it is very simple: For a given µ,
we use an unconstrained solver to find the minimum x∗ (µ) of
X
F (x; µ) = f (x) − µ log(−gi (x)) . (176)
i
(This process is also called “centering”.) We then gradually decrease µ to zero, always
calling the inner loop to recenter. The generated path of x∗ (µ) is called central path.
The method is simple and has very insightful relations to the KKT conditions and the
dual problem. For given µ, the optimality condition is
X µ
∇F (x; µ) = 0 ⇒ ∇f (x) − ∇gi (x) = 0 (177)
i
gi (x)
X
⇔ ∇f (x) + λi ∇gi (x) = 0 , λi gi (x) = −µ (178)
i
Therefore, x∗ is actually the solution to minx L(x, λ), which defines the dual function.
We have
Xm
l(λ) = min L(x, λ) = f (x∗ ) + λi gi (x∗ ) = f (x∗ ) − mµ . (180)
x
i=1
(m is simply the count of inequalities.) That is, mµ is the duality gap between the
(suboptimal) f (x∗ ) and l(λ). Further, given that the dual function is a lower bound,
l(λ) ≤ p∗ , where p∗ = minx f (x) s.t. g(x) ≤ 0 is the optimal primal value, we have
f (x∗ ) − p∗ ≤ mµ . (181)
∗
This gives the interpretation of µ as an upper bound on the suboptimality of f (x ).
We do not put much emphasis on discussing convex problems in this lecture. The
algorithms we discussed so far equally apply on general non-linear programs as well as on
convex problems—of course, only on convex problems we have convergence guarantees,
as we can see from the convergence rate analysis of Wolfe steps based on the assumption
of positive upper and lower bounds on the Hessian’s eigenvalues.
Nevertheless, we at least define standard LPs, QPs, etc. Perhaps the most interesting
part is the discussion of the Simplex algorithm—not because the algorithms is nice or
particularly efficient, but rather because one gains a lot of insights in what actually
makes (inequality) constrained problems hard.
A function is defined
convex ⇔ ∀x, y ∈ Rn , a ∈ [0, 1] : f (ax + (1−a)y) ≤ a f (x) + (1−a) f (y) (183)
quasiconvex ⇔ ∀x, y ∈ Rn , a ∈ [0, 1] : f (ax + (1−a)y) ≤ max{f (x), f (y)} (184)
Note: quasiconvex ⇔ for any α ∈ R the sublevel set {x|f (x) ≤ α} is convex.
Further, I call a function unimodal if it has only one local minimum, which is the global
minimum.
Maths for Intelligent Systems, Marc Toussaint 67
Variant 2 is the stronger and usual definition. Concerning variant 1, if the feasible set is
convex the zero-level sets of all g’s need to be convex and the zero-level sets of h’s needs
to be linear. Above these zero levels the g’s and h’s could in principle be abribtrarily
non-linear, but these non-linearities are irrelevant for the mathematical program itself.
We could replace such g’s and h’s by convex and linear functions and get the same
problem.
Definition 4.6 (Linear program (LP), Quadratic program (QP)). Special case math-
ematical programs are
which includes Travelling Salesman, MaxSAT or MAP inference problems. Relaxing such
a problem means to instead solve the continuous LP
min c>x s.t. Ax = b, xi ∈ [0, 1] . (186)
x
If one is lucky and the continuous LP problem converges to a fully integer solution, where
all xi ∈ {0, 1}, this is also the solution to the integer problem. Typically, the solution
of the continuous LP will be partially integer (some values converge to the extreme
xi ∈ {0, 1}, while others are inbetween xi ∈ (0, 1)). This continuous valued solution
gives a lower bound on the integer problem, and provides very efficient heuristics for
backtracking or branch-and-bound search for a fully integer solution.
The standard example for a QP are Support Vector Machines. The primal problem is
n
X
min ||β||2 + C ξi s.t. yi (x>i β) ≥ 1 − ξi , ξi ≥ 0 (187)
β,ξ
i=1
68 Maths for Intelligent Systems, Marc Toussaint
the dual
n X
X n n
X
l(α, µ) = min L(β, ξ, α, µ) = − 14 αi αi0 yi yi0 x̂>i x̂i0 + αi (188)
β,ξ
i=1 i0 =1 i=1
max l(α, µ) s.t. 0 ≤ αi ≤ C (189)
α,µ
y B
If the solution is bounded there need to be some inequality constraints that keep
the solution from travelling to ∞ in the −c direction.
It follows: The solution will always be located at a vertex, that is, an inter-
section point of several zero-hyperplanes of inequality constraints.
An idea for finding the solution is to walk on the edges of the polytope until
an optimal vertex is found. This is the simplex algorithm of Georg Dantzig, 1947.
In practise this procedure is done by “pivoting on the simplex tableaux”—but we
fully skip such details here.
Maths for Intelligent Systems, Marc Toussaint 69
The simplex algorithm is often efficient, but in worst case it is exponential in both,
n and m! This is hard to make intuitive, because the effects of high dimensions are
not intuitive. But roughly, consider that in high dimensions there is a combinatorial
number of ways of how constraints may intersect and form edges and vertices.
Here is a view that much more relates to our discussion of the log barrier method:
Sitting on an edge/face/vertex is equivalent to temporarily deciding which constraints
are active. If we knew which constraints are eventually active, the problem would be
solved: all inequalities become equalities or void. (And linear equalities can directly
be solved for.) So, jumping along vertices of the polytope is equivalent to sequentially
making decisions on which constraints might be active. Note though that there are 2m
configurations of active/non-active constraints. The simplex algorithm therefore walks
through this combinatorial space.
Interior point methods do exactly the opposite: Recall that the 4th KKT condition
is λi gi (x) = 0. The log barrier method (for instance) instead relaxes this hard logic
of activ/non-active constraints and finds in each iteration a solution to the relaxed 4th
KKT condition λi gi (x) = −µ, which intuitively means that every constraint may be
“somewhat active”. In fact, every constraint contributes somewhat to the stationarity
condition via the log barrier’s gradients. Thereby interior point methods
approach the optimal vertex from the inside of the polytope; avoiding the polytope
surface (and its hard decisions)
Historically, penalty and barrier methods methods were standard before the Simplex
Algorithm. When SA was discovered in the 50ies, it was quickly considered great. But
then, later in the 70-80ies, a lot more theory was developed for interior point methods,
which now again have become somewhat more popular than the simplex algorithm.
Just for reference, SQP is another standard approach to solving non-linear mathematical
programs. In each iteration we compute all coefficients of the 2nd order Taylor f (x+δ) ≈
f (x) + ∇f (x)>δ + 12 δ>Hδ and 1st-order Taylor g(x + δ) ≈ g(x) + ∇g(x)>δ and then solve
the QP
1
min f (x) + ∇f (x)>δ + δ>∇2 f (x)δ s.t. g(x) + ∇g(x)>δ ≤ 0 (190)
δ 2
70 Maths for Intelligent Systems, Marc Toussaint
The optimal δ ∗ of this problem should be seen analogous to the optimal Newton step: If
f were a 2nd-order polynomial and g linear, then δ ∗ would jump directly to the optimum.
However, as this is generally not the case, δ ∗ only gives us a very good direction for line
search. In SQP, we need to backtrack until we found a feasible point and f decreases
sufficiently.
Solving each QP in the sub-routine requires a constrained solver, which itself might have
two nested loops (e.g. using log-barrier or AugLag). In that case, SQP has three nested
loops.
Even if f, g, h are smooth, the solver might not have access to analytic equations or
efficient numeric methods to evaluate the gradients or hessians of these. Therefore we
distinguish (here neglecting the constraint functions g and h):
Definition 4.7.
Quasi-Newton optimization: Only f (x) and ∇f (x) can be evaluated, but the
solver does tricks to estimate ∇2 f (x). (So this is a special case of 1st-order
optimization.)
Gauss-Newton type optimization: f is of the special form f (x) = φ(x)>φ(x)
∂
and ∂x φ(x) can be evaluated.
In this lecture I very briefly want to add comments on global blackbox optimization.
Global means that we now, for the first time, aim to find the global optimum (within
some pre-specified bounded range). In essence, to address such a problem we need to
explicitly know what we know about f 7 , and an obvious way to do this is to use Bayesian
learning.
From now on, let’s neglect constraints and focus on the mathematical program
for a blackbox function f . The optimization process can be viewed as a Markov De-
cision Process that describes the interaction of the solver (agent) with the function
(environment):
The above defined what is an optimal solver! Something we haven’t touched at all
before. The transition dynamics of this MDP is deterministic, given f . However, from
the perspective of the solver, we do not know f apriori. But we can always compute
a posterior belief P (f |Dt ) = P (Dt |f ) P (f )/P (Dt ). This posterior belief defines a
belief MDP with stochastic transitions
Z Z Z
P (Dt+1 ) = [Dt+1 = D ∪ {(xt , f (xt ))}] π(xt |Dt ) P (f |Dt ) P (Dt ) . (193)
Dt f xt
The belief MDP’s state space is P (Dt ) (or equivalently, P (f |Dt ), the current belief over
f ). This belief MDP is something that the solver can, in principle, forward simulate—
it has all information about it. One can prove that, if the solver could solve its own
belief MDP (find an optimal policy for its belief MDP), then this policy is the optimal
solver policy for the original problem given a prior distribution P (f )! So, in principle we
not only defined what is an optimal solver policy, but can also provide an algorithm to
compute it (Dynamic programming in the belief MDP)! However, this is so expensive to
compute that heuristics need to be used in practise.
One aspect we should learn from this discussion: The solver’s optimal decision is based
on its current belief P (f |Dt ) over the function. This belief is the Bayesian representation
of everything one could possibly have learned about f from the data Dt collected so far.
Bayesian Global Optimization methods compute P (f |Dt ) in every step and, based on
this, use a heuristic to choose the next decision.
72 Maths for Intelligent Systems, Marc Toussaint
In pratice one typically uses a Gaussian Process representation of P (f |Dt ). This means
that in every iteration we have an estimate fˆ(x) of the function mean and a variance
estimate σ̂(x)2 that describes our uncertainty about the mean estimate. Based on this
we may define the following acquisition functions
αt (x) = H[p(x∗ |Dt )] − E{p(y|Dt ; x)} H[p(x∗ |Dt ∪ {(x, y)})] (197)
= I(x∗ , y|Dt ) = H[p(y|Dt , x)] − E{p(x∗ |Dt )} H[p(y|Dt , x, x∗ )]
MPI is hardly being used in practise anymore. EI is classical, originating way back in the
50ies or earlier; Jones et al. (1998) gives an overview. UCB received a lot of attention
recently due to the underlying bandit theory and bounded regret theorems due to the
submodularity. But I think that in practise EI and UCB perform about equally. As UCB
is somewhat easier to implement and intuitive.
In all cases, note that the solver policy xt = argmaxx αt (x) requires to internally solve
another non-linear optimization problem. However, αt is an analytic function for which
we can compute gradients and hessians which ensures every efficient local convergence.
But again, xt = argmaxx αt (x) needs to be solved globally —otherwise the solver will
also not solve the original problem properly and globally. As a consequence, the opti-
mization of the acquisition function needs to be restarted from many many potential
start points close to potential local minima; typically from grid(!) points over the full
domain range. The number of grid points is exponential in the problem dimension n.
Therefore, this inner loop can be very expensive.
Maths for Intelligent Systems, Marc Toussaint 73
And a subjective note: This all sounds great, but be aware that Gaussian Processes with
standard squared-exponential kernels do not generalize much in high dimensions: one
roughly needs exponentially many data points to fully cover the domain and reduce belief
uncertainty globally, almost as if we were sampling from a grid with grid size equal to the
kernel width. So, the whole approach is not magic. It just does what is possible given a
belief P (f ). It would be interesting to have much more structured (and heteroscedastic)
beliefs specifically for optimization.
The last acquisition function is called Predictive Entropy Search. This formulation
is beautiful: We sample at places x where the (expected) observed value y informs us
as much as possible about the optimum x∗ of the function! Formally, this means to
maximize the mutual information between y and x∗ , in expectation over y|x.
Line 4 implement an explicit trust region approach, which hard bound α on the
step size.
Line 5 is like the Wolfe condition. But here, the expected decrease is [f (x̂) −
fˆ(x̂ + δ)] instead of −αδ∇f (x).
Line 10 uses the data determinant as a measure of quality! This is meant in the
sense of linear regression on polynomial features. Note that, with data matrix
X ∈ Rn×dim(β) , β̂ ls = (X>X)-1 X>y is the optimal regression. The determinant
det(X>X) or det(X) = det(D) is a measure for well the data supports the
regression. If the determinant is zero, the regression problem is ill-defined. The
larger the determinant, the lower the variance of the regression estimator.
Line 11 is an explicit exploration approach: We add a data point solely for the
purpose of increasing the data determinant (increasing the data spread). Interest-
74 Maths for Intelligent Systems, Marc Toussaint
There are interesting and theoretically well-grounded evolutionary algorithms for opti-
mization, such as Estimation-of-Distribution Algorithms (EDAs). But generally, don’t
use them as first choice.
a) Given a function f : Rn → R with fmin = minx f (x). Assume that its Hessian—
that is, the eigenvalues of ∇2 f —are lower bounded by m > 0 and upper bounded by
M > m, with m, M ∈ R. Prove that for any x ∈ Rn it holds
1 1
f (x) − |∇f (x)|2 ≤ fmin ≤ f (x) − |∇f (x)|2 .
2m 2M
Tip: Start with bounding f (x) between the functions with maximal and minimal curva-
ture. Then consider the minima of these bounds. Note, it also follows:
|∇f (x)|2 ≥ 2m(f (x) − fmin ) .
Maths for Intelligent Systems, Marc Toussaint 75
b) Consider backtracking line search with Wolfe parameter %ls ≤ 12 , and step decrease
%−
factor %−α . First prove that line search terminates the latest when M
α
≤α≤ 1
M, and
then it found a new point y for which
%ls %−
α
f (y) ≤ f (x) − |∇f (x)|2 .
M
From this, using the result from a), prove the convergence equation
h 2m%ls %−
α
i
f (y) − fmin ≤ 1 − (f (x) − fmin ) .
M
i−1
with diagonal matrix C and entries C(i, i) = c n−1 , where n is the dimensionality of x.
We choose a conditioning8 c = 10. To plot the function for n = 2, you can use gnuplot
calling
8 The word “conditioning” generally denotes the ratio of the largest and smallest Eigenvalue of the
Hessian.
76 Maths for Intelligent Systems, Marc Toussaint
b) Play around with parameters. How does the performance change for higher dimen-
sions, e.g., n = 100? How does the performance change with ρls (the Wolfe stop
criterion)? How does the alternative in step 3 work?
c) Newton step: Modify the algorithm simply by multiplying C -1 to the step. How does
that work?
(The Newton direction diverges (is undefined) in the concave part of fhole (x). We’re
cheating here when always multiplying with C -1 to get a good direction.)
4.8.3 Gauss-Newton
2cx2
The function is plotted above for a = 4 (left) and a = 5 (right, having local minima),
and conditioning c = 1. The function is non-convex.
Extend your backtracking method implemented in the last week’s exercise to a Gauss-
Newton method (with constant λ) to solve the unconstrained minimization problem
minx f (x) for a random start point in x ∈ [−1, 1]2 . Compare the algorithm for a = 4
and a = 5 and conditioning c = 3 with gradient descent.
1 h i
∂x1 f (x) = − 4(x2 − x21 )x1 (202)
g(x)
1 h 2 i
∂x2 f (x) = 2(x2 − x21 ) − (1 − x2 ) (203)
g(x) 100
h i2 1 h 2 i
∂x21 f (x) = − ∂x1 f (x) + 8x1 − 4(x2 − x21 ) (204)
g(x)
h i2 1 h 2 i
∂x22 f (x) = − ∂x2 f (x) + 2+ (205)
g(x) 100
h ih i 1 h i
∂x1 ∂x2 f (x) = − ∂x1 f (x) ∂x2 f (x) + − 4x1 (206)
g(x)
(The ’-0.01’ ensures that you can see the contour at the optimum.) List and discuss at
least three properties of the function (at different locations) that may raise problems to
naive optimizers.
b) Use x = (−3, 3) as starting point for an optimization algorithm. Try to code an
optimization method that uses all ideas mentioned in the lecture. Try to tune it to be
efficient on this problem (without cheating, e.g. by choosing a perfect initial stepsize.)
c
In a previous exercise we defined the “hole function” fhole (x). Assume conditioning
c = 10 and use the Lagrangian Method of Multipliers to solve on paper the following
constrained optimization problem in 2D:
c
min fhole (x) s.t. h(x) = 0 (207)
x
h(x) = v>x − 1 (208)
Near the very end, you won’t be able to proceed until you have special values for v. Go
as far as you can without the need for these values.
78 Maths for Intelligent Systems, Marc Toussaint
with variable x ∈ R.
a) Derive the optimal solution x∗ and the optimal value p∗ = f (x∗ ) by hand.
b) Write down the Lagrangian L(x, λ). Plot (using gnuplot or so) L(x, λ) over x for
various values of λ ≥ 0. Verify the lower bound property minx L(x, λ) ≤ p∗ , where p∗
is the optimum value of the primal problem.
c) Derive the dual function l(λ) = minx L(x, λ) and plot it (for λ ≥ 0). Derive the dual
optimal solution λ∗ = argmaxλ l(λ). Is maxλ l(λ) = p∗ (strong duality)?
g(x) = (212)
−x1
Maths for Intelligent Systems, Marc Toussaint 79
a) First, assume x ∈ R2 is 2-dimensional, and draw on paper what the problem looks
like and where you expect the optimum.
b) Find the optimum analytically using the Lagrangian. Here, assume that you know
apriori that all constraints are active! What are the dual parameters λ = (λ1 , λ2 )?
Note: Assuming that you know a priori which constraints are active is a huge assumption!
In real problems, this is the actual hard (and combinatorial) problem. More on this later
in the lecture.
c) Implement a simple the Log Barrier Method. Tips:
– Initialize x = ( 21 , 12 ) and µ = 1
– First code an inner loop:
– In each iteration, first compute the gradient of the log-barrier function. Recall that
X
F (x; µ) = f (x) − µ log(−gi (x)) (213)
i
X
∇F (x; µ) = ∇f − µ (1/gi (x))∇gi (x) (214)
i
– Then perform a backtracking line search along −∇F (x, µ). In particular, backtrack if
a step goes beyond the barrier (where g(x) 6≤ 0 and F (x, µ) = ∞).
– Iterate until convergence; let’s call the result x∗ (µ). Further, compute λ∗ (m) =
−(µ/g1 (x), µ/g2 (x)) at convergence.
– Decrease µ ← µ/2, recompute x∗ (µ) (with the previous x∗ as initialization) and iterate
this.
These exercises focus on the first type, which is just as important as the second, as it
enables the use of a wider range of solvers. Exercises from Boyd et al http://www.
stanford.edu/~boyd/cvxbook/bv_cvxbook.pdf:
Solve Exercise 4.12 (pdf page 193) from Boyd & Vandenberghe, Convex Optimization.
80 Maths for Intelligent Systems, Marc Toussaint
Solve Exercise 4.16 (pdf page 194) from Boyd & Vandenberghe, Convex Optimization.
(This is a subset of Exercise 4.11 (pdf page 193) from Boyd & Vandenberghe.)
n
Let x ∈ RP . The optimization problem is minx ||x − b||1 , where the `1 -norm is defined
n
as ||z||1 = i=1 |zi |. Reformulate this optimization problem as a Linear Program.
The following function is essentially the Rastrigin function, but written slightly differently.
It can be tuned to become uni-modal and is a sum-of-squares problem. For x ∈ R2 we
define
sin(ax1 )
sin(acx2 )
f (x) = φ(x)>φ(x) , φ(x) = 2x1
2cx2
The function is plotted above for a = 4 (left) and a = 5 (right, having local minima),
and conditioning c = 1. The function is non-convex.
Choose a = 6 or larger and implement a random restart method: Repeat initializing
x ∼ U([−2, 2]2 ) uniformlly, followed by a gradient descent (with backtracking line search
and monotone convergence).
Restart the method at least 100 times. Count how often the method converges to which
local optimum.
Find an implementation of Gaussian Processes for your language of choice (e.g. python:
scikit-learn, or Sheffield/Gpy; octave/matlab: gpml) and implement GP-UCB global
Maths for Intelligent Systems, Marc Toussaint 81
optimization. Test your implementation with different hyperparameters (Find the best
combination of kernel and its parameters in the GP) on the 2D function defined above.
On the webpage you find a starting code to use GP regression in scikit-learn. To install
scikit-learn: https://scikit-learn.org/stable/install.html
Instead, we first recap very basics of probability theory, that I assume the reader has
already seen before. The next section will cover this. Then we focus on specific top-
ics that, in my opinion, deepen the understanding of the basics, such as the relation
between optimization and probabilities, log-probabilities & energies, maxEntropy and
maxLikelihood, minimal description length and learning.
5.1 Basics
First, in case you wonder about justifications of the use of (Bayesian) probabilities versus
fuzzy sets or alike, here some pointers to look up: 1) Cox’s theorem, which derives from
basic assumptiond about “rationality and consistency” the standard probability axioms;
2) t-norms, which generalize probability and fuzzy calculus; and 3) read about objective
vs. subjecte Bayesian probability.
– P (S) = 1 (normalization)
Implications are:
– 0 ≤ P (A) ≤ 1
– P (∅) = 0
– A ⊆ B ⇒ P (A) ≤ P (B)
– P (A ∪ B) = P (A) + P (B) − P (A ∩ B)
– P (S \ A) = 1 − P (A)
∀x∈Ω : 0 ≤ P (X = x) ≤ 1
P
x∈Ω P (X = x) = 1
P (X,Y )
The conditional is defined as P (X|Y ) = P (Y ) , which fulfils
P
∀Y : X P (X|Y ) = 1.
P (Y |X) P (X)
P (X|Y ) = (217)
P (Y )
RV parameter distribution
Bernoulli x ∈ {0, 1} µ ∈ [0, 1] Bern(x | µ) = µx (1 −
µ)1−x
Beta µ ∈ [0, 1] α, β ∈ R+ Beta(µ | a, b) =
1 a−1 b−1
B(a,b) µ (1 − µ)
Multinomial x ∈ {1, .., K} µ ∈ [0, 1]K , ||µ||1 = P (x = k | µ) = µk
1
QK k −1
Dirichlet µ ∈ [0, 1]K , ||µ||1 = α1 , .., αK ∈ R+ Dir(µ | α) ∝ k=1 µα
k
1
Clearly, the Multinomial is a generalization of the Bernoulli, as the Dirichlet is of the
Beta. The mean of the Dirichlet is hµi i = Pαiαj , its mode is µ∗i = Pααi −1
j −K
. The mode
j j
of a distribution p(x) is defined as argmaxx p(x).
p(D|x) p(x)
p(x) ∈ C ⇒ p(x|D) = ∈C. (219)
p(D)
RV likelihood conjugate
µ Binomial Bin(D | µ) Beta Beta(µ | a, b)
µ Multinomial Mult(D | µ) Dirichlet Dir(µ | α)
µ Gauss N(x | µ, Σ) Gauss N(µ | µ0 , A)
λ 1D Gauss N(x | µ, λ-1 ) Gamma Gam(λ | a, b)
Λ nD Gauss N(x | µ, Λ-1 ) Wishart Wish(Λ | W, ν)
(µ, Λ) nD Gauss N(x | µ, Λ-1 ) Gauss-Wishart
N(µ | µ0 , (βΛ)-1 ) Wish(Λ | W, ν)
One comment about integrals. If p(x) is a probability density function and f (x) some
arbitrary function, typically one writes
Z
f (x) p(x) dx , (221)
x
where dx denotes the (Borel) measure we integrate over. However, some authors (cor-
rectly) think of a distribution p(x) as being a measure over the space dom(x) (instead
of just a function). So the above notation is actually “double” w.r.t. the measures. So
they might (also correctly) write
Z
p(x) f (x) , (222)
x
and take care that there is exactly one measure to the right of the integral.
5.1.5 Gaussian
Definition 5.6. We define an n-dim Gaussian in normal form as
1 1
N(x | µ, Σ) = exp{− (x − µ)> Σ-1 (x − µ)} (223)
| 2πΣ | 1/2 2
Maths for Intelligent Systems, Marc Toussaint 85
exp{− 21 a>A-1 a} 1
N[x | a, A] = exp{− x> A x + x>a} (224)
| 2πA-1 | 1/2 2
with precision matrix A = Σ-1 and coefficient a = Σ-1 µ (and mean µ = A-1 a).
Gaussians are used all over—below we explain in what sense they are the probabilis-
tic analogue to a parabola (or a 2nd-order Taylor expansions). The most important
properties are:
Transformation:
N(F x + f | a, A) = 1
|F | N(x | F -1 (a − f ), F -1 AF -> )
Marginal
& conditional:
x a A C
N , > = N(x | a, A) · N(y | b + C>A-1 (x - a), B − C>A-1 C)
y b C B
Example 5.1 (ML estimator of the mean of a Gaussian). Assume we have data
D = {x1 , .., xn }, each xi ∈ Rn , with likelihood
P (D | µ, Σ) = i N(xi | µ, Σ)
Q
(225)
n
1 X
argmax P (D | µ, Σ) = xi (226)
µ n i=1
n
1X
argmax P (D | µ, Σ) = (xi − µ)(xi − µ)> (227)
Σ n i=1
Assume we are initially uncertain about µ (but know Σ). We can express this
uncertainty using again a Gaussian N[µ | a, A]. Given data we have
Usually, “particles” are not listed as standard continuous distribution. However I think
they should be. They’re heavily used in several contexts, especially as approximating
other distributions in Monte Carlo methods and particle filters.
∂
δ(x) = H(x) , H(x) = [x ≥ 0] . (231)
∂x
It is akward to think of δ(x) as a normal function, as it’d be “infinite” at zero. But
at least we understand that is has the properties
Z
δ(x) = 0 everywhere except at x = 0 , δ(x) dx = 1 . (232)
I sometimes call the Dirac distribution also a point particle: it has all its unit “mass”
concentrated at zero.
We say that a particle distribution q(x) approximates another distribution p(x) iff for
any (smooth) f
Z
PN
hf (x)ip = f (x)p(x)dx ≈ i=1 wi f (xi ) (234)
x
Note the generality of this statement! f could be anything, it could be any features of
the variable x, like coordinates of x, or squares, or anything. So basically this statement
Maths for Intelligent Systems, Marc Toussaint 87
says, whatever you might like to estimate about p, you can approximate it based on the
particles q.
Computing particle approximations of comples (non-analytical, non-tracktable) distribu-
tions p is a core challenge in many fields. The true p could for instance be a distributions
over games (action sequences). The approximation q could for instance be samples gen-
erated with Monte Carlo Tree Search (MCTS). The tutorial An Introduction to MCMC
for Machine Learning www.cs.ubc.ca/~nando/papers/mlintro.pdf gives an excel-
lent introduction. Here are some illustrations of what it means to approximate some
p by particles q, taken from this tutorial. The black line is p, histograms illustrate the
particles q by showing how many of (uniformly weighted) particles fall into a bin:
There is a natural relation between probabilities and “energy” (or “error”). Namely, if
p(x) denotes a probability for every possible value of x, and E(x) denotes an energy for
state x—or an error one assigns to choosing x—then a natural relation is
Why is that? First, outside the context of physics it is perfectly fair to just define
axiomatically an energy E(x) as neg-log-probability. But let me try to give some more
arguments for why this is a useful definition.
Let assume we have p(x). We want to find a quantity, let’s call it error E(x), which
is a function of p(x). Intuitively, if a certain value x1 is more likely than another,
88 Maths for Intelligent Systems, Marc Toussaint
p(x1 ) > p(x2 ), then picking x1 should imply less error, E(x1 ) < E(x2 ) (Axiom 1).
Further, when we have two independent random variables x and y, probabilities are
multiplicative, p(x, y) = p(x)p(y). We require axiomatically that error is additive,
E(x, y) = E(x) + E(y). From both follows that E needs to be some logarithm of p!
The same argument, now more talking about energy : Assume we have two independent
(physical) systems x and y. p(x, y) = p(x)p(y) is the probability to find them in certain
states. We axiomatically require that energy is additive, E(x, y) = E(x) + E(y).
Again, E needs to be some logarithm of p. In the context of physics, what could
be questioned is “why is p(x) a function of E(x) in the first place?”. Well, that is
much harder to explain and really is a question about statistical physics. Wikipedia
under keywords “Maxwell-Boltzmann statistics” and “Derivation from microcanonical
ensemble” gives an answer. Essentially the argument is a follows: Given many many
molecules
Pn in a gas, each of which can have a different energy ei . The total energy
E = i=1 ei must be conserved. What is the distribution over energy levels that has
the most microstates? The answer is the Boltzmann distribution. (And why do we,
in nature, find energy distributions that have the most microstates? Because these are
most likely.)
Bottom line is: p(x) = e−E(x) , probabilities are multiplicative, energies or errors additive.
Let me state some fact just to underline how useful this way of thinking is:
In machine learning, when data D is given and we have some model β, we typically
try to maximize the likelihood p(D|β). This is equivalent to minimizing the neg-
log-likelihood
This neg-log-likelihood is a typical measure for error of the model. And this error
is additive w.r.t. the data, whereas the likelihood is multiplicative, fitting perfectly
to the above discussion.
The Gaussian distribution p(x) ∝ exp{− 21 ||x − µ||2 /σ 2 } is related to the error
E(x) = 12 ||x − µ||2 /σ 2 , which is nothing but the squared error with the precision
matrix as metric. That’s why squared error measures (classical regression) and
Gaussian distributions (e.g., Bayesian Ridge regression) are directly related.
A Gaussian is the probabilistic analoque to a parabola.
Often h(x) = 1, so let’s neglect this for now. The key point is that the energy
is linear in the features φ(x). This is exactly how discriminative functions (for
classification in Machine learning) are typically formulated.
In the continuous case, the features φ(x) are often chosen as basis polynomials—
just as in polynomial regression. Then, β are the coefficients of the energy poly-
nomial and the exponential family is just the probabilistic analogue to the space
of polynomials.
When we have many variables x1 , .., xn , the structure of a cost function over
these variables can often be expressed as being additive in terms: f (x1 , .., xn ) =
P
i φi (x∂i ) where ∂i denotes the ith group of variables.
Q The respective
PBoltzmann
distribution is a factor graph p(x1 , .., xn ) ∝ i fi (x∂i ) = exp{ i βi φi (x∂i )
where ∂i denotes the
So, factor graphs are the probabilistic analoque to additive functions.
− log p(x) is also the “optimal” coding length you should assign to a symbol x.
P
Entropy is expected error: H[p] = x −p(x) log p(x) = h− log p(x)ip(x) , where
p itself it used to take the expectation.
Assume you use a “wrong” distribution q(x) to decide on the Rcoding length of sym-
bols drawn from p(x). The expected length of an encoding is x p(x)[− log q(x)] ≥
H(p).
The Kullback-Leibler divergence is the difference:
Z
p(x)
D p q = p(x) log ≥0 (239)
x q(x)
So, my message is that probabilities and error measures are naturally related. However,
in the first case we typically do inference, in the second we optimize. Let’s discuss the
relation between inference and optimization a bit more. For instance, given data D and
parameters β, we may define
Definition 5.9 (ML, MAP, and Bayes estimate). Given data D and a parameteric
model p(D|β), we define
Both, the MAP and the ML estimates are really just optimization problems.
The Bayesian parameter estimate P (β|D), which can then be used to do fully Bayesian
prediction, is in principle different. However, in practise also here optimization is a
core tool for estimating such distributions if they cannot be given analytically. This is
described next.
Consider the following problem. We have data drawn i.i.d. from p(x) where x ∈ X in
some discrete space X. Let’s call every x a word. The problem is to find a mapping
from words to codes, e.g. binary codes c : X → {0, 1}∗ . The optimal solution is in
principle simple: Sort all possible words in a list, ordered by p(x) with more likely words
going first; write all possible binary codes in another list, with increasing code lengths.
Match the two lists, and this is the optimal encoding.
Let’s try to get a more analytical grip of this: Let l(x) = |c(x)| be the actual code
length assigned to word x, which is an integer value. Let’s define
1 −l(x)
q(x) = 2 (241)
Z
with the normalization constrant Z = x 2−l(x) . Then we have
P
X X X
p(x)[− log2 q(x)] = − p(x) log 2−l(x) + p(x) log Z (242)
x∈X x x
X
= p(x)l(x) + log Z . (243)
x
What about log Z? Let l-1 (s) be the set of words that have been assigned codes of
length l. There can only be a limited number of words encoded with a given length. For
instance, |L-1 (1)| must not be greater than 2, |L-1 (2)| must not be greater than 4, and
|l-1 (s)| must not be greater than 2l . We have
X
∀s : [l(x) = s] ≤ 2s (244)
x∈X
X
∀s : [l(x) = s]2−s ≤ 1 (245)
x∈X
X
∀s : 2−l(x) ≤ 1 (246)
x∈X
Maths for Intelligent Systems, Marc Toussaint 91
However, this way of thinking is ok for separated codes. If such codes would be in a
continuous stream of bits you’d never know where a code starts or ends. Prefix codes
fix this problem by defining a code tree with leaves that clearly define when a code ends.
For prefix codes it similarly holds
X
Z= 2−l(x) ≤ 1 , (247)
x∈X
Assume we want to estimate some q(x) we cannot express analytically. E.g., q(x) =
p(x|D) ∝ P (D|x)p(x) for some awkward likelihood function p(D|x). An example from
robotics is: x is stochastically controlled path of a robot. p(x) is a prior distribution over
paths that includes how the robot can actually move and some Gaussian prior (squared
costs!) over controlls. If the robot is “linear”, p(x) can be expressed nicely and analyti-
cally; it if it non-linear, expressing p(x) is already hard. However, p(D|x) might indicate
that we do not see collisions on the path—but collisions are a horrible function, usually
computed by some black-box collision detection packages that computes distances be-
tween convex meshes, perhaps giving gradients but certainly not some analytic function.
So q(x) can clearly not be expressed analytically.
One way to approximate q(x) is the Laplace approximation
First, we observe that the Laplace approximation is a Gaussian, because its energy is
a parabola. Further, notice that in the Taylor expansion we skipped the linear term.
That’s because we are at the mode x∗ where ∇E(x∗ ) = 0.
92 Maths for Intelligent Systems, Marc Toussaint
Recall our notion of steepest descent—it depends on the metric in the space!
Consider the space of probability distributions p(x; β) with parameters β. We think of
every p(x; β) as a point in the space and wonder what metric is useful to compare two
points p(x; β1 ) and p(x; β2 ). Let’s take the KLD
TODO
: Let p ∈ ΛX , that is, p is a probability distribution over the space X. Further, let
θ ∈ Rn and θ 7→ p(θ) is some parameterization of the probability distribution. Then
the derivative dθ p(θ) ∈ Tp ΛX is a vector in the tangent space of ΛX . Now, for such
vectors, for tangent vectors of the space of probability distributions, there is a generic
metric, the Fisher metric: [TODO: move to ’probabilities’ section]
Note: These exercises are for ’extra credits’. We’ll discuss them on Thu, 21th Jan.
Maths for Intelligent Systems, Marc Toussaint 93
(These are taken from MacKay’s book Information Theory..., Exercise 22.12 & .13)
a) Assume that a random variable x with discrete domain dom(x) = X comes from a
probability distribution of the form
d
1 hX i
P (x | w) = exp wk fk (x) ,
Z(w)
k=1
where the functions fk (x) are given, and the parameters w ∈ Rd are not known. A
data set D = {xi }ni=1 ofPn points x is supplied. Show by differentiating the log
n
likelihood log P (D|w) = i=1 log P (xi |w) that the maximum-likelihood parameters
∗
w = argmaxw log P (D|w) satisfy
n
X 1 X
P (x | w∗ ) fk (x) = fk (xi )
n i=1
x∈X
where the left-hand sum is over all x, and the right-hand sum is over the data points.
A shorthand for this result is that each function-average under the fitted model must
equal the function-average found in the data:
hfk iP (x | w∗ ) = hfk iD
b) When confronted by a probability distribution P (x) about which only a few facts are
known, the maximum entropy principle (MaxEnt) offers a rule for choosing a distribution
that satisfies those constraints. According to MaxEnt, you should select the P (x) that
maximizes the entropy X
H(P ) = − P (x) log P (x)
x
subject to the constraints. Assuming the constraints assert that the averages of certain
functions fk (x) are known, i.e.,
hfk iP (x) = Fk ,
show, by introducing Lagrange multipliers (one for each constraint, including normaliza-
tion), that the maximum-entropy distribution has the form
1 hX i
PMaxEnt (x) = exp wk fk (x)
Z
k
where the parameters Z and wk are set such that the constraints are satisfied. And hence
the maximum entropy method gives identical results to maximum likelihood fitting of
an exponential-family model.
94 Maths for Intelligent Systems, Marc Toussaint
Note: The exercise will take place on Tue, 2nd Feb. Hung will also prepare how much
‘votes’ you collected in the exercises.
Assume we have a very large data set D = {xi }ni=1 of samples xi ∼ q(x) from some
data distribution q(x). Using this data set we can approximate any expectation
X n
X
hf iq = q(x)f (x) ≈ f (xi ) .
x i=1
Assume we have a parameteric family of distributions p(x|β) and would find the Maxi-
mum Likelihood (ML) parameter β ∗ = argmaxβ p(D|β). Express this ML problem as a
KL-divergence minimization.
In the context of so-called “Gaussian Process Classification” the following problem arises
(we neglect dependence on x here): We have a real-valued RV f ∈ R with prior P (f ) =
N(f | µ, σ 2 ). Further we have a Boolean RV y ∈ {0, 1} with conditional probability
ef
P (y = 1 | f ) = σ(f ) = .
1 + ef
The function σ is called sigmoid funtion, and f is a discriminative value which predicts
y = 1 if it is very positive, and y = 0 if it is very negative. The sigmoid function has the
property
∂
σ(f ) = σ(f ) (1 − σ(f )) .
∂f
Given that we observed y = 1 we want to compute the posterior P (f | y = 1), which
cannot be expressed analytically. Provide the Laplace approximation of this posterior.
(Bonus)RAs an alternative to the sigmoid function σ(f ), we can use the probit function
z
φ(z) = −∞ N(x|0, 1) dx to define the likelihood P (y = 1 | f ) = φ(f ). Now how can
the posterior P (f | y = 1) be approximated?
In a very abstract sense, learning means to model the distribution p(x) for given data
D = {xi }ni=1 . This is literally the case for unsupervised learning; regression, classification
Maths for Intelligent Systems, Marc Toussaint 95
and graphical model learning could be viewed as specific instances of this where x factores
in several random variables, like input and output.
Show in which sense the problem of learning is equivalent to the problem of compression.
Get three text files from the Web, approximately equal length, mostly text (no equations
or stuff). Two of them should be in English, the third in Frensh. (Alternatively, perhaps,
not sure if it’d work, two of them on a very similar topic, the third on a very different.)
How can you use gzip (or some other compression tool) to estimate the mutual infor-
mation between every pair of files? How can you ensure some “normalized” measures
which do not depend too much on the absolute lengths of the text? Do it and check
whether in fact you find that two texts are similar while the third is different.
(Extra) Lempel-Ziv algorithms (like gzip) need to build a codebook on the fly. How does
that fit into the picture?
(These are taken from MacKay’s book Information Theory..., Exercise 22.12 & .13)
a) Assume that a random variable x with discrete domain dom(x) = X comes from a
probability distribution of the form
d
1 hX i
P (x | w) = exp wk fk (x) , (253)
Z(w)
k=1
where the functions fk (x) are given, and the parameters w ∈ Rd are not known. A
data set D = {xi }ni=1 ofPn points x is supplied. Show by differentiating the log
n
likelihood log P (D|w) = i=1 log P (xi |w) that the maximum-likelihood parameters
∗
w = argmaxw log P (D|w) satisfy
n
X
∗ 1 X
P (x | w ) fk (x) = fk (xi ) (254)
n i=1
x∈X
where the left-hand sum is over all x, and the right-hand sum is over the data points.
A shorthand for this result is that each function-average under the fitted model must
equal the function-average found in the data:
b) When confronted by a probability distribution P (x) about which only a few facts are
known, the maximum entropy principle (MaxEnt) offers a rule for choosing a distribution
that satisfies those constraints. According to MaxEnt, you should select the P (x) that
maximizes the entropy
X
H(P ) = − P (x) log P (x) (256)
x
subject to the constraints. Assuming the constraints assert that the averages of certain
functions fk (x) are known, i.e.,
show, by introducing Lagrange multipliers (one for each constraint, including normaliza-
tion), that the maximum-entropy distribution has the form
1 hX i
PMaxEnt (x) = exp wk fk (x) (258)
Z
k
where the parameters Z and wk are set such that the constraints are satisfied. And hence
the maximum entropy method gives identical results to maximum likelihood fitting of
an exponential-family model.
A Gaussian identities
Definitions
We define a Gaussian over x with mean a and covariance matrix A as the function
1 1
N(x | a, A) = 1/2
exp{− (x-a)> A-1 (x-a)} (259)
|2πA| 2
with property N (x | a, A) = N (a| x, A). We also define the canonical form with precision
matrix A as
exp{− 12 a>A-1 a} 1
N[x | a, A] = exp{− x> A x + x>a} (260)
|2πA-1 |1/2 2
with properties
Non-normalized Gaussian
Derivatives
Product
The product of two Gaussians can be expressed as
N(x | a, A) N(x | b, B)
= N[x | A-1 a + B -1 b, A-1 + B -1 ] N(a | b, A + B) , (277)
= N(x | B(A+B) a + A(A+B) b, A(A+B) B) N(a | b, A + B) ,
-1 -1 -1
(278)
N[x | a, A] N[x | b, B]
= N[x | a + b, A + B] N(A-1 a | B -1 b, A-1 + B -1 ) (279)
= N[x| . . . ] N[A a | A(A+B) b, A(A+B) B]
-1 -1 -1
(280)
= N[x| . . . ] N[A a | (1-B(A+B) ) b, (1-B(A+B) ) B] ,
-1 -1 -1
(281)
N(x | a, A) N[x | b, B]
98 Maths for Intelligent Systems, Marc Toussaint
Convolution
Division
-1 -1 -1
C c=A a−B b
C -1 = A-1 − B -1 (285)
N[x|a, A] N[x|b, B] ∝ N[x|a − b, A − B]
(286)
Expectations
Let x ∼ N(x | a, A),
N(x | a, A) g(x) dx
R
E{x} g(x) := x
(287)
> >
E{x} x = a , E{x} xx = A + aa (288)
E{x} f + F x = f + F a (289)
> >
E{x} x x = a a + tr(A) (290)
> >
E{x} (x-m) R(x-m) = (a-m) R(a-m) + tr(RA) (291)
“Propagation” (propagating a message along a coupling, using eqs (277) and (283),
respectively)
A>F>
x a A
N(x | a, A) N(y | b + F x, B) = N , (299)
y b + Fa FA B +F A>F>
x a A C
N , = N(x | a, A) · N(y | b + C>A-1 (x-a), B − C>A-1 C)
y b C> B
(300)
x a + F>B -1 b A + F>B -1 F −F>B -1
N[x | a, A] N(y | b + F x, B) = N ,
y B -1 b −B -1 F B -1
(301)
> -1 > -1 >
x a + F B b A + F B F −F
N[x | a, A] N[y | b + F x, B] = N ,
y b −F B
(302)
x a A C
N , = N[x | a − CB -1 b, A − CB -1 C>] · N[y | b − C>x, B]
y b C> B
(303)
-1
A C b |B| , where A = A − CB D
b
= |A| |B|
b = |A|
D B B = B − DA-1 C
b
(304)
-1 " #
b-1 −A-1 C Bb -1
A C A
= (305)
D B −B DA-1
b -1
Bb -1
" #
A b-1 b-1 CB -1
−A
= (306)
−B -1 DAb-1 Bb -1
pair-wise belief We have a message α(x) = N[x|s, S], transition P (y|x) = N(y|Ax +
a, Q), and a message β(y) = N[y|v, V ], what is the belief b(y, x) = α(x)P (y|x)β(y)?
Entropy
1
H(N(a, A)) = log |2πeA| (310)
2
Kullback-Leibler divergence
X p(x)
p = N(x|a, A) , q = N(x|b, B) , n = dim(x) , D p q = p(x) log
x
q(x)
(311)
|B|
+ tr(B -1 A) + (b − a)>B -1 (b − a) − n
2 D p q = log (312)
|A|
4 Dsym p q = tr(B -1 A) + tr(A-1 B) + (b − a)>(A-1 + B -1 )(b − a) − 2n
(313)
λ-divergence
2 Dλ p q = λ D p λp + (1−λ)q + (1−λ) D p (1−λ)p + λq (314)
Log-likelihoods
1h i
log N(x|a, A) = −
log|2πA| + (x-a)> A-1 (x-a) (315)
2
1h i
log N[x|a, A] = − log|2πA-1 | + a>A-1 a + x>Ax − 2x>a (316)
X 2
N(x|b, B) log N(x|a, A) = −D N(b, B) N(a, A) − H(N(b, B))
(317)
x
B Further
Differential Geometry
Emphasize strong relation between a Riemannian metric (and respective geodesic)
and cost (in an optimization formulation). Pullbacks and costs. Only super brief,
connections.
Maths for Intelligent Systems, Marc Toussaint 101
Manifolds
Local tangent spaces, connection. example of kinematics
Lie groups
exp and log
Information Geometry
[Integrate notes on information geometry]
References
J. M. Hernández-Lobato, M. W. Hoffman, and Z. Ghahramani. Predictive en-
tropy search for efficient global optimization of black-box functions. In
Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Weinberger,
editors, Advances in Neural Information Processing Systems 27, pages 918–
926. Curran Associates, Inc., 2014. URL http://papers.nips.cc/paper/
5324-predictive-entropy-search-for-efficient-global-optimization-of-black
pdf.
random variable, 82
rank, 32
relaxations, 67
root, 49