0% found this document useful (0 votes)
4 views

Lecture Maths

Uploaded by

fadma ter
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Lecture Maths

Uploaded by

fadma ter
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 103

Maths for Intelligent Systems

Marc Toussaint

April, 2022

This script is primarily based on a lecture I gave 2015-2019 in Stuttgart. The current
form also integrates notes and exercises from the Optimization lecture, and a little
material from my Robotics and ML lectures. The first full version was from 2019, since
then I occasionally update it.

Contents
1 Speaking Maths 4
1.1 Describing systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Should maths be taught with many application examples? Or abstractly? . . . 5
1.3 Notation: Some seeming trivialities . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Functions & Derivatives 7


2.1 Basics on uni-variate functions . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 Continuous, differentiable, & smooth functions; 2.1.2 Polynomials, piece-
wise, basis functions, splines

2.2 Partial vs. total derivative, and chain rules . . . . . . . . . . . . . . . . . . . . 9


2.2.1 Partial Derivative; 2.2.2 Total derivative, computation graphs, forward
and backward chain rules

2.3 Gradient, Jacobian, Hessian, Taylor Expansion . . . . . . . . . . . . . . . . . . 11


2.3.1 Gradient & Jacobian; 2.3.2 Hessian; 2.3.3 Taylor expansion

2.4 Derivatives with matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14


2.4.1 Derivative Rules; 2.4.2 Example: GP regression; 2.4.3 Example: Lo-
gistic regression

2.5 Check your gradients numerically! . . . . . . . . . . . . . . . . . . . . . . . . 18


2.6 Examples and Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2 Maths for Intelligent Systems, Marc Toussaint

2.6.1 More derivatives; 2.6.2 Multivariate Calculus; 2.6.3 Finite Difference


Gradient Checking; 2.6.4 Backprop in a Neural Net; 2.6.5 Backprop in a
Neural Net; 2.6.6 Logistic Regression Gradient & Hessian

3 Linear Algebra 21
3.1 Vector Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1.1 Why should we care for vector spaces in intelligent systems research?;
3.1.2 What is a vector?; 3.1.3 What is a vector space?

3.2 Vectors, dual vectors, coordinates, matrices, tensors . . . . . . . . . . . . . . . 22


3.2.1 A taxonomy of linear functions; 3.2.2 Bases and coordinates; 3.2.3 The
dual vector space – and its coordinates; 3.2.4 Coordinates for every linear
thing: tensors; 3.2.5 Finally: Matrices; 3.2.6 Coordinate transformations

3.3 Scalar product and orthonormal basis . . . . . . . . . . . . . . . . . . . . . . . 28


3.3.1 Properties of orthonormal bases

3.4 The Structure of Transforms & Singular Value Decomposition . . . . . . . . . 30


3.4.1 The Singular Value Decomposition Theorem

3.5 Point of departure from the coordinate-free notation . . . . . . . . . . . . . . 33


3.6 Filling SVD with life . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.6.1 Understand vv>as a projection; 3.6.2 SVD for symmetric matrices;
3.6.3 SVD for general matrices

3.7 Eigendecomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.7.1 Power Method; 3.7.2 Power Method including the smallest eigenvalue
; 3.7.3 Why should I care about Eigenvalues and Eigenvectors?
3.8 Beyond this script: Numerics to compute these things . . . . . . . . . . . . . . 38
3.9 Derivatives as 1-forms, steepest descent, and the covariant gradient . . . . . . 38
3.9.1 The coordinate-free view: A derivative takes a change-of-input vector
as input, and returns a change of output; 3.9.2 Contra- and co-variance;
3.9.3 Steepest descent and the covariant gradient vector

3.10 Examples and Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41


3.10.1 Basis; 3.10.2 From the Robotics Course; 3.10.3 Bases for Polynomials;
3.10.4 Projections; 3.10.5 SVD; 3.10.6 Bonus: Scalar product and Orthogonal-
ity; 3.10.7 Eigenvectors; 3.10.8 Covariance and PCA; 3.10.9 Bonus: RKHS

4 Optimization 47
4.1 Downhill algorithms for unconstrained optimization . . . . . . . . . . . . . . . 47
4.1.1 Why you shouldn’t trust the magnitude of the gradient; 4.1.2 Ensuring
monotone and sufficient decrease: Backtracking line search, Wolfe conditions,
Maths for Intelligent Systems, Marc Toussaint 3

& convergence; 4.1.3 The Newton direction; 4.1.4 Least Squares & Gauss-
Newton: a very important special case; 4.1.5 Quasi-Newton & BFGS: ap-
proximating the hessian from gradient observations; 4.1.6 Conjugate Gradient
; 4.1.7 Rprop*
4.2 The general optimization problem – a mathematical program . . . . . . . . . . 57
4.3 The KKT conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.4 Unconstrained problems to tackle a constrained problem . . . . . . . . . . . . 59
4.4.1 Augmented Lagrangian*

4.5 The Lagrangian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61


4.5.1 How the Lagrangian relates to the KKT conditions; 4.5.2 Solving math-
ematical programs analytically, on paper.; 4.5.3 Solving the dual problem, in-
stead of the primal.; 4.5.4 Finding the “saddle point” directly with a primal-
dual Newton method.; 4.5.5 Log Barriers and the Lagrangian

4.6 Convex Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66


4.6.1 Convex sets, functions, problems; 4.6.2 Linear and quadratic programs
; 4.6.3 The Simplex Algorithm; 4.6.4 Sequential Quadratic Programming
4.7 Blackbox & Global Optimization: It’s all about learning . . . . . . . . . . . . . 70
4.7.1 A sequential decision problem formulation; 4.7.2 Acquisition Functions
for Bayesian Global Optimization*; 4.7.3 Classical model-based blackbox op-
timization (non-global)*; 4.7.4 Evolutionary Algorithms*

4.8 Examples and Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74


4.8.1 Convergence proof; 4.8.2 Backtracking Line Search; 4.8.3 Gauss-Newton
; 4.8.4 Robust unconstrained optimization; 4.8.5 Lagrangian Method of Mul-
tipliers; 4.8.6 Equality Constraint Penalties and Augmented Lagrangian; 4.8.7 Lagrangian
and dual function; 4.8.8 Optimize a constrained problem; 4.8.9 Network flow
problem; 4.8.10 Minimum fuel optimal control; 4.8.11 Reformulating an `1 -
norm; 4.8.12 Restarts of Local Optima; 4.8.13 GP-UCB Bayesian Optimization

5 Probabilities & Information 81


5.1 Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.1.1 Axioms, definitions, Bayes rule; 5.1.2 Standard discrete distributions
; 5.1.3 Conjugate distributions; 5.1.4 Distributions over continuous domain;
5.1.5 Gaussian; 5.1.6 “Particle distribution”

5.2 Between probabilities and optimization: neg-log-probabilities, exp-neg-energies,


exponential family, Gibbs and Boltzmann . . . . . . . . . . . . . . . . . . . . . 87
5.3 Information, Entropie & Kullback-Leibler . . . . . . . . . . . . . . . . . . . . . 90
5.4 The Laplace approximation: A 2nd-order Taylor of log p . . . . . . . . . . . . . 91
5.5 Variational Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4 Maths for Intelligent Systems, Marc Toussaint

5.6 The Fisher information metric: 2nd-order Taylor of the KLD . . . . . . . . . . 92


5.7 Examples and Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.7.1 Maximum Entropy and Maximum Likelihood; 5.7.2 Maximum likeli-
hood and KL-divergence; 5.7.3 Laplace Approximation; 5.7.4 Learning =
Compression; 5.7.5 A gzip experiment; 5.7.6 Maximum Entropy and ML

A Gaussian identities 96

B Further 100

Index 102

1 Speaking Maths

1.1 Describing systems

Systems can be described in many ways. Biologists describe their systems often using
text, and lots and lots of data. Architects describe buildings using drawings. Physicists
describe nature using differential equations, or optimality principles, or differential geom-
etry and group theory. The whole point of science is to find descriptions of systems—in
the natural science descriptions that allow prediction, in the engineering sciences de-
scriptions that enable the design of good systems, problem-solving systems.
And how should we describe intelligent systems? Robots, perception systems, machine
learning systems? I think there are two main categories: the imperative way in terms of
literal algorithms (code), or the declarative way in terms of formulating the problem. I
prefer the latter.
The point of this lecture is to teach you to speak maths, to use maths to describe
systems or problems. I feel that most maths courses rather teach to consume maths, or
solve mathematical problems, or prove things. Clearly, this is also important. But for the
purpose of intelligent systems research, it is essential to be skilled in expressing problems
mathematically, before even thinking about solving them and deriving algorithms.
If you happen to attend a Machine Learning or Robotics course you’ll see that every
problem is addressed the same way: You have an “intuitively formulated” problem; the
first step is to find a mathematical formulation; the second step to solve it. The second
step is often technical. The first step is really the interesting and creative part. This is
where you have to nail down the problem, i.e., nail down what it means to be successful
or performant – and thereby describe “intelligence”, or at least a tiny aspect of it.
The “Maths for Intelligent Systems” course will recap essentials of multi-variate func-
tions, linear algebra, optimization, and probabilities. These fields are essential to formu-
late problems in intelligent systems research and hopefully will equip you with the basics
Maths for Intelligent Systems, Marc Toussaint 5

of speaking maths.

1.2 Should maths be taught with many application examples? Or


abstractly?

Maybe this is the wrong question and implies a view on maths I don’t agree with. I think
(but this is arguable) maths is nothing but abstractions of real-world things. At least I
aim to teach maths as abstractions of real-world things. It is misleading to think that
there is “pure maths” and then “applications”. Instead mathematical concepts, such as
a vector, are abstractions of real-world things, such as faces, scenes, images, documents;
and theorems, methods and algorithms that apply on vectors of course also apply to all
the real-world things—subject to the limitations of this abstraction. So, the goal is not
to teach you a lookup table of which method can be used in which application, but
rather to teach which concepts maths offers to abstract real-world things—so that you
find such abstractions yourself once you’ll have to solve a real-world problem.
But yes, I believe that maths – in our context – should ideally be taught with many
exercises relating to AI problems. Perhaps the ideal would be:

ˆ Teach Maths using AI exercises (where AI problems are formulated and treated
analytically).
ˆ Teach AI using coding exercises.
ˆ Teach coding using maths-implementation exercises.

But I haven’t yet adopted this myself in my teaching.

1.3 Notation: Some seeming trivialities

Equations and mathematical expressions have a syntax. This is hardly ever made explicit1
and might seem trivial. But it is surprising how buggy mathematical statements can be
in scientific papers (and oral exams). I don’t want to write much text about this, just
some bullet points:

ˆ Always declare mathematical objects.


ˆ Be aware of variable and index scoping. For instance, if you have an equation,
and one side includes a variable i, the other side doesn’t, this often is a notational
bug. (Unless this equation actually makes a statement about independence on i.)
ˆ Type checking. Within an equation, be sure to know exactly of what type each
term is: vector? matrix? scalar? tensor? Is the type and dimension of both sides
of the equation consistent?
1 Except perhaps by Gödel’s incompleteness theorems and areas like automated theorem proving.
6 Maths for Intelligent Systems, Marc Toussaint

ˆ Decorations are ok, but really not necessary. It is much more important to declare
all things. E.g., there are all kinds of decorations used for vectors, v, v, →

v , |vi
and matrices. But these are not necessary. Properly declaring all symbols is much
more important.

ˆ When declaring sets of indexed elements, I use the notation {xi }ni=1 . Similarly for
tuples: (xi )ni=1 , (x1 , .., xn ), x1:n .

ˆ When defining sets, we write something like {f (x) : x ∈ R}, or {n ∈ N :


∃ {vi }ni=1 linearly independent, vi ∈ V }

ˆ I usually use brackets [a = b] ∈ {0, 1} for the boolean indicator function of some
expression. An alternative notation is I(a = b), or the Kronecker symbol δab .

ˆ A tuple (a, b) ∈ A × B is an element of the product space.

ˆ direct sum A ⊕ B is same as A × B in finite-dimensional spaces

ˆ If f : X → Y , then minx f (x) ∈ Y is minimal function value (output); whereas


argminx f (x) ∈ X is the input (“argument”) that minimizes the function. E.g.,
minx f (x) = f (argminx f (x)).

ˆ One should distinguish between the infimum inf x f (x) and supremum supx f (x)
from the min/max: the inf/sup refer to limits, while the min/max to values actually
acquired by the function. I must admit I am sloppy in this regard and usually only
write min/max.

ˆ Never use multiple letters for one thing. E.g. length = 3 means l times e times n
times g times t times h equals 3.

ˆ There is a difference between → and 7→:

f : R → R, x 7→ cos(x) (1)

ˆ The dot is used to help defining functions with only some arguments fixed:

f : A×B →C , f (a, ·) : B → C, b 7→ f (a, b) (2)

Another example is the typical declaration of an inner product: h·, ·i : V × V → R.

ˆ The p-norm or Lp -norm is defined as ||x||p = [ i xpi ]1/p . By default p = 2, that


P
2
is ||x|| = ||x||2 = |x|, which
P is the L -norm or Euclidean length of a vector. Also
1
the L -norm ||x||1 = i |xi | (aka Manhattan distance) plays an important role.

a1 0
 
 

ˆ I use diag(a1 , .., an ) = ..


 



 .




for a diagonal matrix. And overload this
 
0 an
so that diag(A) = (A11 , .., Ann ) is the diagonal vector of the matrix A ∈ Rn×n .
Maths for Intelligent Systems, Marc Toussaint 7

ˆ A typical convention is

0n = (0, .., 0) ∈ Rn , 1n = (1, .., 1) ∈ Rn , In = diag(1n ) (3)

Also, ei = (0, .., 0, 1, 0, .., 0) ∈ Rn often denotes the ith column of the identity
matrix, which of course are the coordinates of a basis vector ei ∈ V in a basis
(ei )ni=1 .

ˆ The element-wise product of two matrices A and B is also called Hadamard


product and notated A ◦ B (which has nothing to do with the concatenation of
two operations). If there is need, perhaps also use this to notate the element-wise
product of two vectors.

2 Functions & Derivatives

2.1 Basics on uni-variate functions

We super quickly recap the basics of functions R → R, which the reader might already
know.

2.1.1 Continuous, differentiable, & smooth functions

ˆ A function f : R → R is continuous at x if the limit limh→0 f (x + h) = f (x)


exists and equals f (x) (from both sides).
f (x+h)−f (x)
ˆ A function f : R → R is differentiable at x if f 0 (x) = limh→0 h exists.
Note, differentiable ⇒ continuous.

ˆ A function is continuous/differentiable if it is continuous/differentiable at any


x ∈ R.

ˆ A function f is an element of Ck if it is k-fold continuously differentiable, i.e., if


its k-th derivative f (k) is continuous. For example, C0 is the space of continuous
functions, and C2 the space of twice continuously differentiable functions.

ˆ A function f ∈ C∞ is called smooth (infinitely often differentiable).

2.1.2 Polynomials, piece-wise, basis functions, splines

Let’s recap some basic functions that often appear in AI research.


8 Maths for Intelligent Systems, Marc Toussaint

Pp
ˆ A polynomial of degree p is of the form f (x) = i=0 ai xi , which is a weighed
sum of monimials 1, x, x2 , ... Note that for multi-variate functions, the number
of monimials grows combinatorially with the degree and dimension. E.g., the
2 2 2
monimials
 of degree 2 in 3D space are x1 , x1 x2 , x1 x3 , x2 , x2 x3 , x3 . In general,
we have dp monimials of degree p in d-dimensional space.

ˆ Polynomials are smooth.

ˆ I assume it is clear what piece-wise means (we have different polynomials in disjoint
intervals covering the input domain R).

ˆ A set of basis functions {e1 , ..., en } defines the space f = { i ai ei : ai ∈ R}


P
of functions that are linear combinations of the basis functions. More details on
bases are discussed below for vector spaces in general. Here we note typical basis
functions:
– Monimials, which form the basis of polynomials.
– Trigonometric functions sin(2πax), cos(2πax), with integers a, which form the Fourier
basis of functions.
– Radial basis functions ϕ(|x − ci |), where we have n centers ci ∈ R, i = 1, .., n,
and ϕ is typically some bell-shaped or local support function around zero. A very
common one is the squared exponential ϕ(d) = exp{−ld2 } (which you might call a
non-normalized Gaussian with variance 1/l).

ˆ The above functions are all examples of parameteric functions, which means
that they can be specified by a finite number of parameters ai . E.g., when we
have a finite set of basis functions, the functions can all be described by the finite
set of weights in the linear combination.
However, in general a function f : R → R is an “infinite-dimensional object”,
i.e., it has infinitely many degrees-of-freedom f (x), i.e., values at infinitely many
points x. In fact, sometimes it is useful to think of f as a “vector” of ele-
ments fx with continuous index x. Therefore, the space of all possible functions,
and also the space of all continuous function C0 and smooth functions C∞ , is
infinite-dimensional. General functions cannot be specified by a finite number of
parameters, and they are called non-parameteric.
The core example are functions used for regression or classification in Machine
Learning, which are a linear combination of an infinite set of basis functions. E.g.,
an infinite set of radial basis functions ϕ(|x − c|) for all centers c ∈ R. This
infinite set of basis functions spans a function space called Hilbert space (in the
ML context, “Reproducing Kernel Hilbert Space (RKHS)”), which is an infinite-
dimensional vector space. Elements in that space are called non-parameteric.

ˆ As a final note, splines are parameteric functions that are often used in robotics
and engineering in general. Splines usually are piece-wise polynomials that are
continuously joined. Namely, a spline of degree p is in Cp−1 , i.e., p − 1-fold con-
tinuously differentiable. A spline is not fully smooth, as the p-th derivative is
Maths for Intelligent Systems, Marc Toussaint 9

discontinuous. E.g., a cubic spline has a piece-wise constant (“bang-bang”) jerk


(3rd derivative). Cubic splines and B-splines (which are also piece-wise polynomi-
als) are commonly used in robotics to describe motions. In computational design
and graphics, B-splines are equally common to describe surfaces (in the form of
Non-uniform rational basis spline, NURBS). Appendix ?? is a brief reference for
splines.

2.2 Partial vs. total derivative, and chain rules

2.2.1 Partial Derivative

A multi-variate function f : Rn → R can be thought of as a function of n arguments,


f (x1 , .., xn ).

Definition 2.1. The partial derivative of a function of multiple arguments f (x1 , .., xn )
is the standard derivative w.r.t. only one of its arguments,

∂ f (x1 , .., xi + h, .., xn ) − f (x)


f (x1 , .., xn ) = lim . (4)
∂xi h→0 h

2.2.2 Total derivative, computation graphs, forward and backward chain rules

Let me start with an example: We have three real-valued quantities x, g and f which
depend on each other. Specifically,
f (x, g) = 3x + 2g and g(x) = 2x . (5)
Question: What is the “derivative of f w.r.t. x”?
The correct answer is: Which one do you mean? The partial or total?
The partial derivative defined above really thinks of f (x, g) as a function of two argu-
ments, and does not at all care about whether there might be dependencies of these
arguments. It only looks at f (x, g) alone and takes the partial derivative (=derivative
w.r.t. one function argument):

f (x, g) = 3 (6)
∂x

However, if you suddenly talk about h(x) = f (x, g(x)) as a function of the argument x
only, that’s a totally different story, and
∂ ∂
h(x) = 3x + 2(2x) = 7 (7)
∂x ∂x

Bottom line, the definition of the partial derivative really depends on what you explicitly
defined as the arguments of the function.
10 Maths for Intelligent Systems, Marc Toussaint

To allow for a general treatment of differentiation with dependences we need to define


a very useful concept:

Definition 2.2. A function network or computation graph is a directed acyclic


graph (DAG) of n quantities xi where each quantity is a deterministic function of
a set of parents π(i) ⊂ {1, .., n}, that is

xi = fi (xπ(i) ) (8)

where xπ(i) = (xj )j∈π(i) is the tuple of parent values. This could also be called a
deterministic Bayes net.

In a function network all values can be computed deterministically if the input values
(which do have no parents) are given. Concerning differentiation, we may now ask:
Assume we have a variation dx of some input value, how do all other values vary? The
chain rules give the answer. It turns our there are two chain rules in function networks:

Identities 2.1 (Chain rule (Forward-Version)).

df X ∂f dg dx
= (with ≡ 1, in case x ∈ π(f )) (9)
dx ∂g dx dx
g∈π(f )

Read this as follows: “The change of f with x is the sum of changes that come from
its direct dependence on g ∈ π(f ), each multiplied the change of g with x.”
df
This rule defines the total derivative of dx w.r.t. x. Note how different these two
notions of derivatives are by definition: a partial derivative only looks at a function itself
and takes a limit of differences w.r.t. one argument—no notion of further dependencies.
The total derivative asks how, in a function network, one value changes with a change
of another.
The second version of the chain rule is:

Identities 2.2 (Chain rule (Backward-Version)).

df X df ∂g df
= (with ≡ 1, in case x ∈ π(f )) (10)
dx dg ∂x df
g:x∈π(g)

Read this as follows: “The change of f with x is the sum of changes that arise from all
changes of g which directly depend on x.”
Figure 1 illustrates the fwd and bwd versions of the chain rule. The bwd version allows
df
you to propagate back, given gradients dg from top to g, one step further down, from
dg
top to x. The fwd version allows you to propagate forward, given gradients dx from g to
bottom, one step further up, from f to bottom. Both versions are recursive equations.
Maths for Intelligent Systems, Marc Toussaint 11

f f

x x
Figure 1: General Chain Rule. Left: Forward-Version, Right: Backward-Version.
They gray arc denotes the direct dependence ∂f
∂x , which appears in the summations
via dx/dx ≡ 1, df /df ≡ 1.

If you would recursively plug in the definition for a given functional network, both of
df
them would yield the same expression of dx in terms of partial derivatives only.
Let’s compare to the chain rule as it is commonly found in other texts (written more
precisely):

∂f (g(x)) ∂f (g) ∂g(x)


= (11)
∂x ∂g g=g(x) ∂x

Note that we here very explicitly notated that ∂f∂g(g) considers f to be a function of
the argument g, which is evaluted at g = g(x). Written like this, the rule is fine. But
the above discussion and explicitly distinguishing between partial and total derivative is,
when things get complicated, less prone to confusion.

2.3 Gradient, Jacobian, Hessian, Taylor Expansion

2.3.1 Gradient & Jacobian

Let’s take the next step and consider functions f : Rn → Rd that map from n numbers
to a d-dimensional output. In this case, we can take the partial derivative of each output
w.r.t. each input argument, leading to a matrix of partial derivatives:

Definition 2.3. Given f : Rn → Rd , we define the derivative (also called Jacobian


12 Maths for Intelligent Systems, Marc Toussaint

matrix) as
∂ ∂ ∂
∂x1 f1 (x) ∂x2 f1 (x) ... ∂xn f1 (x) 
 

∂ ∂ ∂
 
∂ ∂x1 f2 (x) ∂x2 f2 (x) ... ∂xn f2 (x) 
 
 

f (x) = 

.. ..

 (12)
∂x
 
. .
 
 
 
 
∂ ∂ ∂
∂x1 d (x)
f ∂x2 fd (x) ... f (x)
 
∂xn d

When the function only as one output dimension, f : Rn → R1 , the partial derivative
can be written as a vector. Unlike many other texts, I advocate for consistency with the
Jacobian matrix (and contra-variance, see below) and define this to be a row vector:

Definition 2.4. We define the derivative of f : Rn → R as the row vector of


partial derivatives
∂ ∂ ∂
f (x) = ( f, .., f) . (13)
∂x ∂x1 ∂xn
Further, we define the gradient as the corresponding column “vector”
h ∂ i>
∇f (x) = f (x) . (14)
∂x

The “purpose” of a derivative is to output a change of function value when being


multiplied to a change of input δ. That is, in first order approximation, we have

f (x + δ) − f (x) =
˙ ∂f (x) δ , (15)

where =˙ denotes “in first order approximation”. This equation holds, no matter if the
output space is Rd or R, or the input space and variation is δ ∈ Rn or δ ∈ R. In the
gradient notation we have

˙ ∇f (x)>δ .
f (x + δ) − f (x) = (16)

Jumping ahead to our later discussion of linear algebra: The above two equations are
written in coordinates. But note that the equations are truly independent of the choice
of vector space basis and independent of an optional metric or scalar product in V . The
transpose should not be understood as a scalar product between two vectors, but rather
as undoing the transpose in the definition of ∇f . All this is consistent to understanding
the derivatives as coordinates of a 1-form, as we will introduce it later.
Given a certain direction d (with |d| = 1) we define the directional derivative as
∇f (x)>d, and it holds

f (x + d) − f (x)
∇f (x)>d = lim . (17)
→0 
Maths for Intelligent Systems, Marc Toussaint 13

2.3.2 Hessian
Definition 2.5. We define the Hessian of a scalar function f : Rn → R as the
symmetric matrix
∂2 ∂2 ∂2
 

 ∂x1 ∂x1 f ∂x1 ∂x2 f ... ∂x1 ∂xn f 


∂2 ∂2 ∂2 
∂ ∂x2 ∂x1 f ∂x2 ∂x2 f ... ∂x2 ∂xn f
 

∇2 f (x) =
 
∇f (x) = (18)
 
.. ..
 
∂x
 
 
. .
 
 
 
∂2 ∂2 ∂2
 
 

∂xn ∂x1 f ∂xn ∂x2 f ... ∂xn ∂xn f

The Hessian can be thought of as the Jacobian of ∇f . Using the Hessian, we can express
the 2nd order approximation of f as:
1
¨ f (x) + ∂f (x) δ + δ> ∇2 f (x) δ .
f (x + δ) = (19)
2

For a uni-variate function f : R → R, the Hessian is just a single number, namely the
second derivative f 00 (x). In this section, let’s call this the “curvature” of the function
(not to be confused with the Rimannian curvature of a manifold). In the uni-variate
case, we have the obvious cases:

ˆ If f 00 (x) > 0, the function is locally “curved upwared” and convex (see also the
formal Definition ??).

ˆ If f 00 (x) < 0, the function is locally “curved downward” and concave.

n
P f : R > → R, the Hessian matrix H is symmetric and we can
In the multi-variate case
decompose it as H = i λi hi hi with eigenvalues λi and eigenvectors hi (which we will
learn about in detail later). Importantly, all hi will be orthogonal to each other, forming
a nice orthonormal basis.
This insight gives us a very strong intuition on how the Hessian H describes the local
curvature of the function f : λi gives the directional curvature, i.e., the curvature in
the direction of eigenvector hi . If λi > 0, f is curved upward along hi ; if λi < 0, f is
curved downward along hi . Therefore, the eigenvalues λi tell us whether the function is
locally curved upward, downward, or flat in each of the orthogonal directions hi .

This becomes particular intuitive if ∂x f = 0 is zero, i.e., the derivative (slope) of the
function is zero in all directions. When the curvatures λi are positive in all directions,
the function is locally convex (upward parabolic) and x is a local minimum; if the
curvatures λi are all negative, the function is concave (downward parabolic) and x is a
local maximum; if some curvatures are positive and some are negative along different
directions hi , then the function curves down in some directions, and up in others, and
x is a saddle point.
Again jumping ahead, in the coordinate-free notation, the second derivative would be
14 Maths for Intelligent Systems, Marc Toussaint

defined as the 2-form


d2 f x
: V × V → G, (20)
df x+hw
(v) − f x (v)
(v, w) 7→ lim (21)
h→0 h
f (x + hw + lv) − f (x + hw) − f (x + lv) + f (x)
= lim . (22)
h,l→0 hl
The Hessian matrix are the coordinates of this 2-form (which would actually be a row
vector of row vectors).

2.3.3 Taylor expansion

In 1D, we have
1 1
f (x + v) ≈ f (x) + f 0 (x)v + f 00 (x)v 2 + · · · + f (k) (x)v k (23)
2 k!

For f : Rn → R, we have
1
f (x + v) ≈ f (x) + ∇f (x)>v + v>∇2 f (x)v + · · · (24)
2
which is equivalent to
X ∂ 1 X ∂2
f (x + v) ≈ f (x) + f (x)vj + f (x)vj vk + · · · (25)
j
∂xj 2 ∂xj ∂xk
jk

2.4 Derivatives with matrices

The next section will introduce linear algebra from scratch – here we first want to learn
how to practically deal with derivatives in matrix expressions. We think of matrices and
vectors simply as arrays of numbers ∈ Rn×m and Rn . As a warmup, try to solve the
following exercises:

(i) Let X, A be arbitrary matrices, A invertible. Solve for X:


XA + A> = I (26)

(ii) Let X, A, B be arbitrary matrices, (C − 2A>) invertible. Solve for X:


X>C = [2A(X + B)]> (27)

(iii) Let x ∈ Rn , y ∈ Rd , A ∈ Rd×n . A obviously not invertible, but let A>A be


invertible. Solve for x:
(Ax − y)>A = 0>n (28)
Maths for Intelligent Systems, Marc Toussaint 15

(iv) As above, additionally B ∈ Rn×n , B positive-definite. Solve for x:

(Ax − y)>A + x>B = 0>n (29)

(v) A core problem in Machine Learning: For β ∈ Rd , y ∈ Rn , X ∈ Rn×d , compute

argmin ||y − Xβ||2 + λ||β||2 . (30)


β

(vi) A core problem in Robotics: For q, q0 ∈ Rn , φ : Rn → Rd , y ∗ ∈ Rd non-linear


but smooth, compute

argmin ||φ(q) − y ∗ ||2C + ||q − q0 ||2W . (31)


q

Use a local linearization of φ to solve this.

For problem (v), we want to find a minimum for a matrix expression. We find this by
setting the derivative equal to zero. Here is the solution, and below details will become
clear:

0= ||y − Xβ||2 + λ||β||2 (32)
∂β
= 2(y − Xβ)>(−X) + 2λβ> (33)
>
0 = −X (y − Xβ) + λβ (34)
> >
0 = −X y + (X X + λI)β (35)
> -1 >
β = −(X X + λI) X y (36)

Line 2 uses a standard rule for the derivative (see below) and gives a row vector equation.
Line 3 transposes this to become a column vector equation.

2.4.1 Derivative Rules

As 2nd order terms are very common in AI methods, this is a very useful identity to
learn:

Identities 2.3.
∂ ∂ ∂
f (x)>Ag(x) = f (x)>A g(x) + g(x)>A> f (x) (37)
∂x ∂x ∂x

Note that using the ’gradient column’ convention this reads

∂ ∂
∇x f (x)>Ag(x) = [ g(x)]>A>f (x) + [ f (x)]>Ag(x) (38)
∂x ∂x
16 Maths for Intelligent Systems, Marc Toussaint

which I find impossible to remember, and mixes gradients-in-columns (∇) with gradients-
in-rows (the Jacobian) notation.
Special cases and variants of this identity are:

[whatever]x = [whatever] , if whatever is indep. of x (39)
∂x
∂ >
a x = a> (40)
∂x

Ax = A (41)
∂x

(Ax − b)>(Cx − d) = (Ax − b)>C + (Cx − d)>A (42)
∂x
∂ >
x Ax = x>A + x>A> (43)
∂x
∂ ∂ > 1 1 1 1 >
||x|| = (x x) 2 = (x>x)− 2 2x> = x (44)
∂x ∂x 2 ||x||
∂2
(Ax + a)>C(Bx + b) = A>CB + B>C>A (45)
∂x2

Further useful identities are:

Identities 2.4 (Derivative Rules).

∂ ∂
|A| = |A| tr(A-1 A) (46)
∂θ ∂θ
∂ -1 ∂
A = −A-1 ( A) A-1 (47)
∂θ ∂θ
∂ X ∂
tr(A) = Aii (48)
∂θ i
∂θ

We can also directly take a derivative of a scalar value w.r.t. a matrix:


∂ >
a Xb = ab> (49)
∂X
∂ > >
(a X CXb) = C>Xab> + CXba> (50)
∂X

tr(X) = I (51)
∂X
But if this leads to confusion I would recommend to never take a derivative w.r.t. a
matrix. Instead, perhaps take the derivative w.r.t. a matrix element: ∂X∂ ij a>Xb = ai bj .
For completeness, here are the most important matrix identities (the appendix lists
more):
Maths for Intelligent Systems, Marc Toussaint 17

Identities 2.5 (Matrix Identities).

(A-1 + B -1 )-1 = A (A + B)-1 B = B (A + B)-1 A (52)


-1 -1 -1 -1
(A − B ) = A (B − A) B (53)
-1 -1 -1 -1 -1 -1 -1
(A + U BV ) = A − A U (B + V A U ) V A (54)
-1 -1 -1 -1
(A + B ) = A − A(B + A) A (55)
(A + J>BJ)-1 J>B = A-1 J>(B -1 + JA-1 J>)-1 (56)
> -1 > -1 >
(A + J BJ) A = I − (A + J BJ) J BJ (57)

(54)=Woodbury; (56,57) holds for pos def A and B. See also the matrix cookbook.

2.4.2 Example: GP regression

An example from GP regression: The log-likelihood gradient w.r.t. a kernel hyperparam-


eter:
1 1 n
log P (y|X, b) = − y>K -1 y − log |K| − log 2π (58)
2 2 2
−b(xi −xj )2
where Kij = e + σ 2 δij (59)
∂ > -1 ∂
y K y = y>(−K -1 ( K)K -1 ) = y>(−K -1 AK -1 ) ,
∂b ∂b
2
with Aij = −(xi − xj )2 e−b(xi −xj ) (60)
∂ 1 ∂ 1 ∂
log |K| = |K| = |K| tr(K -1 K) = tr(K -1 A) (61)
∂b |K| ∂b |K| ∂b

2.4.3 Example: Logistic regression

An example from logistic regression: We have the loss gradient and want the Hessian:

∇β L = X>(p − x) + 2λIβ (62)


z
e
where pi = σ(x>i β) , σ(z) = , σ 0 (z) = σ(z) (1 − σ(z)) (63)
1 + ez
∂ ∂
∇β2 L = ∇ L = X> p + 2λ (64)
∂β β ∂β

pi = pi (1 − pi ) x>i (65)
∂β

p = diag([pi (1 − pi )]ni=1 ) X = diag(p ◦ (1 − p)) X (66)
∂β
∇β2 L = X>diag(p ◦ (1 − p)) X + 2λI (67)

(Where ◦ is the element-wise product.)


18 Maths for Intelligent Systems, Marc Toussaint

2.5 Check your gradients numerically!

This is your typical work procedure when implementing a Machine Learning or AI’ish or
Optimization kind of methods:

ˆ You first mathematically (on paper/LaTeX) formalize the problem domain, includ-
ing the objective function.

ˆ You derive analytically (on paper) the gradients/Hessian of your objective function.

ˆ You implement the objective function and these analytic gradient equations in
Matlab/Python/C++, using linear algebra packages.

ˆ You test the implemented gradient equations by comparing them to a finite


difference estimate of the gradients!

ˆ Only if that works, you put everything together, interfacing the objective & gra-
dient equations with some optimization algorithm

Algorithm 1 Finite Difference Jacobian Check


Input: x ∈ Rn , function f : Rn → Rn , function df : Rn → Rd×n
1: initialize Jˆ ∈ Rd×m , and  = 10−6
2: for i = 1 : n do
3: Jˆ·i = [f (x + ei ) − f (x − ei )]/2 // assigns the ith column of Jˆ
4: end for
ˆ − df (x)||∞ < 10−4 return true; else false
5: if ||J

Here ei is the ith standard basis vector in Rn .

2.6 Examples and Exercises

2.6.1 More derivatives

a) In 3D, note that a × b = skew(a)b = −skew(b)a, where skew(v) is the skew matrix
of v. What is the gradient of (a × b)2 w.r.t. a and b?

2.6.2 Multivariate Calculus

Given tensors y ∈ Ra×...×z and x ∈ Rα×...×ω where y is a function of x, the Jacobian


tensor J = ∂x y is in Ra×...×z×α×...×ω and has coefficients


Ji,j,k,...,l,m,n... = yi,j,k,...
∂xl,m,n,...
Maths for Intelligent Systems, Marc Toussaint 19

(All “output” indices come before all “input” indices.)


Compute the following Jacobian tensors

(i) ∂x x, where x is a vector
∂ >
(ii) ∂x x Ax, where A is a matrix
∂ >
(iii) ∂A y Ax, where x and y are vectors (note, we take the derivative w.r.t. A)

(iv) ∂A Ax
∂ >
(v) ∂x f (x) Ag(x), where f and g are vector-valued functions

2.6.3 Finite Difference Gradient Checking

The following exercises will require you to code basic functions and derivatives. You can
code in your prefereed language (Matlab, NumPy, Julia, whatever).

(i) Implement the following pseudo code for empirical gradient checking in the pro-
gramming language of your choice:

Input: x ∈ Rn , function f : Rn → Rd , function df : Rn → Rd×n


1: initialize Jˆ ∈ Rd×n , and  = 10−6
2: for i = 1 : n do
3: Jˆ·i = [f (x + ei ) − f (x − ei )]/2 // assigns the ith column of Jˆ
4: end for
5: if ||Jˆ − df (x)||∞ < 10−4 return true; else false

Here ei is the ith standard basis vector in Rn .


(ii) Test this for

ˆ f : x 7→ Ax, df : x 7→ A, where you sample x ∼randn(n,1) (N(0, 1) in


Matlab) and A ∼randn(m,n)
ˆ f : x 7→ x>x, df : x 7→ 2x>, where x ∼randn(n,1)

2.6.4 Backprop in a Neural Net

Consider the function


f : Rh0 → Rh3 , f (x0 ) = W2 σ(W1 σ(W0 x0 ))
where Wl ∈ Rhl+1 ×hl and σ : R → R is a differentiable activation function which is
applied element-wise. We also describe the function as the computation graph:
x0 7→ z1 = W0 x0 7→ x1 = σ(z1 ) 7→ z2 = W1 x1 7→ x2 = σ(z2 ) 7→ f = W2 x2
20 Maths for Intelligent Systems, Marc Toussaint

df
Derive pseudo code to efficiently compute dx0 . (Ideally also for deeper networks.)

2.6.5 Backprop in a Neural Net

We consider again the function

f : Rh0 → Rh3 , f (x0 ) = W2 σ(W1 σ(W0 x0 )) ,

where Wl ∈ Rhl+1 ×hl and σ(z) = 1/(e−z + 1) is the sigmoid function which is applied
element-wise. We established last time that
df ∂f ∂x2 ∂z2 ∂x1 ∂z1
=
dx0 ∂x2 ∂z2 ∂x1 ∂z1 ∂x0

with:
∂xl ∂zl+1 ∂f
= diag(xl ◦ (1 − xl )) , = Wl , = W2
∂zl ∂xl ∂x2

Note: In the following we still let f be a h3 -dimensional vector. For those that are
confused with the resulting tensors, simplify to f being a single scalar output.

(i) Derive also the necessary equations to get the derivative w.r.t. the weight matrices
Wl , that is the Jacobian tensor
df
dWl
df df
(ii) Write code to implement f (x) and dx0 and dWl .

To test this, choose layer sizes (h0 , h1 , h2 , h3 ) = (2, 10, 10, 2), i.e., 2 input and 2
output dimensions, and hidden layers of dimension 10.
For testing, choose random inputs sampled from x ∼randn(2,1)
And choose random weight matrices Wl ∼ √ 1 rand(h[l+1],h[l]).
hl+1

Check the implemented Jacobian by comparing to the finite difference approxima-


tion.
Debugging Tip: If your first try does not work right away, the typical approach to
debug is to “comment out” parts of your function f and df . For instance, start
with testing f (x) = W0 x0 ; then test f (x) = σ(W0 x0 ); then f (x) = W1 σ(W0 x0 );
then I’m sure all bugs are found.

(iii) Bonus: Try to train the network to become the identity mapping. In the sim-
plest case, use “stochastic gradient descent”, meaning that you sample an input,
2
compute the gradients wl = d(f (x)−x)
dWl , and make tiny updates Wl ← Wl − αwl .
Maths for Intelligent Systems, Marc Toussaint 21

2.6.6 Logistic Regression Gradient & Hessian

Consider the function


n h
X i
L : Rd → R : L(β) = − yi log σ(x>i β) + (1 − yi ) log[1 − σ(x>i β)] + λβ>β ,
i=1

where xi ∈ R is the ith row of a matrix X ∈ Rn×d , and y ∈ {0, 1}n is a vector of 0s and
d

1s only. Here, σ(z) = 1/(e−z + 1) is the sigmoid function, with σ 0 (z) = σ(z)(1 − σ(z)).

Derive the gradient ∂β L(β), as well as the Hessian

∂2
∇2 L(β) = L(β) .
∂β 2

3 Linear Algebra

3.1 Vector Spaces

3.1.1 Why should we care for vector spaces in intelligent systems research?

We want to describe intelligent systems. For this we describe systems, or aspects of


systems, as elements of a space:
– The input space X, output space Y in ML
– The space of functions (or classifiers) f : X → Y in ML
– The space of world states S and actions A in Reinforcement Learning
– The space of policies π : S → A in RL
– The space of feedback controllers π : x 7→ u in robot control
– The configuration space Q of a robot
– The space of paths x : [0, 1] → Q in robotics
– The space of image segmentations s : I → {0, 1} in computer vision

Actually, some of these spaces are not vector spaces at all. E.g. the configuration space
of a robot might have ‘holes’, be a manifold with complex topology, or not even that
(switch dimensionality at some places). But to do computations in these spaces one
always either introduces (local) parameterizations that make them a vector space,2 or
one focusses on local tangent spaces (local linearizations) of these spaces, which are
vector spaces.
Perhaps the most important computation we want to do in these spaces is taking
derivatives—to set them equal to zero, or do gradient descent, or Newton steps for
2 E.g. by definition an n-dimensional manifold X is locally isomorphic to Rn .
22 Maths for Intelligent Systems, Marc Toussaint

optimization. But taking derivatives essentially requires the input space to (locally) be
a vector space.3 So, we also need vector spaces because we need derivatives, and Linear
Algebra to deal with the resulting equations.

3.1.2 What is a vector?

A vector is nothing but an element of a vector space.


It is in general not a column, array, or tuple of numbers. (But tuples of numbers are a
special case of a vector space.)

3.1.3 What is a vector space?

Definition 3.1 (vector space). A vector spacea V is a space (=set) on which two
operations, addition and multiplication, are defined as follows

ˆ addition + : V × V → V is an abelian group, i.e.,


– a, b ∈ V ⇒ a + b ∈ V (closed under +)
– a + (b + c) = (a + b) + c (association)
– a+b=b+a (commutation)
– ∃ unique 0 ∈ V s.t. ∀v ∈ V : 0 + v = v (identity)
– ∀v ∈ V : ∃ unique − v s.t. v + (−v) = 0 (inverse)

ˆ multiplication · : R × V → V fulfils, for α, β ∈ R,


– α(βv) = (αβ)v (association)
– 1v = v (identity)
– α(v + w) = αv + αw (distribution)
a We only consider vector spaces over R.

Roughly, this definition says that a vector space is “closed under linear operations”,
meaning that we can add and scale vectors and they remain vectors.

3.2 Vectors, dual vectors, coordinates, matrices, tensors

In this section we explain what might be obvious: that once we have a basis, we can
write vectors as (column) coordinate vectors, 1-forms as (row) coordinate vectors, and
linear transformations as matrices. Only the last subsection becomes more practical,
refers to concrete exercises, and explains how in practise not to get confused about basis
3 Also when the space is actually a manifold; the differential is defined as a 1-form on the local

tangent.
Maths for Intelligent Systems, Marc Toussaint 23

transforms and coordinate representations in different bases. So a practically oriented


reader might want to skip to the last subsection.

3.2.1 A taxonomy of linear functions

For simplicity we consider only functions involving a single vector space V . But all that
is said transfers to the case when multiple vector spaces V, W, ... were involved.

Definition 3.2. f : V → X linear ⇔ f (αv + βw) = αf (v) + βf (w), where X


is any other vector space (e.g. X = R, or X = V × V ).

Definition 3.3. f : V × V × · · · × V → X multi-linear ⇔ f is linear in each


input.

Many names are used for special linear functions—let’s make some explicit:
– f : V → R, called linear functional4 , or 1-form, or dual vector.
– f : V → V , called linear function, or linear transform, or vector-valued 1-form
– f : V × V → R, called bilinear functional, or 2-form
– f : V × V × V → R, called 3-form (or unspecifically ’multi-linear functional’)
– f : V × V → V , called vector-valued 2-form (or unspecifically ’multi-linear map’)
– f : V × V × V → V × V , called bivector-valued 3-form
– f : V k → V m , called m-vector-valued k-form

This gives us a simple taxonomy of linear functions based on how many vectors a function
eats, and how many it outputs. To give examples, consider some space X of systems
(examples above), which might itself not be a vector space. But locally, around a specific
x ∈ X, its tangent V is a vector space. Then
– f : X → R could be a cost function over the system space.
– The differential df |x : V → R is a 1-form, telling us how f changes when ‘making a tangent
step’ v ∈ V .
– The 2nd derivative d2 f |x : V × V → R is a 2-form, telling us how df |x (v) changes when
‘making a tangent step’ w ∈ V .
– The inner product h·, ·i : V × V → R is a 2-form.

Another example:
– f : Ri → Ro is a neural network that maps i input signals to o output signals.
– Its derivative df |x : Ri → Ro is a vector-valued 1-form, telling us how each output changes
with a step v ∈ Ri in the input.
– Its 2nd derivative d2 f |x : Ri × Ri → Ro is a vector-valued 2-form.
4 The word ’functional’ instead of ’function’ is especially used when V is a space of functions.
24 Maths for Intelligent Systems, Marc Toussaint

This is simply to show that vector-valued functions, 1-forms, and 2-forms are common.
Instead of being a neural network, f could also be a mapping from one parameterization
of a system to another, or the mapping from the joint angles of a robot to its hand
position.

3.2.2 Bases and coordinates

We need to define some notions. I’m not commenting on these definitions—train yourself
in reading maths...

Definition 3.4. span({vi }ki=1 ) = {


P
i αi vi : αi ∈ R}

hP i
Definition 3.5. {vi }ni=1 linearly independent ⇔ i αi vi = 0 ⇒ ∀i αi = 0

Definition 3.6. dim(V ) = maxn {n ∈ N : ∃ {vi }ni=1 lin.indep., vi ∈ V }

Definition 3.7. B = (ei )ni=1 is a basis of V ⇔ span(B) = V and B lin.indep.

n
Definition 3.8. The tuple
n
P (v1 , v2 , .., vn ) ∈ R is called coordinates of v ∈ V in
the basis (ei )i=1 iff v = i vi ei

Note that Rn is also a vector space, and therefore coordinates v1:n ∈ Rn are also vectors,
but in Rn , not V . So coordinates are vectors, but vectors in general not coordinates.
Given a basis (ei )ni=1 , we can describe every vector v as a linear combination v = i vi ei
P
of basic elements—the basis vectors ei . This general idea, that “linear things” can be
described as linear combinations of “basic elements” carries over also to functions. In
fact, to all the types of functions we described above: 1-forms, 2-forms, bi-vector-valued
k-forms, whatever. And if we describe all these als linear combinations of basic elements
we automatically also introduce coordinates for these things. To get there, we first have
to introduce a second type of “basic elements”: 1-forms.

3.2.3 The dual vector space – and its coordinates

Definition 3.9. Given V , its dual space is V ∗ = {f : V → R linear} (the space of


1-forms). Every v ∗ ∈ V ∗ is called 1-form or dual vector (sometimes also covector ).

First, it is easy to see that V ∗ is also a vector space: We can add two linear functionals,
f = f1 + f2 , and scale them, and it remains a linear functional.
Maths for Intelligent Systems, Marc Toussaint 25

ei )ni=1 of V ∗
Second, given a basis (ei )ni=1 of V , we define a corresponding dual basis (b
simply by

∀i,j : ebi (ej ) = δij (68)

where δij = [i = j] is the Kronecker delta. Note that

∀v ∈ V : ebi (v) = vi (69)

That is, ebi is the 1-form that simply maps a vector to its ith coordinate. It can be shown
ei )ni=1 is in fact a basis of V ∗ . (Omitted.) That tells us a lot!
that (b
dim(V ∗ ) = dim(V ). That is, the space of 1-forms has the same dimension as V . At
this place, geometric intuition should kick in: indeed, every linear function over V could
be envisioned as a “plane” over V . Such a plane can be illustrated by its iso-lines and
these can be uniquely determined by their orientation and distance (same dimensionality
as V itself). Also, (assuming we’d know already what a transpose or scalar product is)
every 1-form must be of the form f (v) = c>v for some c ∈ V —so every f is uniquely
described by a c ∈ V . Showing that the vector space V and its dual V ∗ are really twins.
The dual basis (bei )ni=1 introduces coordinates in the dual space: Every 1-form f can be
described as a linear combination of basis 1-forms,
X
f= fi ebi (70)
i

where the tuple (f1 , f2 , .., fn ) are the coordinates of f . And

ei }ni=1 ) = V ∗ .
span({b (71)

3.2.4 Coordinates for every linear thing: tensors

We now have the basic elements: the basis vectors (ei )ni=1 of V , and basis 1-forms
ei )ni=1 of V ∗ . From these, we can describe, for instance, any bivector-valued 3-form as
(b
a linear combination as follows:

f : V ×V ×V →V ×V (72)
X
f= fijklm ei ⊗ ej ⊗ ebk ⊗ ebl ⊗ ebm (73)
ijklm

The ⊗ is called outer product (or tensor product), and v ⊗ w ∈ V × W if V and W


are finite vector spaces. For our purposes, we may think of v ⊗ w = (v, w) simply as
the tuple of both. Therefore ei ⊗ ej ⊗ ebk ⊗ ebl ⊗ ebm is a 5-tuple and we have in total
n5 such basis objects—and fijklm denotes the corresponding n5 coordinates. The first
two indices are contra-variant, the last three covariant—these notions are explained in
detail later.
26 Maths for Intelligent Systems, Marc Toussaint

3.2.5 Finally: Matrices

As a special case of the above, every f : V → U can be described as a linear combination


X
f= fij ei ⊗ ebj , (74)
ij

ej )nj=1 is a basis of V ∗ and (ei ) a basis of U .


where (b
Let’s see how this fits with some easier view, without all this fuss about 1-forms. We
already understood that the operator ebj (v) = vj simply picks the jth coordinate of a
vector. Therefore
hX i X
f (v) = fij ei ⊗ ebj (v) = fij ei vj . (75)
ij ij

In case it helps, we can ‘derive’ this more slowly as


X X X hX i
f (v) = f ( vk e k ) = vk f (ek ) = vk fij ei ⊗ ebj ek (76)
k k k ij
X X XhX i
= fij vk ei ebj (ek ) = fij vk ei δjk = fij vj ei . (77)
ijk ijk i j
P
As a result, this tells us that the vector u = f (v) ∈ V has the coordinates
P ui = j fij vj .
And the vector f (ej ) ∈ V has the coordinates fij , that is, f (ej ) = i fij ei .
So there are n2 coordinates fij for a linear function f . The first index is contra-variant,
the second covariant (explained later). As it so happens, the whole world has agreed
on a convention on how to write such coordinate numbers on sheets of 2-dimensional
paper: as a matrix!
f11 f12 ··· f1n 
 

f21 f22 ··· f2n 
 

 


 .. .. 

 (78)
. . 
 

 
···
 
fn1 fn2 fnn
The first (contra-variant) index spans columns; the second (covariant) spans rows. We
call this and the respective definition of a matrix multiplication as the matrix conven-
tion.
Note that the identity map I : V → V can be written as
X
I= ei ⊗ ebi , Iij = δij . (79)
i

Equally, the (contra-variant) coordinates of a vector are written as columns


v1 
 

v2 
 

 


 .. 

 (80)
. 
 

 
 
vn
Maths for Intelligent Systems, Marc Toussaint 27

and the (covariant) coordinates of a 1-form h : V → R as a row

(h1 h2 ··· hn ) (81)

u = f (v) is itself a vector, and its coordinates written as a column are

u1   f11 f12 ··· f1n   v1 


     

u2  f f22 ···   v2 
f2n 
     
 21
  
 
 =  (82)
..   . ..   . 
     
   
 .   . 

.   . .   . 
 
    
 
···
     
un fn1 fn2 fnn vn
P
where this matrix multiplication is defined by ui = j fij vj , consistent to the above.

columns rows
1-form
vector co-vector
derivative
output space input space
co-variant contra-variant
contra-variant coordinates co-variant coordinates

3.2.6 Coordinate transformations

The above was rather abstract. The exercises demonstrate representing vectors and
transformations with coordinates and matrices in different input and output bases. We
just summarize here:

ˆ We have two bases A = (a1 , .., an ) and B = (b1 , .., bn ), and the transformation
T that maps each ai to bi , i.e., B = T A.
ˆ Given a vector x we denote its coordinates in A by [x]A or briefly as xA . And
we denotes its coordinates in B as [x]B or xB . E.g., xA
i is the ith coordinate in
basis A.
ˆ [bi ]A are the coordinates of the new basis vectors in the old basis. The coordinate
transformation matrix B is given with elements Bij = [bj ]A i . Note that

[x]A = B[x]B , (83)

i.e., while the basis transform T carries old basis ai vectors to new basis vectors bi ,
the matrix B carries coordinates [x]B in the new basis to coordinates [x]A in the
old basis! This is the origin of understanding that coordinates are contra-variant.
ˆ Given a linear transform f in the vector space, we can represent it as a matrix
in four ways, using basis A or B in the input and output spaces, respectively. If
28 Maths for Intelligent Systems, Marc Toussaint

[f ]AA = F is the matrix in old coordinates (using A for input and output), then
[f ]BB = B -1 F B is its matrix in new coordinates, [f ]AB = F B is its matrix using
B for the input and A for the output space, and [f ]BA = B -1 F is the matrix using
A for input and B for output space.
ˆ T itself is also a linear transform. [T ]AA = B is its matrix in old coordinates.
And the same [T ]BB = B is also its matrix in new coordinates! [T ]BA = I is its
matrix when using A for input and B for output space. And [T ]AB = B 2 is its
matrix using B for input and A for output space.

3.3 Scalar product and orthonormal basis

Please note that so far we have not in any way referred to a scalar product or a transpose.
All the concepts above, dual vector space, bases, coordinates, matrix-vector multiplica-
tion, are fully independent of the notion of a scalar product or transpose. Columns
and rows naturally appear as coordinates of vectors and 1-forms. But now we need to
introduce scalar products.

Definition 3.10. A scalar product (also called inner product) of V is a symmetric


positive definite 2-form

h·, ·i : V × V → R . (84)

with hv, wi = hw, vi and hv, vi > 0 for all v 6= 0 ∈ V .

Definition 3.11. Given a scalar product, we define for every v ∈ V its dual v ∗ ∈ V ∗
as
X X
v ∗ = hv, ·i = vi hei , ·i = vi e∗i . (85)
i i

Note that ebi and e∗i are in general different 1-forms! The canonical dual basis (bei )ni=1
is independent of an introduction of a scalar product, they were the basis to introduce
coordinates for linear functions, including matrices. And while such coordinates do
depend on a choice of basis (ei )ni=1 , they do not depend on a choice of scalar product.
The 1-forms (e∗i )ni=1 also form a basis for V ∗ , but a different one to the canonical
basis, and one that depends on the notion of a scalar product. You can see this: the
coordinates vi of v ∗ in the basis (e∗i )ni=1 are identical to the coordinates vi of v in the
basis (ei )ni=1 , but different to the coordinates (v ∗ )i of v ∗ in the basis (b
ei )ni=1 .

Definition 3.12. Given a scalar product, a set of vectors {vi }ni=1 is called orthonor-
mal iff hvi , vj i = δij .
Maths for Intelligent Systems, Marc Toussaint 29

Definition 3.13. Given a scalar product and basis (ei )ni=1 , we define the metric
tensor gij = hei , ej i, which are the coordinates of the 2-form h·, ·i, that is
X
h·, ·i = gij ebi ⊗ ebj . (86)
ij

This also implies that


X X
hv, wi = vi wj hei , ej i = vi wj gij = v>Gw . (87)
ij ij

Although related, do not confuse gij with the usual definition of a metric d(·, ·) in a
metric space.

3.3.1 Properties of orthonormal bases

If we have an orthonormal basis (ei )ni=1 , many thing simplify a lot. Throughout this
subsection, we assume {ei } orthonormal.

ˆ The metric tensor gij = hei , ej i = δij is the identity matrix.5 Such a metric is
also called Euclidean. The norm ||i || = 1. The canonical dual basis (b ei )ni=1 and
the one defined via the scalar product (ei )i=1 become identical, ebi = e∗i = hei , ·i.
∗ n

Consequently, v and v ∗ have the same coordinates vi = (v ∗ )i w.r.t. (ei )ni=1 and
ei )ni=1 , respectively.
(b
ˆ The coordinates of vectors can now easily been computed:
X X X
v= vi ei ⇒ hei , vi = hei , vj e j i = hei , ej ivj = vi (88)
i j j

ˆ The coordinates of a linear transform can equally easily been computed: Given
a linear transform f : V → U an arbitrary (e.g. non-orthonormal) input basis
(vi )ni=1 of V , but an orthonormal basis (ui )ni=1 , then
X
f= fij ui ⊗ vbj ⇒ (89)
ij
X X
hui , f vj i = huj , fkl uk ⊗ vbl (vj )i = fkl huj , uk i vbl (vj )
kl kl
X
= fkl δjk δlj = fij (90)
kl

ˆ The projection onto a basis vector is given by ei hei , ·i.


5 Being picky, a metric is not a matrix but a twice covariant tensor (a row of rows). That’s why it is
correctly called metric tensor.
30 Maths for Intelligent Systems, Marc Toussaint

Pk
ˆ The projection onto the span of several basis vectors (e1 , .., ek ) is given by i=1 ei hei , ·i.
Pdim(V )
ˆ The identity mapping I : V → V is given by I = i=1 ei hei , ·i.
ˆ The scalar product with an orthonormal basis is
X X
hv, wi = vi wj δij = vi wi (91)
ij i

which, using matrix convention, can also be written as

w1 
 

w2 
 
 X
>
hv, wi = (v1 v2 .. vn ) . 
 
 = v w = vi w i , (92)


.. 


i
 
 
 
wn

where for the first time we introduced the transpose which, in the matrix con-
vention, swaps columns to rows and rows to columns.

As a general note, a row vector “eats a vector and outputs a scalar”. That is v> : V → R
should be thought of as a 1-form! Due to the matrix conventions, it generally is the
case that “rows eat columns”, that is, every row index should always be thought of as
relating to a 1-form (dual vector), and every column index as relating to a vector. That
is totally consistent to our definition of coordinates.
For an orthonormal basis we also have

v ∗ (w) = hv, wi = v>w . (93)

That is, v> is the coordinate representation of the 1-form v ∗ . (Which also says, that
the coordinates of the 1-form v ∗ in the special basis (e∗i )ni=1 ⊂ V ∗ coincide with the
coordinates of the vector v.)

3.4 The Structure of Transforms & Singular Value Decomposition

We focus here on linear transforms (or “linear maps”) f : V → U from one vector space
to another (or the same). It turns out that such transforms have a very specific and
intuitive structure, which is captured by the singular value decompositon.

3.4.1 The Singular Value Decomposition Theorem

We state the following theorem:


Maths for Intelligent Systems, Marc Toussaint 31

V U
f
x

span(U )
f (x)
span(V ) σ

input (row) space: span(V ) output (column) space: span(U )


input null space: V/span(V ) output null space: U/span(U )

Pk
Figure 2: A linear transformation f = i=1 σi ui v> i can be described as: take the input
x, project it onto the first input fundamental vector v1 to yield a scalar, stretch/squeeze
it by σ1 , and “unproject” this into the first output fundamental vector u1 ; repeat this
for all i = 1, .., k, and add up the results.

Theorem 3.1 (Singular Value Decomposition). Given two vector spaces V and U
with scalar products, dim(V) = n and dim(U) = m, for every linear transform f :
V → U there exist a k ≤ n, m and orthonormal vectors {vi }ki=1 ⊂ V, orthonormal
vectors {ui }ki=1 ⊂ U, and positive scalars σi > 0, i = 1, .., k, such that
k
X
f= σi ui vi∗ (94)
i=1

As above, vi∗ = hvi , ·i is the basis 1-form that picks the ith coordinate of a vector
in the basis (vi )ki=1 ⊂ V.a
a Note that {vi }ki=1 may not be a full basis of V if k < n. But because {vi } is orthonormal,
hvi , ·i uniquely picks the ith coordinate no matter how {vi }ki=1 is completed with further n − k
vectors to become a full basis.

We first restate this theorem equivalently in coordinates.

Theorem 3.2 (Singular Value Decomposition). For every matrix A ∈ Rm×n there
exists a k ≤ n, m and orthonormal vectors {vi }ki=1 ⊂ Rn , orthonormal vectors
{ui } ⊂ Rm , and positive scalars σi > 0, i = 1, .., k, such that
k
X
A= σi ui v>
i = U SV
>
(95)
i=1

where V = (v1 , .., vk ) ∈ Rn×k , U = (u1 , .., uk ) ∈ Rm×k contain the orthonor-
mal bases vectors as columns and S = diag(σ1 , .., σk ).

Let me rephrase this in a sentence: Every matrix A can be expressed as a linear com-
bination of only k rank-1 matrices. Rank-1 matrices are the most minimalistic kinds of
32 Maths for Intelligent Systems, Marc Toussaint

matrices and they are always of the form uv> for some u and v. The rank-1 matrix uv>
takes an input x, projects it on v (measures its alignment with v), and “unprojects” into
u (multiplies v>x to the output vector u).
Just to explicitly show the transition from coordinate-free to the coordinate-based theo-
rem, consider arbitrary orthonormal bases {ei }ni=1 ⊂ V and {êi }m
i=1 ⊂ U. For x ∈ V we
have
k
X k
X X X X
f (x) = σi ui hvi , xi = σi ( uli êl )h vji ej , xk ek i (96)
i=1 i=1 l j k
k
X X X k
XhX X i
= σi ( uli êl ) vji xk δjk = uli σi vji xj êl (97)
i=1 l jk l i=1 j
Xh i
= U SV >x êl (98)
l
l

where vji are the coordinates of vi , uli the coordinates of ui , U = (u1 , .., uk ) is the ma-
trix containing {ui } as columns, V = (v1 , .., vk ) the matrix containing {vi } as columns,
and S = diag(σ1 , .., σk ) the diagonal matrix with elements σi .
We add some definitions based on this:
Definition 3.14. The rank rank(f ) = rank(A) of a transform f or its matrix A is
the unique minimal k.

Definition 3.15. The determinant of a transform f or its matrix A is


( Q
n
± i=1 σi for rank(f ) = n = m
det(f ) = det(A) = det(S) = , (99)
0 otherwise

where ± depends on whether the transform is a reflection or not.

The last definition is a bit flaky, as the ± is not properly defined. If, alternatively, in the
above theorems we would require V and U to be rotations, that is, elements of SO(n)
(of the special
Qn orthogonal group); then negative σ’s would indicate such a reflection and
det(A) = i=1 σi . But above we required σ’s to be strictly positive and V and U only
orthogonal. Fundamental space vectors vi and ui could flip sign. The ± above indicates
how many flip sign.

Definition 3.16. a) The row space (also called right or input fundamental space)
rank(f )
of a transform f is span{vi }i=1 . The input null space (or right null space) V⊥
is the subspace orthogonal to the row space, such that v ∈ V⊥ ⇒ f (v) = 0.
b) The column space or (also called left or output fundamental space) of a trans-
rank(f )
form f is span{ui }i=1 . The output null space (or left null space) U⊥ the
Maths for Intelligent Systems, Marc Toussaint 33

subspace orthogonal to the column space, such that u ∈ U⊥ ⇒ hf (·), ui = 0.

3.5 Point of departure from the coordinate-free notation

The coordinate-free introduction of vectors and transforms helps a lot to understand


what these fundamentally are. Namely, that coordinate vectors and matrices are ’just’
coordinates and rely on a choice of basis; what a metric gij really is; that only for a
Euclidean metric the inner product satisfies hv, wi = v>w. Further, the coordinate-
free view is essential to understand that vector coordinates behave differently to 1-form
coordinates (e.g., “gradients”!) under a transformation of the basis. We discuss contra-
versus covariance of gradients at the end of this chapter.
However, we now understood that columns correspond to vectors, rows to 1-forms, and
in the Euclidean case the 1-form hv, ·i directly corresponds to v>, in the non-Euclidean to
v>G. In applications we typically represent things from start in orthonormal bases (includ-
ing perhaps non-Euclidean metrics), there is not much gain sticking to the coordinate-free
notation in most cases. Only when the matrix notation gets confusing (and this happens,
e.g. when trying to compute something like the “Jacobian of a Jacobian”, or applying
the chain and product rule for a matrix expression ∂x f (x)>A(x)b(x)) it is always a save
harbour to remind what we actually talk about.
Therefore, in the rest of the notes we rely on the normal coordinate-based view. Only
in some explanations we remind at the coordinate-free view when helpful.

3.6 Filling SVD with life

In the following we list some statements—all of them relate to the SVD theorem and
together
Pk they’re meant to give a more intuitive understanding of the equation A =
> >
i=1 σi ui vi = U SV .

3.6.1 Understand vv> as a projection

ˆ The projection of a vector x ∈ V onto a vector v ∈ V, is given by

1 vv>
xk = vhv, xi or x. (100)
v2 v2
1
Here, the v2 is normalizing in case v does not have length |v| = 1.

ˆ The projection-on-v-matrix vv> is symmetric, semi-pos-def, and has rank(vv>) =


1.

ˆ The projection of a vector x ∈ V onto a subvector space span{vi }ki=1 for orthonor-
34 Maths for Intelligent Systems, Marc Toussaint

mal {vi }ki=1 is given by


X
xk = vi v> T
ix = V V x (101)
i

where V = (v1 , .., vk ) ∈ Rn×k . The projection matrix V V > for orthonormal V is
symmetric, semi-pos-def, and has rank(V V >) = k.

ˆ The expression i vi v>


P
i is quite related to an SVD. Conversely it shows that the
SVD represent every matrix as a liner combination of kind-of-projects, but these
kind-of-projects uv> first project onto v, but then unproject along u.

3.6.2 SVD for symmetric matrices

ˆ Thm 2 ⇒ Every symmetric matrix A is of the form


X
A= λi vi v>
i = V ΛV
>
(102)
i

for orthonormal V = (v1 , .., vk ). Here λi = ±σi and Λ = diag(λ) is the diagonal
matrix of λ’s. This describes nothing but a stretching/squeezing along orthogonal
projections.

ˆ The λi and vi are also the eigenvalues and eigenvectors of A, that is, for all
i = 1, .., k:

Avi = λi vi . (103)

If A has full rank, then the SVD A = V SV > = V SV -1 is therefore also the
eigendecomposition of A.

ˆ The pseudo-inverse of a symmetric matrix is


X
A† = λ-1 > -1 >
i vi vi = V S V (104)
i

which simply does the reverse stretching/squeezing along the same orthogonal
projections. Note that

AA† = A† A = V V > (105)

is the projection on {vi }ki=1 . For full rank(A) = n we have V V > = I and
A† = A-1 . For rank(A) < n, we have that A† y minimizes minx ||Ax − y||2 , but
there are infinitly many x’s that minimize this, spanned by the null space of A.
A† y is the minimizer closest to zero (with smallest norm).
Maths for Intelligent Systems, Marc Toussaint 35

ˆ Consider
Pm a data set D = {xi }m n
i=1 , xi ∈ R . For simplicity assume it has zero
mean, i=1 xi = 0. The covariance matrix is defined as
1 X 1
C= xi x>i = X>X (106)
n i n

where (consistent to ML lecture convention) the data matrix X containes x>i in


the ith row. Each xi x>i is a projection. C is symmetric and semi-pos-dev. Using
SVD we can write
X
C= λi vi v>
i (107)
i

and λi is the data
√ variance along the eigenvector vi ; λi the standard deviation
along vi ; and λi vi the principle axis vectors that make the ellipsoid we typically
illustrate convariances with.

3.6.3 SVD for general matrices

ˆ For full rank(A) = n, the determinant of a matrix is det(A) = ±


Q
i σi . We may
define the volume spanned by any {bi }ni=1 as

vol({bi }ni=1 ) = det(B) , B = (b1 , .., bn ) ∈ Rn×n . (108)

It follows that

vol({Abi }ni=1 ) = det(A) det(B) (109)

that is, the volume is being multiplied with det(A), which is consistent with our
intuition of transforms as stretchings/squeezings along orthonormal projections.
ˆ The pseudo-inverse of a general matrix is
X
A† = σi-1 vi u>i = V S -1 U> . (110)
i

If k = n (full input rank), rank(A>A) = n and

(A>A)-1 A> = (V SU>U SV >)-1 V SU> = V −>S −2 V -1 V SU> = V S -1 U> = A†


(111)

and A† is also called left pseudoinverse because A† A = In .


If k = m (full output rank), rank(AA>) = m and

A>(AA>)-1 = V SU>(U SV >V SU>)-1 = V SU>U −>S −2 U -1 = V S -1 U> = A†


(112)

and A† is also called right pseudoinverse because AA† = Im .


36 Maths for Intelligent Systems, Marc Toussaint

ˆ Assume m = n (same input/output dimension, or V = U), but k < n. Then there


exist orthogonal V, U ∈ Rn×n such that
 

A = U DV > , D = diag(σ1 , .., σk , 0, .., 0) =  S 0 . (113)


0 0

Here, V and U contain a full orthonormal basis instead of only k orthonormal


vectors. But the diagonal matrix D projects all but k of those to zero. Every
square matrix A ∈ Rn×n can be written like this.

Definition 3.17 (Rotation). Given a scalar-product h·, ·i on V, a linear transform


f : V → V is called rotation iff it preserves the scalar product, that is,

∀v, w ∈ V : hf (v), f (w)i = hv, wi . (114)

ˆ Every rotation matrix is orthogonal, i.e., composed of columns of orthonormal


vectors.

ˆ Every rotation has rank n and σ1,..,n = 1. (No stretching/squeezing.)

ˆ Every square matrix can be written as


rotationU · scalingD · rotationV -1

3.7 Eigendecomposition
Definition 3.18. The eigendecomposition or diagonalization of a square matrix
A ∈ Rn×n is (if it exists!)

A = QΛQ-1 (115)

where Λ = diag(λ1 , .., λn ) is a diagonal matrix of eigenvalues. Each column qi of


Q is an eigenvector.

ˆ First note that, unlike SVD, this is not a Theorem but just a definition: If such a
decomposition exists, it is called eigendecomposition. But it exists for almost any
square matrix.

ˆ The set of eigenvalues is the set of roots of the characteristic polynomial


pA (λ) = det(λI − A). Why? Because then A − λI has ’volume’ zero (or
rank < n), showing that there exists a vector that is mapped to zero, that is
0 = (A − λI)x = Ax − λx.

ˆ Is every square matrix diagonalizable? No, but only an n2 −1-dimensional subset of


matrices are not diagonalizable; most matrices are. A matrix is not diagonalizable
if an eigenvalue λ has multiplicity k (more precisely, λ is a root of pA (λ) with
Maths for Intelligent Systems, Marc Toussaint 37

multiplicity k), but n − rank(A − λI) (the dimensionality of the span of the
eigenvectors of λ!) is less than k. Therefore, the eigenvectors of λ are not linearly
independent; they do not span the necessary k dimensions.
So, only very “special” matrices are not diagonalizable. Random matrices are
(with prob 1).
ˆ Symmetric matrices? → SVD
ˆ Rotations? Not real. But complex! Think of oscillating projection onto eigenvec-
tor. If φ is the rotation angle, e±iφ are eigenvalues.

3.7.1 Power Method

To find the largest eigenvector of A, initialize x randomly and iterate


1
x ← Ax , x← x (116)
||x||

– If this converges, x must be an eigenvector and λ = x>Ax the eigenvalue.


– If A is diagonalizable, and x is initially a non-zero linear combination of all eigen vectors,
then it is obvious that x will converge to the “largest” (in absolute terms |λi |) eigenvector
(=eigenvector with largest eigenvalue). Actually, if the largest (by norm) eigenvector is
negative, then it doesn’t really converge but flip sign at every iteration.

3.7.2 Power Method including the smallest eigenvalue

A trick, hard to find in the literature, to also compute the smallest eigenvalue and -vector
is the following. We assume all eigenvalues to be positive. Initialize x and y randomly,
iterate
x ← Ax , λ ← ||x|| , x ← x/λ , y ← (λI − A)y , y ← y/||y|| (117)
Then y will converge to the smallest eigenvector, and λ − ||y|| will be its eigenvalue.
Note that (in the limit) A − λI only has negative eigenvalues, therefore ||y|| should be
positive. Finding smallest eigenvalues is a common problem in model fitting.

3.7.3 Why should I care about Eigenvalues and Eigenvectors?


– Least squares problems (finding smallest eigenvalue/-vector); (e.g. Camera Calibration,
Bundle Adjustment, Plane Fitting)
– PCA
– stationary distributiuons of Markov Processes
– Page Rank
– Spectral Clustering
– Spectral Learning (e.g., as approach to training HMMs)
38 Maths for Intelligent Systems, Marc Toussaint

3.8 Beyond this script: Numerics to compute these things

We will not go into details of numerics. Nathan’s script gives a really nice explaination
of the QR-method. I just mention two things:
(i) The most important forms of matrices for numerics are diagonal matrices, orthogonal
matrices, and upper triangular matrices. One reason is that all three types can very easily
be inverted. A lot of numerics is about finding decompositions of general matrices into
products of these special-form matrices, e.g.:
– QR-decomposition: A = QR with Q orthogonal and R upper triangular.
– LU-decomposition: A = LU with U and L> upper triangular.
– Cholesky decomposition: (symmetric) A = C>C with C upper triangular
– Eigen- & singular value decompositions

Often, these decompositions are intermediate steps to compute eigenvalue or singular


value decompositions.
(ii) Use linear algebra packages. At the origin of all is LAPACK; browse through http:
//www.netlib.org/lapack/lug/ to get an impression of what really has been one of
the most important algorithms in all technical areas of the last half century. Modern
wrappers are: Matlab (Octave), which originated as just a console interface to LAPACK;
the C++-library Eigen; or the Python NumPy.

3.9 Derivatives as 1-forms, steepest descent, and the covariant


gradient

3.9.1 The coordinate-free view: A derivative takes a change-of-input vector as


input, and returns a change of output

We previously defined the Jacobian of a function f : Rn → Rd from vector to vector


space. We repeat defining a derivative more generally in coordinate-free form:

Definition 3.19. Given a function f : V → G we define the differential

f (x + hv) − f (x)
df x
: V → G, v 7→ lim (118)
h→0 h

This definition holds whenever G is a continuous space that allows the definition of this
limit and the limit exists (f is differentiable). The notation df |x reads “the differential
at location x”, i.e., evaluating this derivative at location x.
Note that df |x is a mapping from a “tangent vector” v (a change-of-input vector) to
an output-change. Further, by this definition df |x is linear. df |x (v) is the directional
derivative we mentioned before. Therefore df |x is a G-valued 1-form. As discussed
earlier, we can introduce coordinates for 1-forms; these coordinates are what typically
Maths for Intelligent Systems, Marc Toussaint 39

is called the “gradient” or “Jacobian”. But here we explicitly see that we speak of
coordinates of a 1-form.

3.9.2 Contra- and co-variance

In Section 3.2.6 we summarized the effects of a coorindate transformation. We recap


the same here again also for derivatives and scalar products.
We have a vector space V , and a function f : V → R. We’ll be interested in the change
of function value df |x (δ) for change of input δ ∈ V , as well as the value of the scalar
product hδ, δi. All these quantities are defined without any reference to coordinates;
we’ll check now how their coordinate representations change with a change of basis.
As in Section 3.2.6 we have two bases A = (a1 , .., an ) and B = (b1 , .., bn ), and the
transformation T that maps each ai to bi , i.e., B = T A. Given a vector δ we denote
its coordinates in A by [δ]A , and its coordinates in B by [δ]B . Let T AA = B be the
matrix representation of T in the old A coordinates (B contains the new basis vectors
b as columns).

ˆ We previously learned that

[δ]A = B[δ]B (119)

that is, the matrix B carries new coordinates to old ones. These coordinates are
said to be contra-variant: they transform ‘against’ the transformation of the basis
vectors.

ˆ We require that

df |x (δ) = [∂x f ]A [δ]A = [∂x f ]B [δ]B (120)

must be invariant, i.e., the change of function value for δ should not depend on
whether we compute it using A or B coordinates. It follows

[∂x f ]A [δ]A = [∂x f ]A B[δ]B = [∂x f ]B [δ]B (121)


A B
[∂x f ] B = [∂x f ] (122)

that is, the matrix B carries old 1-form-coordinates to new 1-form-coordinates.


Therefore, such 1-form-coordinates are called co-variant: they transform ‘with’ the
transformation of basis vectors.

ˆ What we just wrote for the derivative df |x (δ) we could equally write and argue for
any 1-form v ∗ ∈ V ∗ ; we always require that the value v ∗ (δ) is invariant.

ˆ We also require that the scalar product hδ, δi is invariant. Let

hδ, δi = [δ]A>[G]A [δ]A = [δ]B>[G]B [δ]B (123)


40 Maths for Intelligent Systems, Marc Toussaint

where [G]A and [G]B are the 2-form-coordinates (metric tensor) in the old and
new basis. It follows
[δ]A>[G]A [δ]A = [δ]B>B>[G]A B[δ]B (124)
[G]B = B>[G]A B (125)
that is, the matrix carries the old 2-form-coordinates to new ones. These coordi-
nates are called twice co-variant.

Consider the following example: We have the function f : R2 → R, f (x) = x1 +x2 . The
function’s partial derivative is of course ∂f
∂x = (1 1). Now let’s transform the coordinates
of the space: we introduce new coordinates (z1 , z2 ) = (2x1 , x2 ) or z = B -1 x with B =

1 
0. The same function, written in the new coordinates, is f (z) =
1
2 z1 + z2 . The
2
0 1

partial derivatives of that same function, written in these new coordinates, is ∂f 1


∂z = ( 2 1).

Generally, consider we have two kinds of mathematical objects and when we multiply
them together this gives us a scalar. The scalar shouldn’t depend on any choice of
coordinate system and is therefore invariant against coordinate transforms. Then, if
one of the objects transforms in a covariant (“transforming with the transformation”)
manner, the other object must transform in a contra-variant (“transforming contrary
to the transformation”) manner to ensure that the resulting scalar is invariant. This is a
general principle: whenever two things multiply together to give an invariant thing, one
should transform co- the other contra-variant.
Let’s also check Wikipedia:
– “For a vector to be basis-independent, the components [=coordinates] of the vector must
contra-vary with a change of basis to compensate. That is, the matrix that transforms
the vector of components must be the inverse of the matrix that transforms the basis
vectors. The components of vectors (as opposed to those of dual vectors) are said to be
contravariant.
– For a dual vector (also called a covector) to be basis-independent, the components of the
dual vector must co-vary with a change of basis to remain representing the same covector.
That is, the components must be transformed by the same matrix as the change of basis
matrix. The components of dual vectors (as opposed to those of vectors) are said to be
covariant.”

Ordinary gradient descent of the form x ← x + α∇f adds objects of different types:
contra-variant coordinates x with co-variant partial derivatives ∇f . Clearly, adding two
such different types leads to an object who’s transformation under coordinate transforms
is strange—and indeed the ordinary gradient descent is not invariant under transforma-
tions.

3.9.3 Steepest descent and the covariant gradient vector

Let’s define the steepest descent direction to be the one where, when you make a step of
length 1, you get the largest decrease of f in its linear (=1st order Taylor) approximation.
Maths for Intelligent Systems, Marc Toussaint 41

Definition 3.20. Given f : V → R and a norm ||x||2 = hx, xi (or scalar product)
defined on V , we define the steepest descent vector δ ∗ ∈ V as the vector:

δ ∗ = argmin df x (δ) s.t. ||δ||2 = 1 (126)


δ

Note that for this definition we need to assume we have a scalar product, otherwise the
length=1 constraint is not defined. Also recall that df |x (δ) = ∂x f (x)δ = ∇f (x)>δ are
equivalent notations.
Clearly, if we have coordinates in which the norm is Euclidean then
||δ||2 = δ>δ ⇒ δ ∗ ∝ −∇f (x) (127)

However, if we have coordinates in which the metric is non-Euclidean, we have:


Theorem 3.3 (Steepest Descent Direction (Covariant gradient)). For a general
scalar product hv, wi = v>Gw (with metric tensor G), the steepest descent direction
is

δ ∗ ∝ −G-1 ∇f (x) (128)

Proof: Let G = B>B (Cholesky decomposition) and z = Bδ


δ ∗ = argmin ∇f>δ s.t. δ>Gδ = 1 (129)
δ
= B argmin ∇f>B -1 z s.t. z>z = 1
-1
(130)
z
∝ B -1 [−B ∇f ] = −G-1 ∇f
->
(131)

For a coordinate transformation B, recall that new metric becomes G̃ = B>GB, and
ff = B>∇f . Therefore, the new steepest descent is
the new gradient ∇
ff = −B -1 G-1 B -> B>∇f = −B -1 G-1 ∇f
δe∗ = −[G̃]-1 ∇ (132)
and therefore transformes like normal contra-variant coordinates of a vector.
There is an important special case of this, when f is a function over the space of
probability distributions and G is the Fisher metric, which we’ll discuss later.

3.10 Examples and Exercises

3.10.1 Basis

Given a linear transform f : R2 → R2 ,


−10 
 

f (x) = Ax =  7 x .
5 −8

42 Maths for Intelligent Systems, Marc Toussaint

 1   2 
Consider the basis B =  ,   , which we also simply refer to by the matrix B =
   

 
1 1
1 2 . Given a vector x in the vector space R , we denote its coordinates in basis B
2
1 1

with xB .

(i) Show that x = BxB .


(ii) What is the matrix F B of f in the basis B, i.e., such that [f (x)]B = F B xB ?
Prove the general equation F B = B -1 AB.
(iii) Provide F B numerically

−b 
   

Note that for a matrix M =  a b, M


-1
= 1  d

c d ad−bc  −c a

3.10.2 From the Robotics Course

You have a book lying on the table. The edges of the book define the basis B, the edges
of the table define basis A. Initially A and B are identical (also their origins align). Now
we rotate the book by 45◦ counter-clock-wise about its origin.

(i) Given a dot p marked on the book at position pB = (1, 1) in the book coordinate
frame, what are the coordinates pA of that dot with respect to the table frame?
(ii) Given a point x with coordinates xA = (0, 1) in table frame, what are its coordi-
nates xB in the book frame?
(iii) What is the coordinate transformation matrix from book frame to table frame,
and from table frame to book frame?

3.10.3 Bases for Polynomials


Pn
Consider the set V of all polynomials i=0 αi xi of degree n, where x ∈ R and αi ∈ R
for each i = 0, . . . , n.

(i) Is this set of functions a vector space? Why?


(ii) Consider two different bases of this vector space:

A = {1, x, x2 , . . . , xn }

and

B = {1, 1 + x, 1 + x + x2 , . . . , 1 + x + . . . + xn }.
Maths for Intelligent Systems, Marc Toussaint 43

Let f (x) = 1 + x + x2 + x3 be one element in V . (This function f is a vector


in the vector space V , so from here on we refer to it as a vector rather than a
function.)
What are the coordinates [f ]A of this vector in basis A?
What are the coordinates [f ]B of this vector in basis B?
(iii) What matrix I BA allows you to convert between coordinates [f ]A and [f ]B , i.e.
[f ]B = I BA [f ]A ? Which matrix I AB does the same in the opposite direction, i.e.
[f ]A = I AB [f ]B ? What is the relationship between I AB and I BA ?
(iv) What does the difference between coordinates [f ]A − [f ]B represent?
(v) Consider the linear transform t : V → V that maps basis elements of A directly
to basis elements of B:

1→1
x→1+x
x2 → 1 + x + x2
..
.
xn → 1 + x + x2 + · · · + xn

ˆ What is the matrix T A for the linear transform t in the basis A, i.e., such
that [t(f )]A = T A [f ]A ? (Basis A is used for both, input and output spaces.)
ˆ What is the matrix T B for the linear transform t in the basis B, i.e., such
that [t(f )]B = T B [f ]B ? (Basis B is used for both, input and output spaces.)
ˆ What is the matrix T BA if we use A as input space basis, and B as output
space basis, i.e., such that [t(f )]B = T BA [f ]A ?
ˆ What is the matrix T AB if we use B as input space basis, and A as output
space basis, i.e., such that [t(f )]A = T AB [f ]B ?
ˆ Show that T B = I BA T A I AB (cp. Exercise 1(b)). Also note that T AB =
T A I AB and T BA = I BA T A .

3.10.4 Projections

(i) In Rn , a plane (through the origin) is typically described by the linear equation

c>x = 0 , (133)

where c ∈ Rn parameterizes the plane. Provide the matrix that describes the
orthogonal projection onto this plane. (Tip: Think of the projection as I minus a
rank-1 matrix.)
44 Maths for Intelligent Systems, Marc Toussaint

(ii) In Rn , let’s have k linearly independent {vi }ki=1 , which form the matrix V =
(v1 , .., vk ) ∈ Rn×k . Let’s formulate a projection using an optimality principle,
namely,

α∗ (x) = argmin ||x − V α||2 . (134)


α∈Rk

Derive the equation for the optimal α∗ (x) from the optimality principle.
Pk
(For information only: Note that V α = i=1 αi vi is just the linear combination
of vi ’s with coefficients α. The projection of a vector x is then x|| = V α∗ (x).)

3.10.5 SVD

Consider the matrices


   
1 0 0 1 1 1
0 −2 0 1 0 −1
A= 0 0 0 ,
 B=
  (135)
2 1 0
0 0 0 0 1 2

(i) Describe their 4 fundamental spaces (dimensionality, possible basis vectors).

(ii) Find the SVD of A (using pen and paper only!)

(iii) Given an arbitrary input vector x ∈ R3 , provide the linear transformation matrices
PA and PB that project x into the input null space of matrix A and B, respectively.

(iv) Compute the pseudo inverse A† .

(v) Determine all solutions to the linear equations Ax = y and Bx = y with y =


(2, 3, 0, 0). What is the more general expression for an arbitrary y?

3.10.6 Bonus: Scalar product and Orthogonality

(i) Show that f (x, y) = 2x1 y1 − x1 y2 − x2 y1 + 5x2 y2 is an scalar product on R2 .


R 2π
(ii) In the space of functions with the scalar product hf, gi = 0 f (x)g(x)dx, what
is the scalar product of sin(x) with sin(2x)? (Graphical argument is ok.)

(iii) What property does a matrix M have to satisfy in order to be a valid metric tensor,
i.e. such that x>M y is a valid scalar product?
Maths for Intelligent Systems, Marc Toussaint 45

3.10.7 Eigenvectors

(i) A symmetric matrix A ∈ Rn×n is called positive semidefinite (PSD) if x>Ax ≥


0, ∀x ∈ Rn . (PSD is usually only used with symmetric matrices.) Show that all
eigenvalues of a PSD matrix are non-negative.

(ii) Show that if v is an eigenvector of A with eigenvalue λ, then v is also an eigenvector


of Ak for any positive integer k. What is the corresponding eigenvalue?

(iii) Let v be an eigenvector of A with eigenvalue λ and w an eigenvector of A> with


a different eigenvalue µ 6= λ. Show that v and w are orthogonal.

(iv) Suppose A ∈ Rn×n has eigenvalues λ1 , . . . , λn ∈ R. What are the eigenvalues of


A + αI for α ∈ R and I an identity matrix?

(v) Assume A ∈ Rn×n is diagonalizable, i.e., it has n linearly independent eigenvec-


tors, each with a different eigenvalue. Initialize x ∈ Rn as a random normalized
vector and iterate the two steps
1
x ← Ax , x← x
||x||

Prove that (under certain conditions) these iterations converge to the eigenvector
x with a largest (in absolute terms |λi |) eigenvalue of A. How fast does this
converge? In what sense does it converge if the largest eigenvalue is negative?
What if eigenvalues are not different? Other convergence conditions?

(vi) Let A be a positive definite matrix with λmax its largest eigenvalue (in absolute
terms |λi |). What do we get when we apply power iteration method to the matrix
B = A − λmax I? How can we get the smallest eigenvalue of A?

(vii) Consider the following variant of the previous power iteration:

1 1
z ← Ax , λ ← x>z , y ← (λI − A)y , x← z, y← y.
||z|| ||y||

If A is a positive definite matrix, show that the algorithm can give an estimate of
the smallest eigenvalue of A.

3.10.8 Covariance and PCA

Suppose we’re given a collection of zero-centered data points D = {xi }N


i=1 , with each
xi ∈ Rn . The covariance matrix is defined as

N
1X 1
C= xi x>i = X>X
n i=1 n
46 Maths for Intelligent Systems, Marc Toussaint

where (consistent to ML lecture convention) the data matrix X contains each x>i as a
row, i.e., X> = (x1 , .., xN ).
If we project D onto some unit vector v ∈ Rn , then the variance of the projected data
points is v>Cv. Show that the direction that maximizes this variance is the largest
eigenvector of C. (Hint: Expand v in terms of the eigenvector basis of C and exploit
the constraint v>v = 1.)

3.10.9 Bonus: RKHS

In machine learning we often work in spaces of functions called Reproducing Kernel


Hilbert Spaces. These spaces are constructed from a certain type of function called the
kernel. The kernel k : Rd × Rd → R takes two d-dimensional inputs k(x, x0 ), and from
the kernel we construct a basis for the space of function, namely B = {k(x, ·)}x∈Rd .
Note that this is a set of infinite element: each x ∈ Rd adds a basis function k(x, ·) to the
basis B. The scalar product between two basis functions kx = k(x, ·) and kx0 = k(x0 , ·)
is defined to be the kernel evaluation itself: hkx , kx0 i = k(x, x0 ). The kernel function
is therefore required to be a positive definite function so that it defines a viable scalar
product.

(i) Show that for any function f ∈ span B it holds

hf, kx i = f (x)

(ii) Assume we only have a finite set of points {D = {xi }ni=1 }, which defines a finite
basis {kxi }ni=1 ⊂ B. This finite function basis spans a subspace FD = span{kxi :
xi ∈ D} of the space of all functions.
For a general function f , we decompose it f = fs + f⊥ with fs ∈ FD and
∀g ∈ FD : hf⊥ , gi = 0, i.e., f⊥ is orthogonal to FD . Show that for every xi ∈ D:

f (xi ) = fs (xi )

(Note: This shows that the function values of any function f at the data points
D only depend on the part fs which is inside the spann of {kxi : xi ∈ D}.
This implies the so-called representer theorem, which is fundamental in kernel
machines: A loss can only depend on function values f (xi ) at data points, and
therefore on fs . The part f⊥ can only increase the complexity (norm) of a function.
Therefore, the simplest function to optimize any loss will have f⊥ = 0 and be
within span{kxi : xi ∈ D}.)

(iii) Within span{kxi : xi ∈ D}, what is the coordinate representation of the scalar
product?
Maths for Intelligent Systems, Marc Toussaint 47

4 Optimization

4.1 Downhill algorithms for unconstrained optimization

We discuss here algorithms that have one goal: walk downhill as quickly as possible.
These target at efficiently finding local optima—in constrast to global optimization
methods, which try to find the global optimum.
For such downhill walkers, there are two essential things to discuss: the stepsize and
the step direction. When discussing the stepsize we’ll hit on topics like backtracking
line search, the Wolfe conditions and its implications in a basic convergence proof.
The discussion of the step direction will very much circle around Newton’s method and
thereby also cover topics like quasi-Newton methods (BFGS), Gauss-Newton, covariant
and conjugate gradients.

4.1.1 Why you shouldn’t trust the magnitude of the gradient

Consider the following 1D function and naive gradient descent x ← x − α∇f for some
fixed and small α.

small gradient
small step?

large gradient
large step?

ˆ In plateaus we’d make small steps, at steep slopes (here close to the optimum) we
make huge steps, very likely overstepping the optimum. In fact, for some α the
algorithm might indefinitely loop a non-sensical sequence of very slowly walking
left on the plateau, then accellerating, eventually overstepping the optimum, then
being thrown back far to the right again because of the huge negative gradient on
the left.

ˆ Generally, never trust an algorithm that depends on the scaling—or choice of


units—of a function! An optimization algorithm should be invariant on whether
you measure costs in cents or Euros! Naive gradient descent x ← x − α∇f is not!
48 Maths for Intelligent Systems, Marc Toussaint

As a conclusion, the gradient ∇f gives a reasonable descent direction, but its magnitude
is really arbitrary and no good indication of a good stepsize. Therefore, it often makes
sense to just compute the step direction
1
δ=− ∇f (x) (136)
|∇f (x)|
and iterate x ← x + αδ for some appropriate stepsize.

4.1.2 Ensuring monotone and sufficient decrease: Backtracking line search,


Wolfe conditions, & convergence

The first idea is simple: If a step would increase the objective value, reduce the stepsize.
We typically use multiplicative stepsize adaptations: Reduce α ← %− −
α α with %α ≈ 0.5;
+ +
and increase α ← %α α with %α ≈ 1.2. A simple monotone gradient descent algorithm
reads as follows (the blue part is explained later). Here, the step vector δ is always

Algorithm 2 Plain gradient descent with backtracking line search


Input: initial x ∈ Rn , functions f (x) and ∇f (x), tolerance θ, parameters (defaults:

%+α = 1.2, %α = 0.5, δmax = ∞, %ls = 0.01)
1: initialize stepsize α = 1
2: repeat
∇f (x)
3: δ ← − |∇ f (x)| // (alternative: δ = −∇f (x))
4: while f (x + αδ) > f (x)+%ls ∇f (x)>(αδ) do // backtracking line search
5: α ← %− αα // decrease stepsize
6: end while
7: x ← x + αδ
8: α ← min{%+ α α, δmax } // increase stepsize
9: until |αδ| < θ // perhaps for 10 iterations in sequence

normalized and α is adapted on the fly; decreasing when f (x + αδ) is not sufficiently
smaller than f (x).
This sufficiently smaller is described by the blue part and is called the (1st) Wolfe
condition
f (x + αδ) > f (x) + %ls ∇f (x)>(αδ) . (137)
Figure 3 illustrates this. Note that ∇f (x)>(αδ) is a negative value and describes how
much the objective would decrease if f was linear. But f is of course not linear; we
cannot expect that a step would really decrease f that much. Instead we require that
it decreases by a fraction of this expectation. %ls describes this fraction and is typically
chosen very moderate, e.g. %ls ∈ [0.01, 0.1]. So, the Wolfe conditions requires that f
descreases by the %ls -fraction of what it would decrease if f was linear. Note that for
α → 0 the Wolfe condition will always be fulfilled for smooth functions, because f
“becomes locally linear”.
Maths for Intelligent Systems, Marc Toussaint 49

Figure 3: The 1st Wolfe condition: f (x) + ∇f (x)>(αδ) is the tangent, which describes
the expected decrease of f (x + αδ) if f was linear. We cannot expect this to be the
case; so f (x + αδ) > f (x) + %ls ∇f (x)>(αδ) weakens this condition.

You’ll proove the following theorem in the exercises. It is fundamental to convex opti-
mization and proves that Alg 2 is efficient for convex objective functions:

Theorem 4.1 (Exponential convergence on convex functions). Let f : Rν → R be


an objective function where the eigenvalues λ of the Hessian ∇2 f (x) are for any x.
Further, assume that Hessian eigenvalues λ are lower bounded by m > 0 and upper
bounded by M > m, with m, M ∈ R, at any location x ∈ Rn . (f is convex.) Then
Algorithm 2 converges exponentially with convergence rate (1 − 2 M m
%ls %−
α ) to the
optimum.

But even if your objective function is not globally convex, Alg 2 is an efficient downhill
walker, and once it reaches a convex region it will efficiently walk into its local minimum.
For completeness, there is a second Wolfe condition,

|∇f (x + αδ)>δ| ≤ b|∇f (x)>δ| , (138)

which states that the gradient magnitude should have decreased sufficiently. We do not
use it much.

4.1.3 The Newton direction

We already discussed the steepest descent direction −G-1 ∇f (x) if G is a metric tensor.
Let’s keep this in mind!
The original Newton method is a method to find the root (that is, zero point) of a
function f (x). In 1D it iterates x ← x − ff0(x) 0
(x) , that is, it uses the gradient f to
estimate where the function might cross the x-axis. To find an optimum (minimum
50 Maths for Intelligent Systems, Marc Toussaint

or maximum) of f we want to find the root of its gradient. For x ∈ Rn the Newton
method iterates

x ← x − ∇2 f (x)-1 ∇f (x) . (139)

Note that the Newton step δ = −∇2 f (x)-1 ∇f (x) is the solution to
h 1 i
min f (x) + ∇f (x)>δ + δ>∇2 f (x)δ . (140)
δ 2
So the Newton method can also be viewed as 1) compute the 2nd-order Taylor approx-
imation to f at x, and 2) jump to the optimum of this approximation.
Note:

ˆ If f is just a 2nd-order polynomial, the Newton method will jump to the optimum
in just one step.
ˆ Unlike the gradient magnitude |∇f (x)|, the magnitude of the Newton step δ is
very meaningful. It is scale invariant! If you’d rescale f (trade cents by Euros), δ
is unchanged. |δ| is the distance to the optimum of the 2nd-order Taylor.
ˆ Unlike the gradient ∇f (x), the Newton step δ is truely a vector! The vector itself
is invariant under coordinate transformations; the coordinates of δ transforms
contra-variant, as it is supposed to for vector coordinates.
ˆ The hessian as metric, and the Newton step as steepest descent: Assume
that the hessian H = ∇2 f (x) is pos-def.
P Then it fulfils all necessary conditions
to define a scalar product hv, wi = ij vi wj Hij , where H plays the role of the
metric tensor. If H was the space’s metric, then the steepest descent direction is
−H -1 ∇f (x), which is the Newton direction.
Another way to understand the same: In the 2nd-order Taylor approximation f (x+
δ) ≈ f (x)+∇f (x)>δ+ 21 δ>Hδ the Hessian plays the role of a metric tensor. Or: we
may think of the function f as being an isometric parabola f (x + δ) ∝ hδ, δi, but
we’ve chosen coordinates where hv, vi = v>Hv and the parabola seems squeezed.
Note that this discussion only holds for pos-dev hessian.

A robust Newton method is the core of many solvers, see Algorithm 3. We do back-
tracking line search along the Newton direction, but with maximal step size α = 1 (the
full Newton step).
We can additionally add and adapt damping to gain more robustness. Some notes on
the λ:

ˆ In Alg 3, the first line chooses λ to ensure that (∇2 f (x) + λI) is indeed pos-dev—
and a Newton step actually decreases f instead of seeking for a maximum. There
would be other options: instead of adding to all eigenvalues we could only set the
negative ones to some λ > 0.
Maths for Intelligent Systems, Marc Toussaint 51

Algorithm 3 Newton method


Input: initial x ∈ Rn , functions f (x), ∇f (x), ∇2 f (x), tolerance θ, parameters (de-
− + −
faults: %+α = 1.2, %α = 0.5, %λ = %λ = 1, %ls = 0.01)
1: initialize stepsize α = 1
2: repeat
3: choose λ > −(minimal eigenvalue of ∇2 f (x))
4: compute δ to solve (∇2 f (x) + λI) δ = −∇f (x)
5: while f (x + αδ) > f (x) + %ls ∇f (x)>(αδ) do // line search
6: α ← %− αα // decrease stepsize
7: optionally: λ ← %+ λ λ and recompute d // increase damping
8: end while
9: x ← x + αδ // step is accepted
10: α ← min{%+ α α, 1} // increase stepsize
11: optionally: λ ← %− λλ // decrease damping
12: until ||αδ||∞ < θ

ˆ δ solves the problem


h 1 1 i
min ∇f (x)>δ + δ>∇2 f (x)δ + λδ 2 ) . (141)
δ 2 2

So, we added a squared potential λδ 2 to the local 2nd-order Taylor approximation.


This is like introducing a squared penalty for large steps!

ˆ Trust region method: Let’s consider a different mathematical program over the
step:
1
min ∇f (x)>δ + δ>∇2 f (x)δ s.t. δ 2 ≤ β (142)
δ 2
This problem wants to find the minimum of the 2nd-order Taylor (like the Newton
step), but constrained to a stepsize no larger than β. This β defines the trust
region: The region in which we trust the 2nd-order Taylor to be a reasonable
enough approximation.
Let’s solve this using Lagrange parameters (as we will learn it later): Let’s assume
the inequality constraint is active. Then we have
1
L(δ, λ) = ∇f (x)>δ + δ>∇2 f (x)δ + λ(δ 2 − β) (143)
2
∇δ L(δ, λ) = ∇f (x)> + δ>(∇2 f (x) + 2λI) (144)

Setting this to zero gives the step δ = −(∇2 f (x) + 2λI)-1 ∇f (x).
Therefore, the λ can be viewed as dual variable of a trust region method. There
is no analytic relation between β and λ; we cannot determine λ directly from
β. We could use a constrained optimization method, like primal-dual Newton,
52 Maths for Intelligent Systems, Marc Toussaint

or Augmented Lagrangian approach to solve for λ and δ. Alternatively, we can


always increase λ when the computed steps are too large, and decrease if they are
smaller than β—the Augmented Lagrangian update equations could be used to
do such an update of λ.
ˆ The λδ 2 term can be interpreted as penalizing the velocity (stepsizes) of the
algorithm. This is in analogy to a damping, such as “honey pot damping”, in
physical dynamic systems. The parameter λ is therefore also called damping
parameter. Such a damped (or regularized) least-squares method is also called
Levenberg-Marquardt method.
ˆ For λ → ∞ the step direction δ becomes aligned with the plain gradient direction
−∇f (x). This shows that for λ → ∞ the Hessian (and metric deformation of the
space) becomes less important and instead we’ll walk orthogonal to the iso-lines
of the function.
ˆ The λ term makes the δ non-scale-invariant! δ is not anymore a proper vector!

4.1.4 Least Squares & Gauss-Newton: a very important special case

A special case that appears a lot in intelligent systems is the Least Squares case:
Consider an objective function of the form
X
f (x) = φ(x)>φ(x) = φi (x)2 (145)
i

where we call φ(x) the cost features. This is also called a sum-of-squares problem.
We have

∇f (x) = 2 φ(x)>φ(x) (146)
∂x
∂ ∂ ∂2
∇2 f (x) = 2 φ(x)> φ(x) + 2φ(x)> 2 φ(x) (147)
∂x ∂x ∂x

The Gauss-Newton method is the Newton method for f (x) = φ(x)>φ(x) while approxi-
mating ∇2 φ(x) ≈ 0. That is, it computes approximate Newton steps
∂ ∂ ∂
δ = −( φ(x)> φ(x) + λI)-1 φ(x)>φ(x) . (148)
∂x ∂x ∂x
Note:

ˆ The approximate Hessian 2 ∂x∂


φ(x)> ∂x

φ(x) is always semi-pos-def! Therefore,
no problems arise with negative hessian eigenvalues.
ˆ The approximate Hessian only requires the first-order derivatives of the cost fea-
tures. There is no need for computationally expensive hessians of φ.
Maths for Intelligent Systems, Marc Toussaint 53

Algorithm 4 Newton method – practical version realized in rai


Input: initial x ∈ Rn , functions f (x), ∇f (x), ∇2 f (x)
Input: stopping tolerances θx , θf
− + −
Input: parameters %+ α = 1.2, %α = 0.5, λ0 = 0.01, %λ = %λ = 1, %ls = 0.01
Input: maximal stepsize δmax ∈ R, optionally lower/upper bounds x, x ∈ Rn

1: initialize stepsize α = 1, λ = λ0
2: repeat
3: compute smallest eigenvalue σmin of ∇2 f (x) + λI
4: if σmin > 0 is sufficiently posititive then
5: compute δ to solve (∇2 f (x) + λI) δ = −∇f (x)
6: else
7: Option 1: λ ← 2λ − σmin and goto line 3 // increase regularization
8: Option 2: δ ← −δmax ∇f (x)/|∇f (x)| // gradient step of length δmax
9: end if
10: if ||δ||∞ > δmax : δ ← (δmax /||δ||∞ ) δ // cap δ length
11: y ← BoundClip(x + αδ, x, x)
12: while f (y) > f (x) + %ls ∇f (x)>(y − x) do // bound-projected line search
13: α ← %− αα // decrease stepsize
14: (unusual option: λ ← %+ λ λ and goto line 3) // increase damping
15: y ← BoundClip(x + αδ, x, x)
16: end while
17: xold ← x
18: x←y // step is accepted
19: α ← min{%+ α α, 1} // increase stepsize
20: (unusual option: λ ← %− λ λ) // decrease damping
21: until ||xold − x||∞ < θx repeatedly, or f (xold ) − f (x) < θf repeatedly

22: procedure BoundClip(x, x, x)


23: ∀i : xi ← min{xi , xi } , xi ← max{xi , xi }
24: end procedure
54 Maths for Intelligent Systems, Marc Toussaint

ˆ The objective f (x) can be interpreted as the Euclidean norm f (φ) = φ>φ but
pulled back into the x-space. More precisely: Consider a mapping φ : Rn → Rm
and a general scalar product h·, ·iφ in the output space. In differential geometry
there is the notion of a pull-back of a metric, that is, we define a scalar product
h·, ·ix in the input space as

hx, yix = hdφ(x), dφ(y)iφ (149)

where dφ is the differential of φ (a Rm -valued 1-form). Assuming φ-coordinates


such that the metric tensor of h·, ·iφ is Euclidean, we have

∂ ∂
hx, xix = hdφ(x), dφ(x)iφ = φ(x)> φ(x) (150)
∂x ∂x
and therefore, the approximate Hessian is the pullback of a Euclidean cost
feature metric, and hx, xix approximates the 2-order polynomial term of f (x),
with the non-constant (i.e., Riemannian) pull-back metric h·, ·ix .

4.1.5 Quasi-Newton & BFGS: approximating the hessian from gradient obser-
vations

To apply full Newton methods we need to be able to compute f (x), ∇f (x), and ∇2 f (x)
for any x. However, sometimes, computing ∇2 f (x) is not possible, e.g., because we
cannot derive an analytic expression for ∇2 f (x), or it would be too expensive to compute
the hessian exactly, or even to store it in memory—especially in very high-dimensional
spaces. In such cases it makes sense to approximate ∇2 f (x) or ∇2 f (x)-1 with a low-rank
approximation.
Assume we have computed ∇f (x1 ) and ∇f (x2 ) at two different points x1 , x2 ∈ Rn . We
define

y = ∇f (x2 ) − ∇f (x1 ) , δ = x2 − x1 . (151)

From this we may wish to find some approximate Hessian matrix H or H -1 that fulfils
! !
H δ=y or δ = H -1 y (152)

The first equation is called secant equation. Here are guesses of H and H -1 :

yy> δδ>
H= or H -1 = (153)
y>δ δ>y
Convince yourself that these choices fulfil the respective desired relation above. However,
these choices are under-determined. There exist many alternative H or H -1 that would
be consistent with the observed change in gradient. However, given our understanding
of the structure of matrices it is clear that these choices are the lowest rank solutions,
namely rank 1.
Maths for Intelligent Systems, Marc Toussaint 55

Broyden-Fletcher-Goldfarb-Shanno (BFGS): An optimization algorithm computes


∇f (xk ) at a series of points x0:K . We incrementally update our guess of H -1 with an
update equation
 yδ> > -1  yδ>  δδ>
H -1 ← I − > H I− > + > , (154)
δ y δ y δ y

which is equivalent to (using the Sherman-Morrison formula)

Hδδ>H> yy>
H←H− + > . (155)
δ T Hδ y δ
Note:

δδ>
ˆ If H -1 is initially zero, this update will assign H -1 ← δ>y
, which is the minimal
rank 1 update we discussed above.

ˆ If H -1 is previously non-zero,
 the red
-1
 part “deletes certain dimensions” from H .
yδ>
More precisely, note that I − δ>y y = 0, that is, this rank n − 1 construction
deletes span{y} from its input space. Therefore, the red part gives zero when
multiplied with y; and it is guaranteed that the resulting H -1 fulfils H -1 y = δ.

The BFGS algorithms uses this H -1 instead of a precise ∇2 f (x)-1 to compute the steps
in a Newton method. All we said about line search and Levenberg-Marquardt damping
is unchanged. 6
In very high-dimensional spaces we do not want to store H -1 densely. Instead we use
aPcompressed storage for low-rank matrices, e.g., storing vectors {vi } such that H -1 =
>
i vi vi . Limited memory BFGS (L-BFGS) makes this more memory efficient: it
limits the rank of the H -1 and thereby the used memory. I do not know the details
myself, but I assume that with every update it might aim to delete the lowest eigenvalue
to keep the rank constant.

4.1.6 Conjugate Gradient

The Conjugate Gradient Method is a method for solving large linear eqn. systems Ax +
b = 0. We only mention its extension for optimizing nonlinear functions f (x).
As above, assume that we evaluted ∇f (x1 ) and ∇f (x2 ) at two different points x1 , x2 ∈
Rn . But now we make one more assumption: The point x2 is the minimum of a line
search from x1 along the direction δ1 . This latter assumption is quite optimistic: it
6 Taken from Peter Blomgren’s lecture slides: terminus.sdsu.edu/SDSU/Math693a_f2013/
Lectures/18/lecture.pdf This is the original Davidon-Fletcher-Powell (DFP) method suggested by
W.C. Davidon in 1959. The original paper describing this revolutionary idea – the first quasi-Newton
method – was not accepted for publication. It later appeared in 1991 in the first issue the the SIAM
Journal on Optimization.
56 Maths for Intelligent Systems, Marc Toussaint

assumes we did perfect line search. But it gives great information: The iso-lines of f (x)
at x2 are tangential to δ1 .
In this setting, convince yourself of the following: Ideally each search direction should be
orthogonal to the previous one—but not orthogonal in the conventional Euclidean sense,
but orthogonal w.r.t. the Hessian H. Two vectors d and d0 are called conjugate w.r.t.
a metric H iff d0>Hd = 0. Therefore, subsequent search directions should be conjugate
to each other.
Conjugate gradient descent does the following:

Algorithm 5 Conjugate gradient descent


Input: initial x ∈ Rn , functions f (x), ∇f (x), tolerance θ
Output: x
1: initialize descent direction d = g = −∇f (x)
2: repeat
3: α ← argminα f (x + αd) // line search
4: x ← x + αd
5: g 0 ← g, gn= −∇f (x) o // store and compute grad
g>(g−g 0 )
6: β ← max g 0>g 0
,0
7: d ← g + βd // conjugate descent direction
8: until |∆x| < θ

ˆ The equation for β is by Polak-Ribière: On a quadratic function f (x) = x>Hx


this leads to conjugate search directions, d0>Hd = 0.

ˆ Intuitively, β > 0 implies that the new descent direction always adds a bit of the
old direction. This essentially provides 2nd order information.

ˆ For arbitrary quadratic functions CG converges in n iterations. But this only works
with perfect line search.

4.1.7 Rprop*

Read through Algorithm 6. Notes on this:

ˆ Stepsize adaptation is done in each coordinate separately !

ˆ The algorithm not only ignores |∇f | but also its exact direction! Only the gradient
signs in each coordinate are relevant. Therefore, the step directions may differ up
to < 90◦ from −∇f .

ˆ It often works surprisingly efficient and robust.


Maths for Intelligent Systems, Marc Toussaint 57

Algorithm 6 Rprop
Input: initial x ∈ Rn , function f (x), ∇f (x), initial stepsize α, tolerance θ
Output: x
1: initialize x = x0 , all αi = α, all gi = 0
2: repeat
3: g ← ∇f (x)
4: x0 ← x
5: for i = 1 : n do
6: if gi gi0 > 0 then // same direction as last time
7: αi ← 1.2αi
8: xi ← xi − αi sign(gi )
9: gi0 ← gi
10: else if gi gi0 < 0 then // change of direction
11: αi ← 0.5αi
12: xi ← xi − αi sign(gi )
13: gi0 ← 0 // force last case next time
14: else
15: xi ← xi − αi sign(gi )
16: gi0 ← gi
17: end if
18: optionally: cap αi ∈ [αmin xi , αmax xi ]
19: end for
0
20: until |x − x| < θ for 10 iterations in sequence

ˆ If you like, have a look at:


Christian Igel, Marc Toussaint, W. Weishui (2005): Rprop using the natural gradient
compared to Levenberg-Marquardt optimization. In Trends and Applications in Con-
structive Approximation. International Series of Numerical Mathematics, volume 151,
259-272.

4.2 The general optimization problem – a mathematical program


Definition 4.1. Let x ∈ Rn , f : Rn → R, g : Rn → Rm , h : Rn → Rl . An
optimization problem, or mathematical program, is

min f (x)
x
s.t. g(x) ≤ 0 , h(x) = 0

We typically at least assume f, g, h to be differentiable or smooth.


Get an intuition about this problem formulation by considering the following examples.
Always discuss where is the optimum, and at the optimum, how the objective f pulls at
the point, while the constraints g or h push against it.
58 Maths for Intelligent Systems, Marc Toussaint

Figure 4: 2D example: f (x, y) = −x, pulling constantly to the right; three inequality
constraints, two active, one inactive. The “pull/push” vectors fulfil the stationarity
condition ∇f + λ1 ∇g1 + λ2 ∇g2 = 0.

For the following examples, draw the situation and guess, without much maths, where
the optimum is:

ˆ A 1D example: x ∈ R, h(x) = sin(x), g(x) = x2 /4 − 1, f some non-linear


function.
ˆ 2D example: f (x, y) = x (intuition: the objective is constantly pulling to the left),
h(x, y) = 0 is some non-linear path in the plane → the optimum is at the left-
most tangent-point of h. Tangent-point means that the tangent of h is vertical.
h pushes to the right (always orthogonal to the zero-line of h).
ˆ 2D example: f (x, y) = x, g(x, y) = y 2 − x − 1. The zero-line of g is a parabola
towards the right. The objective f pulls into this parabola; the optimum is in the
’bottom’ of the parabola, at (−1, 0).
ˆ 2D example: f (x, y) = x, g(x, y) = x2 + y 2 − 1. The zero-line of g is a circle.
The objective f pulls to the left; the optimum is at the left tangent-point of the
circle, at (−1, 0).
ˆ Figure 4

4.3 The KKT conditions


Theorem 4.2 (Karush-Kuhn-Tucker conditions). Given a mathematical program,

x optimal ⇒ ∃λ ∈ Rm , κ ∈ Rl s.t.
m
X l
X
∇f (x) + λi ∇gi (x) + κj ∇hj (x) = 0 (stationarity)
i=1 j=1

∀j : hj (x) = 0 , ∀i : gi (x) ≤ 0 (primal feasibility)


Maths for Intelligent Systems, Marc Toussaint 59

∀i : λi ≥ 0 (dual feasibility)
∀i : λi gi (x) = 0 (complementarity)

Note that these are, in general, only necessary conditions. Only in special cases, e.g.
convex, these are also sufficient.
These conditions should be intuitive in the previous examples:

ˆ The first condition describes the “force balance” of the objective pulling and the
active constraints pushing back. The existance of dual parameters λ, κ could
implicitly be expressed by stating

∇f (x) ∈ span({∇g1..m , ∇h1..l }) (156)

The specific values of λ and κ tell us, how strongly the constraints push against
the objective, e.g., λi |∇gi | is the force excerted by the ith inequality.
ˆ The fourth condition very elegantly describes the logic of inequality constraints
being either active (λi > 0, gi = 0) or inactive (λi = 0, gi ≤ 0). Intuitively it says:
An inequality can only push at the boundary, where gi = 0, but not inside the
feasible region, where gi < 0. The trick of using the equation λi gi = 0 to express
this logic is beautiful, especially when later we discuss a case which relaxes this
strict logic to λi gi = −µ for some small µ—which roughly means that inequalities
may push a little also inside the feasible region.
ˆ Special case m = l = 0 (no constraints). The first condition is just the usual
∇f (x) = 0.
ˆ Discuss the previous examples as special cases; and how the force balance is met.

4.4 Unconstrained problems to tackle a constrained problem

Assume you’d know about basic unconstrained optimization methods (like standard gra-
dient descent or the Newton method) but nothing about constrained optimization meth-
ods. How would you solve a constrained problem? Well, I think you’d very quickly have
the idea to introduce extra cost terms for the violation of constraints—a million people
have had this idea and successfully applied it in practice.
In the following we define a new cost function F (x), which includes the objective f (x)
and some extra terms.
Definition 4.2 (Log barrier, squared penalty, Lagrangian, Augmented Lagrangian).

X X
Fsp (x; ν, µ) = f (x) + ν hj (x)2 + µ [gi (x) > 0] gi (x)2 (sqr. penalty)
j i
60 Maths for Intelligent Systems, Marc Toussaint

Figure 5: The function −µ log(−g) (with g on the “x-axis”) for various µ. This is
always undefined (“∞”) for g > 0. For µ → 0 this becomes the hard step function.

X
Flb (x; µ) = f (x) − µ log(−gi (x)) (log barrier)
i
X X
L(x, λ, κ) = f (x) + κj hj (x) + λi gi (x) (Lagrangian)
j i
X X
L̂(x) = f (x) + κj hj (x) + λi gi (x) +
j i
X X
+ν hj (x)2 + µ [gi (x) > 0] gi (x)2 (Aug. Lag.)
j i

ˆ The squared penalty method is straight-forward if we have an algorithm to mini-


mize F (x). We initialize ν = µ = 1, minimize F (x), then increase ν, µ (multiply
with a number > 1) and iterate. For ν, µ → ∞ we retrieve the optimum.

ˆ The log barrier method (see Fig. 5) does exactly the same, except that we decrease
µ towards zero (multiply with a numer < 1 in each iteration). Note that we
need a feasible initialization x0 , because otherwise the barriers are ill-defined! The
whole algorithm will keep the temporary solutions always inside the feasible regions
(because the barriers push away from the constraints). That’s why it is also called
interior point method.

ˆ The Lagrangian is a function L(x, λ, κ) which has the gradient


X X
∇L(x, λ, κ) = ∇f (x) + λi ∇gi (x) + κj ∇hj (x) . (157)
i j

That is, ∇L(x, λ, κ) = 0 is our first KKT condition! In that sense, the additional
terms in the Lagrangian generate the push forces of the constraints. If we knew
the correct λ’s and κ’s beforehand, then we could find the optimal x by the
unconstrained problem minx L(x, λ, κ) (if this has a unique solution).
Maths for Intelligent Systems, Marc Toussaint 61

ˆ The Augmented Lagrangian L̂ is a function that includes both, squared penalties,


and Lagrangian terms that push proportional to λ, κ. The Augmented Lagrangian
method is an iterative algorithm that, while running, figures out how strongly we
need to push to ensure that the final solution is exactly on the constraints, where
all squared penalties will anyway be zero. It does not need to increase ν, µ and
still converges to the correct solution.

4.4.1 Augmented Lagrangian*

This is not a main-stream algorithm, but I like it. See Toussaint (2014).
In the Augmented Lagrangian L̂, the solver has two types of knobs to tune: the strenghts
of the penalties ν, µ and the strengths of the Lagrangian forces λ, κ. The trick is
conceptually easy:

ˆ Initially we set λ, κ = 0 and ν, µ = 1 (or some other constant). In the first


iteration, the unconstrained solver will find x0 = minx L̂(x); the objective f will
typically pull into the penalizations.
ˆ For the second iteration we then choose parameters λ, κ that try to avoid that we
will be pulled into penalizations the next time. Let’s update

κj ← κj + 2µhj (x0 ) , λi ← max(λi + 2µgi (x0 ), 0). (158)

Note that 2µhj (x0 ) is the force (gradient) of the equality penalty at x0 ; and
max(λi + 2µgi (x0 ), 0) is the force of the inequality constraint at x0 . What this
update does is: it analyzes the forces excerted by the penalties, and translates
them to forces excerted by the Lagrange terms in the next iteration. It tries to
trade the penalizations for the Lagrange terms.
More rigorously, observe that, if f, g, h are linear and the same constraints are
active in two consecutive iterations, then this update will guarantee that all penalty
terms are zero in the second iteration, and therefore the solution fulfils the first
KKT condition (Toussaint, 2014). See also the respective exercise.

4.5 The Lagrangian

4.5.1 How the Lagrangian relates to the KKT conditions

The Lagrangian L(x, κ, λ) = f + κ>h + λ>g has a number of properties that relates it
to the KKT conditions:

(i) Requiring a zero-x-gradient, ∇x L = 0, implies the 1st KKT condition.


(ii) Requiring a zero-κ-gradient, 0 = ∇κ L = h, implies primal feasibility (the 2nd KKT
condition) w.r.t. the equality constraints.
62 Maths for Intelligent Systems, Marc Toussaint

(iii) Requiring that L is maximized w.r.t. λ ≥ 0 is related to the remaining 2nd and
4th KKT conditions:
(
f (x) if g(x) ≤ 0
max L(x, λ) = (159)
λ≥0 ∞ otherwise
(
λi = 0 if gi (x) < 0
λ = argmax L(x, λ) ⇒ (160)
λ≥0 0 = ∇λi L(x, λ) = gi (x) otherwise

This implies either (λi = 0 ∧ gi (x) < 0) or gi (x) = 0, which is equivalent to the
complementarity and primal feasibility for inequalities.

These three facts show how tightly the Lagrangian is related to the KKT conditions.
To simplify the discussion let us assume only inequality constraints from now on. Fact
(i) tells us that if we minx L(x, λ), we reproduce the 1st KKT condition. Fact (iii) tells
us that if we maxλ≥0 L(x, λ), we reproduce the remaining KKT conditions. Therefore,
the optimal primal-dual solution (x∗ , λ∗ ) can be characterized as a saddle point of the
Lagrangian. Finding the saddle point can be written in two ways:

Definition 4.3 (primal and dual problem).

min max L(x, λ) (primal problem)


x λ≥0

max min L(x, λ) (dual problem)


λ≥0 x
| {z }
l(λ) (dual function)

Convince
h yourself, using 159, that
i the first expression is indeed the original primal prob-
lem minx f (x) s.t. g(x) ≤ 0 .

What can we learn from this? The KKT conditions state that, at an optimum, there
exist some λ, κ. This existance statement is not very helpful to actually find them. In
contrast, the Lagrangian tells us directly how the dual parameters can be found: by
maximizing w.r.t. them. This can be exploited in several ways:

4.5.2 Solving mathematical programs analytically, on paper.

Consider the problem

min x2 s.t. x1 + x2 = 1 . (161)


x∈R2

We can find the solution analytically via the Lagrangian:

L(x, κ) = x2 + κ(x1 + x2 − 1) (162)


Maths for Intelligent Systems, Marc Toussaint 63

 

0 = ∇x L(x, κ) = 2x + κ 1 ⇒ x1 = x2 = −κ/2 (163)


1
 

0 = ∇κ L(x, κ) = x1 + x2 − 1 = −κ/2 − κ/2 − 1 ⇒ κ = −1 (164)


⇒x1 = x2 = 1/2 (165)

Here we first formulated the Lagrangian. In this context, κ is often called Lagrange
multiplier, but I prefer the term dual variable. Then we find a saddle point of L by
requiring 0 = ∇x L(x, κ), 0 = ∇κ L(x, κ). If we want to solve a problem with an inequality
constrained, we do the same calculus for both cases: 1) the constraint is active (handled
like an equality constrained), and 2) the constrained is inactive. Then we check if the
inactive case solution is feasible, or the active case is dual-feasible (λ ≥ 0). Note that
if we have m inequality constraints we have to analytically evaluate every combination
of constraints being active/inactive—which are 2m cases. This already hints at the fact
that a real difficulty in solving mathematical programs is to find out which inequality
constraints are active or inactive. In fact, if we knew this a priori, everything would
reduce to an equality constrained problem, which is much easier to solve.

4.5.3 Solving the dual problem, instead of the primal.

In some cases the dual function l(λ) = minx L(x, λ) can analytically be derived. In this
case it makes very much sense to try solving the dual problem instead of the primal.
First, the dual problem maxλ≥0 l(λ) is guaranteed to be convex even if the primal is
non-convex. (The dual function l(λ) is concave, and the constraint λ ≥ 0 convex.)
But note that l(λ) is itself defined as the result of a generally non-convex optimization
problem minx L(x, λ). Second, the inequality constraints of the dual problem are very
simple: just λ ≥ 0. Such inequality constraints are called bound constraints and can
be handled with specialized methods.
However, in general minx maxy f (x, y) 6= maxy minx f (x, y). For example, in dis-
crete domain x, y ∈ {1, 2}, let f (1, 1) = 1, f (1, 2) = 3, f (2, 1) = 4, f (2, 2) = 2, and
minx f (x, y) = (1, 2) and maxy f (x, y) = (3, 4). Therefore, the dual problem is in
general not equivalent to the primal.
The dual function is, for λ ≥ 0, a lower bound
h i
l(λ) = min L(x, λ) ≤ min f (x) s.t. g(x) ≤ 0 . (166)
x x

And consequently

(dual) max min L(x, λ) ≤ min max L(x, λ) (primal) (167)


λ≥0 x x λ≥0

We say strong duality holds iff

max min L(x, λ) = min max L(x, λ) (168)


λ≥0 x x λ≥0
64 Maths for Intelligent Systems, Marc Toussaint

If the primal is convex, and there exist an interior point

∃x : ∀i : gi (x) < 0 (169)

(which is called Slater condition), then we have strong duality.

4.5.4 Finding the “saddle point” directly with a primal-dual Newton method.

In basic unconstrained optimization an efficient way to find an optimum (minimum or


maximum) is to find a point where the gradient is zero with a Newton method. At
saddle points all gradients are also zero. So, to find a saddle point of the Lagrangian we
can equally use a Newton method that seeks for roots of the gradient. Note that such
a Newton method optimizes in the joint primal-dual space of (x, λ, κ).
In the case of inequalities, the zero-gradients view is over-simplified: While facts (i) and
(ii) characterize a saddle point in terms of zero gradients, fact (iii) makes this more
precise to handle the inequality case. For this reason it is actually easier to describe the
primal-dual Newton method directly in terms of the KKT conditions: We seek a point
(x, λ, κ), with λ ≥ 0, that solves the equation system

∇x f (x) + λ>∂x g + κ>∂x h = 0 (170)


h(x) = 0 (171)
diag(λ)g(x) + µ1m = 0 (172)

Note that the first equation is the 1st KKT, the 2nd is the 2nd KKT w.r.t. equalities,
and the third is the approximate 4th KKT with log barrier parameter µ (see below).
These three equations reflect the saddle point properties (facts (i), (ii), and (iii) above).
We define
∇f (x) + λ>∂g(x) + κ>∂h(x) 
 

r(x, λ, κ) = (173)
 


 h(x) 


diag(λ) g(x) + µ1m
 

and use the Newton method


x x
   

-1
 ←  λ  − α ∂r(x, λ, κ)
 




λ


 
 
 
r(x, λ) (174)
κ κ

to find the root r(x, λ, κ) = 0 (α is the stepsize). We have


2 2 2
∂g(x)> ∂h(x)>
 P P 
 ∇ f (x) + i λi ∇ gi (x) + j κj ∇ hj (x)
∂r(x, λ, κ) = 
 

 ∂h(x) 0 0 

diag(λ) ∂g(x) diag(g(x)) 0
 

(175)

where ∂r(x, λ, κ) ∈ R(n+m+l)×(n+m+l) . Note that this method uses the hessians
∇2 f, ∇2 g and ∇2 h.
Maths for Intelligent Systems, Marc Toussaint 65

The above formulation allows for a duality gap µ. One could choose µ = 0, but often
that is not robust. The beauty is that we can adapt µ on the fly, before each Newton
step, so that we do not need a separate outer loop to adapt µ.
1
Pm
Before computing a Newton step, we compute the current duality measure µ̃ = − m i=1 λi gi (x
Then we set µ = 21 µ̃ to half of this. In this way, the Newton step will compute a direction
that aims to half the current duality gap. In practise, this leads to good convercence in
a single-loop Newton method. (See also Boyd sec 11.7.3.)
The dual feasibility λi ≥ 0 needs to be handled explicitly by the root finder – the line
search can simply clip steps to stay within the bound constraints.
Typically, the method is called “interior primal-dual Newton”, in which case also the
primal feasibility gi ≤ 0 has to be ensured. But I found there are tweaks to make the
method also handle infeasible x, including infeasible initializations.

4.5.5 Log Barriers and the Lagrangian

Finally, let’s revisit the log barrier method. In principle it is very simple: For a given µ,
we use an unconstrained solver to find the minimum x∗ (µ) of
X
F (x; µ) = f (x) − µ log(−gi (x)) . (176)
i

(This process is also called “centering”.) We then gradually decrease µ to zero, always
calling the inner loop to recenter. The generated path of x∗ (µ) is called central path.
The method is simple and has very insightful relations to the KKT conditions and the
dual problem. For given µ, the optimality condition is
X µ
∇F (x; µ) = 0 ⇒ ∇f (x) − ∇gi (x) = 0 (177)
i
gi (x)
X
⇔ ∇f (x) + λi ∇gi (x) = 0 , λi gi (x) = −µ (178)
i

where we defined(!) λi = −µ/gi (x), which guarantees λi ≥ 0 as long as we are in the


interior (gi ≤ 0).
So, ∇F (x; µ) = 0 is equivalent to the modified (=approximate) KKT conditions,
where the complemenetarity is relaxed: inequalities may push also inside the feasible
region. For µ → 0 we converge to the exact KKT conditions with strict complementarity.
So µ has the interpretation of a relaxation of complementarity. We can derive another
interpretation of µ in terms of suboptimality or the duality gap:
Let x∗ (µ) = minx F (x; µ) be the central path. At each x∗ we may define, as above,
λi = −µ/gi (x∗ ). We note that λ ≥ 0 (dual feasible), as well that x∗ (µ) minimizes the
66 Maths for Intelligent Systems, Marc Toussaint

Lagrangian L(x, λ) w.r.t. x! This is because,


m
X
0 = ∇F (x, µ) = ∇f (x) + λi ∇gi (x) = ∇L(x, λ) . (179)
i=1

Therefore, x∗ is actually the solution to minx L(x, λ), which defines the dual function.
We have
Xm
l(λ) = min L(x, λ) = f (x∗ ) + λi gi (x∗ ) = f (x∗ ) − mµ . (180)
x
i=1

(m is simply the count of inequalities.) That is, mµ is the duality gap between the
(suboptimal) f (x∗ ) and l(λ). Further, given that the dual function is a lower bound,
l(λ) ≤ p∗ , where p∗ = minx f (x) s.t. g(x) ≤ 0 is the optimal primal value, we have
f (x∗ ) − p∗ ≤ mµ . (181)

This gives the interpretation of µ as an upper bound on the suboptimality of f (x ).

4.6 Convex Problems

We do not put much emphasis on discussing convex problems in this lecture. The
algorithms we discussed so far equally apply on general non-linear programs as well as on
convex problems—of course, only on convex problems we have convergence guarantees,
as we can see from the convergence rate analysis of Wolfe steps based on the assumption
of positive upper and lower bounds on the Hessian’s eigenvalues.
Nevertheless, we at least define standard LPs, QPs, etc. Perhaps the most interesting
part is the discussion of the Simplex algorithm—not because the algorithms is nice or
particularly efficient, but rather because one gains a lot of insights in what actually
makes (inequality) constrained problems hard.

4.6.1 Convex sets, functions, problems

Definition 4.4 (Convex set, convex function). A set X ⊆ V (a subset of some


vector space V ) is convex iff

∀x, y ∈ X, a ∈ [0, 1] : ax + (1−a)y ∈ X (182)

A function is defined
convex ⇔ ∀x, y ∈ Rn , a ∈ [0, 1] : f (ax + (1−a)y) ≤ a f (x) + (1−a) f (y) (183)
quasiconvex ⇔ ∀x, y ∈ Rn , a ∈ [0, 1] : f (ax + (1−a)y) ≤ max{f (x), f (y)} (184)

Note: quasiconvex ⇔ for any α ∈ R the sublevel set {x|f (x) ≤ α} is convex.
Further, I call a function unimodal if it has only one local minimum, which is the global
minimum.
Maths for Intelligent Systems, Marc Toussaint 67

Definition 4.5 (Convex program).


Variant 1: A mathematical program minx f (x) s.t. g(x) ≤ 0, h(x) = 0 is convex
if f is convex and the feasible set is convex.
Variant 2: A mathematical program minx f (x) s.t. g(x) ≤ 0, h(x) = 0 is convex
if f and every gi are convex and h is linear.

Variant 2 is the stronger and usual definition. Concerning variant 1, if the feasible set is
convex the zero-level sets of all g’s need to be convex and the zero-level sets of h’s needs
to be linear. Above these zero levels the g’s and h’s could in principle be abribtrarily
non-linear, but these non-linearities are irrelevant for the mathematical program itself.
We could replace such g’s and h’s by convex and linear functions and get the same
problem.

4.6.2 Linear and quadratic programs

Definition 4.6 (Linear program (LP), Quadratic program (QP)). Special case math-
ematical programs are

Linear Program (LP): min c>x s.t. Gx ≤ h, Ax = b


x
LP in standard form: min c>x s.t. x ≥ 0, Ax = b
x
1 >
Quadratic Program (QP): min x Qx + c>x s.t. Gx ≤ h, Ax = b , Qpos-def
x 2
Rarely, also a Quadratically Constrained QP (QCQP) is considered.

An important example for LP are relaxations of integer linear programs,


min c>x s.t. Ax = b, xi ∈ {0, 1} , (185)
x

which includes Travelling Salesman, MaxSAT or MAP inference problems. Relaxing such
a problem means to instead solve the continuous LP
min c>x s.t. Ax = b, xi ∈ [0, 1] . (186)
x

If one is lucky and the continuous LP problem converges to a fully integer solution, where
all xi ∈ {0, 1}, this is also the solution to the integer problem. Typically, the solution
of the continuous LP will be partially integer (some values converge to the extreme
xi ∈ {0, 1}, while others are inbetween xi ∈ (0, 1)). This continuous valued solution
gives a lower bound on the integer problem, and provides very efficient heuristics for
backtracking or branch-and-bound search for a fully integer solution.
The standard example for a QP are Support Vector Machines. The primal problem is
n
X
min ||β||2 + C ξi s.t. yi (x>i β) ≥ 1 − ξi , ξi ≥ 0 (187)
β,ξ
i=1
68 Maths for Intelligent Systems, Marc Toussaint

the dual
n X
X n n
X
l(α, µ) = min L(β, ξ, α, µ) = − 14 αi αi0 yi yi0 x̂>i x̂i0 + αi (188)
β,ξ
i=1 i0 =1 i=1
max l(α, µ) s.t. 0 ≤ αi ≤ C (189)
α,µ

(See ML lecture 5:13 for a derivation.)

y B

4.6.3 The Simplex Algorithm

Consider an LP. We make the following observations:

ˆ First, in LPs the equality constraints could be resolved simply by introducing


new coordinates along the zero-hyperplane of h. Therefore, for the conceptual
discussion we neglect equality constraints.

ˆ The objective constantly pulls in the direction −c = −∇f (x).

ˆ If the solution is bounded there need to be some inequality constraints that keep
the solution from travelling to ∞ in the −c direction.

ˆ It follows: The solution will always be located at a vertex, that is, an inter-
section point of several zero-hyperplanes of inequality constraints.

ˆ In fact, we should think of the feasible region as a polytope that is defined by


all the zero-hyperplanes of the inequalities. The inside the polytope is the feasible
region. The polytope has edges (intersections of two contraint planes), faces, etc.
A solution will always be located at a vertex of the polytope; more precisely, there
could be a whole set of optimal points (on a face orthogonal to c), but at least
one vertex is also optimal.

ˆ An idea for finding the solution is to walk on the edges of the polytope until
an optimal vertex is found. This is the simplex algorithm of Georg Dantzig, 1947.
In practise this procedure is done by “pivoting on the simplex tableaux”—but we
fully skip such details here.
Maths for Intelligent Systems, Marc Toussaint 69

ˆ The simplex algorithm is often efficient, but in worst case it is exponential in both,
n and m! This is hard to make intuitive, because the effects of high dimensions are
not intuitive. But roughly, consider that in high dimensions there is a combinatorial
number of ways of how constraints may intersect and form edges and vertices.

Here is a view that much more relates to our discussion of the log barrier method:
Sitting on an edge/face/vertex is equivalent to temporarily deciding which constraints
are active. If we knew which constraints are eventually active, the problem would be
solved: all inequalities become equalities or void. (And linear equalities can directly
be solved for.) So, jumping along vertices of the polytope is equivalent to sequentially
making decisions on which constraints might be active. Note though that there are 2m
configurations of active/non-active constraints. The simplex algorithm therefore walks
through this combinatorial space.
Interior point methods do exactly the opposite: Recall that the 4th KKT condition
is λi gi (x) = 0. The log barrier method (for instance) instead relaxes this hard logic
of activ/non-active constraints and finds in each iteration a solution to the relaxed 4th
KKT condition λi gi (x) = −µ, which intuitively means that every constraint may be
“somewhat active”. In fact, every constraint contributes somewhat to the stationarity
condition via the log barrier’s gradients. Thereby interior point methods

ˆ post-pone the hard decisions about active/non-active constraints

ˆ approach the optimal vertex from the inside of the polytope; avoiding the polytope
surface (and its hard decisions)

ˆ thereby avoids the need to search through a combinatorial space of constraint


activities and instead continuously converges to a decision

ˆ has polynomial worst-case guaranteed

Historically, penalty and barrier methods methods were standard before the Simplex
Algorithm. When SA was discovered in the 50ies, it was quickly considered great. But
then, later in the 70-80ies, a lot more theory was developed for interior point methods,
which now again have become somewhat more popular than the simplex algorithm.

4.6.4 Sequential Quadratic Programming

Just for reference, SQP is another standard approach to solving non-linear mathematical
programs. In each iteration we compute all coefficients of the 2nd order Taylor f (x+δ) ≈
f (x) + ∇f (x)>δ + 12 δ>Hδ and 1st-order Taylor g(x + δ) ≈ g(x) + ∇g(x)>δ and then solve
the QP
1
min f (x) + ∇f (x)>δ + δ>∇2 f (x)δ s.t. g(x) + ∇g(x)>δ ≤ 0 (190)
δ 2
70 Maths for Intelligent Systems, Marc Toussaint

The optimal δ ∗ of this problem should be seen analogous to the optimal Newton step: If
f were a 2nd-order polynomial and g linear, then δ ∗ would jump directly to the optimum.
However, as this is generally not the case, δ ∗ only gives us a very good direction for line
search. In SQP, we need to backtrack until we found a feasible point and f decreases
sufficiently.
Solving each QP in the sub-routine requires a constrained solver, which itself might have
two nested loops (e.g. using log-barrier or AugLag). In that case, SQP has three nested
loops.

4.7 Blackbox & Global Optimization: It’s all about learning

Even if f, g, h are smooth, the solver might not have access to analytic equations or
efficient numeric methods to evaluate the gradients or hessians of these. Therefore we
distinguish (here neglecting the constraint functions g and h):

Definition 4.7.

ˆ Blackbox optimization: Only f (x) can be evaluated.

ˆ 1st-order/gradient optimization: Only f (x) and ∇f (x) can be evaluated.

ˆ Quasi-Newton optimization: Only f (x) and ∇f (x) can be evaluated, but the
solver does tricks to estimate ∇2 f (x). (So this is a special case of 1st-order
optimization.)
ˆ Gauss-Newton type optimization: f is of the special form f (x) = φ(x)>φ(x)

and ∂x φ(x) can be evaluated.

ˆ 2nd order optimization: f (x), ∇f (x) and ∇2 f (x) can be evaluated.

In this lecture I very briefly want to add comments on global blackbox optimization.
Global means that we now, for the first time, aim to find the global optimum (within
some pre-specified bounded range). In essence, to address such a problem we need to
explicitly know what we know about f 7 , and an obvious way to do this is to use Bayesian
learning.

4.7.1 A sequential decision problem formulation

From now on, let’s neglect constraints and focus on the mathematical program

min f (x) (191)


x

7 Cf. the KWIK (knows what it knows) framework.


Maths for Intelligent Systems, Marc Toussaint 71

for a blackbox function f . The optimization process can be viewed as a Markov De-
cision Process that describes the interaction of the solver (agent) with the function
(environment):

ˆ At step t, Dt = {(xi , yi )}t-1


i=1 is the data that the solver has collected from previous
samples. This Dt is the state of the MDP.
ˆ At step t, the solver may choose a new decision xt about where to sample next.
ˆ Given state Dt and decision xt , the next state is Dt+1 = D ∪ {(xt , f (xt ))}, which
is a deterministic transition given the function f .
ˆ A solver policy is a mapping π : Dt 7→ xt that maps any state (of knowledge) to
a new decision.
ˆ We may define an optimal solver policy as
Z
π ∗ = argminhyT i = argmin P (f ) P (DT |π, f ) yT (192)
π π f

where P (DT |π, f ) is deterministic, and P (f ) is a prior over functions.


This objective function cares only about the last value yT sampled by the solver
PT for
a fixed time horizon (budget) T . Alternatively, we may choose objectives t=1 yt
PT
or t=1 γ t yt for some discounting γ ∈ [0, 1]

The above defined what is an optimal solver! Something we haven’t touched at all
before. The transition dynamics of this MDP is deterministic, given f . However, from
the perspective of the solver, we do not know f apriori. But we can always compute
a posterior belief P (f |Dt ) = P (Dt |f ) P (f )/P (Dt ). This posterior belief defines a
belief MDP with stochastic transitions
Z Z Z
P (Dt+1 ) = [Dt+1 = D ∪ {(xt , f (xt ))}] π(xt |Dt ) P (f |Dt ) P (Dt ) . (193)
Dt f xt

The belief MDP’s state space is P (Dt ) (or equivalently, P (f |Dt ), the current belief over
f ). This belief MDP is something that the solver can, in principle, forward simulate—
it has all information about it. One can prove that, if the solver could solve its own
belief MDP (find an optimal policy for its belief MDP), then this policy is the optimal
solver policy for the original problem given a prior distribution P (f )! So, in principle we
not only defined what is an optimal solver policy, but can also provide an algorithm to
compute it (Dynamic programming in the belief MDP)! However, this is so expensive to
compute that heuristics need to be used in practise.
One aspect we should learn from this discussion: The solver’s optimal decision is based
on its current belief P (f |Dt ) over the function. This belief is the Bayesian representation
of everything one could possibly have learned about f from the data Dt collected so far.
Bayesian Global Optimization methods compute P (f |Dt ) in every step and, based on
this, use a heuristic to choose the next decision.
72 Maths for Intelligent Systems, Marc Toussaint

4.7.2 Acquisition Functions for Bayesian Global Optimization*

In pratice one typically uses a Gaussian Process representation of P (f |Dt ). This means
that in every iteration we have an estimate fˆ(x) of the function mean and a variance
estimate σ̂(x)2 that describes our uncertainty about the mean estimate. Based on this
we may define the following acquisition functions

Definition 4.8. Probability of Improvement (MPI)


R y∗
αt (x) = −∞
N(y|fˆ(x), σ̂(x)) dy (194)

Expected Improvement (EI)


R y∗
αt (x) = −∞
N(y|fˆ(x), σ̂(x)) (y ∗ − y) dy (195)

Upper Confidence Bound (UCB)

αt (x) = −fˆ(x) + βt σ̂(x) (196)

Predictive Entropy Search Hernández-Lobato et al. (2014)

αt (x) = H[p(x∗ |Dt )] − E{p(y|Dt ; x)} H[p(x∗ |Dt ∪ {(x, y)})] (197)
= I(x∗ , y|Dt ) = H[p(y|Dt , x)] − E{p(x∗ |Dt )} H[p(y|Dt , x, x∗ )]

The last one is special; we’ll discuss it below.


These acquisition functions are heuristics that define how valuable it is to acquire data
from the site x. The solver then makes the decision

xt = argmax αt (x) . (198)


x

MPI is hardly being used in practise anymore. EI is classical, originating way back in the
50ies or earlier; Jones et al. (1998) gives an overview. UCB received a lot of attention
recently due to the underlying bandit theory and bounded regret theorems due to the
submodularity. But I think that in practise EI and UCB perform about equally. As UCB
is somewhat easier to implement and intuitive.
In all cases, note that the solver policy xt = argmaxx αt (x) requires to internally solve
another non-linear optimization problem. However, αt is an analytic function for which
we can compute gradients and hessians which ensures every efficient local convergence.
But again, xt = argmaxx αt (x) needs to be solved globally —otherwise the solver will
also not solve the original problem properly and globally. As a consequence, the opti-
mization of the acquisition function needs to be restarted from many many potential
start points close to potential local minima; typically from grid(!) points over the full
domain range. The number of grid points is exponential in the problem dimension n.
Therefore, this inner loop can be very expensive.
Maths for Intelligent Systems, Marc Toussaint 73

And a subjective note: This all sounds great, but be aware that Gaussian Processes with
standard squared-exponential kernels do not generalize much in high dimensions: one
roughly needs exponentially many data points to fully cover the domain and reduce belief
uncertainty globally, almost as if we were sampling from a grid with grid size equal to the
kernel width. So, the whole approach is not magic. It just does what is possible given a
belief P (f ). It would be interesting to have much more structured (and heteroscedastic)
beliefs specifically for optimization.
The last acquisition function is called Predictive Entropy Search. This formulation
is beautiful: We sample at places x where the (expected) observed value y informs us
as much as possible about the optimum x∗ of the function! Formally, this means to
maximize the mutual information between y and x∗ , in expectation over y|x.

4.7.3 Classical model-based blackbox optimization (non-global)*

A last method very worth mentioning: Classical model-based blackbox optimization


simply fits a local polynomial model to the recent data and takes this a basis for search.
This is similar to BFGS, but now for the blackbox case where we not even observe
gradients. See Algorithm 7.
The local fitting of a polynomial model is again a Machine Learning method. Whether
this gives a function approximation for optimization depends on the quality of the data Dt
used for this approximation. Classical model-based optimization has interesting heuris-
tics to evaluate the data quality as well as sample new points to improve the data
quality. Here is a rough algorithm (following Nodecal et al.’s section on “Derivative-free
optimization”):
Some notes:

ˆ Line 4 implement an explicit trust region approach, which hard bound α on the
step size.

ˆ Line 5 is like the Wolfe condition. But here, the expected decrease is [f (x̂) −
fˆ(x̂ + δ)] instead of −αδ∇f (x).

ˆ If there is no sufficient decrease we may blame it on two reasons: bad data or a


too large stepsize.

ˆ Line 10 uses the data determinant as a measure of quality! This is meant in the
sense of linear regression on polynomial features. Note that, with data matrix
X ∈ Rn×dim(β) , β̂ ls = (X>X)-1 X>y is the optimal regression. The determinant
det(X>X) or det(X) = det(D) is a measure for well the data supports the
regression. If the determinant is zero, the regression problem is ill-defined. The
larger the determinant, the lower the variance of the regression estimator.

ˆ Line 11 is an explicit exploration approach: We add a data point solely for the
purpose of increasing the data determinant (increasing the data spread). Interest-
74 Maths for Intelligent Systems, Marc Toussaint

Algorithm 7 Classical model-based optimization


1
1: Initialize D with at least
2 (n + 1)(n + 2) data points
2: repeat
3: Compute a regression fˆ(x) = φ2 (x)>β on D
4: Compute δ = argminδ fˆ(x̂ + δ) s.t. |δ| < α
5: if f (x̂ + δ) < f (x̂) − %ls [f (x̂) − fˆ(x̂ + δ)] then // test sufficient decrease
6: Increase the stepsize α
7: Accept x̂ ← x̂ + δ
8: Add to data, D ← D ∪ {(x̂, f (x̂))}
9: else // no sufficient decrease
10: if det(D) is too small then // blame the data quality
11: Compute x+ = argmaxx0 det(D ∪ {x0 }) s.t. |x − x0 | < α
12: Add to data, D ← D ∪ {(x+ , f (x+ ))}
13: else // blame the stepsize
14: Decrease the stepsize α
15: end if
16: end if
17: Perhaps prune the data, e.g., remove argmaxx∈∆ det(D \ {x})
18: until x converges

ing. Nocedal describes in more detail a geometry-improving procedure to update


D.

4.7.4 Evolutionary Algorithms*

There are interesting and theoretically well-grounded evolutionary algorithms for opti-
mization, such as Estimation-of-Distribution Algorithms (EDAs). But generally, don’t
use them as first choice.

4.8 Examples and Exercises

4.8.1 Convergence proof

a) Given a function f : Rn → R with fmin = minx f (x). Assume that its Hessian—
that is, the eigenvalues of ∇2 f —are lower bounded by m > 0 and upper bounded by
M > m, with m, M ∈ R. Prove that for any x ∈ Rn it holds
1 1
f (x) − |∇f (x)|2 ≤ fmin ≤ f (x) − |∇f (x)|2 .
2m 2M
Tip: Start with bounding f (x) between the functions with maximal and minimal curva-
ture. Then consider the minima of these bounds. Note, it also follows:
|∇f (x)|2 ≥ 2m(f (x) − fmin ) .
Maths for Intelligent Systems, Marc Toussaint 75

b) Consider backtracking line search with Wolfe parameter %ls ≤ 12 , and step decrease
%−
factor %−α . First prove that line search terminates the latest when M
α
≤α≤ 1
M, and
then it found a new point y for which

%ls %−
α
f (y) ≤ f (x) − |∇f (x)|2 .
M
From this, using the result from a), prove the convergence equation
h 2m%ls %−
α
i
f (y) − fmin ≤ 1 − (f (x) − fmin ) .
M

4.8.2 Backtracking Line Search

Consider the functions

fsq (x) = x>Cx , (199)


fhole (x) = 1 − exp(−x>Cx) . (200)

i−1
with diagonal matrix C and entries C(i, i) = c n−1 , where n is the dimensionality of x.
We choose a conditioning8 c = 10. To plot the function for n = 2, you can use gnuplot
calling

set isosamples 50,50


set contour
f(x,y) = x*x+10*y*y
#f(x,y) = 1 - exp(-x*x-10*y*y)
splot [-1:1][-1:1] f(x,y)

a) Implement gradient descent with backtracking, as described on page 42 (Algorithm


2 Plain gradient descent). Test the algorithm on fsq (x) and fhole (x) with start point
x0 = (1, 1). To judge the performance, create the following plots:
– The function value over the number of function evaluations.
– For n = 2, the function surface including algorithm’s search trajectory. If using gnuplot,
store every evaluated point x and function value f (x) in a line (with n + 1 entries) in a file
’path.dat’, and plot using
unset contour
splot [-3:3][-3:3] f(x,y), ’path.dat’ with lines

8 The word “conditioning” generally denotes the ratio of the largest and smallest Eigenvalue of the

Hessian.
76 Maths for Intelligent Systems, Marc Toussaint

b) Play around with parameters. How does the performance change for higher dimen-
sions, e.g., n = 100? How does the performance change with ρls (the Wolfe stop
criterion)? How does the alternative in step 3 work?
c) Newton step: Modify the algorithm simply by multiplying C -1 to the step. How does
that work?
(The Newton direction diverges (is undefined) in the concave part of fhole (x). We’re
cheating here when always multiplying with C -1 to get a good direction.)

4.8.3 Gauss-Newton

In x ∈ R2 consider the function


 
sin(ax1 )
sin(acx2 )
f (x) = φ(x)>φ(x) , φ(x) = 
 2x1 

2cx2

The function is plotted above for a = 4 (left) and a = 5 (right, having local minima),
and conditioning c = 1. The function is non-convex.
Extend your backtracking method implemented in the last week’s exercise to a Gauss-
Newton method (with constant λ) to solve the unconstrained minimization problem
minx f (x) for a random start point in x ∈ [−1, 1]2 . Compare the algorithm for a = 4
and a = 5 and conditioning c = 3 with gradient descent.

4.8.4 Robust unconstrained optimization

A ’flattened’ variant of the Rosenbrock function is defined as


1
f (x) = log[1 + (x2 − x21 )2 + (1 − x2 )2 ]
100
and has the minimum at x∗ = (1, 1). For reference, the gradient and hessian are
1
g(x) := 1 + (x2 − x21 )2 + (1 − x2 )2 (201)
100
Maths for Intelligent Systems, Marc Toussaint 77

1 h i
∂x1 f (x) = − 4(x2 − x21 )x1 (202)
g(x)
1 h 2 i
∂x2 f (x) = 2(x2 − x21 ) − (1 − x2 ) (203)
g(x) 100
h i2 1 h 2 i
∂x21 f (x) = − ∂x1 f (x) + 8x1 − 4(x2 − x21 ) (204)
g(x)
h i2 1 h 2 i
∂x22 f (x) = − ∂x2 f (x) + 2+ (205)
g(x) 100
h ih i 1 h i
∂x1 ∂x2 f (x) = − ∂x1 f (x) ∂x2 f (x) + − 4x1 (206)
g(x)

a) Use gnuplot to display the function copy-and-pasting the following lines:

set isosamples 50,50


set contour
f(x,y) = log(1+(y-(x**2))**2 + .01*(1-x)**2 ) - 0.01
splot [-3:3][-3:4] f(x,y)

(The ’-0.01’ ensures that you can see the contour at the optimum.) List and discuss at
least three properties of the function (at different locations) that may raise problems to
naive optimizers.
b) Use x = (−3, 3) as starting point for an optimization algorithm. Try to code an
optimization method that uses all ideas mentioned in the lecture. Try to tune it to be
efficient on this problem (without cheating, e.g. by choosing a perfect initial stepsize.)

4.8.5 Lagrangian Method of Multipliers

c
In a previous exercise we defined the “hole function” fhole (x). Assume conditioning
c = 10 and use the Lagrangian Method of Multipliers to solve on paper the following
constrained optimization problem in 2D:

c
min fhole (x) s.t. h(x) = 0 (207)
x
h(x) = v>x − 1 (208)

Near the very end, you won’t be able to proceed until you have special values for v. Go
as far as you can without the need for these values.
78 Maths for Intelligent Systems, Marc Toussaint

4.8.6 Equality Constraint Penalties and Augmented Lagrangian

The squared penalty approach to solving a constrained optimization problem minimizes


m
X
min f (x) + µ hi (x)2 . (209)
x
i=1

The Augmented Lagrangian method adds a Lagrangian term and minimizes


m
X m
X
min f (x) + µ hi (x)2 + λi hi (x) . (210)
x
i=1 i=1

Assume that we first minimize (209) we end up at a minimum x


b.
Now prove that setting λi = 2µhi (bx) will, if we assume that the gradients ∇f (x) and
∇h(x) are (locally) constant, ensure that the minimum of (210) fulfills the constraints
h(x) = 0.

4.8.7 Lagrangian and dual function

(Taken roughly from ‘Convex Optimization’, Ex. 5.1)


Consider the optimization problem

min x2 + 1 s.t. (x − 2)(x − 4) ≤ 0


x

with variable x ∈ R.
a) Derive the optimal solution x∗ and the optimal value p∗ = f (x∗ ) by hand.
b) Write down the Lagrangian L(x, λ). Plot (using gnuplot or so) L(x, λ) over x for
various values of λ ≥ 0. Verify the lower bound property minx L(x, λ) ≤ p∗ , where p∗
is the optimum value of the primal problem.
c) Derive the dual function l(λ) = minx L(x, λ) and plot it (for λ ≥ 0). Derive the dual
optimal solution λ∗ = argmaxλ l(λ). Is maxλ l(λ) = p∗ (strong duality)?

4.8.8 Optimize a constrained problem

Consider the following constrained problem


n
X
min xi s.t. g(x) ≤ 0 (211)
x
i=1
x>x − 1 
 

g(x) =  (212)
−x1
 
Maths for Intelligent Systems, Marc Toussaint 79

a) First, assume x ∈ R2 is 2-dimensional, and draw on paper what the problem looks
like and where you expect the optimum.
b) Find the optimum analytically using the Lagrangian. Here, assume that you know
apriori that all constraints are active! What are the dual parameters λ = (λ1 , λ2 )?
Note: Assuming that you know a priori which constraints are active is a huge assumption!
In real problems, this is the actual hard (and combinatorial) problem. More on this later
in the lecture.
c) Implement a simple the Log Barrier Method. Tips:
– Initialize x = ( 21 , 12 ) and µ = 1
– First code an inner loop:
– In each iteration, first compute the gradient of the log-barrier function. Recall that
X
F (x; µ) = f (x) − µ log(−gi (x)) (213)
i
X
∇F (x; µ) = ∇f − µ (1/gi (x))∇gi (x) (214)
i

– Then perform a backtracking line search along −∇F (x, µ). In particular, backtrack if
a step goes beyond the barrier (where g(x) 6≤ 0 and F (x, µ) = ∞).
– Iterate until convergence; let’s call the result x∗ (µ). Further, compute λ∗ (m) =
−(µ/g1 (x), µ/g2 (x)) at convergence.
– Decrease µ ← µ/2, recompute x∗ (µ) (with the previous x∗ as initialization) and iterate
this.

Does x∗ and λ∗ converge to the expected solution?


Note: The path x∗ (µ) = argminx F (x; µ) (the optimum in dependence of µ) is called
central path.

Comment: Solving problems in the real world involves 2 parts:

1) formulating the problem as an optimization problem (conform to a standard opti-


mization problem category) (→ human)

2) the actual optimization problem (→ algorithm)

These exercises focus on the first type, which is just as important as the second, as it
enables the use of a wider range of solvers. Exercises from Boyd et al http://www.
stanford.edu/~boyd/cvxbook/bv_cvxbook.pdf:

4.8.9 Network flow problem

Solve Exercise 4.12 (pdf page 193) from Boyd & Vandenberghe, Convex Optimization.
80 Maths for Intelligent Systems, Marc Toussaint

4.8.10 Minimum fuel optimal control

Solve Exercise 4.16 (pdf page 194) from Boyd & Vandenberghe, Convex Optimization.

4.8.11 Reformulating an `1 -norm

(This is a subset of Exercise 4.11 (pdf page 193) from Boyd & Vandenberghe.)
n
Let x ∈ RP . The optimization problem is minx ||x − b||1 , where the `1 -norm is defined
n
as ||z||1 = i=1 |zi |. Reformulate this optimization problem as a Linear Program.

4.8.12 Restarts of Local Optima

The following function is essentially the Rastrigin function, but written slightly differently.
It can be tuned to become uni-modal and is a sum-of-squares problem. For x ∈ R2 we
define  
sin(ax1 )
sin(acx2 )
f (x) = φ(x)>φ(x) , φ(x) =   2x1 

2cx2
The function is plotted above for a = 4 (left) and a = 5 (right, having local minima),
and conditioning c = 1. The function is non-convex.
Choose a = 6 or larger and implement a random restart method: Repeat initializing
x ∼ U([−2, 2]2 ) uniformlly, followed by a gradient descent (with backtracking line search
and monotone convergence).
Restart the method at least 100 times. Count how often the method converges to which
local optimum.

4.8.13 GP-UCB Bayesian Optimization

Find an implementation of Gaussian Processes for your language of choice (e.g. python:
scikit-learn, or Sheffield/Gpy; octave/matlab: gpml) and implement GP-UCB global
Maths for Intelligent Systems, Marc Toussaint 81

optimization. Test your implementation with different hyperparameters (Find the best
combination of kernel and its parameters in the GP) on the 2D function defined above.
On the webpage you find a starting code to use GP regression in scikit-learn. To install
scikit-learn: https://scikit-learn.org/stable/install.html

5 Probabilities & Information


It is beyond the scope of these notes to give a detailed introduction to probability theory.
There are excellent books:

ˆ Thomas & Cover


ˆ Bishop
ˆ MacKay

Instead, we first recap very basics of probability theory, that I assume the reader has
already seen before. The next section will cover this. Then we focus on specific top-
ics that, in my opinion, deepen the understanding of the basics, such as the relation
between optimization and probabilities, log-probabilities & energies, maxEntropy and
maxLikelihood, minimal description length and learning.

5.1 Basics

First, in case you wonder about justifications of the use of (Bayesian) probabilities versus
fuzzy sets or alike, here some pointers to look up: 1) Cox’s theorem, which derives from
basic assumptiond about “rationality and consistency” the standard probability axioms;
2) t-norms, which generalize probability and fuzzy calculus; and 3) read about objective
vs. subjecte Bayesian probability.

5.1.1 Axioms, definitions, Bayes rule

Definition 5.1 (set-theoretic axioms of probabilities).


ˆ An experiment can have multiple outcomes; we call the set of possible out-
comes sample space or domain S
ˆ A mapping P : A ⊆ S 7→ [0, 1], that maps any subset A ⊆ S to a real
number, is called probability measure on S iff
– P (A) ≥ 0 for any A ⊆ S (non-negativity)
S P
– P ( i Ai ) = i P (Ai ) if Ai ∩ Aj = ∅ (additivity)
82 Maths for Intelligent Systems, Marc Toussaint

– P (S) = 1 (normalization)

ˆ Implications are:
– 0 ≤ P (A) ≤ 1
– P (∅) = 0
– A ⊆ B ⇒ P (A) ≤ P (B)
– P (A ∪ B) = P (A) + P (B) − P (A ∩ B)
– P (S \ A) = 1 − P (A)

Formally, a random variable X is a mapping X : S → Ω from a measureable space S


(that is, a sample space S that has a probability measure P ) to another sample space
Ω, which I typically call the domain dom(X) of the random variable. Thereby, the
mapping X : S → Ω now also defines a probability measure over the domain Ω:

P (B ⊆ Ω) = P ({s : X(s) ∈ B}) (215)

In practise we just use the following notations:

Definition 5.2 (Random Variable).


ˆ Let X be a random variable with discrete domain dom(X) = Ω

ˆ P (X = x) ∈ R denotes the specific probability that X = x for some x ∈ Ω

ˆ P (X) denotes the probability distribution (function over Ω)

ˆ ∀x∈Ω : 0 ≤ P (X = x) ≤ 1

ˆ
P
x∈Ω P (X = x) = 1

ˆ We often use the short hand X P (X) · · · = x∈dom(X) P (X = x) · · ·


P P
when summing over possible values of a RV

If we have two or more random variables, we have

Definition 5.3 (Joint, marginal, conditional, independence, Bayes’ Theorem).


ˆ We denote the joint distribution of two RVs as P (X, Y )

ˆ The marginal is defined as P (X) = Y P (X, Y )


P

P (X,Y )
ˆ The conditional is defined as P (X|Y ) = P (Y ) , which fulfils
P
∀Y : X P (X|Y ) = 1.

ˆ X is independent of Y iff P (X, Y ) = P (X) P (Y ), or equivalently,


P (X|Y ) = P (X).
Maths for Intelligent Systems, Marc Toussaint 83

ˆ The definition of a conditional implies the product rule

P (X, Y ) = P (X|Y ) P (Y ) = P (Y |X) P (X) (216)

and Bayes’ Theorem

P (Y |X) P (X)
P (X|Y ) = (217)
P (Y )

The individual terms in Bayes’ Theorem are typically given names:


likelihood · prior
posterior = (218)
normalization
(Sometimes, the normalization is also called evidence.)
ˆ X is conditionally independent of Y given Z iff P (X|Y, Z) = P (X|Z) or
P (X, Y |Z) = P (X|Z) P (Y |Z)

5.1.2 Standard discrete distributions

RV parameter distribution
Bernoulli x ∈ {0, 1} µ ∈ [0, 1] Bern(x | µ) = µx (1 −
µ)1−x
Beta µ ∈ [0, 1] α, β ∈ R+ Beta(µ | a, b) =
1 a−1 b−1
B(a,b) µ (1 − µ)
Multinomial x ∈ {1, .., K} µ ∈ [0, 1]K , ||µ||1 = P (x = k | µ) = µk
1
QK k −1
Dirichlet µ ∈ [0, 1]K , ||µ||1 = α1 , .., αK ∈ R+ Dir(µ | α) ∝ k=1 µα
k
1
Clearly, the Multinomial is a generalization of the Bernoulli, as the Dirichlet is of the
Beta. The mean of the Dirichlet is hµi i = Pαiαj , its mode is µ∗i = Pααi −1
j −K
. The mode
j j
of a distribution p(x) is defined as argmaxx p(x).

5.1.3 Conjugate distributions

Definition 5.4 (Conjugacy). Let p(D|x) be a likelihood conditional on a RV x. A


family C of distributions (i.e., C is a space of distributions, like the space of all Beta
distributions) is called conjugate to the likelihood function p(D|x) iff

p(D|x) p(x)
p(x) ∈ C ⇒ p(x|D) = ∈C. (219)
p(D)

The standard conjugates you should know:


84 Maths for Intelligent Systems, Marc Toussaint

RV likelihood conjugate
µ Binomial Bin(D | µ) Beta Beta(µ | a, b)
µ Multinomial Mult(D | µ) Dirichlet Dir(µ | α)
µ Gauss N(x | µ, Σ) Gauss N(µ | µ0 , A)
λ 1D Gauss N(x | µ, λ-1 ) Gamma Gam(λ | a, b)
Λ nD Gauss N(x | µ, Λ-1 ) Wishart Wish(Λ | W, ν)
(µ, Λ) nD Gauss N(x | µ, Λ-1 ) Gauss-Wishart
N(µ | µ0 , (βΛ)-1 ) Wish(Λ | W, ν)

5.1.4 Distributions over continuous domain


Definition 5.5. Let x be a continuous RV. The probability density function (pdf)
p(x) ∈ [0, ∞) defines the probability
Z b
P (a ≤ x ≤ b) = p(x) dx ∈ [0, 1] (220)
a
Ry
The cumulative probability distribution F (y) = P (x ≤ y) = −∞
dx p(x) ∈
[0, 1] is the cumulative integral with limy→∞ F (y) = 1
However, I and most others say probability distribution to refer to probability
density function.

One comment about integrals. If p(x) is a probability density function and f (x) some
arbitrary function, typically one writes
Z
f (x) p(x) dx , (221)
x

where dx denotes the (Borel) measure we integrate over. However, some authors (cor-
rectly) think of a distribution p(x) as being a measure over the space dom(x) (instead
of just a function). So the above notation is actually “double” w.r.t. the measures. So
they might (also correctly) write
Z
p(x) f (x) , (222)
x

and take care that there is exactly one measure to the right of the integral.

5.1.5 Gaussian
Definition 5.6. We define an n-dim Gaussian in normal form as
1 1
N(x | µ, Σ) = exp{− (x − µ)> Σ-1 (x − µ)} (223)
| 2πΣ | 1/2 2
Maths for Intelligent Systems, Marc Toussaint 85

with mean µ and covariance matrix Σ. In canonical form we define

exp{− 21 a>A-1 a} 1
N[x | a, A] = exp{− x> A x + x>a} (224)
| 2πA-1 | 1/2 2

with precision matrix A = Σ-1 and coefficient a = Σ-1 µ (and mean µ = A-1 a).

Gaussians are used all over—below we explain in what sense they are the probabilis-
tic analogue to a parabola (or a 2nd-order Taylor expansions). The most important
properties are:

ˆ Symmetry: N(x | a, A) = N(a | x, A) = N(x − a | 0, A)


ˆ Product:
N(x | a, A) N(x | b, B) = N[x | A-1 a + B -1 b, A-1 + B -1 ] N(a | b, A + B)
N[x | a, A] N[x | b, B] = N[x | a + b, A + B] N(A-1 a | B -1 b, A-1 + B -1 )
ˆ “Propagation”:
N(x | a + F y, A) N(y | b, B) dy = N(x | a + F b, A + F BF>)
R
y

ˆ Transformation:
N(F x + f | a, A) = 1
|F | N(x | F -1 (a − f ), F -1 AF -> )
ˆ Marginal
 & conditional:

x a A C
N , > = N(x | a, A) · N(y | b + C>A-1 (x - a), B − C>A-1 C)
y b C B

More Gaussian identities are found at http://ipvs.informatik.uni-stuttgart.de/mlr/marc/


notes/gaussians.pdf

Example 5.1 (ML estimator of the mean of a Gaussian). Assume we have data
D = {x1 , .., xn }, each xi ∈ Rn , with likelihood

P (D | µ, Σ) = i N(xi | µ, Σ)
Q
(225)
n
1 X
argmax P (D | µ, Σ) = xi (226)
µ n i=1
n
1X
argmax P (D | µ, Σ) = (xi − µ)(xi − µ)> (227)
Σ n i=1

Assume we are initially uncertain about µ (but know Σ). We can express this
uncertainty using again a Gaussian N[µ | a, A]. Given data we have

P (µ | D) ∝ P (D | µ, Σ) P (µ) = i N(xi | µ, Σ) N[µ | a, A]


Q
(228)
= i N[µ | Σ xi , Σ ] N[µ | a, A] ∝ N[µ | Σ
-1 -1 -1 -1
Q P
i xi , nΣ + A]
(229)
86 Maths for Intelligent Systems, Marc Toussaint

Note: in the limit A → 0 (uninformative prior) this becomes


1X 1
P (µ | D) = N(µ | xi , Σ) (230)
n i n

which is consistent with the Maximum Likelihood estimator

5.1.6 “Particle distribution”

Usually, “particles” are not listed as standard continuous distribution. However I think
they should be. They’re heavily used in several contexts, especially as approximating
other distributions in Monte Carlo methods and particle filters.

Definition 5.7 (Dirac or δ-distribution). In distribution theory it is proper to define


a distribution δ(x) that is the derivative of the Heavyside step function H(x),


δ(x) = H(x) , H(x) = [x ≥ 0] . (231)
∂x
It is akward to think of δ(x) as a normal function, as it’d be “infinite” at zero. But
at least we understand that is has the properties
Z
δ(x) = 0 everywhere except at x = 0 , δ(x) dx = 1 . (232)

I sometimes call the Dirac distribution also a point particle: it has all its unit “mass”
concentrated at zero.

Definition 5.8 (Particle Distribution). We define a particle distribution q(x) as


a mixture of Diracs,
N
X
q(x) := wi δ(x − xi ) , (233)
i=1

which is parameterized by the number N , the locations {xi }N n


i=1 , xi ∈ R , and the
N
normalized weights {wi }i=1 , wi ∈ R, ||w||1 = 1 of the N particles.

We say that a particle distribution q(x) approximates another distribution p(x) iff for
any (smooth) f
Z
PN
hf (x)ip = f (x)p(x)dx ≈ i=1 wi f (xi ) (234)
x

Note the generality of this statement! f could be anything, it could be any features of
the variable x, like coordinates of x, or squares, or anything. So basically this statement
Maths for Intelligent Systems, Marc Toussaint 87

says, whatever you might like to estimate about p, you can approximate it based on the
particles q.
Computing particle approximations of comples (non-analytical, non-tracktable) distribu-
tions p is a core challenge in many fields. The true p could for instance be a distributions
over games (action sequences). The approximation q could for instance be samples gen-
erated with Monte Carlo Tree Search (MCTS). The tutorial An Introduction to MCMC
for Machine Learning www.cs.ubc.ca/~nando/papers/mlintro.pdf gives an excel-
lent introduction. Here are some illustrations of what it means to approximate some
p by particles q, taken from this tutorial. The black line is p, histograms illustrate the
particles q by showing how many of (uniformly weighted) particles fall into a bin:

(from de Freitas et al.)

5.2 Between probabilities and optimization: neg-log-probabilities,


exp-neg-energies, exponential family, Gibbs and Boltzmann

There is a natural relation between probabilities and “energy” (or “error”). Namely, if
p(x) denotes a probability for every possible value of x, and E(x) denotes an energy for
state x—or an error one assigns to choosing x—then a natural relation is

p(x) = e−E(x) , E(x) = − log p(x) . (235)

Why is that? First, outside the context of physics it is perfectly fair to just define
axiomatically an energy E(x) as neg-log-probability. But let me try to give some more
arguments for why this is a useful definition.
Let assume we have p(x). We want to find a quantity, let’s call it error E(x), which
is a function of p(x). Intuitively, if a certain value x1 is more likely than another,
88 Maths for Intelligent Systems, Marc Toussaint

p(x1 ) > p(x2 ), then picking x1 should imply less error, E(x1 ) < E(x2 ) (Axiom 1).
Further, when we have two independent random variables x and y, probabilities are
multiplicative, p(x, y) = p(x)p(y). We require axiomatically that error is additive,
E(x, y) = E(x) + E(y). From both follows that E needs to be some logarithm of p!
The same argument, now more talking about energy : Assume we have two independent
(physical) systems x and y. p(x, y) = p(x)p(y) is the probability to find them in certain
states. We axiomatically require that energy is additive, E(x, y) = E(x) + E(y).
Again, E needs to be some logarithm of p. In the context of physics, what could
be questioned is “why is p(x) a function of E(x) in the first place?”. Well, that is
much harder to explain and really is a question about statistical physics. Wikipedia
under keywords “Maxwell-Boltzmann statistics” and “Derivation from microcanonical
ensemble” gives an answer. Essentially the argument is a follows: Given many many
molecules
Pn in a gas, each of which can have a different energy ei . The total energy
E = i=1 ei must be conserved. What is the distribution over energy levels that has
the most microstates? The answer is the Boltzmann distribution. (And why do we,
in nature, find energy distributions that have the most microstates? Because these are
most likely.)
Bottom line is: p(x) = e−E(x) , probabilities are multiplicative, energies or errors additive.
Let me state some fact just to underline how useful this way of thinking is:

ˆ Given an energy function E(x), its Boltzmann distribution is defined as

p(x) = e−E(x) . (236)

This is sometimes also called Gibbs distribution.

ˆ In machine learning, when data D is given and we have some model β, we typically
try to maximize the likelihood p(D|β). This is equivalent to minimizing the neg-
log-likelihood

L(β) = − log p(D|β) . (237)

This neg-log-likelihood is a typical measure for error of the model. And this error
is additive w.r.t. the data, whereas the likelihood is multiplicative, fitting perfectly
to the above discussion.

ˆ The Gaussian distribution p(x) ∝ exp{− 21 ||x − µ||2 /σ 2 } is related to the error
E(x) = 12 ||x − µ||2 /σ 2 , which is nothing but the squared error with the precision
matrix as metric. That’s why squared error measures (classical regression) and
Gaussian distributions (e.g., Bayesian Ridge regression) are directly related.
A Gaussian is the probabilistic analoque to a parabola.

ˆ The exponential family is defined as

p(x|β) = h(x)g(β) exp{β>φ(x)} (238)


Maths for Intelligent Systems, Marc Toussaint 89

Often h(x) = 1, so let’s neglect this for now. The key point is that the energy
is linear in the features φ(x). This is exactly how discriminative functions (for
classification in Machine learning) are typically formulated.
In the continuous case, the features φ(x) are often chosen as basis polynomials—
just as in polynomial regression. Then, β are the coefficients of the energy poly-
nomial and the exponential family is just the probabilistic analogue to the space
of polynomials.
ˆ When we have many variables x1 , .., xn , the structure of a cost function over
these variables can often be expressed as being additive in terms: f (x1 , .., xn ) =
P
i φi (x∂i ) where ∂i denotes the ith group of variables.
Q The respective
PBoltzmann
distribution is a factor graph p(x1 , .., xn ) ∝ i fi (x∂i ) = exp{ i βi φi (x∂i )
where ∂i denotes the
So, factor graphs are the probabilistic analoque to additive functions.
ˆ − log p(x) is also the “optimal” coding length you should assign to a symbol x.
P
Entropy is expected error: H[p] = x −p(x) log p(x) = h− log p(x)ip(x) , where
p itself it used to take the expectation.
Assume you use a “wrong” distribution q(x) to decide on the Rcoding length of sym-
bols drawn from p(x). The expected length of an encoding is x p(x)[− log q(x)] ≥
H(p).
The Kullback-Leibler divergence is the difference:
Z
 p(x)
D p q = p(x) log ≥0 (239)
x q(x)

Proof of inequality, using the Jenson inequality:


Z Z
q(x) q(x)
− p(x) log ≥ − log p(x) =0 (240)
x p(x) x p(x)

So, my message is that probabilities and error measures are naturally related. However,
in the first case we typically do inference, in the second we optimize. Let’s discuss the
relation between inference and optimization a bit more. For instance, given data D and
parameters β, we may define

Definition 5.9 (ML, MAP, and Bayes estimate). Given data D and a parameteric
model p(D|β), we define

ˆ Maximum likelihood (ML) parameter estimate:


β ML := argmaxβ P (D|β)
ˆ Maximum a posteriori (MAP) parameter estimate:
β MAP = argmaxβ P (β|D)
90 Maths for Intelligent Systems, Marc Toussaint

ˆ Bayesian parameter estimate:


P (β|D) ∝ P (D|β) P (β)
R
used for Bayesian prediction: P (prediction|D) = β
P (prediction|β) P (β|D)

Both, the MAP and the ML estimates are really just optimization problems.
The Bayesian parameter estimate P (β|D), which can then be used to do fully Bayesian
prediction, is in principle different. However, in practise also here optimization is a
core tool for estimating such distributions if they cannot be given analytically. This is
described next.

5.3 Information, Entropie & Kullback-Leibler

Consider the following problem. We have data drawn i.i.d. from p(x) where x ∈ X in
some discrete space X. Let’s call every x a word. The problem is to find a mapping
from words to codes, e.g. binary codes c : X → {0, 1}∗ . The optimal solution is in
principle simple: Sort all possible words in a list, ordered by p(x) with more likely words
going first; write all possible binary codes in another list, with increasing code lengths.
Match the two lists, and this is the optimal encoding.
Let’s try to get a more analytical grip of this: Let l(x) = |c(x)| be the actual code
length assigned to word x, which is an integer value. Let’s define
1 −l(x)
q(x) = 2 (241)
Z
with the normalization constrant Z = x 2−l(x) . Then we have
P

X X X
p(x)[− log2 q(x)] = − p(x) log 2−l(x) + p(x) log Z (242)
x∈X x x
X
= p(x)l(x) + log Z . (243)
x

What about log Z? Let l-1 (s) be the set of words that have been assigned codes of
length l. There can only be a limited number of words encoded with a given length. For
instance, |L-1 (1)| must not be greater than 2, |L-1 (2)| must not be greater than 4, and
|l-1 (s)| must not be greater than 2l . We have
X
∀s : [l(x) = s] ≤ 2s (244)
x∈X
X
∀s : [l(x) = s]2−s ≤ 1 (245)
x∈X
X
∀s : 2−l(x) ≤ 1 (246)
x∈X
Maths for Intelligent Systems, Marc Toussaint 91

However, this way of thinking is ok for separated codes. If such codes would be in a
continuous stream of bits you’d never know where a code starts or ends. Prefix codes
fix this problem by defining a code tree with leaves that clearly define when a code ends.
For prefix codes it similarly holds
X
Z= 2−l(x) ≤ 1 , (247)
x∈X

which is called Kraft’s inequality. That finally gives


X X
p(x)[− log2 q(x)] ≤ p(x)l(x) (248)
x∈X x

5.4 The Laplace approximation: A 2nd-order Taylor of log p

Assume we want to estimate some q(x) we cannot express analytically. E.g., q(x) =
p(x|D) ∝ P (D|x)p(x) for some awkward likelihood function p(D|x). An example from
robotics is: x is stochastically controlled path of a robot. p(x) is a prior distribution over
paths that includes how the robot can actually move and some Gaussian prior (squared
costs!) over controlls. If the robot is “linear”, p(x) can be expressed nicely and analyti-
cally; it if it non-linear, expressing p(x) is already hard. However, p(D|x) might indicate
that we do not see collisions on the path—but collisions are a horrible function, usually
computed by some black-box collision detection packages that computes distances be-
tween convex meshes, perhaps giving gradients but certainly not some analytic function.
So q(x) can clearly not be expressed analytically.
One way to approximate q(x) is the Laplace approximation

Definition 5.10 (Laplace approximation). Given a smooth distribution q(x), we


define its Laplace approximation as

q̃(x) = exp{−Ẽ(x)} , (249)

where Ẽ(x) is the 2nd-order Taylor expansion


1
Ẽ(x) = E(x∗ ) + (x − x∗ )>∇2 E(x∗ )(x − x∗ ) (250)
2
of the energy E(x) = − log q(x) at the mode

x∗ = argmin E(x) = argmax q(x) . (251)


x x

First, we observe that the Laplace approximation is a Gaussian, because its energy is
a parabola. Further, notice that in the Taylor expansion we skipped the linear term.
That’s because we are at the mode x∗ where ∇E(x∗ ) = 0.
92 Maths for Intelligent Systems, Marc Toussaint

The Laplace approximation is the probabilistic analoque of a local second-order approx-


imation of a function, just as we used it in Newton methods. However, it is defined to
be taken specifically at the mode of the distribution.
Now, computing x∗ is a classical optimization problem x∗ = argminx E(x) which one
might ideally solve using Newton methods. These Newton methods anyway compute
the local Hessian of E(x) in every step—at the optimum we therefore have the Hessian
already, which is then the precision matrix of our Gaussian.
The Laplace approximation is nice, very efficient to use, e.g., in the context of optimal
control and robotics. While we can use the expressive power of probability theory to
formalize the problem, the Laplace approximation brings us computationally back to
efficient optimization methods.

5.5 Variational Inference

Another reduction of inference to optimization is variational inference.

Definition 5.11 (variational inference). Given a distribution p(x), and a parameter-


ized family of distributions q(x|β), the variational approximation of p(x) is defined
as

argmin D q p (252)
q

5.6 The Fisher information metric: 2nd-order Taylor of the KLD

Recall our notion of steepest descent—it depends on the metric in the space!
Consider the space of probability distributions p(x; β) with parameters β. We think of
every p(x; β) as a point in the space and wonder what metric is useful to compare two
points p(x; β1 ) and p(x; β2 ). Let’s take the KLD
TODO
: Let p ∈ ΛX , that is, p is a probability distribution over the space X. Further, let
θ ∈ Rn and θ 7→ p(θ) is some parameterization of the probability distribution. Then
the derivative dθ p(θ) ∈ Tp ΛX is a vector in the tangent space of ΛX . Now, for such
vectors, for tangent vectors of the space of probability distributions, there is a generic
metric, the Fisher metric: [TODO: move to ’probabilities’ section]

5.7 Examples and Exercises

Note: These exercises are for ’extra credits’. We’ll discuss them on Thu, 21th Jan.
Maths for Intelligent Systems, Marc Toussaint 93

5.7.1 Maximum Entropy and Maximum Likelihood

(These are taken from MacKay’s book Information Theory..., Exercise 22.12 & .13)
a) Assume that a random variable x with discrete domain dom(x) = X comes from a
probability distribution of the form
d
1 hX i
P (x | w) = exp wk fk (x) ,
Z(w)
k=1

where the functions fk (x) are given, and the parameters w ∈ Rd are not known. A
data set D = {xi }ni=1 ofPn points x is supplied. Show by differentiating the log
n
likelihood log P (D|w) = i=1 log P (xi |w) that the maximum-likelihood parameters

w = argmaxw log P (D|w) satisfy
n
X 1 X
P (x | w∗ ) fk (x) = fk (xi )
n i=1
x∈X

where the left-hand sum is over all x, and the right-hand sum is over the data points.
A shorthand for this result is that each function-average under the fitted model must
equal the function-average found in the data:

hfk iP (x | w∗ ) = hfk iD

b) When confronted by a probability distribution P (x) about which only a few facts are
known, the maximum entropy principle (MaxEnt) offers a rule for choosing a distribution
that satisfies those constraints. According to MaxEnt, you should select the P (x) that
maximizes the entropy X
H(P ) = − P (x) log P (x)
x
subject to the constraints. Assuming the constraints assert that the averages of certain
functions fk (x) are known, i.e.,

hfk iP (x) = Fk ,

show, by introducing Lagrange multipliers (one for each constraint, including normaliza-
tion), that the maximum-entropy distribution has the form
1 hX i
PMaxEnt (x) = exp wk fk (x)
Z
k

where the parameters Z and wk are set such that the constraints are satisfied. And hence
the maximum entropy method gives identical results to maximum likelihood fitting of
an exponential-family model.
94 Maths for Intelligent Systems, Marc Toussaint

Note: The exercise will take place on Tue, 2nd Feb. Hung will also prepare how much
‘votes’ you collected in the exercises.

5.7.2 Maximum likelihood and KL-divergence

Assume we have a very large data set D = {xi }ni=1 of samples xi ∼ q(x) from some
data distribution q(x). Using this data set we can approximate any expectation
X n
X
hf iq = q(x)f (x) ≈ f (xi ) .
x i=1

Assume we have a parameteric family of distributions p(x|β) and would find the Maxi-
mum Likelihood (ML) parameter β ∗ = argmaxβ p(D|β). Express this ML problem as a
KL-divergence minimization.

5.7.3 Laplace Approximation

In the context of so-called “Gaussian Process Classification” the following problem arises
(we neglect dependence on x here): We have a real-valued RV f ∈ R with prior P (f ) =
N(f | µ, σ 2 ). Further we have a Boolean RV y ∈ {0, 1} with conditional probability
ef
P (y = 1 | f ) = σ(f ) = .
1 + ef
The function σ is called sigmoid funtion, and f is a discriminative value which predicts
y = 1 if it is very positive, and y = 0 if it is very negative. The sigmoid function has the
property

σ(f ) = σ(f ) (1 − σ(f )) .
∂f
Given that we observed y = 1 we want to compute the posterior P (f | y = 1), which
cannot be expressed analytically. Provide the Laplace approximation of this posterior.

(Bonus)RAs an alternative to the sigmoid function σ(f ), we can use the probit function
z
φ(z) = −∞ N(x|0, 1) dx to define the likelihood P (y = 1 | f ) = φ(f ). Now how can
the posterior P (f | y = 1) be approximated?

5.7.4 Learning = Compression

In a very abstract sense, learning means to model the distribution p(x) for given data
D = {xi }ni=1 . This is literally the case for unsupervised learning; regression, classification
Maths for Intelligent Systems, Marc Toussaint 95

and graphical model learning could be viewed as specific instances of this where x factores
in several random variables, like input and output.
Show in which sense the problem of learning is equivalent to the problem of compression.

5.7.5 A gzip experiment

Get three text files from the Web, approximately equal length, mostly text (no equations
or stuff). Two of them should be in English, the third in Frensh. (Alternatively, perhaps,
not sure if it’d work, two of them on a very similar topic, the third on a very different.)
How can you use gzip (or some other compression tool) to estimate the mutual infor-
mation between every pair of files? How can you ensure some “normalized” measures
which do not depend too much on the absolute lengths of the text? Do it and check
whether in fact you find that two texts are similar while the third is different.

(Extra) Lempel-Ziv algorithms (like gzip) need to build a codebook on the fly. How does
that fit into the picture?

5.7.6 Maximum Entropy and ML

(These are taken from MacKay’s book Information Theory..., Exercise 22.12 & .13)
a) Assume that a random variable x with discrete domain dom(x) = X comes from a
probability distribution of the form
d
1 hX i
P (x | w) = exp wk fk (x) , (253)
Z(w)
k=1

where the functions fk (x) are given, and the parameters w ∈ Rd are not known. A
data set D = {xi }ni=1 ofPn points x is supplied. Show by differentiating the log
n
likelihood log P (D|w) = i=1 log P (xi |w) that the maximum-likelihood parameters

w = argmaxw log P (D|w) satisfy
n
X
∗ 1 X
P (x | w ) fk (x) = fk (xi ) (254)
n i=1
x∈X

where the left-hand sum is over all x, and the right-hand sum is over the data points.
A shorthand for this result is that each function-average under the fitted model must
equal the function-average found in the data:

hfk iP (x | w∗ ) = hfk iD (255)


96 Maths for Intelligent Systems, Marc Toussaint

b) When confronted by a probability distribution P (x) about which only a few facts are
known, the maximum entropy principle (MaxEnt) offers a rule for choosing a distribution
that satisfies those constraints. According to MaxEnt, you should select the P (x) that
maximizes the entropy
X
H(P ) = − P (x) log P (x) (256)
x

subject to the constraints. Assuming the constraints assert that the averages of certain
functions fk (x) are known, i.e.,

hfk iP (x) = Fk , (257)

show, by introducing Lagrange multipliers (one for each constraint, including normaliza-
tion), that the maximum-entropy distribution has the form
1 hX i
PMaxEnt (x) = exp wk fk (x) (258)
Z
k

where the parameters Z and wk are set such that the constraints are satisfied. And hence
the maximum entropy method gives identical results to maximum likelihood fitting of
an exponential-family model.

A Gaussian identities
Definitions
We define a Gaussian over x with mean a and covariance matrix A as the function
1 1
N(x | a, A) = 1/2
exp{− (x-a)> A-1 (x-a)} (259)
|2πA| 2

with property N (x | a, A) = N (a| x, A). We also define the canonical form with precision
matrix A as
exp{− 12 a>A-1 a} 1
N[x | a, A] = exp{− x> A x + x>a} (260)
|2πA-1 |1/2 2
with properties

N[x | a, A] = N(x | A-1 a, A-1 ) (261)


N(x | a, A) = N[x | A a, A ] .
-1 -1
(262)

Non-normalized Gaussian

N(x, a, A) = |2πA|1/2 N(x|a, A) (263)


1
= exp{− (x-a)> A-1 (x-a)} (264)
2
Maths for Intelligent Systems, Marc Toussaint 97

Matrices [matrix cookbook: http://www.imm.dtu.dk/pubdb/views/edoc_download.php/3274/


pdf/imm3274.pdf]

(A-1 + B -1 )-1 = A (A+B)-1 B = B (A+B)-1 A (265)


-1 -1 -1 -1
(A − B ) = A (B-A) B (266)
∂x |Ax | = |Ax | tr(A-1
x ∂x Ax ) (267)
∂x Ax = −Ax (∂x Ax ) A-1
-1 -1
x (268)
-1 -1 -1 -1 -1 -1 -1
(A + U BV ) = A − A U (B + V A U ) V A (269)
-1 -1 -1 -1
(A + B ) = A − A(B + A) A (270)
> -1 > -1 > -1 -1 > -1
(A + J BJ) J B = A J (B + JA J ) (271)
> -1 > -1 >
(A + J BJ) A = I − (A + J BJ) J BJ (272)

(269)=Woodbury; (271,272) holds for pos def A and B

Derivatives

∂x N(x|a, A) = N(x|a, A) (−h>) , h := A-1 (x-a) (273)


∂θ N(x|a, A) = N(x|a, A) ·
h 1 1 i
− h>(∂θ x) + h>(∂θ a) − tr(A-1 ∂θ A) + h>(∂θ A)h (274)
2 2
h 1 1 > -1
∂θ N[x|a, A] = N[x|a, A] − x ∂θ Ax + a A ∂θ AA-1 a
>
2 2
> > -1 1 -1
i
+ x ∂θ a − a A ∂θ a + tr(∂θ AA ) (275)
2
∂θ Nx (a, A) = Nx (a, A) ·
h 1 i
h>(∂θ x) + h>(∂θ a) + h>(∂θ A)h (276)
2

Product
The product of two Gaussians can be expressed as

N(x | a, A) N(x | b, B)
= N[x | A-1 a + B -1 b, A-1 + B -1 ] N(a | b, A + B) , (277)
= N(x | B(A+B) a + A(A+B) b, A(A+B) B) N(a | b, A + B) ,
-1 -1 -1
(278)
N[x | a, A] N[x | b, B]
= N[x | a + b, A + B] N(A-1 a | B -1 b, A-1 + B -1 ) (279)
= N[x| . . . ] N[A a | A(A+B) b, A(A+B) B]
-1 -1 -1
(280)
= N[x| . . . ] N[A a | (1-B(A+B) ) b, (1-B(A+B) ) B] ,
-1 -1 -1
(281)
N(x | a, A) N[x | b, B]
98 Maths for Intelligent Systems, Marc Toussaint

= N[x | A-1 a + b, A-1 + B] N(a | B -1 b, A + B -1 ) (282)


= N[x| . . . ] N[a | (1-B(A +B) ) b, (1-B(A +B) ) B]
-1 -1 -1 -1
(283)

Convolution

N(x | a, A) N(y − x | b, B) dx = N(y | a + b, A + B)


R
x
(284)

Division

N(x|a, A) N(x|b, B) = N(x|c, C) N(c|b, C + B)


 

-1 -1 -1
C c=A a−B b
C -1 = A-1 − B -1 (285)
N[x|a, A] N[x|b, B] ∝ N[x|a − b, A − B]

(286)

Expectations
Let x ∼ N(x | a, A),

N(x | a, A) g(x) dx
R
E{x} g(x) := x
(287)
> >
E{x} x = a , E{x} xx = A + aa (288)
E{x} f + F x = f + F a (289)
> >
E{x} x x = a a + tr(A) (290)
> >
E{x} (x-m) R(x-m) = (a-m) R(a-m) + tr(RA) (291)

Transformation Linear transformations imply the following identities,

N(x | a, A) = N(x + f | a + f, A) , N(x | a, A) = |F | N(F x | F a, F AF>)


(292)
1
N(F x + f | a, A) = N(x | F -1 (a − f ), F -1 AF -> ) (293)
|F |
1
= N[x | F>A-1 (a − f ), F>A-1 F ] , (294)
|F |
1
N[F x + f | a, A] = N[x | F>(a − Af ), F>AF ] . (295)
|F |

“Propagation” (propagating a message along a coupling, using eqs (277) and (283),
respectively)

N(x | a + F y, A) N(y | b, B) dy = N(x | a + F b, A + F BF>)


R
y
(296)
Maths for Intelligent Systems, Marc Toussaint 99

N(x | a + F y, A) N[y | b, B] dy = N[x | (F -> -K)(b + BF -1 a), (F -> -K)BF -1 ] ,


R
y
(297)
K = F -> B(F -> A-1 F -1 +B)-1 (298)

marginal & conditional:

A>F>
 
x a A
N(x | a, A) N(y | b + F x, B) = N , (299)
y b + Fa FA B +F A>F>
 
x a A C
N , = N(x | a, A) · N(y | b + C>A-1 (x-a), B − C>A-1 C)
y b C> B
(300)
x a + F>B -1 b A + F>B -1 F −F>B -1
 
N[x | a, A] N(y | b + F x, B) = N ,
y B -1 b −B -1 F B -1
(301)
> -1 > -1 >
 
x a + F B b A + F B F −F
N[x | a, A] N[y | b + F x, B] = N ,
y b −F B
(302)
 
x a A C
N , = N[x | a − CB -1 b, A − CB -1 C>] · N[y | b − C>x, B]
y b C> B
(303)
-1
A C b |B| , where A = A − CB D
b
= |A| |B|
b = |A|
D B B = B − DA-1 C
b
(304)
-1 " #
b-1 −A-1 C Bb -1

A C A
= (305)
D B −B DA-1
b -1
Bb -1
" #
A b-1 b-1 CB -1
−A
= (306)
−B -1 DAb-1 Bb -1

pair-wise belief We have a message α(x) = N[x|s, S], transition P (y|x) = N(y|Ax +
a, Q), and a message β(y) = N[y|v, V ], what is the belief b(y, x) = α(x)P (y|x)β(y)?

b(y, x) = N[x|s, S] N(y|Ax + a, Q-1 ) N[y|v, V ] (307)


x A>Q-1 a A>Q-1 A −A>Q-1
     
x s S 0 x 0 0 0
=N , N , N ,
y 0 0 0 y Q-1 a −Q-1 A Q-1 y v 0 V
(308)
> -1 > -1 > -1
 
x s + A Q a S + A Q A −A Q
∝N , (309)
y v + Q-1 a −Q-1 A V + Q-1
100 Maths for Intelligent Systems, Marc Toussaint

Entropy
1
H(N(a, A)) = log |2πeA| (310)
2

Kullback-Leibler divergence
 X p(x)
p = N(x|a, A) , q = N(x|b, B) , n = dim(x) , D p q = p(x) log
x
q(x)
(311)
|B|
+ tr(B -1 A) + (b − a)>B -1 (b − a) − n

2 D p q = log (312)
|A|
4 Dsym p q = tr(B -1 A) + tr(A-1 B) + (b − a)>(A-1 + B -1 )(b − a) − 2n

(313)

λ-divergence
  
2 Dλ p q = λ D p λp + (1−λ)q + (1−λ) D p (1−λ)p + λq (314)

For λ = .5: Jensen-Shannon divergence.

Log-likelihoods
1h i
log N(x|a, A) = −
log|2πA| + (x-a)> A-1 (x-a) (315)
2
1h i
log N[x|a, A] = − log|2πA-1 | + a>A-1 a + x>Ax − 2x>a (316)
X 2
N(x|b, B) log N(x|a, A) = −D N(b, B) N(a, A) − H(N(b, B))

(317)
x

Mixture of Gaussians Collapsing a MoG into a single Gaussian


X
pi N(ai , Ai ) N(b, B)

argmin D (318)
b,B i
 X X 
= b= pi ai , B = pi (Ai + ai a>i − b b>) (319)
i i

B Further
ˆ Differential Geometry
Emphasize strong relation between a Riemannian metric (and respective geodesic)
and cost (in an optimization formulation). Pullbacks and costs. Only super brief,
connections.
Maths for Intelligent Systems, Marc Toussaint 101

ˆ Manifolds
Local tangent spaces, connection. example of kinematics
ˆ Lie groups
exp and log
ˆ Information Geometry
[Integrate notes on information geometry]

References
J. M. Hernández-Lobato, M. W. Hoffman, and Z. Ghahramani. Predictive en-
tropy search for efficient global optimization of black-box functions. In
Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Weinberger,
editors, Advances in Neural Information Processing Systems 27, pages 918–
926. Curran Associates, Inc., 2014. URL http://papers.nips.cc/paper/
5324-predictive-entropy-search-for-efficient-global-optimization-of-black
pdf.

D. Jones, M. Schonlau, and W. Welch. Efficient global optimization of expensive black-


box functions. Journal of Global Optimization, 13:455–492, 1998.
M. Toussaint. A novel augmented lagrangian approach for inequalities and convergent
any-time non-central updates. e-Print arXiv:1412.4329, 2014.
Index
L1 -norm, 6 eigendecomposition, 36
L2 -norm, 6 eigenvector, 36
Lp -norm, 6 energy is additive, 88
p-norm, 6 Entropy is expected error, 89
(dual problem), 62 error is additive, 88
(primal problem), 62 Euclidean, 29
Euclidean length, 6
basis, 24 Expected Improvement (EI), 72
basis functions, 8
Bayes’ Theorem, 83 factor graph, 89
Bayesian parameter estimate:, 90 Fisher metric, 92
Bayesian prediction: , 90 function network, 10
belief MDP, 71
BFGS, 55 gradient, 12
Boltzmann distribution, 88
bound constraints, 63 Hadamard product, 7
Hessian, 13
characteristic polynomial, 36
column space, 32 independent, 82
computation graph, 10 input null space, 32
concave, 13 interior point method, 60
conditional, 82
conjugate, 56, 83 Jacobian, 11
continuous, 7 joint, 82
contra-variant, 40
convex, 13, 66 Kronecker, 6
coordinates, 24 Kronecker delta, 25
covariance, 85 Kullback-Leibler divergence, 89
covariance matrix, 35
covariant, 40 Lagrange multiplier, 63
cumulative probability distribution, 84 Least Squares, 52
Levenberg-Marquardt, 52
damping, 52 Limited memory BFGS (L-BFGS), 55
determinant, 32 linear, 23
diagonalization, 36 Linear Program (LP):, 67
differentiable, 7 linearly independent, 24
direct sum, 6 LP in standard form:, 67
directional derivative, 12
domain, 81, 82 Manhattan distance, 6
dual, 28 marginal, 82
dual space, 24 matrix, 26
dual variable, 51 matrix convention, 26
dual vector, 24 matrix multiplication, 27
Maths for Intelligent Systems, Marc Toussaint 103

Maximum a posteriori (MAP) parameter row space, 32


estimate:, 89
Maximum likelihood (ML) parameter es- saddle point, 13
timate:, 89 saddle point of the Lagrangian, 62
mean, 85 sample space, 81
metric tensor, 29 scalar product, 28
minimum for a matrix expression, 15 Slater condition, 64
mixture, 86 smooth, 7
mode, 83 steepest descent, 41
modified (=approximate) KKT conditions, strong duality, 63, 64
65 sum-of-squares, 52
monimials, 8
multi-linear, 23 tensor product, 25
the approximate Hessian is the pullback
non-parameteric, 8 of a Euclidean cost feature met-
ric, 54
orthonormal, 28 total derivative, 10
outer product, 25 transpose, 30
output null space, 32 Trust region method:, 51

parameteric functions, 8 unimodal, 66


particle distribution, 86 Upper Confidence Bound (UCB), 72
perfect line search, 56
point particle, 86 vector, 22
polynomial, 8 vector space, 22
polytope, 68 volume, 35
precision, 85
Predictive Entropy Search, 72, 73 Wolfe condition, 48
primal-dual space, 64
probabilities are multiplicative, 88
probability density function (pdf), 84
probability distribution, 82, 84
probability measure, 81
Probability of Improvement (MPI), 72
product rule, 83
projection, 33

Quadratic Program (QP):, 67


Quadratically Constrained QP (QCQP),
67
quasiconvex, 66

random variable, 82
rank, 32
relaxations, 67
root, 49

You might also like