Notes
Notes
Jeffrey Liu
A quick warning...
These notes are incomplete and subject to mass reorganization and editing. However
chapters 4-6 are fairly readable in its current state, and they are the most important.
When I’m done chapter 7 will probably be three times the length of the other chapters,
but by the time I finish it will be the most interesting.
Also, all credit to Fabrizio Conti for the brilliant cover photo, retrieved from Unsplash.
Hope it helps the reader visualize descent algorithms :).
2
Foreword
Hi, Jeffrey here. Today is St Patrick’s Day, 2020, but this year I won’t be drinking
my problems away on Ezra. I’m not sure when in the future that you, if anyone, will
be reading this, so I’ll forgive you if you don’t remember that this was the year of the
COVID-19 pandemic. My classes were suspended last Friday, and I’ve spent the better
part of the break so far playing video games and watching YouTube1 .
To keep my sanity over the next couple months, I’ve also decided to start a collection
of notes for some of my more favoured subjects.
Convexity is a beautiful property that lends itself to many powerful results in al-
gebra, analysis, and geometry. Applications have been found in many branches of
pure mathematics, engineering, finance, computer science, physics, and of course op-
timization. I’ll be basing these notes off of the extensive literature in the subject, in
particular: Boyd and Vandenberghe: Convex Optimization; and Wolkowicz’s CO 463
course notes2 , among others.
In contrast to these sources, however, I’ll be exploring the subject from the perspec-
tive of a (not exceptionally bright :p) undergraduate, so I’ll try my best to motivate. I’ll
also look to introduce non-convex optimization, especially towards numerical methods
and their applications in machine learning.
I hope that these notes will form a brief yet extensive overview of the subject,
perhaps as a companion for a first graduate-level course in convex or nonlinear opti-
mization. Of course, I’m still a student (read: I’m not an expert), so expect room for
improvement. A quick note: some elementary linear algebra and calculus is expected,
although I’ll try to have an appendix with some prerequisite theorems.
Enjoy3 !
1
If my parents or any future professors or employers ever read this, I was also reading lots of
textbooks.
2
I’ll be following the course notes the closest, but with my own commentary and editorials.
3
I’d like to thank Fabrizio Conti again for the brilliant cover photo, retrieved from Unsplash. Also,
special thanks to my CO 463 Professor Henry Wolkowicz, as well as my mom and dad for their continued
belief in me.
3
Contents
1 Convex Geometry 9
1.1 Convex Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.1.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.1.2 Operations Preserving Convexity . . . . . . . . . . . . . . . . . . 10
1.1.3 Convex Hulls and Carathéodory’s Theorem . . . . . . . . . . . . 11
1.2 Affine sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3 Geometry with Convex Sets . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.3.1 Extreme Points and Faces . . . . . . . . . . . . . . . . . . . . . . 13
1.3.2 Projections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.3.3 Separation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.4 Cones . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.4.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.4.2 Partial Orders . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.4.3 The Dual Cone . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2 Convex Functions 17
2.1 Preliminary Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1.1 As a vector space . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.1.2 Elementary Properties . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2 Other Types of Convexity . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.1 Quasiconvexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.2 Strong Convexity . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.3 Strict Convexity . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.4 K-Convexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3 Calculus with Convex Functions . . . . . . . . . . . . . . . . . . . . . . 19
2.3.1 Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.2 Subdifferentials . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3 Convex Programs 20
3.1 The Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2 Optimality Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3 Duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3.1 Minimax Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3.2 Lagrangian Duality . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.3.3 Strong duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3.4 Optimality Conditions . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3.5 Conjugate Duality . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4
4.2 Second Order Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2.1 Newton’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2.2 Trust Region Methods . . . . . . . . . . . . . . . . . . . . . . . . 27
7 Applications 33
7.1 Semidefinite Programming . . . . . . . . . . . . . . . . . . . . . . . . . . 34
7.1.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
7.1.2 Semidefinite Programming . . . . . . . . . . . . . . . . . . . . . . 35
7.2 Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
7.3 Quadratic Programming (and Support Vector Machines) . . . . . . . . . 36
7.4 Quadratic Assignment Problem . . . . . . . . . . . . . . . . . . . . . . . 36
7.5 Max Cut . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
7.6 Sensor Network Localization . . . . . . . . . . . . . . . . . . . . . . . . . 36
7.7 ¿Neural Networks? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
8 Additional Enrichment 36
8.1 Convex Optimization on Manifolds . . . . . . . . . . . . . . . . . . . . . 36
A Appendix I: Prerequisites 36
A.1 Linear Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
A.2 Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
A.2.1 The first derivative . . . . . . . . . . . . . . . . . . . . . . . . . . 37
A.2.2 The other derivatives . . . . . . . . . . . . . . . . . . . . . . . . . 37
A.3 Linear Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5
Notation
I try to use standard notation; anything that may be controversial will be listed here.
If you are unfamiliar with any of these definitions, please read the appendix.
Abuse of Notation
minx∈Ω f (x) The infimum of f (x) as x ranges in Ω
maxx∈Ω f (x) The supremum of f (x) as x ranges in Ω
4
An astute reader may note that all n dimensional Euclidean vector spaces are isomorphic to Rn ,
and may ask why we bother distinguishing them. Honestly I think the only reason is so that we can
be lazy and always assume Rn refers to that specific vector space with the standard inner product.
6
Introduction
Note: I’ll put some pictures here eventually.
Historically, mathematicians have preferred structural results over numerical ones.
Greats such as Euler, Gauss, Riemann, etc., have been devoted to the creation of
beautiful theories in algebra, analysis, geometry, and number theory. It wasn’t until
the 20th century and the invention and popularization of the computer did the study
of algorithms and computation take off.
Combinatorics and Optimization are two fields which blossomed in this new age.
Many computational subfields of combinatorics, including those of graph theory, ma-
troid theory, and polyhedral theory, have only been developed recently. The simplex
method, published by Dantzig in 1947, was the first of many algorithms with promising
practicality, and with it came the field of operations research.
Convexity, as a geometric property, has been known since antiquity. Properties have
been investigated by the likes of Euler and Cauchy, however its true potential wasn’t
realized until the late 1800’s when German mathematician Minkowski was able to apply
it to number theory. He and fellow German Brunn developed much of the theory in
two and three dimensions. Carathèodory, Krein, Milman, and Fenchel, among many
others, developed and generalized much of the theory from the turn of the century until
about the second world war. By 1970 or so all of the convexity theory we will require
has been discovered.
The simplex method was a huge landmark in optimization with remarkable practical
efficiency, however no polynomial time variation has ever been found. Collaborations to
find a provably polynomial time algorithm lead to the discovery of the ellipsoid method
in the 70’s. In 1984, Indian Karmarkar proposed the first poly-time interior-point
method for linear programming while working for Bell Labs, and non-linear adaptations
have been an active field of research since the late 1980’s.
Since the 90’s, applications have been found in the traditional domains of opera-
tions research, and also in engineering (robotics, signal processing, circuit design, . . . );
computer science (machine learning), the physical sciences, and finance.
We’ll first discuss convex geometry and the natural extension to functions. With
this, we’ll be able to analyze convex programs and develop optimality constraints.
These constraints, when violated, give arise to algorithms. Finally, we’ll conclude with
as many interesting applications as I could find.
For better flow, I will not prove many of the propositions and theorems unless they
are particularly insightful. My primary intention is for these notes to be a reference to
myself, but I’ll try my best to be a good teacher (plus I enjoy teaching and hope to do
it more often). I’ll have an appendix on hints and solutions for some the more tricky
proofs.
Oh, one last thing: I’ll try to maintain a prerequisite DAG. So please don’t be
discouraged like I was and think that you have to read two chapters of geometry before
you even see an optimization problem. In fact, you may be able to understand many
of the algorithms by simply reading their description and googling any definitions you
7
haven’t heard of before.
But still, it’s really important to know the fundamentals. I’m really sorry that
sections 1 and 2 may get a little boring, but we need to learn to walk before we can
run.
8
1 Convex Geometry
We shall explore the main geometric structures which arise in convex optimization,
beginning with, of course, sets.
9
1.1.2 Operations Preserving Convexity
There are several common operations which preserve convexity of a set. We’ll be able
to use these operations to build increasingly sophisticated sets from the basic examples
given above. These propositions all follow from definitions, and the reader is encouraged
to prove them on their own.
is convex.
In particular, the convex hull of a set S, possibly defined as the intersection of all
convex sets containing S, is convex. We’ll talk more about this later.
is a convex set.
Proposition 1.8. Let Ci ⊆ Eni for i ∈ [m]. Then the Cartesian product defined by
is a convex set. Conversely, if C ⊆ En1 ×···×nm is a convex set, then each projection
Proposition 1.10. Let C ⊆ E be a convex set. Then the interior int C and closure
cl C are convex sets as well.
Actually, we shall see that we can strengthen Proposition 1.10 when we define the
relative interior, in case C has a “lower dimension” than E. But this will come later.
10
1.1.3 Convex Hulls and Carathéodory’s Theorem
Many problems encountered in nature will not be convex (and usually will be difficult
to solve), so we’ll formulate strategies to relax these problems so that they are convex.
The simplest of which is the convex hull.
Definition 1.11. Let S ⊆ E be any set. Then the convex hull of S, denoted conv S,
is the smallest convex set containing S. That is, if C is any other convex set containing
S, then conv S ⊆ C.
It’s easy to show that the convex hull of S is the intersection of all convex sets
containing S. In fact, this also serves as a decent definition of the convex hull, as
by Proposition 1.6 we know that this intersection is convex. Some texts prefer the
following equivalent formulation of convex hull:
That is, the convex hull of S is precisely the set of all convex combinations of points
from S. Now, despite the simplicity of the definition of convexity, we can build a very
rich theory to describe these objects. We can immediately state a beautiful result of
the subject.
Proof. We’ll show that Carathéodory’s Theorem follows from linear independence6 .
Insert proof here.
In other words, no matter how misbehaved S is as a set, any point in the convex
hull of S can be written as a convex combination of just n + 1 points from S. This
is remarkable, as it will often allow us to simplify arbitrarily many variables into just
n + 1. In fact, if S is sufficiently well behaved, we get an even stronger statement.
Theorem 1.14 (Fenchel and Blunt, YYYY?). Let S ⊆ En , and suppose that S has no
more than n connected components. Then only n points are needed in Theorem 1.13.
There are several related theorems that are worth knowing, although they don’t
appear too often7 .
6
Although there is a very nice proof using linear programming
7
Some of the proofs of these theorems have appeared as challenge problems on some of my exams,
so beware.
11
Corollary 1.15 (A consequence of Carathéodory’s Theorem). The convex hull of a
compact set is compact.
Exercise 1.16. Beware! The compactness condition cannot be weakened: the convex
hull of a closed set may not be closed. Try to find a counterexample!
Theorem 1.17 (Helly, YYYY). Let C (i) ⊆ En , i ∈ I be a (potentially uncountable?)
collection of compact convex sets. Then if every subcollection of n + 1 sets have a
non-empty intersection, then \
C (i) 6= ∅ (1.11)
i∈I
Theorem 1.18 (Radon, YYYY). Let {x(1 ), ..., x(n+2) } ⊆ En . Then there is a partition
I1 , I2 of the indices [n + 2] such that the convex hulls
n o n o
C1 = conv x(i) : i ∈ I1 , C2 = conv x(i) : i ∈ I2 (1.12)
intersect C1 ∩ C2 6= ∅.
Theorem 1.19 (Shapley-Folkman, YYYY). GOES HERE
x, y ∈ S, λ ∈ R, =⇒ (1 − λ)x + λy ∈ S. (1.13)
Intuitively, just as convex sets contain all of their convex combinations, we’d like
an affine set to be one containing all of its affine combinations. However unlike their
convex counterparts, affine sets can be characterized extremely simply.
Proposition 1.21 (Different formulations of affine sets). Let S ⊆ E. Then the follow-
ing are equivalent:
(i) S is an affine set, in the sense of Definition 1.20.
(ii) S is a linear manifold, ie., there exists a linear transformation A : E → F and
vector b ∈ F so that
S = {x ∈ E : Ax = b} . (1.14)
(iii) There exists some d ∈ E and linear transformation B : F → E so that
S = {x ∈ E : x = By + d, y ∈ F} . (1.15)
12
(iv) S is the translation of a subspace, ie., for x ∈ S the set S − x is a subspace of E.
Among these formulations, (1.14) is the most useful, ie., affine sets are the solution
sets to some system of linear equations, and the terms affine set and linear manifold
will be used interchangeably.
However, (iv) inspires us to adapt linear independence to the affine case.
affinely independence definition goes here, and some basic theorems
These tools form the backbone of what’s really happening in our proof of Theorem
1.13.
Dimension, Relative interior
Proposition 1.22. Let C ⊆ E be a convex set. Then ext C is non-empty if and only
if C is pointed, ie., C does not contain any lines.
In other words, the set of extreme points of C is the smallest set which is “good
enough” to determine C via its convex hull; it is the “shortest worker’s instruction for
building the set”.
1.3.2 Projections
Projections onto convex sets will be our main tools in proving hyperplane separation
theorems. They are a natural extension of projections onto subspaces from linear
algebra.
Recall that we can define an orthogonal projection P onto a subspace V ⊆ E as a
linear transformation satisfying P 2 = P > = P . AND THEN w=v+v perp, etc
We’d like to define the projection onto a convex set similarly.
It takes a little work to see why PC (x) always exists, and if it does, why it’s unique.
13
Proposition 1.25 (Kolmogorov’s Criterion, YYYY). Let C ⊆ E be a non-empty
closed convex set, and x ∈ E. Then PC (x), as defined in Definition 1.24, is well
defined. In particular, yx = PC (x) if and only if
hx − yx , y − yx i ≤ 0, ∀y ∈ C. (1.18)
Proof. ??
Moreau decomp
Here are a few nice exercises with projections.
1.3.3 Separation
Preliminary version:
for all c ∈ C.
for all c1 ∈ C1 , c2 ∈ C2 .
14
1.4 Cones
1.4.1 Definitions
The cone is another very important geometric object, although as with the convex set,
we define it algebraically.
x ∈ K, λ ∈ R+ =⇒ λx ∈ K. (1.22)
Oftentimes, we may find ourselves focusing on the closed convex cones (c.c.c. for
short), which are closer our primary school intuition of a “pylon-shaped” cone. As
we shall see, these specific cones have many nice properties, however it is important
to note that these c.c.c.s are not the only types of cones. Convex cones have a nice
characterization:
x, y ∈ K =⇒ x + y ∈ K. (1.23)
A convex cone contains all of its conic combinations, which are sums of the form
k
X
λi x(i) for λ ∈ Rk+ and x(i) ∈ K. (1.24)
i=1
Just as with convex sets, the intersection of any family of convex cones is itself a convex
cone. Similar to the convex hull, we can define a conical relaxation of a set, or the conic
hull:
Definition 1.31. Let S ⊆ E be any set. Then the conical hull of S, denoted cone S,
is the smallest convex cone containing S.
The reason why we care about closed convex cones so much (besides the fact that
they are just so cool ) is because they correspond with partial orders on E. Un-
derstanding cone geometry leads to conic optimization (duh), most notably including
second-order cone programming and semidefinite programming. We have efficient al-
gorithms for both these problems.
15
1.4.2 Partial Orders
Many things cannot be compared to each other8 . In linear algebra, there really isn’t
a good way to compare two arbitrary vectors. For example, in R2 , the vectors (0, 1)
and (1, 0) are indistinguishable (especially before the choice of a basis). This appears
to be a huge problem in optimization, where questions of the form “minimize . . . ”,
inherently requires us to be able to compare stuff to each other.
However, this is a non-issue, as we introduce the concept of a partial order.
PARTIAL ORDER DEFINITION
Unlike total orders, we are allowed to have pairs of objects a, b which are incom-
parable, meaning neither a b or b a.
Example 1.33. We are already familiar with a partial order— the less than or equal
to relation ≤ over the real numbers. We can generalize this to obtain a natural partial
order in Rn , where
xy ⇐⇒ xi ≤ yi , ∀i ∈ [n]. (1.26)
Essentially we say x y if every entry of x is less than the corresponding entry of y.
Usually we won’t be too picky with the notation and just write x ≤ y. After messing
around with the definition for a bit, a cool alternate formulation is
x≤y ⇐⇒ y − x ∈ R+ . (1.27)
As it turns out, this is actually a more natural definition for the ≤ relation9 . In
particular, it is (at least on the surface) coordinate free, and it extends easily.
Proposition 1.34. Indeed, if K ⊂ E is any pointed convex cone, then the relation
x y if and only if y − x ∈ K is a partial order. Conversely, if is any partial order
then
{x ∈ E : 0 x} (1.28)
is a pointed convex cone.
Definition 1.35. Let S ⊆ E. Then we can define the positive polar cone of S,
denoted as S + , as
S + := {φ ∈ E : hx, φi ≥ 0, ∀x ∈ S} . (1.29)
8
Apples and oranges is a canonical example
9
and it may be familiar if you’ve done anything with equivalence relations
16
We can define the negative polar cone S ◦ similarly, and S ◦ = −S + . Note that
the positive polar cone is sometimes referred to as the dual cone, denoted as S ∗ , and
the negative polar cone is simply called the polar cone.
Insert graphic representing dual cone
The dual cone comes up surprisingly often, as it represents I DON’T KNOW...
FIGURE THIS OUT. One fact that comes up a lot is the following:
K = (K + )+ . (1.30)
Corollary 1.37 (Farkas’ Lemma). Let A ∈ Rn×m , b ∈ Rn . Then exactly one of the
following is true:
(i) The system Ax = b, x ≥ 0 has a solution
(ii) The system A> y ≥ 0, b> y < 0 has a solution
Proof. Hi
Farkas’ lemma and other such theorems of the alternative are the foundation of a
beautiful duality theory in linear programming. It is strongly recommended that the
reader be familiar with these ideas.
2 Convex Functions
2.1 Preliminary Definitions
We’ve seen many examples of the importance and elegance of convex sets. As we shall
see, there is a very natural correspondence between convex sets and convex functions,
which will allow us to transfer much of the theory over. In particular, we will be able
to derive several properties which are crucial to the success of convex optimization.
The general definition of a convex function is usually introduced in freshman cal-
culus, and is defined by Jensen’s Inequality.
We see that it really is necessary for C to be a convex set, as otherwise the LHS of
2.1 may not be defined. Pictorially, we define f to be convex if it lies below all of its
secant lines.
MAYBE INSERT PICTURES AND EXAMPLES
17
If we have a function f such that −f is convex, then we call f concave. Concave
functions satisfy the inequality
We can immediately uncover the relationship between convex sets and convex functions.
Recall the following definition:
Definition 2.2. Let f : S ⊆ E → R be any function. We define the epigraph of f ,
denoted epi f , by
epi f = {(x, r) ∈ S × R : f (x) ≤ r} . (2.3)
PICTURE? The epigraph represents the region “above” the graph of f , and is
a subset of a Euclidean space of one dimension higher. Then we have the following
equivalence:
Proposition 2.3. Let f : S → R. Then the following are equivalent:
(i) f is a convex function
(ii) epi f is a convex set
Examples: lines, quadratics (introduce hessian is psd iff convex), other examples,
norms,
As an aside, we’d also like to note that we can recover the classical Jensen’s in-
equality by induction.
Theorem 2.4 (Jensen’s Inequality). Let f : E → RPbe a convex function, x(1) , . . . , x(k) ∈
E be points in the domain, and λ ∈ Rk satisfying ki=1 λi = 1 be weights. Then
k k
!
X X
f λi x(i) ≤ λi f (x(i) ). (2.4)
i=1 i=1
18
continuous + bounded sublevel sets? lower upper semicontinuous to understand
what these mean geometrically, perhaps its helpful to picture in terms of epigraph
closed function
19
2.3.2 Subdifferentials
some more I bet
3 Convex Programs
minimum vs minimal
Theorem 3.1 (Fermat, 1600’s). Let f : A → R be some function, and suppose that
x∗ is a local extremum for f . If f is differentiable at x∗ , then
∇f (x∗ ) = 0. (3.1)
When condition 3.1 is violated, then we have a non-zero gradient and thus a direc-
tion for descent/ascent. As we’ll see, this leads to gradient descent.
For convex functions we have even stronger conditions. The most important prop-
erty is this very simple one:
20
Most algorithms are only capable of finding local extrema, but in convex optimiza-
tion, this turns out to be equivalent to finding global extrema. We can use this theorem,
as well as some facts about convex functions, to strengthen Fermat’s theorem:
Theorem 3.3. A point x∗ ∈ dom f is a global minimizer for f if and only if 0 ∈ ∂f (x∗ ).
Of course, if x∗ ∈ int dom f and f is differentiable at x∗ , then ∂f (x∗ ) = {∇f (x∗ )}
and we recover Fermat’s original theorem. So what happens if x∗ 6∈ int dom f ?
Tangent cones, that one condition from homework? Rockafeller pshenichnyi
Weierstrauss Theorem
Lagrange Multipliers - introduce it math 247 style and say we’ll talk more about it
with duality
Maximizing convex functions
3.3 Duality
Duality, as a concept, is loosely defined as looking at an object in two ways. For
example, we can analyze a signal with respect to either the frequency domain or the
time domain. A compact convex set can be regarded by the union of a bunch of points
(REFERENCE ABOVE), or the intersection of a bunch of halfspaces (REFERENCE
ABOVE). Here, we’ll define several different notions of duality to help us understand
convex functions and convex programs.
Picture a zero-sum game played between two opponents X (Xavier) and Y (say,
Yvette), where g is the profit function for Y . That is, if X plays the move x and Y
plays the move y, then g(x, y) is the (possibly negative) money paid out to Y from
X. Player Y seeks to maximize her profit, while player X seeks to minimize his losses.
There is a first player disadvantage in this game: the first player to commit is worse
off, as the second player can adapt their strategy in response.
Example 3.5 (Rock Paper Scissors). Alphonse and Beryl are playing a game of rock
paper scissors, in which the loser pays the winner one Canadian Peso. From Beryl’s
perspective (Beryl takes the role of Y above), she has the following payoff matrix:
B
Rock Paper Scissors
A
G = Rock 0 -1 1 (3.5)
Paper 1 0 -1
Scissors -1 1 0
21
Alphonse and Beryl must both choose a vector from the set:
1 0 0
M =N = 0 , 1 , 0 ,
(3.6)
0 0 1
and if Alphonse chooses x and Beryl chooses y, the payout for Beryl is
By Theorem 3.4 (as well as common sense), the first person to reveal their choice is
significantly disadvantaged (since the other person would just pick a winning match-
up), and we can compute
We can interpret this game theoretic perspective as some sort of duality. By defining
F (x) = maxy∈N g(x, y) as a sort of “dual objective function” to f (x) = minx∈M g(x, y)
(ie., your opponent’s objective function is the dual of your own), we can see that
Proposition 3.4 resembles some sort of weak duality statement à la linear programming.
Remarkably, however, sometimes it doesn’t matter who goes first: with optimal
play, the disadvantage of revealing your plan early is nonexistent. John von Neumann
was the first to publish a strong duality minimax theorem, which many regard as the
start of game theory.
Theorem 3.6 (von Neumann, 1928, slightly modified). Let M and N be compact
convex sets, and g : M × N → R be a continuous function satisfying
(i) g(·, y) : M → R is convex for a fixed y ∈ N
(ii) g(x, ·) : N → R is concave for a fixed x ∈ M .
Then
min max g(x, y) = max min g(x, y). (3.9)
x∈M y∈N y∈N x∈M
Theorem 3.7 (Sion, 1958). Let M and N be convex sets, with at least one of them
compact, and g : M × N → R be a l.s.c. quasi-convex function in x ∈ M , and an u.s.c.
quasi-concave function on y ∈ N . Then
Before we prove this theorem, it’s important to note that 3.10 may no longer hold if
any of the preconditions are false. Let’s look at a few case studies with g(x, y) = x + y.
22
Example 3.8. Sometimes, the second to play gains a lot! Consider
while
max min x + y = −∞ (3.12)
y∈R x∈R
This example also illustrates the necessity of the compact condition in Theorem 3.7.
Example 3.9. On the other hand, if we do enforce compactness, then Sion’s Theorem
holds as we’d expect:
min max x + y = −∞. (3.13)
x∈R 0≤y≤1
Example 3.10. Some commentary on how Sion’s Theorem is not necessary GOES
HERE
min max x + y = −∞. (3.14)
x∈R y≤0
We don’t have compactness in either set, but Sion’s Theorem still holds.
Example 3.11 (von Neumann’s Zero Sum Game). Alphonse and Beryl realize that
they should play according to a probability distribution, etc., etc., generalize. Let ∆n
be the standard simplex in Rn , then x, y represent the probability distributions, etc.,
represent something called a mixed strategy.
p∗ = min f (x)
s.t. g(x) K 0 ∈ Em
(3.16)
h(x) = 0 ∈ Ep
x ∈ Ω.
Definition 3.12. In this framework, we can define the Lagrangian function with
Here, we introduce two new parameters λ and µ, often called the dual variables
(as they will become the variables in our dual program), or sometimes simply the
Lagrange multipliers. If we are working in Rn , we usually write the more familiar form
m
X p
X
L(x, λ, µ) = f (x) + λi gi (x) + µi hi (x), (3.18)
i=1 i=1
23
although it is slightly less revealing. You really should think of the Lagrangian as an
affine functional of λ and µ. We can recover our original NLP by solving the following
unconstrained problem.
You should convince yourself why these problems are equivalent, and in particular, why
the Lagrange multipliers guarantees feasibility in (3.16). By reversing the order of play
(by Proposition 3.4) we immediately get a statement of weak duality:
Let’s rewrite this a bit more succinctly. Define the dual functional of NLP with
We rewrite (3.20) and define the dual problem, which will form the basis of Lagrange
relaxation.
Definition 3.13. Given a program in the form (3.16), we can define the dual program
24
Here, K is a closed convex cone, Ω ⊆ E is a convex set (the domain), f : Ω → R is a
convex function, and g : Ω → F, a K-convex function on Ω. Then the Lagrangian of
(3.23) is
L(x, λ, µ) := f (x) + hλ, g(x)i (3.24)
with dual functional
φ(λ) = min L(x, λ), (3.25)
x∈Ω
and weak duality
p∗ ≥ d∗ := max φ(λ). (3.26)
λK + 0
x) ≺K 0.
g(b (3.27)
Remark 3.17. This is also a reason why we include the domain Ω in the convex
problem. If Slater’s CQ fails to hold, then sometimes we can modify the constraints
and the setΩ appropriately to introduce strict feasibility13 .
And, as alluded to,
Theorem 3.18 (Strong Duality). Suppose that p∗ is finite for (ACP), and that Slater’s
CQ holds. Then there exists a λ∗ ∈ K + such that
In other words, we have strong duality in the sense that p∗ = d∗ , there is no duality
gap. Moreover, if x∗ is optimal in (ACP), then it is optimal in (3.28) as well, and
some commentary about how this strong duality doesn’t guarantee attainment, and
also of feasibility.
Theorem 3.19. Theorem 5.1.12 in Henry’s notes
Proof. Rockafellar Pschenichnyi
13
This is the idea behind facial reduction. INSERT SOURCES HERE
25
3.3.4 Optimality Conditions
Existence of KT vectors, Slater’s Condition, KT vector implies compactness which
allows us to use Sion’s theorem.. kkt conditions, idk
26
The idea is simple. Locally, around x(n) , we know that f is quite well approximated
by a linear function. If we picked a direction d, then we could write
g(t) := f (x + td) = f (x) + ∇f (x)> d + o(ktdk). (4.2)
There are two immediate questions we need to answer: what choice of t (called the
step size) and what choice of d (called the descent direction) do we want?
Cauchy suggested that we should choose the direction on which f decreases the
fastest, the steepest direction, so to speak. In other words, we would like to minimize
the directional derivative ∇f (x)> d. To do this, we’d like to solve the subproblem:
min ∇f (x)> d
(4.3)
s.t. kdk2 = 1.
We pass constraint qualification, so we can use Lagrange multipliers. The Lagrangian
is
L(d, λ) = ∇f (x)> d + λ(1 − kdk2 ), (4.4)
with derivative (with respect to d)
0 = ∇L(d, λ) = ∇f (x) − 2λd. (4.5)
If ∇f (x) = 0 then by Theorem 3.3 x would be a local optimum, and we’d be done.
Otherwise ∇f (x) 6= 0, and we can solve the system of equation to conclude
1 ∇f (x)
λ = ± k∇f (x)k , d=± . (4.6)
2 kf (x)k
∇f (x)
We are looking to minimize, so we conclude that d = − kf k(x) is the direction of steepest
descent. Now we must decide on a suitable step length t to guarantee convergence.
Beware of skiing - show what happens if step size is too small or too large, have an
example which shows why its not always the best choice to go to the minimum of g,
27
5.2 Newton’s Method Revisited
5.3 Infeasible Start Newton’s Method
5.4 Penalty Function
5.5 Implementation Details
p∗ = min f (x)
s.t. g(x) ≥ 0
(6.2)
h(x) = 0
x ∈ Ω.
28
Recall that, for a general set S ⊆ E, we can define an indicator function IS : E →
R ∪ {+∞} with (
0 x∈S
IS (x) = (6.3)
+∞ x 6∈ S.
Then if we let
F = {x ∈ Rn : g(x) ≥ 0, h(x) = 0} (6.4)
be the feasible region, then the unconstrained problem
is equivalent to our original problem. This seems to work out nicely, but unfortunately
the gravy train stops here. The indicator function is not continuous, meaning we won’t
be able to use most of our developed algorithms (subgradient method doesn’t work
well either, as we are not convex). We’ll have to introduce soft approximations of these
indicator functions instead.
INTRODUCE PENALTY BARRIER with pictures. SPLIT UP EQUALITY AND
INEQUALITY CONSTRAINTS?
Then we can define the joint penalty-barrier function
!
1 X
Pµ (x) := f (x) + kh(x)k2 − µ log gi (x) . (6.6)
2µ
i
Here, the second term is called the quadratic penalty term, and the third term is
called the log barrier term. We notice that Pµ (x) is only defined on the interior
of the feasible set (points satisfying g(x) > 0). So if we start at an interior point,
then successive iterations will also be at interior points (hence the name interior point
method). Consider the corresponding optimization problem
we can see that the quadratic penalty term encourages feasibility of the equality con-
straints, while the log barrier term encourages us to stay away from the boundary of
the set. As µ decreases to 0, the penalty increases and forces h(x) = 0, and the bar-
rier decreases and allows the g(x) to get closer to the boundary, while increasing the
influence of the objective function f . Formally,
Theorem 6.1 (Penalty Barrier Global Convergence). Let {µk }k≥1 be a sequence ap-
proaching 0 from above, and xµk be the corresponding optimal solution to (6.7). Then
every limit point x∗ of the sequence {xµk }k≥1 is a solution to (6.2).
Exercise 6.2. The log barrier is just one of many barrier functions we could have
chosen in REFERENCE. Consider blah
p
X 1
...f (x)?? − (6.8)
gj (x)
j=1
29
6.2 Barrier Methods
We know how to can convert constrained optimization problems to unconstrained ones.
Now, we shall adapt the algorithms we’ve developed for unconstrained optimization to
these new barrier problems. Given our convergence above, it may be tempting to just
pick a very small µ and solve the corresponding barrier subproblem,
!
1 X
min Pµ (x) = f (x) + kh(x)k2 − µ log gi (x) , (6.9)
x 2µ
i
min c> x
max b> y
s.t. Ax = b > (6.10)
s.t. A y ≤ c,
x ≥ 0,
30
although we’ll usually write the dual in terms of a slack variable z:
From a high level perspective: the interior point method will generate a sequence of
strictly feasible points x(i) , y (i) , z (i) where x(i) > 0, y (i) > 0, converging to the optimal
solution (these points are in the interior 18 of the feasible region, hence the name of
the algorithm). In practice, we can get within 10−8 of the optimal solution after 10-50
(expensive) iterations. In fact, we shall show that just O(n log 1ε ) iterations are enough
√
for (1+ε) of the optimal value (actually interior-point algorithms with just O( n log 1ε )
iterations exist, but they are slower in practice).
For now, suppose that we only have the primal problem (we shall derive the dual
and slack variables along the way):
min c> x
s.t. Ax = b (6.12)
x ≥ 0.
A> y + µX −1 e = c (6.15)
Ax = b. (6.16)
Here, X = diag(x), and so on. We notice by slightly rearranging (6.15) we obtain the
perturbed complementary slackness conditions,
for all i ∈ [n] (resembling the complementary slackness conditions we are used to with
linear programs, except with µ on the RHS instead of 0).
18
well, actually relative interior
31
Let’s call z := c−A> y. Rearranging the KKT equations a bit, we get the perturbed
optimality equations
0 A> I
>
∆x A +z−c
A 0 0 ∆y = − Ax − b . (6.21)
Z 0 X ∆z Zx − µe
By solving this system we obtain directions ∆x, ∆y, ∆z of descent. After taking an
appropriate step size to remain strictly feasible, we update our points x, y, z. Finally,
we update µ, usually by setting µ ← σµ for some fixed σ ∈ (0, 1), and push the solution
towards optimality.
INSERT PSEUDOCODE?
Insert picture!!
This is the general idea of the algorithm. There are a few details that we have yet to
take care of. Over the next few paragraphs, we’ll briefly comment on: what to choose
for initial values for x, y, z, µ; what step size to take when updating x, y, z; and how to
solve (6.21) relatively efficiently in practice. More detailed analysis will be attached in
following sections.
While in theory we could just pick any strictly feasible x and z (that is x, z > 0),
in practice there are heuristics to follow when choosing initial feasible points (not too
close to boundary, etc). However for LPs it tends to work okay. Choosing good initial
values for x, y, z is very important, and in general is a hard problem. In non-linear
programs, it is difficult to even find strictly feasible solutions. We’ll discuss this more
when we talk about two phase interior point methods.
When possible, the best possible choice for x is at the optimal solution.
Picking appropriate step sizes α and σ is much easier. Again, there are lots of
heuristics to follow, and careful tuning of these parameters lead to faster algorithms,
however the most important thing is to make sure you remain feasible at each step.
Finally, I’d like to devote a bit of time to exploring how we can efficiently solve
(6.21). Letting rd = A> y + z − c, rp = Ax − b, and rc = Zx − µe be the RHS, we can
perform some block gaussian elimination. The first row yields
32
Finally substituting this into the second equation will give us a system
Hence we’ve reduced our 2n + m variable system to one with just m unknowns (and
usually m << n) and a symmetric LHS , leading to faster numerical methods. We can
recover ∆x, ∆z by back substitution. (When these algorithms are actually implemented
there are more tricks we can do).
In comparison to the simplex, which takes many cheap iterations to converge,
primal-dual methods take fewer, more expensive steps. In particular, we need to solve
the perturbed KKT equations at each iteration, which is quite computationally ex-
pensive. We’ll talk more about the time complexity after we choose specifics on the
hyperparameters.
Exercise 6.3. We can also formulate the barrier subproblem in terms of the dual:
FROM CO 463 notes. Can you derive the same perturbed KKT conditions?
Exercise 6.4. The point of this exercise, as well as the three (COUNT?) that follow,
will be to derive a primal-dual interior point method for various quadratic optimization
problems, starting with
min q(x) = 12 x> Qx + c> x
s.t. Ax = b ∈ Rm (6.25)
x ≥ 0, x ∈ Rn .
Here we are optimizing over an affine manifold, with an additional non-negativity
constraint. Derive a primal-dual interior-point algorithm to solve (6.25), by first adding
an appropriate log-barrier term, writing down the perturbed optimality conditions, and
computing a Newton direction and suitable step length towards optimality.
Implement your solution in your favourite scientific programming language. How
fast can you make it?
7 Applications
Reorder, maybe pick some of the better ones
33
7.1 Semidefinite Programming
7.1.1 Preliminaries
Semidefinite programming has been a speciality of the Waterloo C&O Department.
Semidefinite programs (SDPs) resemble linear programs, except with our variable is
taken in the space of positive semidefinite matrices, and the non-negativity constraints
are replaced by semidefinite ones.
Although SDPs have been studied since at least the 1940’s (under different names),
it wasn’t until the late 1900’s and early 2000’s that we had efficient algorithms for
solving them. There are many diverse applications of SDPs, and hopefully I’ll be able
to show you many of them19 .
We begin with the many equivalent formulations of semidefiniteness:
Proposition 7.4 (Schur Complement). The following are equivalent (for appropriately
sized A, B, C)
19
Semidefinite programming is usually its own course at Waterloo, although some other schools
(Stanford) cover it into a second convex optimization course. As a result this section may be way more
in depth than some of the others.
34
A B
(i) 0
B> C
(ii) A 0, C − B > A−1 B 0
(iii) C 0, A − BC −1 B > 0
min cT x
s.t. Ax = b (7.1)
x ≥ 0.
where A is determined by the covectors hAi , ·i. We can derive the dual SDP by con-
sidering the Lagrangian primal:
By weak duality (Proposition 3.4) we know that p∗ ≥ d∗ . How can we determine when
strong duality holds?
interior point method
35
7.2 Least Squares
7.3 Quadratic Programming (and Support Vector Machines)
7.4 Quadratic Assignment Problem
7.5 Max Cut
7.6 Sensor Network Localization
7.7 ¿Neural Networks?
- unfortunately not very sophisticated techniques but deep learning is a meme... Deep
learning as a field is somewhat like alchemy was in the 16th century. There is a lot of
stuff that seems to work, but we really have no idea why. We begin with momentum,
an interesting idea inspired by physics even if it isn’t mathematically supported.
8 Additional Enrichment
8.1 Convex Optimization on Manifolds
A Appendix I: Prerequisites
A.1 Linear Algebra
Linear algebra is the only field of mathematics which is understood20 . It is a beautiful
theory which forms the foundation of mathematics, including of course optimization.
As a consequence, it is crucial that you understand it.
I won’t go over the basic definitions, I think it’s fair to assume that you’ve taken
a first course in linear algebra, and so you know about vector spaces, bases, linear
transformations, rank and nullity, range and nullspace (or kernel as I’ll call it), and basic
properties about eigenvalues and eigenvectors. In particular, it is of vital importance
that you think of vectors as abstract coordinate-free objects rather than n-tuples of
scalars or pointed arrows.
In this appendix, V will be an arbitrary finite dimensional vector space over the
real numbers R, capital letters will represent linear transformations, lower case letters
from the earlier part of the alphabet will represent scalars and those from the latter
part will represent scalars. I’ll pick and choose the relevant parts of linear algebra for
optimization. In particular, I will not be covering dual spaces, geometry, scalar fields
other than R, and so on. This stuff should really be review, you should definitely take
more linear algebra courses if it isn’t.
20
In the extremely crude sense that we have answers to most of the questions.
36
A.2 Calculus
The world is not linear21 . However, the nonlinear stuff is usually pretty well approx-
imated by the linear stuff, and even better by higher order approximations. We’ll
formalize this notion and more with ideas from differential22 calculus, the study of
change.
I’ll assume that the reader is familiar with many fundamental ideas covered between
pre-calculus and freshman differential calculus. You should know and be familiar with
the many definitions and characterizations of the real numbers; the epsilon-delta def-
inition of a limit; the definition and properties of the derivative of a function in one
variable; the intermediate value and extreme value theorems; the mean value theorem
and Taylor’s theorem.
I’ll cover multivariate differential calculus23 , as well as bits and pieces of real analysis
which we will need.
21
Most unfortunately.
22
Integral Calculus doesn’t really appear that much, at least at the level of these notes. I do want
to write more one day on differential geometry.
23
Usually taught in a third calculus class
37