0% found this document useful (0 votes)
2 views37 pages

Notes

Uploaded by

ranarayy776
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
2 views37 pages

Notes

Uploaded by

ranarayy776
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 37

Convex Optimization and Analysis

Jeffrey Liu
A quick warning...
These notes are incomplete and subject to mass reorganization and editing. However
chapters 4-6 are fairly readable in its current state, and they are the most important.
When I’m done chapter 7 will probably be three times the length of the other chapters,
but by the time I finish it will be the most interesting.

Also, all credit to Fabrizio Conti for the brilliant cover photo, retrieved from Unsplash.
Hope it helps the reader visualize descent algorithms :).

2
Foreword
Hi, Jeffrey here. Today is St Patrick’s Day, 2020, but this year I won’t be drinking
my problems away on Ezra. I’m not sure when in the future that you, if anyone, will
be reading this, so I’ll forgive you if you don’t remember that this was the year of the
COVID-19 pandemic. My classes were suspended last Friday, and I’ve spent the better
part of the break so far playing video games and watching YouTube1 .
To keep my sanity over the next couple months, I’ve also decided to start a collection
of notes for some of my more favoured subjects.
Convexity is a beautiful property that lends itself to many powerful results in al-
gebra, analysis, and geometry. Applications have been found in many branches of
pure mathematics, engineering, finance, computer science, physics, and of course op-
timization. I’ll be basing these notes off of the extensive literature in the subject, in
particular: Boyd and Vandenberghe: Convex Optimization; and Wolkowicz’s CO 463
course notes2 , among others.
In contrast to these sources, however, I’ll be exploring the subject from the perspec-
tive of a (not exceptionally bright :p) undergraduate, so I’ll try my best to motivate. I’ll
also look to introduce non-convex optimization, especially towards numerical methods
and their applications in machine learning.
I hope that these notes will form a brief yet extensive overview of the subject,
perhaps as a companion for a first graduate-level course in convex or nonlinear opti-
mization. Of course, I’m still a student (read: I’m not an expert), so expect room for
improvement. A quick note: some elementary linear algebra and calculus is expected,
although I’ll try to have an appendix with some prerequisite theorems.
Enjoy3 !

1
If my parents or any future professors or employers ever read this, I was also reading lots of
textbooks.
2
I’ll be following the course notes the closest, but with my own commentary and editorials.
3
I’d like to thank Fabrizio Conti again for the brilliant cover photo, retrieved from Unsplash. Also,
special thanks to my CO 463 Professor Henry Wolkowicz, as well as my mom and dad for their continued
belief in me.

3
Contents
1 Convex Geometry 9
1.1 Convex Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.1.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.1.2 Operations Preserving Convexity . . . . . . . . . . . . . . . . . . 10
1.1.3 Convex Hulls and Carathéodory’s Theorem . . . . . . . . . . . . 11
1.2 Affine sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3 Geometry with Convex Sets . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.3.1 Extreme Points and Faces . . . . . . . . . . . . . . . . . . . . . . 13
1.3.2 Projections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.3.3 Separation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.4 Cones . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.4.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.4.2 Partial Orders . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.4.3 The Dual Cone . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2 Convex Functions 17
2.1 Preliminary Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1.1 As a vector space . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.1.2 Elementary Properties . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2 Other Types of Convexity . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.1 Quasiconvexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.2 Strong Convexity . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.3 Strict Convexity . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.4 K-Convexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3 Calculus with Convex Functions . . . . . . . . . . . . . . . . . . . . . . 19
2.3.1 Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.2 Subdifferentials . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3 Convex Programs 20
3.1 The Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2 Optimality Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3 Duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3.1 Minimax Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3.2 Lagrangian Duality . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.3.3 Strong duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3.4 Optimality Conditions . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3.5 Conjugate Duality . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4 Algorithms for Unconstrained Problems 26


4.1 First Order Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.1.1 Steepest Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.1.2 Subgradient Methods . . . . . . . . . . . . . . . . . . . . . . . . 27

4
4.2 Second Order Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2.1 Newton’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2.2 Trust Region Methods . . . . . . . . . . . . . . . . . . . . . . . . 27

5 Equality constrained minimization 27


5.1 Equality Constrained Quadratic Programs . . . . . . . . . . . . . . . . . 27
5.2 Newton’s Method Revisited . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.3 Infeasible Start Newton’s Method . . . . . . . . . . . . . . . . . . . . . . 28
5.4 Penalty Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.5 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

6 Interior Point Methods 28


6.1 Penalty and Barrier problems . . . . . . . . . . . . . . . . . . . . . . . . 28
6.2 Barrier Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
6.3 Primal-dual interior-point Methods . . . . . . . . . . . . . . . . . . . . . 30
6.4 Linear and Semidefinite Programs . . . . . . . . . . . . . . . . . . . . . 30
6.5 Feasibility and Two Phase Methods . . . . . . . . . . . . . . . . . . . . 33
6.6 Step Size and Time Analysis . . . . . . . . . . . . . . . . . . . . . . . . 33

7 Applications 33
7.1 Semidefinite Programming . . . . . . . . . . . . . . . . . . . . . . . . . . 34
7.1.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
7.1.2 Semidefinite Programming . . . . . . . . . . . . . . . . . . . . . . 35
7.2 Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
7.3 Quadratic Programming (and Support Vector Machines) . . . . . . . . . 36
7.4 Quadratic Assignment Problem . . . . . . . . . . . . . . . . . . . . . . . 36
7.5 Max Cut . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
7.6 Sensor Network Localization . . . . . . . . . . . . . . . . . . . . . . . . . 36
7.7 ¿Neural Networks? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

8 Additional Enrichment 36
8.1 Convex Optimization on Manifolds . . . . . . . . . . . . . . . . . . . . . 36

A Appendix I: Prerequisites 36
A.1 Linear Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
A.2 Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
A.2.1 The first derivative . . . . . . . . . . . . . . . . . . . . . . . . . . 37
A.2.2 The other derivatives . . . . . . . . . . . . . . . . . . . . . . . . . 37
A.3 Linear Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

B Appendix II: Proofs and Sketches 37

5
Notation
I try to use standard notation; anything that may be controversial will be listed here.
If you are unfamiliar with any of these definitions, please read the appendix.

Sets and Vector Spaces


N The set of non-negative integers {0, 1, 2, ...}
[n] The set of integers {1, 2, ..., n}
Z, Z+ , Z++ The set of integers, resp., the non-negative and positive integers
R, R+ , R++ The set of real numbers, resp., the non-negative and positive reals
Rn The vector space of column vectors with n real entries, together with
the standard inner product
E, En Any Euclidean vector space4 (finite dimensional real inner product
space), resp., of dimension n if specified
Mn The vector space of n × n square matrices with real entries, together
with the Frobenius inner product (hA, Bi = tr(A> B))
Sn+ , Sn++ The subset of Mn consisting of positive semidefinite (resp. positive
definite) matrices.
ETC TODO

Vectors and Matrices


x> The transpose of x
I, In The identity matrix (resp. of dimension n if specified)
xk The kth entry of x
x(k) The kth element of a sequence of vectors {x(k) }k≥0 .
diag(x), X The square matrix with x along the main diagonal and zeros every-
where else
kxkp The `p -norm of x
kxk The induced norm of x (square root of inner product)
e The vector of all 1’s with dimension implied by context
ei The ith vector in the standard basis of Rn

Abuse of Notation
minx∈Ω f (x) The infimum of f (x) as x ranges in Ω
maxx∈Ω f (x) The supremum of f (x) as x ranges in Ω

4
An astute reader may note that all n dimensional Euclidean vector spaces are isomorphic to Rn ,
and may ask why we bother distinguishing them. Honestly I think the only reason is so that we can
be lazy and always assume Rn refers to that specific vector space with the standard inner product.

6
Introduction
Note: I’ll put some pictures here eventually.
Historically, mathematicians have preferred structural results over numerical ones.
Greats such as Euler, Gauss, Riemann, etc., have been devoted to the creation of
beautiful theories in algebra, analysis, geometry, and number theory. It wasn’t until
the 20th century and the invention and popularization of the computer did the study
of algorithms and computation take off.
Combinatorics and Optimization are two fields which blossomed in this new age.
Many computational subfields of combinatorics, including those of graph theory, ma-
troid theory, and polyhedral theory, have only been developed recently. The simplex
method, published by Dantzig in 1947, was the first of many algorithms with promising
practicality, and with it came the field of operations research.
Convexity, as a geometric property, has been known since antiquity. Properties have
been investigated by the likes of Euler and Cauchy, however its true potential wasn’t
realized until the late 1800’s when German mathematician Minkowski was able to apply
it to number theory. He and fellow German Brunn developed much of the theory in
two and three dimensions. Carathèodory, Krein, Milman, and Fenchel, among many
others, developed and generalized much of the theory from the turn of the century until
about the second world war. By 1970 or so all of the convexity theory we will require
has been discovered.
The simplex method was a huge landmark in optimization with remarkable practical
efficiency, however no polynomial time variation has ever been found. Collaborations to
find a provably polynomial time algorithm lead to the discovery of the ellipsoid method
in the 70’s. In 1984, Indian Karmarkar proposed the first poly-time interior-point
method for linear programming while working for Bell Labs, and non-linear adaptations
have been an active field of research since the late 1980’s.
Since the 90’s, applications have been found in the traditional domains of opera-
tions research, and also in engineering (robotics, signal processing, circuit design, . . . );
computer science (machine learning), the physical sciences, and finance.
We’ll first discuss convex geometry and the natural extension to functions. With
this, we’ll be able to analyze convex programs and develop optimality constraints.
These constraints, when violated, give arise to algorithms. Finally, we’ll conclude with
as many interesting applications as I could find.
For better flow, I will not prove many of the propositions and theorems unless they
are particularly insightful. My primary intention is for these notes to be a reference to
myself, but I’ll try my best to be a good teacher (plus I enjoy teaching and hope to do
it more often). I’ll have an appendix on hints and solutions for some the more tricky
proofs.
Oh, one last thing: I’ll try to maintain a prerequisite DAG. So please don’t be
discouraged like I was and think that you have to read two chapters of geometry before
you even see an optimization problem. In fact, you may be able to understand many
of the algorithms by simply reading their description and googling any definitions you

7
haven’t heard of before.
But still, it’s really important to know the fundamentals. I’m really sorry that
sections 1 and 2 may get a little boring, but we need to learn to walk before we can
run.

8
1 Convex Geometry
We shall explore the main geometric structures which arise in convex optimization,
beginning with, of course, sets.

1.1 Convex Sets


1.1.1 Definitions
Although some notion of convexity has existed geometrically since at least Archimedes,
the modern definition of a convex set is actually an algebraic one:
Definition 1.1. A subset C ⊆ E is convex if
x, y ∈ C, λ ∈ [0, 1], =⇒ (1 − λ)x + λy ∈ C. (1.1)
We can see that, for a fixed x and y, the quantity (1−λ)x+λy represents the closed
line segment between x and y as λ ranges over the unit interval. Then Equation 1.1 is
equivalent to the statement that the set S contains all of its line segments. There are
a couple alternate equivalent formulations of convexity which may (or may not) help
you visualize the definition of convexity:
Proposition 1.2. Let C ⊆ E. Then the following are equivalent:
(i) C is convex,
(ii) C contains all of its P convex combinations, that is, if x(1) , . . . , x(k) ∈ C and
λ1 , . . . , λk ∈ R+ with ki=1 λi = 1, then the convex combination5
k
X
λi x(i) (1.2)
i=1
belongs to C as well.
(iii) The intersection of C with any line is either the empty set or a connected interval.
Convexity is an extraordinarily simple condition; many everyday sets can easily be
shown to be convex. We’ll list a few examples:
Example 1.3. The empty set is convex (vacuously).
Example 1.4. All subspaces S of E are convex. In addition, translations of subspaces
(S + x for some x ∈ E) are convex. These translations are called affine sets (more on
this later!).
Example 1.5. All polyhedra, sets of the form (where A ∈ Rn×m , b ∈ Rm )
P = {x ∈ E : Ax ≤ b} , (1.3)
are convex. In particular if m = 1 then we conclude the closed halfspace
H = {x ∈ E : hφ, xi ≤ α} (1.4)
is convex as well. The open halfspace is also convex, although it is not a polyhedra.
5
This is also called the weighted mean (from physics), and a generalization of barycentric coordi-
nates. You can also think of the λi as being a probability distribution.

9
1.1.2 Operations Preserving Convexity
There are several common operations which preserve convexity of a set. We’ll be able
to use these operations to build increasingly sophisticated sets from the basic examples
given above. These propositions all follow from definitions, and the reader is encouraged
to prove them on their own.

Proposition 1.6. Let Ci ⊆ E, i ∈ I be a collection of convex sets, where I is a


(potentially uncountably large) index set. Then the intersection
\
Ci (1.5)
i∈I

is convex.

In particular, the convex hull of a set S, possibly defined as the intersection of all
convex sets containing S, is convex. We’ll talk more about this later.

Proposition 1.7. Let C1 , C2 ⊆ E be two convex sets, and α, β ∈ R+ . Then the


Minkowski sum, defined by

αC1 + βC2 := {αx + βy : x ∈ C1 , y ∈ C2 } (1.6)

is a convex set.

Proposition 1.8. Let Ci ⊆ Eni for i ∈ [m]. Then the Cartesian product defined by

C1 × · · · × Cm = {(x1 , . . . , xm ) ∈ En1 ·····nm : xi ∈ Ci , ∀i ∈ [m]} (1.7)

is a convex set. Conversely, if C ⊆ En1 ×···×nm is a convex set, then each projection

xi ⊆ Eni : (x1 , . . . , xm ) ∈ En1 ×···×nm



(1.8)

is a convex set as well, for i ∈ [m].

Proposition 1.9. Let A : En → Em be an affine mapping. Then if C ⊆ En and


D ⊆ Em are convex sets, both the image A(C) and the pre-image A−1 (D) are convex
sets.

Proposition 1.10. Let C ⊆ E be a convex set. Then the interior int C and closure
cl C are convex sets as well.

Actually, we shall see that we can strengthen Proposition 1.10 when we define the
relative interior, in case C has a “lower dimension” than E. But this will come later.

10
1.1.3 Convex Hulls and Carathéodory’s Theorem
Many problems encountered in nature will not be convex (and usually will be difficult
to solve), so we’ll formulate strategies to relax these problems so that they are convex.
The simplest of which is the convex hull.

Definition 1.11. Let S ⊆ E be any set. Then the convex hull of S, denoted conv S,
is the smallest convex set containing S. That is, if C is any other convex set containing
S, then conv S ⊆ C.

It’s easy to show that the convex hull of S is the intersection of all convex sets
containing S. In fact, this also serves as a decent definition of the convex hull, as
by Proposition 1.6 we know that this intersection is convex. Some texts prefer the
following equivalent formulation of convex hull:

Proposition 1.12. Let S ⊆ E. Then


( k k
)
X X
conv S = λi x(i) : k ∈ Z++ , λi = 1, λ ∈ Rk+ , x(i) ∈ S . (1.9)
i=1 i=1

That is, the convex hull of S is precisely the set of all convex combinations of points
from S. Now, despite the simplicity of the definition of convexity, we can build a very
rich theory to describe these objects. We can immediately state a beautiful result of
the subject.

Theorem 1.13 (Carathéodory, 1911). Let S ⊆ En . Then the convex hull of S is


(n+1 n+1
)
X X
(i) n+1 (i)
λi x : λi = 1, λ ∈ R+ , x ∈ S . (1.10)
i=1 i=1

Proof. We’ll show that Carathéodory’s Theorem follows from linear independence6 .
Insert proof here.

In other words, no matter how misbehaved S is as a set, any point in the convex
hull of S can be written as a convex combination of just n + 1 points from S. This
is remarkable, as it will often allow us to simplify arbitrarily many variables into just
n + 1. In fact, if S is sufficiently well behaved, we get an even stronger statement.

Theorem 1.14 (Fenchel and Blunt, YYYY?). Let S ⊆ En , and suppose that S has no
more than n connected components. Then only n points are needed in Theorem 1.13.

There are several related theorems that are worth knowing, although they don’t
appear too often7 .
6
Although there is a very nice proof using linear programming
7
Some of the proofs of these theorems have appeared as challenge problems on some of my exams,
so beware.

11
Corollary 1.15 (A consequence of Carathéodory’s Theorem). The convex hull of a
compact set is compact.
Exercise 1.16. Beware! The compactness condition cannot be weakened: the convex
hull of a closed set may not be closed. Try to find a counterexample!
Theorem 1.17 (Helly, YYYY). Let C (i) ⊆ En , i ∈ I be a (potentially uncountable?)
collection of compact convex sets. Then if every subcollection of n + 1 sets have a
non-empty intersection, then \
C (i) 6= ∅ (1.11)
i∈I

Theorem 1.18 (Radon, YYYY). Let {x(1 ), ..., x(n+2) } ⊆ En . Then there is a partition
I1 , I2 of the indices [n + 2] such that the convex hulls
n o n o
C1 = conv x(i) : i ∈ I1 , C2 = conv x(i) : i ∈ I2 (1.12)

intersect C1 ∩ C2 6= ∅.
Theorem 1.19 (Shapley-Folkman, YYYY). GOES HERE

1.2 Affine sets


Earlier, we alluded to the notion of an affine set, as well as the dimension and relative
interior of a set in E. We make these definitions now.
If we do not restrict λ to the interval [0, 1] in the definition of convexity, then we
get another definition:
Definition 1.20. A subset S ⊆ E is affine if

x, y ∈ S, λ ∈ R, =⇒ (1 − λ)x + λy ∈ S. (1.13)

Intuitively, just as convex sets contain all of their convex combinations, we’d like
an affine set to be one containing all of its affine combinations. However unlike their
convex counterparts, affine sets can be characterized extremely simply.
Proposition 1.21 (Different formulations of affine sets). Let S ⊆ E. Then the follow-
ing are equivalent:
(i) S is an affine set, in the sense of Definition 1.20.
(ii) S is a linear manifold, ie., there exists a linear transformation A : E → F and
vector b ∈ F so that
S = {x ∈ E : Ax = b} . (1.14)
(iii) There exists some d ∈ E and linear transformation B : F → E so that

S = {x ∈ E : x = By + d, y ∈ F} . (1.15)

This is sometimes called the parametric form or nullspace representation


of S.

12
(iv) S is the translation of a subspace, ie., for x ∈ S the set S − x is a subspace of E.

Among these formulations, (1.14) is the most useful, ie., affine sets are the solution
sets to some system of linear equations, and the terms affine set and linear manifold
will be used interchangeably.
However, (iv) inspires us to adapt linear independence to the affine case.
affinely independence definition goes here, and some basic theorems
These tools form the backbone of what’s really happening in our proof of Theorem
1.13.
Dimension, Relative interior

1.3 Geometry with Convex Sets


1.3.1 Extreme Points and Faces
What is this used for? facial reduction and stuff... to be added...

Proposition 1.22. Let C ⊆ E be a convex set. Then ext C is non-empty if and only
if C is pointed, ie., C does not contain any lines.

We conclude this section with the interior description of a convex set:

Theorem 1.23 (Minkowski). Let C ⊆ E be a bounded convex set. Then

C = conv ext C. (1.16)

In other words, the set of extreme points of C is the smallest set which is “good
enough” to determine C via its convex hull; it is the “shortest worker’s instruction for
building the set”.

1.3.2 Projections
Projections onto convex sets will be our main tools in proving hyperplane separation
theorems. They are a natural extension of projections onto subspaces from linear
algebra.
Recall that we can define an orthogonal projection P onto a subspace V ⊆ E as a
linear transformation satisfying P 2 = P > = P . AND THEN w=v+v perp, etc
We’d like to define the projection onto a convex set similarly.

Definition 1.24. Let C ⊆ E be a non-empty closed convex set, and x ∈ E. Then


define the projection of x onto C , denoted as PC (x), as the unique solution to

PC (x) = arg min{ky − xk}. (1.17)


y∈C

It takes a little work to see why PC (x) always exists, and if it does, why it’s unique.

13
Proposition 1.25 (Kolmogorov’s Criterion, YYYY). Let C ⊆ E be a non-empty
closed convex set, and x ∈ E. Then PC (x), as defined in Definition 1.24, is well
defined. In particular, yx = PC (x) if and only if

hx − yx , y − yx i ≤ 0, ∀y ∈ C. (1.18)

Proof. ??

Moreau decomp
Here are a few nice exercises with projections.

Proposition 1.26. Let x, y ∈ E and C ⊆ E be a non-empty closed convex set. Then

kPC (x) − PC (y)k ≤ kx − yk . (1.19)

Insert some more...

1.3.3 Separation
Preliminary version:

Theorem 1.27 (Hyperplane Separation). Let C ⊆ E be a closed convex set, and


suppose that x ∈ E \ C. Then there exists a hyperplane separating x from C, i.e., there
exists φ ∈ E, α ∈ R such that

hc, φi ≤ α < hx, φi, (1.20)

for all c ∈ C.

We can think of hyperplane separation as another Theorem of the Alternative, in


the sense that either x is contained inside C, or it isn’t and we have a hyperplane
certificate to prove it. Spoiler alert: This theorem will prove essential to developing
strong duality for convex optimization.
We can immediately generalize Theorem 1.27 where we don’t just separating a point
from a convex set, but instead separate two convex sets from each other:

Corollary 1.28. Let C1 , C2 ⊆ E be two non-empty closed convex sets, where C2 is


compact and C1 ∩ C2 = ∅. Then there exists a strictly separating hyperplane between
C1 and C2 , ie., there exists φ ∈ E, α ∈ R such that

hφ, c1 i < α < hφ, c2 i (1.21)

for all c1 ∈ C1 , c2 ∈ C2 .

other stuff outer description

14
1.4 Cones
1.4.1 Definitions
The cone is another very important geometric object, although as with the convex set,
we define it algebraically.

Definition 1.29. A subset K ⊆ E is called a cone if

x ∈ K, λ ∈ R+ =⇒ λx ∈ K. (1.22)

Oftentimes, we may find ourselves focusing on the closed convex cones (c.c.c. for
short), which are closer our primary school intuition of a “pylon-shaped” cone. As
we shall see, these specific cones have many nice properties, however it is important
to note that these c.c.c.s are not the only types of cones. Convex cones have a nice
characterization:

Proposition 1.30. Let K ⊆ E be a cone. Then K is a convex cone if and only if

x, y ∈ K =⇒ x + y ∈ K. (1.23)

A convex cone contains all of its conic combinations, which are sums of the form
k
X
λi x(i) for λ ∈ Rk+ and x(i) ∈ K. (1.24)
i=1

Just as with convex sets, the intersection of any family of convex cones is itself a convex
cone. Similar to the convex hull, we can define a conical relaxation of a set, or the conic
hull:

Definition 1.31. Let S ⊆ E be any set. Then the conical hull of S, denoted cone S,
is the smallest convex cone containing S.

As with before, we have an elegant theorem for conical hulls:

Theorem 1.32 (Carathéodory, 1911). Let S ⊆ En . Then the conic hull of S is


( n n
)
X X
λi x(i) : λi = 1, λ ∈ Rn+ , x(i) ∈ S . (1.25)
i=1 i=1

The reason why we care about closed convex cones so much (besides the fact that
they are just so cool ) is because they correspond with partial orders on E. Un-
derstanding cone geometry leads to conic optimization (duh), most notably including
second-order cone programming and semidefinite programming. We have efficient al-
gorithms for both these problems.

15
1.4.2 Partial Orders
Many things cannot be compared to each other8 . In linear algebra, there really isn’t
a good way to compare two arbitrary vectors. For example, in R2 , the vectors (0, 1)
and (1, 0) are indistinguishable (especially before the choice of a basis). This appears
to be a huge problem in optimization, where questions of the form “minimize . . . ”,
inherently requires us to be able to compare stuff to each other.
However, this is a non-issue, as we introduce the concept of a partial order.
PARTIAL ORDER DEFINITION
Unlike total orders, we are allowed to have pairs of objects a, b which are incom-
parable, meaning neither a  b or b  a.

Example 1.33. We are already familiar with a partial order— the less than or equal
to relation ≤ over the real numbers. We can generalize this to obtain a natural partial
order in Rn , where
xy ⇐⇒ xi ≤ yi , ∀i ∈ [n]. (1.26)
Essentially we say x  y if every entry of x is less than the corresponding entry of y.
Usually we won’t be too picky with the notation and just write x ≤ y. After messing
around with the definition for a bit, a cool alternate formulation is

x≤y ⇐⇒ y − x ∈ R+ . (1.27)

As it turns out, this is actually a more natural definition for the ≤ relation9 . In
particular, it is (at least on the surface) coordinate free, and it extends easily.

Proposition 1.34. Indeed, if K ⊂ E is any pointed convex cone, then the relation
x  y if and only if y − x ∈ K is a partial order. Conversely, if  is any partial order
then
{x ∈ E : 0  x} (1.28)
is a pointed convex cone.

EXAMPLE: SEMI DEFINITE CONE


RECESSION/ASYMPTOTIC CONE ¡- move this smwhere?
TANGENT CONE ¡- and this
moreau decompositions

1.4.3 The Dual Cone


Blah blah

Definition 1.35. Let S ⊆ E. Then we can define the positive polar cone of S,
denoted as S + , as
S + := {φ ∈ E : hx, φi ≥ 0, ∀x ∈ S} . (1.29)
8
Apples and oranges is a canonical example
9
and it may be familiar if you’ve done anything with equivalence relations

16
We can define the negative polar cone S ◦ similarly, and S ◦ = −S + . Note that
the positive polar cone is sometimes referred to as the dual cone, denoted as S ∗ , and
the negative polar cone is simply called the polar cone.
Insert graphic representing dual cone
The dual cone comes up surprisingly often, as it represents I DON’T KNOW...
FIGURE THIS OUT. One fact that comes up a lot is the following:

Proposition 1.36. Let K ⊆ E. Then K is a closed convex cone if and only if

K = (K + )+ . (1.30)

Proposition 1.36 can be used to provide a geometric interpretation/proof of the


famous Farkas’ Lemma from linear programming:

Corollary 1.37 (Farkas’ Lemma). Let A ∈ Rn×m , b ∈ Rn . Then exactly one of the
following is true:
(i) The system Ax = b, x ≥ 0 has a solution
(ii) The system A> y ≥ 0, b> y < 0 has a solution

Proof. Hi

Farkas’ lemma and other such theorems of the alternative are the foundation of a
beautiful duality theory in linear programming. It is strongly recommended that the
reader be familiar with these ideas.

2 Convex Functions
2.1 Preliminary Definitions
We’ve seen many examples of the importance and elegance of convex sets. As we shall
see, there is a very natural correspondence between convex sets and convex functions,
which will allow us to transfer much of the theory over. In particular, we will be able
to derive several properties which are crucial to the success of convex optimization.
The general definition of a convex function is usually introduced in freshman cal-
culus, and is defined by Jensen’s Inequality.

Definition 2.1. Let f : C → R. Then f is a convex function if C is a convex set


and

x, y ∈ C, λ ∈ [0, 1] =⇒ f ((1 − λ)x + λy) ≤ (1 − λ)f (x) + λf (y). (2.1)

We see that it really is necessary for C to be a convex set, as otherwise the LHS of
2.1 may not be defined. Pictorially, we define f to be convex if it lies below all of its
secant lines.
MAYBE INSERT PICTURES AND EXAMPLES

17
If we have a function f such that −f is convex, then we call f concave. Concave
functions satisfy the inequality

x, y ∈ C, λ ∈ [0, 1] =⇒ f ((1 − λ)x + λy) ≥ (1 − λ)f (x) + λf (y). (2.2)

We can immediately uncover the relationship between convex sets and convex functions.
Recall the following definition:
Definition 2.2. Let f : S ⊆ E → R be any function. We define the epigraph of f ,
denoted epi f , by
epi f = {(x, r) ∈ S × R : f (x) ≤ r} . (2.3)
PICTURE? The epigraph represents the region “above” the graph of f , and is
a subset of a Euclidean space of one dimension higher. Then we have the following
equivalence:
Proposition 2.3. Let f : S → R. Then the following are equivalent:
(i) f is a convex function
(ii) epi f is a convex set
Examples: lines, quadratics (introduce hessian is psd iff convex), other examples,
norms,
As an aside, we’d also like to note that we can recover the classical Jensen’s in-
equality by induction.
Theorem 2.4 (Jensen’s Inequality). Let f : E → RPbe a convex function, x(1) , . . . , x(k) ∈
E be points in the domain, and λ ∈ Rk satisfying ki=1 λi = 1 be weights. Then
k k
!
X X
f λi x(i) ≤ λi f (x(i) ). (2.4)
i=1 i=1

We also need several definitions about functions in general. Depending on your


calculus/analysis background, you may have seen many of these definitions before.
Personally I knew none of these concepts before I took CO 255/463, and if you’re in a
similar situation I recommend that you spend some time to draw some examples and
get a visual intuition.
Many of these definitions don’t show up too often, and a surprising amount of it
is unnecessary if you’re simply looking to apply algorithms10 . Nevertheless, they form
important tools for us to develop much of the deeper theory in convex analysis.
Definition 2.5. Let f : E → R and r ∈ R. Then we can define the rth level set,
denoted Lr (f ), as
Lr (f ) = {x ∈ E : f (x) = r} . (2.5)
Similarly, we can define the rth sublevel set (sometimes called lower level set), denoted
as Sr (f ), as
Sr (f ) = {x ∈ E : f (x) ≤ r} . (2.6)
10
It may even be tempting to skip this section and hope that it never comes up.

18
continuous + bounded sublevel sets? lower upper semicontinuous to understand
what these mean geometrically, perhaps its helpful to picture in terms of epigraph
closed function

2.1.1 As a vector space


REDO THIS A BIT Before we begin, we’d like to modify our definition of function to
include a few objects which will simplify future propositions. In particular, we extend
the real line by considering the addition of a new point which we call positive infinity11 .
This way, the infimum of any subset of the extended real line will be an extended real
number.
Definition 2.6. For any function f˜ : C ⊆ E → R we can define the extended value
function f : E → (−∞, +∞] by
(
f˜(x) x∈C
f (x) = (2.7)
+∞ x 6∈ C.
For our purposes, we shall always assume that the domain C in Definition 2.6 is a
convex set. We’ll often want to recover this original domain from our extended function,
so we’ll adopt the following notation:
Definition 2.7. Let f : E → (−∞, +∞] be an extended value function. Then the
domain of f , denoted dom f , is the set
dom f := {x ∈ E : f (x) < ∞} . (2.8)

2.1.2 Elementary Properties


maxes are at extreme points, local mins are global mins, tangent line theorems, three
slopes,

2.2 Other Types of Convexity


2.2.1 Quasiconvexity
2.2.2 Strong Convexity
2.2.3 Strict Convexity
2.2.4 K-Convexity
2.3 Calculus with Convex Functions
2.3.1 Derivatives
locally lipschitz, differentiable nearly everywhere
11
I’ll denote this extended real line as (−∞, +∞], but other authors may use R t {+∞} or other
more concise notation. We’ll try to avoid doing arithmetic with positive infinity; it’s enough to just
assume that c + ∞, λ · ∞ are both equal to ∞ (for c ∈ (−∞, +∞] and λ ∈ (0, +∞]) and all other
expressions are undefined.

19
2.3.2 Subdifferentials
some more I bet

3 Convex Programs
minimum vs minimal

3.1 The Framework


abstract convex program
kkt style convex program

3.2 Optimality Conditions


Picture this: you are at a bar on a Friday night, you’re sitting with a few friends
who’re tryna vibe to crappy mumble rap, while you yourself are working very hard on
a convex optimization problem. A hooded stranger12 notices you, walks up and hands
you a napkin with some numbers written on it. He claims that he has found an optimal
solution for your problem. How could you quickly verify that he is correct?
Understanding optimality constraints is crucial to designing good algorithms. For
one, it is important to know when to stop. Also, by analyzing when and why optimality
constraints fail to hold, we can discover algorithms. As a stupid example, consider
Fermat’s Theorem:

Theorem 3.1 (Fermat, 1600’s). Let f : A → R be some function, and suppose that
x∗ is a local extremum for f . If f is differentiable at x∗ , then

∇f (x∗ ) = 0. (3.1)

When condition 3.1 is violated, then we have a non-zero gradient and thus a direc-
tion for descent/ascent. As we’ll see, this leads to gradient descent.
For convex functions we have even stronger conditions. The most important prop-
erty is this very simple one:

Theorem 3.2 (“Unimodality”). Suppose that f : E → R is a convex function, and


that M ⊆ E is a convex set. Then suppose that x∗ ∈ M ∩ dom f is a local minimizer
on M , ie., there exists some δ > 0 such that

x ∈ M ∩ B(x∗ ; r) =⇒ f (x) ≥ f (x∗ ). (3.2)

Then x∗ is actually a global minimizer of f on M , ie.,

x∈M =⇒ f (x) ≥ f (x∗ ). (3.3)


12
Perhaps a UofT student.

20
Most algorithms are only capable of finding local extrema, but in convex optimiza-
tion, this turns out to be equivalent to finding global extrema. We can use this theorem,
as well as some facts about convex functions, to strengthen Fermat’s theorem:
Theorem 3.3. A point x∗ ∈ dom f is a global minimizer for f if and only if 0 ∈ ∂f (x∗ ).
Of course, if x∗ ∈ int dom f and f is differentiable at x∗ , then ∂f (x∗ ) = {∇f (x∗ )}
and we recover Fermat’s original theorem. So what happens if x∗ 6∈ int dom f ?
Tangent cones, that one condition from homework? Rockafeller pshenichnyi
Weierstrauss Theorem
Lagrange Multipliers - introduce it math 247 style and say we’ll talk more about it
with duality
Maximizing convex functions

3.3 Duality
Duality, as a concept, is loosely defined as looking at an object in two ways. For
example, we can analyze a signal with respect to either the frequency domain or the
time domain. A compact convex set can be regarded by the union of a bunch of points
(REFERENCE ABOVE), or the intersection of a bunch of halfspaces (REFERENCE
ABOVE). Here, we’ll define several different notions of duality to help us understand
convex functions and convex programs.

3.3.1 Minimax Theorem


We begin by borrowing a theorem from game theory, which formalizes the first move
disadvantage in many zero-sum games.
Proposition 3.4. For any g : M × N → R,
min max g(x, y) ≥ max min g(x, y). (3.4)
x∈M y∈N y∈N x∈M

Picture a zero-sum game played between two opponents X (Xavier) and Y (say,
Yvette), where g is the profit function for Y . That is, if X plays the move x and Y
plays the move y, then g(x, y) is the (possibly negative) money paid out to Y from
X. Player Y seeks to maximize her profit, while player X seeks to minimize his losses.
There is a first player disadvantage in this game: the first player to commit is worse
off, as the second player can adapt their strategy in response.
Example 3.5 (Rock Paper Scissors). Alphonse and Beryl are playing a game of rock
paper scissors, in which the loser pays the winner one Canadian Peso. From Beryl’s
perspective (Beryl takes the role of Y above), she has the following payoff matrix:
B
Rock Paper Scissors
A
G = Rock 0 -1 1 (3.5)
Paper 1 0 -1
Scissors -1 1 0

21
Alphonse and Beryl must both choose a vector from the set:
     
 1 0 0 
M =N =  0 , 1 , 0 ,
    (3.6)
0 0 1
 

and if Alphonse chooses x and Beryl chooses y, the payout for Beryl is

g(x, y) = x> Gy. (3.7)

By Theorem 3.4 (as well as common sense), the first person to reveal their choice is
significantly disadvantaged (since the other person would just pick a winning match-
up), and we can compute

1 = min max x> Gy ≥ max min x> Gy = −1, (3.8)


x∈M y∈N y∈N x∈M

so we can verify that Inequality 3.4 holds.

We can interpret this game theoretic perspective as some sort of duality. By defining
F (x) = maxy∈N g(x, y) as a sort of “dual objective function” to f (x) = minx∈M g(x, y)
(ie., your opponent’s objective function is the dual of your own), we can see that
Proposition 3.4 resembles some sort of weak duality statement à la linear programming.
Remarkably, however, sometimes it doesn’t matter who goes first: with optimal
play, the disadvantage of revealing your plan early is nonexistent. John von Neumann
was the first to publish a strong duality minimax theorem, which many regard as the
start of game theory.

Theorem 3.6 (von Neumann, 1928, slightly modified). Let M and N be compact
convex sets, and g : M × N → R be a continuous function satisfying
(i) g(·, y) : M → R is convex for a fixed y ∈ N
(ii) g(x, ·) : N → R is concave for a fixed x ∈ M .
Then
min max g(x, y) = max min g(x, y). (3.9)
x∈M y∈N y∈N x∈M

There have been many generalizations of von Neumann’s minimax theorem. We


shall prove one of the same flavour by American/Canadian mathematician Maurice
Sion:

Theorem 3.7 (Sion, 1958). Let M and N be convex sets, with at least one of them
compact, and g : M × N → R be a l.s.c. quasi-convex function in x ∈ M , and an u.s.c.
quasi-concave function on y ∈ N . Then

min max g(x, y) = max min g(x, y). (3.10)


x∈M y∈N y∈N x∈M

Before we prove this theorem, it’s important to note that 3.10 may no longer hold if
any of the preconditions are false. Let’s look at a few case studies with g(x, y) = x + y.

22
Example 3.8. Sometimes, the second to play gains a lot! Consider

min max x + y = +∞ (3.11)


x∈R y∈R

while
max min x + y = −∞ (3.12)
y∈R x∈R

This example also illustrates the necessity of the compact condition in Theorem 3.7.
Example 3.9. On the other hand, if we do enforce compactness, then Sion’s Theorem
holds as we’d expect:
min max x + y = −∞. (3.13)
x∈R 0≤y≤1

Example 3.10. Some commentary on how Sion’s Theorem is not necessary GOES
HERE
min max x + y = −∞. (3.14)
x∈R y≤0

We don’t have compactness in either set, but Sion’s Theorem still holds.
Example 3.11 (von Neumann’s Zero Sum Game). Alphonse and Beryl realize that
they should play according to a probability distribution, etc., etc., generalize. Let ∆n
be the standard simplex in Rn , then x, y represent the probability distributions, etc.,
represent something called a mixed strategy.

min max x> Ay = max min x> Ay. (3.15)


x∈∆m y∈∆n y∈∆n x∈∆m

3.3.2 Lagrangian Duality


We can use some of the convexity theory to generalize the theory of Lagrange multi-
pliers. (INTRODUCE LAGRANGE MULTIPLIERS somewhere)
Now consider a non-linear program (NLP) given in standard form:

p∗ = min f (x)
s.t. g(x) K 0 ∈ Em
(3.16)
h(x) = 0 ∈ Ep
x ∈ Ω.

Definition 3.12. In this framework, we can define the Lagrangian function with

L(x, λ, µ) := f (x) + hλ, g(x)i + hµ, h(x)i. (3.17)

Here, we introduce two new parameters λ and µ, often called the dual variables
(as they will become the variables in our dual program), or sometimes simply the
Lagrange multipliers. If we are working in Rn , we usually write the more familiar form
m
X p
X
L(x, λ, µ) = f (x) + λi gi (x) + µi hi (x), (3.18)
i=1 i=1

23
although it is slightly less revealing. You really should think of the Lagrangian as an
affine functional of λ and µ. We can recover our original NLP by solving the following
unconstrained problem.

p∗ = min max L(x, λ, µ) = f (x) + hλ, g(x)i + hµ, h(x)i. (3.19)


x∈Ω µ∈E
λK + 0

You should convince yourself why these problems are equivalent, and in particular, why
the Lagrange multipliers guarantees feasibility in (3.16). By reversing the order of play
(by Proposition 3.4) we immediately get a statement of weak duality:

p∗ ≥ d∗ = max min L(x, λ, µ). (3.20)


µ∈E x∈Ω
λK + 0

Let’s rewrite this a bit more succinctly. Define the dual functional of NLP with

φ(λ, µ) = min L(x, λ, µ). (3.21)


x∈Ω

We rewrite (3.20) and define the dual problem, which will form the basis of Lagrange
relaxation.
Definition 3.13. Given a program in the form (3.16), we can define the dual program

d∗ = max φ(λ, µ). (3.22)


µ∈E
λK + 0

Moreover we have weak duality p∗ ≥ d∗ .


What’s particularly remarkable is that the function φ is concave, as it is the point-
wise minimum (infimum) of affine functionals. Hence the dual program is always a
convex problem, even if the primal was not. This relaxation of a non-convex problem
to a convex one is what we call Lagrange relaxation.
Remark 3.14. Every choice of λ K 0, µ ∈ E in φ(λ, µ) will give us a lower bound
for p∗ . Even if we cannot solve (3.22) exactly, the approximate solutions will give us
better and better bounds for the primal.
Of course, our lower bounds could be quite terrible. We’ll devote the next section
to understanding when we have strong duality, and talk about more about constraint
qualification and optimality conditions in the next. MAYBE CHANGE THIS

3.3.3 Strong duality


Now, we need to restrict NLP to the case of a convex program. Recall the abstract
convex program (ACP):
p∗ = min f (x)
s.t. g(x) K 0 (3.23)
x ∈ Ω.

24
Here, K is a closed convex cone, Ω ⊆ E is a convex set (the domain), f : Ω → R is a
convex function, and g : Ω → F, a K-convex function on Ω. Then the Lagrangian of
(3.23) is
L(x, λ, µ) := f (x) + hλ, g(x)i (3.24)
with dual functional
φ(λ) = min L(x, λ), (3.25)
x∈Ω
and weak duality
p∗ ≥ d∗ := max φ(λ). (3.26)
λK + 0

Definition 3.15. A constraint qualification, CQ, on (ACP) is a condition on the


constraints which guarantees the existance of a Lagrange multiplier λ∗ ∈ K + so that
we have strong duality. That is, p∗ = d∗ , and the supremum d∗ is attained.
We’ll explore constraint qualifications for non-convex problems later. For now, with
a convex program, we have the very simple Slater’s condition.
Definition 3.16 (Slater’s condition). Slater’s constraint qualification holds when there
b ∈ Ω so that
exists a strictly feasible point, ie., there exists x

x) ≺K 0.
g(b (3.27)

Remark 3.17. This is also a reason why we include the domain Ω in the convex
problem. If Slater’s CQ fails to hold, then sometimes we can modify the constraints
and the setΩ appropriately to introduce strict feasibility13 .
And, as alluded to,
Theorem 3.18 (Strong Duality). Suppose that p∗ is finite for (ACP), and that Slater’s
CQ holds. Then there exists a λ∗ ∈ K + such that

p∗ = min L(x, λ∗ ). (3.28)


x∈Ω

In other words, we have strong duality in the sense that p∗ = d∗ , there is no duality
gap. Moreover, if x∗ is optimal in (ACP), then it is optimal in (3.28) as well, and

hλ∗ , g(x∗ )i = 0, (3.29)

ie., complementary slackness conditions hold.


Proof. ...

some commentary about how this strong duality doesn’t guarantee attainment, and
also of feasibility.
Theorem 3.19. Theorem 5.1.12 in Henry’s notes
Proof. Rockafellar Pschenichnyi
13
This is the idea behind facial reduction. INSERT SOURCES HERE

25
3.3.4 Optimality Conditions
Existence of KT vectors, Slater’s Condition, KT vector implies compactness which
allows us to use Sion’s theorem.. kkt conditions, idk

3.3.5 Conjugate Duality


Also called Fenchel duality?

4 Algorithms for Unconstrained Problems


Sometimes, as with the equality constrained quadratic problem, it is possible to de-
termine the minimum of a function analytically. When we can’t, however, we turn to
iterative algorithms. Throughout the next few sections, we shall be trying to solve the
unconstrained problem
x∗ = arg minn f (x), (4.1)
x∈R

where f is real valued and sufficiently smooth14 .


We note that many of these algorithms still converge even if f is not convex, however
you may find that they do not converge to a global minimum. For now, we’ll assume
that f is convex as well, but the reader should be aware of the non-convex case.
The following algorithms all follow the same framework. Suppose that we are
currently at a point x(n) ∈ E, and we would like to iterate towards a new point x(n+1)
with the hope that, with enough iterations, x(n) → x∗ . For notation’s sake, we write
x := x(n) as our current feasible point.

4.1 First Order Methods


The simplest15 methods are the first order methods—the methods where we move
from point to point by considering only information from the first derivative. However,
there has been a fair bit of research into improving these methods, especially with
the recent popularity of deep learning. Due to the size of some of the problems, the
more sophisticated methods are infeasible16 . Still, first order methods kind of suck and
converge very slowly.

4.1.1 Steepest Descent


The first first-order method was originally proposed by Cauchy in 1847. However de-
spite its age, Cauchy’s steepest descent illustrates many of the design considerations
to be aware of today.
14
The definition of smooth may change from section to section.
15
and slowest to converge, although still very useful!
16
For now! Hopefully this changes in the future.

26
The idea is simple. Locally, around x(n) , we know that f is quite well approximated
by a linear function. If we picked a direction d, then we could write
g(t) := f (x + td) = f (x) + ∇f (x)> d + o(ktdk). (4.2)
There are two immediate questions we need to answer: what choice of t (called the
step size) and what choice of d (called the descent direction) do we want?
Cauchy suggested that we should choose the direction on which f decreases the
fastest, the steepest direction, so to speak. In other words, we would like to minimize
the directional derivative ∇f (x)> d. To do this, we’d like to solve the subproblem:
min ∇f (x)> d
(4.3)
s.t. kdk2 = 1.
We pass constraint qualification, so we can use Lagrange multipliers. The Lagrangian
is
L(d, λ) = ∇f (x)> d + λ(1 − kdk2 ), (4.4)
with derivative (with respect to d)
0 = ∇L(d, λ) = ∇f (x) − 2λd. (4.5)
If ∇f (x) = 0 then by Theorem 3.3 x would be a local optimum, and we’d be done.
Otherwise ∇f (x) 6= 0, and we can solve the system of equation to conclude
1 ∇f (x)
λ = ± k∇f (x)k , d=± . (4.6)
2 kf (x)k
∇f (x)
We are looking to minimize, so we conclude that d = − kf k(x) is the direction of steepest
descent. Now we must decide on a suitable step length t to guarantee convergence.
Beware of skiing - show what happens if step size is too small or too large, have an
example which shows why its not always the best choice to go to the minimum of g,

4.1.2 Subgradient Methods


Ur function is not differentiable :(

4.2 Second Order Methods


4.2.1 Newton’s Method
4.2.2 Trust Region Methods
Probably can/should make this its own section below

5 Equality constrained minimization


5.1 Equality Constrained Quadratic Programs
We have already developed enough blah to solve quadratic programs analytically

27
5.2 Newton’s Method Revisited
5.3 Infeasible Start Newton’s Method
5.4 Penalty Function
5.5 Implementation Details

6 Interior Point Methods


Technically speaking, the algorithms we’ve covered above (steepest descent, second or-
der, etc) are interior point methods, as they travel through the interior of a feasible
set rather than along the boundary. However the term interior point method typically
refers to a class of algorithms called primal-dual interior point methods17 . We shall
develop these algorithms in this section, first in the general context of non-linear pro-
grams, and then after looking at the specific cases of linear and quadratic programs, we
shall try to extend to the case of an abstract convex problem with inequality constraints
over an affine manifold,
min f (x)
s.t. g(x) ≤ 0
p∗ = (6.1)
Ax = b
x ∈ Ω.
and develop the relevant ideas along the way.
We’ll begin by discussing methods to convert a constrained optimization problem
into a series of unconstrained problems. We’ll doing by modifying our objective function
to reward feasibility, by either adding a high cost to infeasibility or when approaching
the boundary of the feasible region.

6.1 Penalty and Barrier problems


We present a simple and surprisingly powerful technique for solving general nonlinear
problems. Suppose we are given a NLP of the form:

p∗ = min f (x)
s.t. g(x) ≥ 0
(6.2)
h(x) = 0
x ∈ Ω.

Here, f : Rn → R, g : Rn → Rme , h : Rn → Rmi are any functions (well, sufficiently


smooth, twice differentiable is usually enough), and Ω is any simple enough constraint
set (for example an affine manifold, or perhaps a polyhedron). Note that we make no
convex assumptions. We’d like to reformulate this problem into an equivalent uncon-
strained optimization problem in the hopes of applying Newton’s method. Our first
try will be to throw a brick at it.
17
Which again is a misnomer because some methods are primal or dual only.

28
Recall that, for a general set S ⊆ E, we can define an indicator function IS : E →
R ∪ {+∞} with (
0 x∈S
IS (x) = (6.3)
+∞ x 6∈ S.
Then if we let
F = {x ∈ Rn : g(x) ≥ 0, h(x) = 0} (6.4)
be the feasible region, then the unconstrained problem

p∗ = min f (x) + IF (x) (6.5)


x∈Ω

is equivalent to our original problem. This seems to work out nicely, but unfortunately
the gravy train stops here. The indicator function is not continuous, meaning we won’t
be able to use most of our developed algorithms (subgradient method doesn’t work
well either, as we are not convex). We’ll have to introduce soft approximations of these
indicator functions instead.
INTRODUCE PENALTY BARRIER with pictures. SPLIT UP EQUALITY AND
INEQUALITY CONSTRAINTS?
Then we can define the joint penalty-barrier function
!
1 X
Pµ (x) := f (x) + kh(x)k2 − µ log gi (x) . (6.6)

i

Here, the second term is called the quadratic penalty term, and the third term is
called the log barrier term. We notice that Pµ (x) is only defined on the interior
of the feasible set (points satisfying g(x) > 0). So if we start at an interior point,
then successive iterations will also be at interior points (hence the name interior point
method). Consider the corresponding optimization problem

xµ = arg min Pµ (x), (6.7)


g(x)>0,x∈Ω

we can see that the quadratic penalty term encourages feasibility of the equality con-
straints, while the log barrier term encourages us to stay away from the boundary of
the set. As µ decreases to 0, the penalty increases and forces h(x) = 0, and the bar-
rier decreases and allows the g(x) to get closer to the boundary, while increasing the
influence of the objective function f . Formally,
Theorem 6.1 (Penalty Barrier Global Convergence). Let {µk }k≥1 be a sequence ap-
proaching 0 from above, and xµk be the corresponding optimal solution to (6.7). Then
every limit point x∗ of the sequence {xµk }k≥1 is a solution to (6.2).
Exercise 6.2. The log barrier is just one of many barrier functions we could have
chosen in REFERENCE. Consider blah
p
X 1
...f (x)?? − (6.8)
gj (x)
j=1

29
6.2 Barrier Methods
We know how to can convert constrained optimization problems to unconstrained ones.
Now, we shall adapt the algorithms we’ve developed for unconstrained optimization to
these new barrier problems. Given our convergence above, it may be tempting to just
pick a very small µ and solve the corresponding barrier subproblem,
!
1 X
min Pµ (x) = f (x) + kh(x)k2 − µ log gi (x) , (6.9)
x 2µ
i

as an unconstrained optimization problem. While in theory this will converge to within


an ε of the optimal value, in practice due to numerical issues it does not work well except
for small and well-behaved problems, and only to a moderate accuracy.
So we will need to extend our unconstrained optimization algorithms in a more
intelligent manner. These adaptations are generally called barrier methods.

6.3 Primal-dual interior-point Methods


Now let’s try to apply our convexity theory to the barrier methods. By using infor-
mation from both the primal and dual problems to simultaneously update the primal
and dual variables at every iteration, we can observe faster convergence than barrier
methods. The search directions we obtain are very similar to those obtained in the
barrier method, but not identical.
For many basic classes of problems, including linear, quadratic, semidefinite, etc.,
primal-dual interior-point methods outperform barrier methods. For more compli-
cated/general convex problems, primal-dual algorithms are still under active research,
but we have high hopes.

6.4 Linear and Semidefinite Programs


Let’s do a few examples, beginning with linear programming. The first primal-dual
interior-point methods are usually credited to Karmakar (1984), with their application
to linear programming, however barrier methods have been known since the 1950’s.
In contrast with the ellipsoid method ADD SECTION ON THIS?, interior point
methods lead to efficient polynomial-time algorithms competitive (and in some cases
faster than) the celebrated simplex method. In some sense it combines the best of
both algorithms: the theoretical guarantees of the ellipsoid method and the blazing
fast real-world performance of the simplex method.
Consider the standard equality form linear program and its dual:

min c> x
max b> y
s.t. Ax = b > (6.10)
s.t. A y ≤ c,
x ≥ 0,

30
although we’ll usually write the dual in terms of a slack variable z:

min c> x max b> y


s.t. Ax = b s.t. A> y + z = c (6.11)
x ≥ 0, z ≥ 0.

From a high level perspective: the interior point method will generate a sequence of
strictly feasible points x(i) , y (i) , z (i) where x(i) > 0, y (i) > 0, converging to the optimal
solution (these points are in the interior 18 of the feasible region, hence the name of
the algorithm). In practice, we can get within 10−8 of the optimal solution after 10-50
(expensive) iterations. In fact, we shall show that just O(n log 1ε ) iterations are enough

for (1+ε) of the optimal value (actually interior-point algorithms with just O( n log 1ε )
iterations exist, but they are slower in practice).
For now, suppose that we only have the primal problem (we shall derive the dual
and slack variables along the way):

min c> x
s.t. Ax = b (6.12)
x ≥ 0.

We eliminate the non-negativity constraints by introducing a barrier function. Letting


µ > 0 be the barrier parameter, we can formulate the barrier subproblem,
n
X
minn c> x − µ log xi subject to: Ax = b, (6.13)
x∈R+
i=1

and corresponding Lagrangian function


n
X
>
L(x, y) = c x − µ log xi − y > (Ax − b). (6.14)
i=1

By Theorem CITE THEOREM??, we pass constraint qualification, and so at the op-


timal solution x there exists a y satisfying the following KKT conditions:

A> y + µX −1 e = c (6.15)
Ax = b. (6.16)

Here, X = diag(x), and so on. We notice by slightly rearranging (6.15) we obtain the
perturbed complementary slackness conditions,

(A> y − c)i · xi = µ, (6.17)

for all i ∈ [n] (resembling the complementary slackness conditions we are used to with
linear programs, except with µ on the RHS instead of 0).
18
well, actually relative interior

31
Let’s call z := c−A> y. Rearranging the KKT equations a bit, we get the perturbed
optimality equations

A> y + z − c = 0, z > 0 (dual feasibility) (6.18)


Ax − b = 0, x > 0 (primal feasibility) (6.19)
Zx − µe = 0. (perturbed complementary slackness) (6.20)

We denote these conditions as Fµ (x, y, z) = 0. Of course, at our current location,


if we’re not optimal we have Fµ (x, y, z) 6= 0, so we’d like to take a Newton step to-
wards the root. We can find this Newton search direction by solving the equation
∇Fµ (x, y, z)(∆x, ∆y, ∆z) = −Fµ (x, y, z), or in block equation form,

0 A> I
    > 
∆x A +z−c
A 0 0  ∆y  = −  Ax − b  . (6.21)
Z 0 X ∆z Zx − µe
By solving this system we obtain directions ∆x, ∆y, ∆z of descent. After taking an
appropriate step size to remain strictly feasible, we update our points x, y, z. Finally,
we update µ, usually by setting µ ← σµ for some fixed σ ∈ (0, 1), and push the solution
towards optimality.
INSERT PSEUDOCODE?
Insert picture!!
This is the general idea of the algorithm. There are a few details that we have yet to
take care of. Over the next few paragraphs, we’ll briefly comment on: what to choose
for initial values for x, y, z, µ; what step size to take when updating x, y, z; and how to
solve (6.21) relatively efficiently in practice. More detailed analysis will be attached in
following sections.
While in theory we could just pick any strictly feasible x and z (that is x, z > 0),
in practice there are heuristics to follow when choosing initial feasible points (not too
close to boundary, etc). However for LPs it tends to work okay. Choosing good initial
values for x, y, z is very important, and in general is a hard problem. In non-linear
programs, it is difficult to even find strictly feasible solutions. We’ll discuss this more
when we talk about two phase interior point methods.
When possible, the best possible choice for x is at the optimal solution.
Picking appropriate step sizes α and σ is much easier. Again, there are lots of
heuristics to follow, and careful tuning of these parameters lead to faster algorithms,
however the most important thing is to make sure you remain feasible at each step.
Finally, I’d like to devote a bit of time to exploring how we can efficiently solve
(6.21). Letting rd = A> y + z − c, rp = Ax − b, and rc = Zx − µe be the RHS, we can
perform some block gaussian elimination. The first row yields

∆z = −rd − A> ∆y, (6.22)

which when substituted into the third row gives us

∆x = −Z −1 rc + Z −1 Xrd + Z −1 XA> ∆y. (6.23)

32
Finally substituting this into the second equation will give us a system

AZ −1 XA> ∆y = FILL THIS IN. (6.24)

Hence we’ve reduced our 2n + m variable system to one with just m unknowns (and
usually m << n) and a symmetric LHS , leading to faster numerical methods. We can
recover ∆x, ∆z by back substitution. (When these algorithms are actually implemented
there are more tricks we can do).
In comparison to the simplex, which takes many cheap iterations to converge,
primal-dual methods take fewer, more expensive steps. In particular, we need to solve
the perturbed KKT equations at each iteration, which is quite computationally ex-
pensive. We’ll talk more about the time complexity after we choose specifics on the
hyperparameters.

Exercise 6.3. We can also formulate the barrier subproblem in terms of the dual:
FROM CO 463 notes. Can you derive the same perturbed KKT conditions?

Exercise 6.4. The point of this exercise, as well as the three (COUNT?) that follow,
will be to derive a primal-dual interior point method for various quadratic optimization
problems, starting with
min q(x) = 12 x> Qx + c> x
s.t. Ax = b ∈ Rm (6.25)
x ≥ 0, x ∈ Rn .
Here we are optimizing over an affine manifold, with an additional non-negativity
constraint. Derive a primal-dual interior-point algorithm to solve (6.25), by first adding
an appropriate log-barrier term, writing down the perturbed optimality conditions, and
computing a Newton direction and suitable step length towards optimality.
Implement your solution in your favourite scientific programming language. How
fast can you make it?

Exercise 6.5. generalized trust region subproblem

6.5 Feasibility and Two Phase Methods


how to get feasible start point

6.6 Step Size and Time Analysis


Short step, long step, predictor-corrector. Include pictures?

7 Applications
Reorder, maybe pick some of the better ones

33
7.1 Semidefinite Programming
7.1.1 Preliminaries
Semidefinite programming has been a speciality of the Waterloo C&O Department.
Semidefinite programs (SDPs) resemble linear programs, except with our variable is
taken in the space of positive semidefinite matrices, and the non-negativity constraints
are replaced by semidefinite ones.
Although SDPs have been studied since at least the 1940’s (under different names),
it wasn’t until the late 1900’s and early 2000’s that we had efficient algorithms for
solving them. There are many diverse applications of SDPs, and hopefully I’ll be able
to show you many of them19 .
We begin with the many equivalent formulations of semidefiniteness:

Proposition 7.1. Let A ∈ Sn be a n × n real symmetric matrix. The following are


equivalent:
(i) A is positive semidefinite (p.s.d.), written as A  0 or A ∈ Sn+
(ii) For all x ∈ Rn , hx, Axi ≥ 0
(iii) All the eigenvalues of A are real and nonnegative
(iv) All 2n − 1 principal minors of A are nonnegative
(v) There exists a (Cholesky) factorization A = LL>
(vi) There exists a (unique) positive semidefinite square root A = SS (S ∈ Sn+ ).

Of course, a very similar statement holds if we restrict A to be a positive definite


matrix:

Proposition 7.2. Let A ∈ Sn be a n × n real symmetric matrix. The following are


equivalent:
(i) A is positive definite, written as A  0 or A ∈ Sn++
(ii) For all x ∈ Rn , x 6= 0 =⇒ hx, Axi > 0
(iii) All the eigenvalues of A are real and positive
(iv) All the leading principal minors of A are positive
(v) There exists a (Cholesky) factorization A = LL> , where L is square and nonsin-
gular
(vi) There exists a (unique) positive definite square root A = SS (S ∈ Sn++ ).

Some other linear algebra facts we may use are:

Proposition 7.3. Let A ∈ Sn+ and T ∈ Mn be non-singular. Then the signature


(number of positive, negative, and zero eigenvalues) of A and T > AT are the same.

Proposition 7.4 (Schur Complement). The following are equivalent (for appropriately
sized A, B, C)
19
Semidefinite programming is usually its own course at Waterloo, although some other schools
(Stanford) cover it into a second convex optimization course. As a result this section may be way more
in depth than some of the others.

34
 
A B
(i) 0
B> C
(ii) A  0, C − B > A−1 B  0
(iii) C  0, A − BC −1 B >  0

7.1.2 Semidefinite Programming


Recall the primal linear program in standard equality form:

min cT x
s.t. Ax = b (7.1)
x ≥ 0.

Let A : Sn → Em be a linear transformation, b ∈ Em , and C ∈ Sn . Then we can write


the primal SDP similarly:
p∗ = min hC, Xi (7.2)
s.t. AX = b
X  0.
Note that we can write the linear transformation a bit more explicitly, if we’d like:
 
hA1 , Xi
AX =  .. Ai ∈ Sn ,
, (7.3)
 
.
hAm , Xi

where A is determined by the covectors hAi , ·i. We can derive the dual SDP by con-
sidering the Lagrangian primal:

p∗ = min max L(X, y) := hC, Xi + y > (b − AX) (7.4)


X0 y

and rewriting to obtain the Lagrangian dual:

d∗ = max min L(X, y) = b> y + hX, C − A∗ yi. (7.5)


y X0

By weak duality (Proposition 3.4) we know that p∗ ≥ d∗ . How can we determine when
strong duality holds?
interior point method

35
7.2 Least Squares
7.3 Quadratic Programming (and Support Vector Machines)
7.4 Quadratic Assignment Problem
7.5 Max Cut
7.6 Sensor Network Localization
7.7 ¿Neural Networks?
- unfortunately not very sophisticated techniques but deep learning is a meme... Deep
learning as a field is somewhat like alchemy was in the 16th century. There is a lot of
stuff that seems to work, but we really have no idea why. We begin with momentum,
an interesting idea inspired by physics even if it isn’t mathematically supported.

8 Additional Enrichment
8.1 Convex Optimization on Manifolds

A Appendix I: Prerequisites
A.1 Linear Algebra
Linear algebra is the only field of mathematics which is understood20 . It is a beautiful
theory which forms the foundation of mathematics, including of course optimization.
As a consequence, it is crucial that you understand it.
I won’t go over the basic definitions, I think it’s fair to assume that you’ve taken
a first course in linear algebra, and so you know about vector spaces, bases, linear
transformations, rank and nullity, range and nullspace (or kernel as I’ll call it), and basic
properties about eigenvalues and eigenvectors. In particular, it is of vital importance
that you think of vectors as abstract coordinate-free objects rather than n-tuples of
scalars or pointed arrows.
In this appendix, V will be an arbitrary finite dimensional vector space over the
real numbers R, capital letters will represent linear transformations, lower case letters
from the earlier part of the alphabet will represent scalars and those from the latter
part will represent scalars. I’ll pick and choose the relevant parts of linear algebra for
optimization. In particular, I will not be covering dual spaces, geometry, scalar fields
other than R, and so on. This stuff should really be review, you should definitely take
more linear algebra courses if it isn’t.
20
In the extremely crude sense that we have answers to most of the questions.

36
A.2 Calculus
The world is not linear21 . However, the nonlinear stuff is usually pretty well approx-
imated by the linear stuff, and even better by higher order approximations. We’ll
formalize this notion and more with ideas from differential22 calculus, the study of
change.
I’ll assume that the reader is familiar with many fundamental ideas covered between
pre-calculus and freshman differential calculus. You should know and be familiar with
the many definitions and characterizations of the real numbers; the epsilon-delta def-
inition of a limit; the definition and properties of the derivative of a function in one
variable; the intermediate value and extreme value theorems; the mean value theorem
and Taylor’s theorem.
I’ll cover multivariate differential calculus23 , as well as bits and pieces of real analysis
which we will need.

A.2.1 The first derivative


Local linear approximations, fenchel differentiation

A.2.2 The other derivatives


Hessian, Taylor series

A.3 Linear Programming


Linear programming is not strictly a prerequisite for much of convex optimization.
However, the theory of linear programs is a fantastic introduction into the field of
optimization, and is beautiful in its own right. I sincerely encourage familiarity in the
subject out of interest and the many, many applications.

B Appendix II: Proofs and Sketches


Many of the proofs were omitted in the notes, for two reasons. First, a lot of the proofs
are easy yet unrevealing, and I feel that including them would not have added much to
the subject. Second, I wish to use these notes myself as a theorem reference, and the
proofs add a lot of clutter.
However, proofs are still very important. I’m not going to prove all my claims (that
is your job as a student) but I’ll leave lots of hints and sketches.

21
Most unfortunately.
22
Integral Calculus doesn’t really appear that much, at least at the level of these notes. I do want
to write more one day on differential geometry.
23
Usually taught in a third calculus class

37

You might also like