Optimisation Theory Lecture Notes
Optimisation Theory Lecture Notes
Lecture Notes
January 5, 2018
Contents
1 Basic definitions 4
1.1 The real numbers and their order . . . . . . . . . . . . . . . . . . . . 4
1.2 Infimum and supremum . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Constructing the real numbers . . . . . . . . . . . . . . . . . . . . . 8
1.4 Maximization and minimization . . . . . . . . . . . . . . . . . . . . 11
1.5 Sequences, convergence and limits . . . . . . . . . . . . . . . . . . . 13
2 Graph algorithms 17
2.1 Graphs, digraphs, networks . . . . . . . . . . . . . . . . . . . . . . . 17
2.2 Walks, paths, tours, cycles, strong connectivity . . . . . . . . . . . . 19
2.3 Shortest walks in networks . . . . . . . . . . . . . . . . . . . . . . . 21
2.4 Introduction to algorithms . . . . . . . . . . . . . . . . . . . . . . . . 23
2.5 Single-source shortest paths: Bellman–Ford . . . . . . . . . . . . . . 27
2.6 O-notation and running time analysis . . . . . . . . . . . . . . . . . 34
2.7 Single-source shortest paths: Dijkstra’s algorithm . . . . . . . . . . 37
3 Continuous optimization 43
3.1 Euclidean norm and maximum norm . . . . . . . . . . . . . . . . . 43
3.2 Sequences and convergence in Rn . . . . . . . . . . . . . . . . . . . 46
3.3 Open and closed sets . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.4 Bounded and compact sets . . . . . . . . . . . . . . . . . . . . . . . . 50
3.5 Continuity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.6 Proving continuity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.7 The Theorem of Weierstrass . . . . . . . . . . . . . . . . . . . . . . . 60
3.8 Using the Theorem of Weierstrass . . . . . . . . . . . . . . . . . . . . 61
4 First-order conditions 65
4.1 Introductory example . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.2 Differentiability in Rn . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.3 Partial derivatives and C1 functions . . . . . . . . . . . . . . . . . . 72
4.4 Taylor’s theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.5 Unconstrained optimization . . . . . . . . . . . . . . . . . . . . . . . 76
4.6 Equality constraints and the Theorem of Lagrange . . . . . . . . . . 80
4.7 Inequality constraints and the KKT conditions . . . . . . . . . . . . 86
2
CONTENTS 3
5 Linear optimization 98
5.1 Linear functions, hyperplanes, and halfspaces . . . . . . . . . . . . 98
5.2 Linear programming: introduction . . . . . . . . . . . . . . . . . . . 100
5.3 Linear programs and duality . . . . . . . . . . . . . . . . . . . . . . 104
5.4 Lemma of Farkas and proof of strong LP duality . . . . . . . . . . . 107
5.5 Boundedness and dual feasibility . . . . . . . . . . . . . . . . . . . . 113
5.6 General LP duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.7 Complementary slackness . . . . . . . . . . . . . . . . . . . . . . . . 122
5.8 Convex sets and functions . . . . . . . . . . . . . . . . . . . . . . . . 125
5.9 LP duality and the KKT theorem . . . . . . . . . . . . . . . . . . . . 129
5.10 The simplex algorithm: example . . . . . . . . . . . . . . . . . . . . 130
5.11 The simplex algorithm: general description . . . . . . . . . . . . . . 133
Chapter 1
Basic definitions
4
1.1. THE REAL NUMBERS AND THEIR ORDER 5
that the distance between 0 and 1 is 1. The set of reals R is thought to be exactly
the set of these points on the line.
The fact that real numbers are ordered makes them one of the most useful
mathematical objects in practical applications. For example, the complex num-
bers cannot be ordered in such a useful way. The complex numbers allow us to
solve arbitrary polynomial equations such as x2 = −1, for which no real solution
x exists, because x2 ≥ 0 holds for every real number x. This, as we show shortly,
is a property of the order relation ≥, and so we cannot have a system of numbers
that we can order and thus minimize and maximize, and that at the same time
allows us to solve arbitrary polynomial equations.
We state a few properties of the order relation ≥ that imply the inequality
x2 ≥ 0 for all x. Most importantly, the order relation x ≥ y should be compatible
with addition in the sense that we can add any number z to both sides and pre-
serve the property (which is obvious from our picture of the real line). That is, for
any reals x, y, z
x ≤ y ⇒ x+z ≤ y+z (1.1)
which for z = − x − y implies
x ≤ y ⇒ − y ≤ −x . (1.2)
Because − x = (−1) · x, the implication (1.2) states the well-known rule that mul-
tiplication with −1 reverses an inequality. If you forget why this is the case, simply
subtract the terms on both sides from the inequality to get this result, as we have
done. Another condition concerning multiplication of real numbers is
x, y ≥ 0 ⇒ x · y ≥ 0 . (1.3)
In terms of the real number line, this means that y is “stretched” (if x > 1) or
“shrunk” (if 0 ≤ x < 1) or stays the same (if x = 1) when multiplied with the
nonnegative number x, but stays on the same side as 0 (this holds for any real
number y; here y is also assumed to be nonnegative). Condition (1.3) holds for
a positive integer x as a consequence of (1.1), because y ≥ 0 implies y + y ≥ y
and hence 2y = y + y ≥ y ≥ 0, and similarly for any repeated addition of y.
Extending this from positive integers x to real numbers x gives (1.3).
We now show that x · x ≥ 0 for any real number x. This holds if x ≥ 0 by (1.3).
If x ≤ 0 then − x ≥ 0 by (1.2), and so x · x = (−1) · (−1) · x · x = (− x )(− x ) ≥ 0
again by (1.2), where we have used that (−1) · (−1) = 1. This, in turn, follows
from something we have already used, namely that (−1) · y = −y for any y,
because −y is the unique negative of y so that y + (−y) = 0: Namely, we also
have y + (−1) · y = 1 · y + (−1) · y = (1 − 1) · y = 0 · y = 0, so (−1) · y is indeed
−y. Similarly, −(−y) = y (because (−y) + y = 0) and in particular −(−1) = 1,
which means (−1) · (−1) = 1 as claimed.
A systematic derivation of all properties of the order ≤ of the reals in com-
bination with the arithmetic operations addition and multiplication is laborious,
6 CHAPTER 1. BASIC DEFINITIONS
and so we appeal to the intuition of the real number line. Here we note the follow-
ing “axioms” for ≤ which are useful to understand in their separate importance.
These are transitivity, which says that for all x, y, z ∈ R
x ≤ y, y ≤ z ⇒ x ≤ z. (1.4)
x ≤ y, y ≤ x ⇒ x = y. (1.5)
x ≤ x. (1.6)
x ≤ y or y ≤ x . (1.8)
Definition 1.1 An ordered set is a set S together with a binary relation ≤ that is
transitive, antisymmetric, and reflexive, that is, (1.4) (1.5), (1.6) hold for all x, y, z
in S. The order is called total and the set totally ordered if (1.8) holds for all x, y
in S.
upper bound in Q (defined shortly), but it does in R. Intuitively, the parabola that
consists of all pairs ( x, y) so that y = x2 − 2 is a “continuous curve” in R2 that
√
should intersect the “x-axis” where y = 0 at two points ( x, y) where x = ± 2
and y = 0, in agreement with the intuition that x can take all values on the real
number line.
The following definition of an upper or lower bound of a set applies to any
ordered set; the order need not be total. The applications we have in mind occur
when the ordered set S is R or Q, but there are other interesting cases as well that
will be considered in the exercises.
Definition 1.3 Let S be an ordered set and let A ⊆ S. We say A is bounded from
above if A has an upper bound, and bounded from below if A has a lower bound,
and just bounded if A is bounded from above and below. The least upper bound
or supremum of a set A, denoted sup A, is the least element of the set of upper
bounds of A (if it exists). The greatest lower bound or infimum of a set A, denoted
inf A, is the greatest element of the set of lower bounds of A (if it exists).
Proposition 1.4 Let S be an ordered set and let A ⊆ S. Then A has a maximum if and
only if sup A exists and belongs to A. Similarly, A has a minimum if and only if inf A
exists and belongs to A.
The mentioned order completeness of R asserts that any nonempty set of real
numbers with an upper bound has a least upper bound (supremum), and any
nonempty set of real numbers with a lower bound has a greatest lower bound
(infimum). It is a basic property of the real numbers.
Axiom 1.5 Let A be a nonempty set of real numbers. Then if A is bounded from above
then A has a supremum sup A, and if A is bounded from below then A has an infimum
inf A.
We have stated condition 1.5 as an axiom about the real numbers rather than
a theorem. That is, we assume this condition for all sets of real numbers, according
to our intuition about the real number line.
In an exercise, you will be asked to prove that one of the conditions in Ax-
iom 1.5 (existence of a supremum for any nonempty set that is bounded from
above) implies the other (existence of an infimum for any nonempty set that is
bounded from below). Interestingly, this holds already for any ordered set and
does not require to consider − x for x ∈ A as in (1.2).
positive integer a) represents the same fraction as p/q. The set of all fractions
defines the set Q of rational numbers.
The most familiar way to define the real number is as infinite decimal frac-
tions. A decimal fraction starts with the representation of an integer in decimal
notation, followed by a decimal point, followed by an infinite sequence of deci-
mal digits. The decimal fraction that represents a real number is unique except
when the fraction is finite, that is, after some time all digits are 0, for example 1.25
(which represents the fraction 5/4 in decimal). As an infinite sequence of decimal
digits, this can be written as either 1.25000 . . . or 1.24999 . . .. Here one typically
chooses the finite sequence that ends in all 0’s rather than the sequence that ends
in all 9’s. For example, 1/3 is represented as 0.333 . . .. Multiplied by 3, this gives
1 or the equivalent representation 0.999 . . .. It can be shown that any rational
number is represented by a decimal fraction that is eventually periodic, that is, it
becomes an infinitely repeated finite sequence of digits. For example, 1/7 has the
decimal fraction 0.142857142857142857 . . . which is written as 0.142857. Another
example is 1/12 = 0.08333 . . . = 0.083. That is, any rational number, either as
p/q with a pair of integers p and q, or as an eventually periodic decimal fraction,
has a finite description.
In contrast, an arbitrary real number (which can be irrational, that is, it is not
an element of Q) is a general decimal fraction. It requires an infinite description
with an infinite sequence of digits after the decimal point which in general has no
predictable pattern. For example, the ratio π of the circumference of a circle to its
diameter starts with 3.1415926 . . . but has no discernible pattern in its digits (of
which billions have been computed). A single real number is therefore already
“described” by an infinite object. In practice, we are typically content to assume
that a finite prefix of the sequence of digits suffices to describe the real num-
ber “sufficiently accurately”, where we can extend this accuracy as much as we
like. The intuition is that finite (truncated) decimal fractions (which are rational
numbers) approximate the represented real number more and more accurately,
depending on the considered length of the truncated sequence.
One complication of infinite decimal fractions is that the arithmetic opera-
tions, such as addition and multiplication, are hard to describe using these in-
finite descriptions. Essentially, they are performed on the finite approximations
themselves. A way to do this more generally is to define real numbers as “Cauchy
sequences” of rational numbers.
A sequence of numbers (which themselves can be rationals or reals) is written
as x1 , x2 , . . . with elements xk of the sequence for each k ∈ N (where xk ∈ Q for a
sequence of rational numbers, and xk ∈ R for a sequence of real numbers). The
entire sequence is denoted by { xk }k∈N or just { xk } with the understanding that k
goes through all natural numbers.
A Cauchy sequence { xk } has the property that eventually any two of its ele-
ments are eventually arbitrarily close together, that is,
10 CHAPTER 1. BASIC DEFINITIONS
That is, for any positive ε, which can be as small as one likes, there is a subscript
K so that for all i and j that are at least K, the sequence elements xi and x j differ
by less than ε. Note that, in particular, we could choose i = K and j arbitrarily
larger than i, and yet xi and x j would differ by less than ε.
In the Cauchy condition (1.9), all elements x1 , x2 , . . . and ε can be rational
numbers. An example of such a Cauchy sequence is the sequence of finite decimal
fractions xk obtained from an infinite decimal fraction up to the kth place past the
decimal point. For example, if the infinite decimal fraction is 3.1415926 . . ., then
this sequence of rational numbers is given by x1 = 3.1, x2 = 3.14, x3 = 3.141,
x4 = 3.1415, x5 = 3.14159, x6 = 3.141592, x7 = 3.1415926, and so on, which is
easily seen to be a Cauchy sequence.
So, more generally, we can define a real number to be a Cauchy sequence
of rational numbers. Two sequences { xk } and {yk } are equivalent if | xk − yk | is
arbitrarily small for sufficiently large k, and if one of these two sequences is a
Cauchy sequence then so is the other (which is easily seen). Any two equivalent
Cauchy sequences define the same real number. Note that a real number is an
infinite object (in fact an entire equivalence class of Cauchy sequences of rational
numbers), similar to an infinite decimal fraction.
With real numbers defined as Cauchy sequences of rational numbers, it is
possible to prove Axiom 1.5 as a theorem. This requires to show the existence
of limits of sequences of real numbers, and the construction of a supremum as a
suitable limit; see R. K. Sundaram (1996), A First Course in Optimization Theory
(Cambridge University Press), Appendix B and Section 1.2.4.
We mention a second possible construction of the real numbers where Axiom
1.5 is much easier to prove. A Dedekind cut is a partition of Q into two nonempty
sets L and U so that a < b for every a ∈ L and b ∈ U, and so that L has no
maximal element. The idea is that each real number x defines uniquely such a cut
of the rational numbers into L and U given by
L = { a ∈ Q | a < x }, U = {b ∈ Q | b ≥ x } . (1.10)
If x is itself a rational number, then x belongs to the upper set U for the Dedekind
cut L, U for x (which is why we require that L has no maximal element, to make
this a unique choice). If x is irrational, then x belongs to neither L nor U and is
“between” L and U. Hence, we can see this cut as a definition of x. The Dedekind
cut L, U in (1.10) that represents x is unique. This holds because any two different
real numbers x and y define different cuts (because a suitable rational number c
with x < c < y will belong to the “upper” set of the cut for x but to the “lower”
set of the cut for y).
In constructing R as Dedekind cuts, the described partitions L, U of Q, each
real number has a unique description as such a cut. Each such cut is uniquely
1.4. MAXIMIZATION AND MINIMIZATION 11
determined by its “lower” set L, by taking the “upper” set U as the set of upper
bounds of L (if we start just with the set L, then we have to require that L is
nonempty, is bounded from above, contains with each a any rational number
smaller than a, and has no maximal element). So, similar to the representation as
a Cauchy sequence, a real number x has an infinite description as a set of rational
numbers. If x and x 0 have in this description lower cut sets L and L0 , respectively,
then we can define x ≤ x 0 by the inclusion relation L ⊆ L0 (as seen from (1.10)).
Now Axiom 1.5 is very easy to prove: Given a nonempty set A, bounded
above, of real numbers x represented by their lower cut sets L in (1.10), the supre-
mum of A is represented by the union of these sets L. This union is a set of rational
numbers, which can be easily shown to fulfill the properties of a lower cut set of
a Dedekind cut, and thus defines a real number, which can be shown to be the
supremum sup A.
Dedekind cuts are an elegant construction of the real numbers from the ratio-
nal numbers. It is slightly more complicated to define arithmetic operations of ad-
dition and, in particular, multiplication of real numbers via the rational numbers
in the respective cut sets than using Cauchy sequences, but the order property is
very accessible.
However, Dedekind cuts are an abstraction that “defines” a point x on the
real line via all the rational numbers a to the left of x, which defines the lower cut
set L in (1.10). This infinite set L is mathematically “simpler” than x because it
contains only rational numbers a. We “know” these rational numbers via their fi-
nite descriptions as fractions, but as points on the line they do not provide a good
intuition about the reals. In our reasoning about the real numbers, we therefore
refer usually to our intuition of the real number line.
Useful literature on the material of this section:
https://en.wikipedia.org/wiki/Construction_of_the_real_numbers
https://en.wikipedia.org/wiki/Dedekind_cut
The following is an easy but useful observation, proved with the help of (1.2).
We use it to consider only maximization problems, rather than repeating very
similar considerations for minimization problems.
Theorem 1.10 Suppose X = X1 ∪ X2 (the two sets X1 and X2 need not be disjoint), so
that there exists an element y in X1 so that f (y) ≥ f ( x ) for all x ∈ X2 , and f attains a
maximum on X1 . Then f attains a maximum on X.
Analysis and the study of continuity require the use of sequences and limits. For
the moment we limit ourselves to sequences of real numbers. Recall that a se-
quence x1 , x2 , . . . is denoted by { xk }k∈N or just { xk }. The limit of such a sequence,
if it exists, is a real number L so that the elements xk of the sequence are eventually
arbitrary close to L. This closeness is described by a maximum distance from L,
often called ε, that can be an arbitrarily small positive real number.
In words, (1.11) says that for every (arbitrarily small) positive ε there is some
index K so that from K onwards (k ≥ K) all sequence elements xk differ in abso-
lutely value by less than ε from L.
The next proposition asserts that if a sequence has a limit, that limit is unique
(something you should remember from Real Analysis – try proving it yourself
before you read the proof).
Proof. Suppose there are two limits L and L0 of the sequence { xk }k∈N with L 6= L0 .
We arrive at a contradiction as follows. Let ε = | L − L0 |/2 and consider K and K 0
so that k ≥ K implies | xk − L| = | L − xk | < ε and k ≥ K 0 implies | xk − L0 | < ε.
Consider some k so that k ≥ K and k ≥ K 0 . Then
which is a contradiction.
∀ M ∈ R ∃K ∈ N ∀k ∈ N : k ≥ K ⇒ xk > M . (1.13)
Proof. The following argument has a nice visualization in terms of “hotels that
have a view of the sea”. Suppose the real numbers x1 , x2 , . . . are the heights of
hotels. From the top of each hotel with height xk you can look beyond the subse-
quent hotels with heights xk+1 , xk+2 , . . . if they have lower height, and see the sea
at infinity if these are all lower. In other words, a hotel has “seaview” if it belongs
to the set S given by
(presumably, these are very expensive hotels). If S is infinite, then we take the
elements of S, in ascending order as the subscripts k1 , k2 , k3 , . . . that give our sub-
sequence xk1 , xk2 , xk3 , . . ., which is clearly decreasing. If, however, S is finite with
maximal element K (take K = 0 if S is empty), then for each k > K we have
k 6∈ S and hence for xk there exists some j > k with xk ≤ x j . Starting with xk1
for k1 = K + 1 we let k2 = j > k1 with xk1 ≤ xk2 . Then find another k3 > k2
so that xk2 ≤ xk3 , and so on, which gives a nondecreasing subsequence { xkn }n∈N
with xk1 ≤ xk2 ≤ xk3 ≤ · · · . In either case, the original sequence has a monotonic
subsequence.
The last two propositions together give a famous theorem known as the “Bol-
zano–Weierstrass” theorem.
16 CHAPTER 1. BASIC DEFINITIONS
Graph algorithms
17
18 CHAPTER 2. GRAPH ALGORITHMS
In an abstract setting, the women and men in this example define the nodes
of a graph, with the possible couples called edges. The graph in the marriage
problem has the special property of being bipartite (meaning that edges always
connect two nodes a and b that come from two disjoint sets). We first define the
concept of a general graph (or undirected graph), and for reasons of space will in
fact not consider further the marriage problem or bipartite graphs.
Definition 2.1 A graph is given by (V, E) with a finite set V of vertices or nodes
and a set E of edges which are unordered pairs {u, v} of nodes u, v.
a b
(2.1)
c d
We will normally not assume that connections between two nodes u and v are
symmetric (even though this may apply in many cases). The concept of a directed
graph allows us to distinguish between getting from u to v and getting from v to u.
Definition 2.2 A digraph (directed graph) is given by (V, A) with a finite set V of
vertices or nodes and a set A of arcs which are ordered pairs (u, v) of nodes u, v.
a b
(2.2)
c d
An arc (u, v) is also called an arc from u to v. In a digraph, we do not allow arcs
(u, u) from a node u to itself (such arcs are called “loops”). We do allow arcs in
reverse directions such as (u, v) and (v, u), as in the example (2.2) for u, v = b, d.
Because A is a set, it cannot record multiple or “parallel” arcs between the same
nodes, that is, two or more arcs of the form (u, v), so these are automatically
excluded.
The following is an example of a network with the digraph from (2.2). Next
to each arc is written its weight (in many examples we use integers as weights,
but this need not be the case, like here where w( a, b) = 1.2).
1.2
a b
−9 3 −7
c d
2
The underlying structure in our study will always be a digraph, typically with a
weight function so that this becomes a network. The arcs in the digraph represent
connections between nodes that can be followed. A sequence of such connections
defines a walk in the network, as well as special cases of walks according to the
following terminology.
The visited nodes on a walk are just the nodes of the walk. In a walk, but not
in a path, a node may be revisited. A tour starts and ends at the same node. A
cycle has also the same startpoint and endpoint but otherwise does not allow to
revisit a node.
In (2.2), the sequence a, b, c, d, b is a walk but not a path, a, b, c, d is a path,
d, b, d is a tour and a cycle, and c, d, b, d, b, c is a tour but not a cycle.
Sometimes a walk, path, tour, or cycle is sometimes called a directed walk, di-
rected path, directed tour, or directed cycle to emphasize that arcs cannot be tra-
versed backwards. Because we only consider a digraph (and not, say, the graph
that results when ignoring the direction of each arc), we do not need to add the
adjective “directed”.
One calls a graph “connected” if any two nodes are connected by a path; for
example, the graph (2.1) is connected. For a digraph, strong connectivity requires
the existence of a walk between any two nodes u, v, where “strong” emphasizes
that the digraph has a walk from u to v and a walk from v to u (which is implied
by the following definition where v, u is just another pair of nodes).
Definition 2.5 A digraph D = (V, A) is called strongly connected if for every two
nodes u, v there is a walk from u to v.
The digraph in (2.2) is not strongly connected because there is no walk from
b to a.
Paths are of particular interest in digraphs because there are only finitely
many of them, because each visited node on a path must be new. (As discussed
at the beginning of this chapter, the number of possible paths may still be huge.)
The following is a simple but crucial observation.
Proposition 2.6 Consider a digraph and two nodes u, v. If there is a walk from u to v,
then there is a path from u to v.
In a similar way, one can show the following (the qualifier “positive length”
is added to exclude the trivial cycle of length zero).
Proposition 2.7 Consider a digraph and a node u. If there is a tour of positive length
that starts and ends at u, then there is a cycle of positive length that starts and ends at u.
2.3. SHORTEST WALKS IN NETWORKS 21
In a network, the weights typically represent costs of some sort associated with
the respective arcs. Weights for walks (and similarly of paths, tours, cycles) are
defined by summing the weights of their arcs.
Note the difference between length and weight of a walk: length counts the
number of arcs in the walk, whereas weight is the sum of the weights of these
arcs (only if every arc has weight one, then length is the same as weight).
Given a network and two nodes u and v, we are interested in the u, v-walk
of minimum weight, often called shortest walk from u to v (remember throughout
that “shortest” means “least weighty”). Because there may be infinitely many
walks between two nodes if there is a possibility to revisit some nodes on the
way, “minimum weight” may not be a real number, but be equal to plus or minus
infinity.
+∞ if Y (u, v) = ∅,
Theorem 2.10 Let u, v be two nodes in a network ( D, w). Then dist(u, v) = −∞ if and
only if there is a cycle C with w(C ) < 0 that starts and ends at some node on a walk from
u to v. If dist(u, v) 6= ±∞, then
Proof. We make repeated use of the proof of Proposition 2.6. Suppose there is a
walk P = u0 , u1 , . . . , uk with u0 = u and uk = v and a cycle C = ui , v1 , . . . , v`−1 , ui
that starts and ends at some node ui on that walk, 0 ≤ i ≤ k, with w(C ) < 0. Let
n ∈ N. We insert n repetitions of C into P to obtain a walk W that we write (in an
obvious notation) as
The first i arcs together with the last k − i arcs of W are those of P, with n copies of
C in the middle, so W has weight w( P) + n · w(C ). For larger n this is arbitrarily
negative because w(C ) < 0, and W belongs to the set Y (u, v). Hence dist(u, v) =
−∞.
Conversely, let dist(u, v) = −∞. Consider a path P from u to v of minimum
weight w( P) as given by the minimum in (2.3). Suppose there is a u, v-walk W
with w(W ) < w( P), which exists because dist(u, v) = −∞ (otherwise w( P) would
be a lower bound of y(u, v)). Because W is clearly not a path, it contains a tour T
as in the proof of Proposition 2.6. If w( T ) ≥ 0 then we could remove T from W
and obtain a path W 0 with weight w(W 0 ) = w(W ) − w( T ) ≤ w(W ) and thus
eventually a path of weight less than w( P), in contrast to the definition of P. So
W contains a tour T with w( T ) < 0, which starts and ends at some node x, say.
We now claim that T contains a cycle C with w(C ) < 0. If T is itself a cycle, that
is clearly the case. Otherwise, T either contains a “subtour” T 0 with w( T 0 ) < 0
(and in general some other startpoint y) which we can consider instead of T, or
else every subtour T 0 of T fulfills w( T 0 ) ≥ 0 in which case we can remove T 0 from
T without increasing w( T ); an example of these two possibilities is shown in the
following picture with T = x, y, z, y, x and T 0 = y, z, y.
1 1 1 1
u x v u x v
1 1 −1 −2 (2.4)
−1 1
z y z y
−2 1
In either case, T is eventually reduced to a cycle C with w(C ) < 0 which is part of
W (where W is modified alongside T when removing subtours T 0 of nonnegative
weight). This shows the first claim of the theorem.
This implies that if dist(u, v) 6= ±∞, then there is a walk and hence a path
from u, v, and no u, v-walk contains a cycle or tour of negative weight, and hence
(2.3) holds according to the preceding reasoning.
Note that the left picture in (2.4) shows that we can have dist(u, v) = −∞
even though no cycle of negative weight can be inserted into a path from u to v.
In this example it is only possible to insert a tour of negative weight into a path
from u to v, or to insert a cycle of negative weight into a walk from u to v.
2.4. INTRODUCTION TO ALGORITHMS 23
The Theorem 2.10 would be simpler to prove if it just stated the existence of a
negative-weight tour that can be inserted into a walk from u to v as an equivalent
statement to dist(u, v) = −∞. However, it has been common to call this condition
the existence of negative-weight (or just “negative”) cycles. For that reason, we
proved the stronger statement.
Consider again the network in the left picture in (2.4) but with the arc (y, x )
removed:
1 1
u x v
1
−1
z y
−2
In that case dist(u, v) = 2 because the only walk from u to v is the path u, x, v, and
the negative (-weight) cycle y, z, y can be reached from u but cannot be extended
to a walk to v. Nevertheless, we will in the following consider all negative cycles
that can be reached from a given node u as “bad” for the computation of distances
d(u, v) for nodes v.
1. m ← some element of S
2. remove m from S
3. while S 6= ∅ :
4. x ← some element of S
5. m ← min{m, x }
6. remove x from S
24 CHAPTER 2. GRAPH ALGORITHMS
In this algorithm, we first specify its behaviour in terms of its input and out-
put. Here the input is a nonempty finite set S of real numbers, and the output is
the minimum of that set, denoted by m. The algorithm is described by a sequence
of instructions. These instructions are numbered, but solely for reference; the line
numbers are irrelevant for the operation of the algorithm (some descriptions of
algorithms make use of them, such as “go to step 3”, but we do not).
Line 1 says that the variable m (which here stores a real number) will be set
to the result of the right-hand side of the arrow “ ← ”. Such a statement is also
called an assignment. Here, this right-hand side is a function that produces some
element of S (which for the moment we assume is implemented somehow). Line 2
states to remove m from S. Line 3 is the beginning of a repeated sequence of in-
structions, and says that as long as the condition S 6= ∅ is true, the instructions
in the subsequent lines 4–6 will be executed. The fact that the repeated set of
instructions are these three lines (and not just line 4, say) is indicated in this no-
tation by the indentation of lines 4–6, that is, they all start further to the right with
a fixed blank space inserted on the left (see also the discussion that follows Al-
gorithm 2.12 below). This convention makes such a “computer program” very
readable, and is in fact adopted by the programming language Python. Other
programming languages, such as C, C++, or Java, require that several instruc-
tions that are to be executed together are put between delimiters { and } (so the
opening brace { would appear between lines 3 and 4 and the corresponding clos-
ing brace } after line 6); we use indentation instead.
Consider now what happens in lines 3–6 which are executed repeatedly. First,
if S = ∅, then the computation in these lines finishes, and because there are no
further instructions, the algorithm terminates altogether. This happens immedi-
ately if the original set S contains only a single element. Otherwise, S is not empty
and in line 4 another element x is found in S. In line 5, the right-hand side is the
minimum of two numbers, here the current values of m and x, and the result will
be assigned again to m. The effect is that if x < m, then m will assume the new
smaller value x, otherwise m is unchanged. In line 6, the element x is removed
from S. Because S loses one element in each iteration and is originally finite, S
will eventually become empty, and the algorithm terminates. It can be seen that
m will then be the smallest of all the elements in S, as required.
Several observations are in order. First, S will in practice not contain a set of
arbitrary real numbers but only of rational numbers, which have a finite represen-
tation; even that representation is typically limited to a certain number of digits.
Nevertheless, the algorithm works also in an idealized setting where S contains
arbitrary reals. Second, the instruction in line 5 seems circuitous because it asks
to compute the minimum of two numbers m and x, but we are meant to define
an algorithm that computes a minimum of a finite set. However, computing the
minimum of two numbers is a simpler task, and in fact one of the numbers, m,
will become the result. That is, line 5 can be replaced by the more basic conditional
instruction that uses an if statement:
2.4. INTRODUCTION TO ALGORITHMS 25
5. if x < m : m ← x
where the assignment m ← x will not happen if x < m is false, that is, if x ≥ m,
in which case m is unchanged. We have chosen the description in Algorithm 2.11
because it is more readable.
A further observation is that the algorithm can be made more “elegant” by
avoiding the repetition of the similar instructions in lines 2 and 6. Namely, we
omit line 2 altogether and replace line 1 with the assignment
1. m ← ∞
under the assumption that an element ∞ that is larger than all real numbers exists
and can be stored in the computer. In that case, the first element that is found in
the set S is x in line 4, which when compared in line 5 with m (which currently
has value ∞) will certainly fulfill x < m and thus m takes in the first iteration
the value of x, which is then removed from S in line 6. So then the first iteration
of the “loop” in lines 3–6 performs what happened in lines 1–2 in the original
Algorithm 2.11. This variant of the algorithm is not only shorter but also more
general because it can also be applied to an empty set S. It is reasonable to define
min ∅ = ∞ because ∞ is the neutral element of min in the sense that min{ x, ∞} =
x for all reals x, just as an empty sum is 0 (the neutral element of addition) or an
empty product is 1 (the neutral element of multiplication). For example, this
would apply to the case that dist(u, v) = ∞ in (2.3) when there is no path from u
to v.
When Algorithm 2.11 terminates, the set S will be empty and therefore no
longer be the original set. If this is undesired, one may instead create a copy of
S on which the algorithm operates that can be “destroyed” in this way while the
original S is preserved.
This raises the question of how a set S is represented in a computer. The best
way to think of this is as a table of a fixed length n, say, that stores the elements of S
which will be denoted as S[1], S[2], . . . , S[n]. Each table element S[i ] for 1 ≤ i ≤ n
is a “real” number in a given limited precision just as the variables m and x. In
programming terminology, S is then also called an array of numbers, with a given
array index i in a specified range (here 1 ≤ i ≤ n) to access the array element S[i ].
In the computer, the array corresponds to a consecutive sequence of memory
cells, each of which stores an array element. The only difference to a set S is that
in that way, repetitions of elements may occur if the numbers S[1], S[2], . . . , S[n]
are not all distinct. Computing the minimum of these n (not necessarily distinct)
numbers is possible just as before. Algorithm 2.12, shown below, is close to an
actual implementation in a programming language such as Python. We just say
“numbers” which are real (in fact, rational) numbers as they can be represented
in a computer.
In Algorithm 2.12, the indentation (white space at the left) in lines 6–7 means
that these two statements are executed if the condition S[k] < m of the if state-
ment in line 5 is true. Line 8 has the same indentation as line 5 so the statement
26 CHAPTER 2. GRAPH ALGORITHMS
1. m ← S [1]
2. i ← 1
3. k ← 2
4. while k ≤ n :
5. if S[k ] < m :
6. m ← S[k]
7. i ← k
8. k ← k+1
Algorithm 2.13 (Finding the minimum in a set of numbers using “for all”)
1. m ← ∞
2. for all x ∈ S :
3. m ← min{m, x }
2 v s x y z
s
1
x d[v, 0] 0 ∞ ∞ ∞
−1 d[v, 1] 0 1 ∞ ∞ (2.6)
2
d[v, 2] 0 1 3 0
y z d[v, 3] 0 1 1 0
1
28 CHAPTER 2. GRAPH ALGORITHMS
1. d[s, 0] ← 0
2. for all v ∈ V − {s} : d[v, 0] ← ∞
3. i ← 0
4. while i < |V | − 1 :
5. for all v ∈ V : d[v, i + 1] ← d[v, i ]
6. for all (u, v) ∈ A :
7. d[v, i + 1] ← min{ d[v, i + 1], d[u, i ] + w(u, v) }
8. i ← i+1
9. for all (u, v) ∈ A :
10. if d[u, |V | − 1] + w(u, v) < d[v, |V | − 1] :
11. print “Negative cycle!” and stop immediately
12. for all v ∈ V : dist(s, v) ← d[v, |V | − 1]
The right side of (2.6) shows d[v, i ] as rows of a table for i = 0, 1, 2, 3, with the
vertices v as columns. In lines 1–2 of Algorithm 2.14, these values are initialized
(initially set) to d[s, 0] = 0 and d[s, v] = ∞ for v 6= S. Lines 4–8 represent the
main loop of the algorithm, where i takes successively the values 0, 1, . . . |V | − 2,
and the entries in row d[v, i + 1] are computed from those in row d[v, i ]. The
important property of these numbers, which we prove shortly, is the following.
Theorem 2.15 In Algorithm 2.14, at the beginning of each iteration of the main loop
(lines 4–8), d[v, i ] is the smallest weight of any walk from s to v that has at most i arcs.
The main loop begins with i = 0 after line 3. In line 5, the entries d[v, i + 1]
are copied from d[v, i ], and will subsequently be updated. In the example (2.6),
d[v, 1] first contains 0, ∞, ∞, ∞. Lines 6–7 describe a second “inner” loop that
considers all arcs (u, v). Whenever d[u, i ] + w(u, v) is smaller than d[v, i + 1],
the assignment d[v, i + 1] ← d[u, i ] + w(u, v) takes place. This will not happen
if d[u, i ] = ∞ because then also d[u, i ] + w(u, v) = ∞. For i = 0, the only arc
(u, v) where this is not the case is (u, v) = (s, x ), in which case d[u, i ] + w(u, v) =
d[s, 0] + 1 = 1, which is less than ∞, resulting in the assignment d[ x, 1] ← 1. In
(2.6), this assignment is shown by the new entry 1 for d[ x, 1] surrounded by a box.
This is the only assignment of this sort. After all arcs have been considered, it can
be verified that the entries 0, 1, ∞, ∞ in row d[v, 1] represent indeed the shortest
weights of walks from s to v that use at most one arc, as asserted by Theorem 2.15.
After i is increased from 0 to 1 in line 8, the second iteration of the main loop
starts with i = 1. Then arcs (u, v) where d[u, i ] < ∞ are those where u = s or
2.5. SINGLE-SOURCE SHORTEST PATHS: BELLMAN–FORD 29
u = x, which are the arcs (s, x ), ( x, s), ( x, y), and ( x, z). The last two produce the
updates d[y, 2] ← d[ x, 1] + w( x, y) = 1 + 2 = 3 and d[z, 2] ← d[ x, 1] + w( x, z) =
1 − 1 = 0, shown by the boxed entries in row d[v, 2] of the table. Again, it can be
verified that these are the weights of shortest walks from s to v with at most two
arcs.
The last iteration of the main loop is for i = 2, which produces only a single
update, namely when the arc (z, y) is considered in line 7, where d[y, 3] is updated
from its current value 3 to d[y, 3] ← d[z, 2] + w(z, y) = 0 + 1 = 1. Row d[v, 3]
then has the weights of shortest walks from s to v that use at most three arcs. The
main loop terminates when i = 3 (in general, when i = |V | − 1).
Because the network in (2.6) has only four nodes, any walk with more than
three nodes (in general, more than |V | − 1 nodes) cannot be a path. In fact, if there
is a walk with |V | arcs that is shorter than found so far, it must contain a tour
of negative weight, as will be proved in Theorem 2.17 below. In Algorithm 2.14,
lines 9–11 test for a possible improvement of the current values in d[v, |V | − 1] (the
last row in the table), much in the same way as in the previous updates in lines 6–
7, by considering all arcs (u, v). However, unlike the assignment in line 7, such a
possible improvement is now taken to terminate the algorithm immediately with
the notification that there must be a negative cycle that can be reached from s.
The normal case is that no such improvement is possible. In that case, line 12
produces the desired output of the distances d(s, v). In the example (2.6), these
are the entries 0, 1, 1, 0 in the last row d[v, 3].
Before we prove Theorem 2.15, we note that any prefix of a shortest walk is a
shortest walk.
Proof. Suppose there was shorter walk W 0 from u to ui than the prefix u0 , u1 , . . . , ui
of W. Then W 0 followed by ui+1 , . . . , uk is a shorter walk from s to u than W, which
contradicts the definition of W.
some walk of i arcs from s to some node u, followed by an arc (u, v), where the
walk from s to u has minimum weight d[u, i ] by Lemma 2.16. In either case,
w(W ) = d[v, i + 1] = min{d[v, i ], d[u, i ] + w(u, v)} as computed (via line 5) in
line 7, which was to be shown.
Theorem 2.15 presents one part of the correctness of Algorithm 2.14. A second
part is the correct detection of negative (-weight) cycles. We first consider an
example, which is the same network as in (2.6) with an additional arc (y, s) of
weight −5, which creates two negative cycles, namely s, x, y, s and s, x, y, z, s.
v s x y z
2 d[v, 0] 0 ∞ ∞ ∞
s x
1
d[v, 1] 0 1 ∞ ∞
−5 −1 (2.7)
2 d[v, 2] 0 1 3 0
y z
d[v, 3] −2 1 1 0
1
neg. cycle? −4 −1
In this network, any walk from s to y has two or more arcs, so by Theorem 2.15 the
rows d[v, 0], d[v, 1], d[v, 2] are the same as in (2.6) and only d[v, 3] has the different
entries −2, 1, 1, 0, when the main loop terminates. In the additional row in the
table in (2.7), there are two possible improvements of d[v, 3], namely of d[s, 3],
indicated by −4 , when the arc (y, s) is considered as (u, v) in line 10, or of d[ x, 3],
indicated by −1 , when the arc (s, x ) is considered. Whichever improvement
is discovered first (depending on the order of arcs (u, v) in line 9), it leads to
the immediate stop of the algorithm in line 11. both improvements reveal the
existence of a walk with four arcs that is shorter than the current shortest walk
with at most three arcs. For the first improvement, this four-arc walk is s, x, z, y, s,
for the second it is s, x, y, s, x.
Theorem 2.17 Consider a network (V, A, w) with a source node s. Then there is a
negative cycle that starts at some node which can be reached from s if and only if Algo-
rithm 2.14 stops in line 11.
2 2
s x a v s x y z a b
1
−3
−1 2 dist(s, v) 0 1 1 0 3 1
2
pred[v] NIL s z x x z
y z b
1 1
(2.10)
The following algorithm is an extension of Algorithm 2.14 that also computes
the shortest-path predecessors.
32 CHAPTER 2. GRAPH ALGORITHMS
Line 1 of this algorithm not only initializes d[v, 0] to ∞ but also pred[v] to
NIL, for all nodes v. Line 2 then sets d[s, 0] to 0. In fact, without the assign-
ment pred[v] ← NIL, lines 1–2 would be faster than the initialization lines 1–2
of Algorithm 2.14, because the latter has a loop in line 2 where every node v has
to be checked if it is not equal to s, assuming V is represented in an array and
s is not known to be the first element of that array (which in general it is not).
This (not very important) speedup compensates for the double assignment of
first d[s, 0] ← ∞ and then d[s, 0] ← 0 in Algorithm 2.18.
Lines 7 and 7a represent the previous update of d[v, i + 1] in line 7 of Algo-
rithm 2.14, and line 7b the new assignment of the predecessor pred[v] on the new
shortest walk to v from s. This the predecessor as computed by the algorithm. In
general, a shortest path is not necessarily unique. For example, in (2.11) a short-
est path from s to z could also go via node a. Algorithm 2.18 will only compute
pred[z] as x, as shown in (2.11), irrespective of the order in which the arcs in A are
traversed in line 6 (why? – hint: Theorem 2.15).
The following demonstrates the updating of the pred array in Algorithm 2.18
for our familiar example (2.6). Its main purpose is a notation to record the progress
of the algorithm with the additional information of the assignment pred[v] ← u
in line 7b, by simply writing u as a subscript next to the box which indicates the
update of d[v, i + 1] in line 7a. There are four such updates, and the most recent
2.5. SINGLE-SOURCE SHORTEST PATHS: BELLMAN–FORD 33
one gives the final value of pred[v] as shown in the last row of the table.
v s x y z
2 d[v, 0] 0 ∞ ∞ ∞
s x
1
d[v, 1] 0 1 s ∞ ∞
−1 (2.11)
2 d[v, 2] 0 1 3 x 0 x
d[v, 3] 0 1 1 z 0
y z
1
pred[v] NIL s z x
Note that the notation with updated values of d[v, i + 1] shown by boxes with
a subscript u for the corresponding arc (u, v)s is completely ad-hoc, and chosen
to document the progress of the algorithm in a compact and unambiguous way.
Furthermore, a case that has not yet occurred is that d[v, i + 1] is updated more
than once for the same value of i, in case there are several arcs (u, v) where this
occurs, depending on the order in which these arcs are traversed in line 6. An
example is (2.10) above for i = 2 and v = 2, where the update of d[b, 3] occurs
from ∞ to 5 via the arc ( a, b), and then to 1 via the arc (z, b). One may record
these updates of d[b, 3] in the table as 5 a 1 z , or by only listing the last update
1 z (which is the only update in case arc (z, b) is considered before ( a, b)).
Algorithm 2.19 (Bellman–Ford, second version)
1. d[s] ← 0
2. for all v ∈ V − {s} : d[v] ← ∞
4. repeat |V | − 1 times :
6. for all (u, v) ∈ A :
7. d[v] ← min{ d[v], d[u] + w(u, v) }
9. for all (u, v) ∈ A :
10. if d[u] + w(u, v) < d[v] :
11. print “Negative cycle!” and stop immediately
12. for all v ∈ V : dist(s, v) ← d[v]
Algorithm 2.14 is the first version of the Bellman–Ford algorithm. The prog-
ress of the algorithm is nicely described by Theorem 2.15. Algorithm 2.19 is a
second, simpler version of the algorithm. Instead of storing the current distances
from s for walks that use at most i arcs in a separate table row d[v, i ], the second
version of the algorithm uses just a single array with entries d[v]. The new algo-
rithm has fewer instructions, which we have numbered with some line numbers
omitted for easier comparison with the first version in Algorithm 2.14.
34 CHAPTER 2. GRAPH ALGORITHMS
The main difference between this algorithm and the first version is the update
rule in line 7. The first version compared d[v, i + 1] with d[u, i ] + w(u, v) where
d[u, i ] was always the value of the previous iteration, whereas the second version
compares d[v] with d[u] + w(u, v) where d[u] may already have improved in the
current iteration. The following simple example illustrates the difference.
v s x y z
d[v, 0] 0 ∞ ∞ ∞
s x y z d[v, 1] 0 1 ∞ ∞ (2.12)
1 1 1
d[v, 2] 0 1 2 ∞
d[v, 3] 0 1 2 3
The table on the right in (2.12) shows the progress of the first version of the algo-
rithm. Suppose that in the second version in line 6, the arcs are considered in the
order (s, x ), ( x, y), and (y, z). Then the assignments in the inner loop in line 7 are
d[ x ] ← 1, d[y] ← 2, d[z] ← 3, so the complete array is already found in the first
iteration of the main loop in lines 4–7, without any further improvements in the
second and third iteration of the main loop. However, if the order of arcs in line 6
is (y, z), ( x, y), and (s, x ), then the only update in the main loop in line 7 in the
first iteration is d[ x ] ← 1, with d[y] ← 2 in the second iteration, and d[z] ← 3
in the last iteration. In general, the main loop does need |V | − 1 iterations for the
algorithm to work correctly, as asserted by the following theorem.
Theorem 2.20 In Algorithm 2.19, at the beginning of the ith iteration of the main loop
(lines 4–7), 1 ≤ i ≤ |V | − 1, we have d[v] ≤ w(W ) for any node v and any s, v-walk W
that has at most i − 1 arcs. Moreover, if d[v] < ∞, then there is some s, v-walk of weight
d[v]. If the algorithm terminates without stopping in line 11, then d[v] = dist(s, v) as
claimed.
Proof. The algorithm performs at least the updates of Algorithm 2.14 (possibly
more quickly), which shows that d[v] ≤ w(W ) for any s, v-walk W with at most
i − 1 arcs, as in Theorem 2.15. If d[v] < ∞, then d[v] = w(W 0 ) for some s, v-
walk W 0 because of the way d[v] is computed in line 7. Furthermore, dist(s, v) =
d[v] 6= −∞ if and only if d[u] + w(u, v) ≥ d[v] for all arcs (u, v), as proved for
Theorem 2.17.
In this section, we describe the O-notation used to describe the running time of
algorithms, and apply it to the analysis of the Bellman–Ford algorithm.
2.6. O-NOTATION AND RUNNING TIME ANALYSIS 35
The time needed to execute an algorithm depends on the size of its input,
and on the machine that performs the instructions of the algorithm. The size of
the input can be very accurately measured in terms of the number of bits (binary
digits) to represent the input. If the input is a network, then the input size is
normally measured more coarsely by the number of nodes and arcs, assuming
that each piece of associated information (such as the endpoints of an arc, and
the weight of an arc) can be stored in some fixed number of bits (which is realistic
in practice).
The execution time of an instruction depends on the computer, and on the
way that the instruction is represented in terms of more primitive instructions,
for example how an assigment translates to the evaluation of the right-hand side
and to the storage of the computed value in memory (and in which location) for
the assigned variable. Because computing technology is constantly improving, it
is normally assumed that a basic instruction, such as an assignment or a test of a
condition like x < m, takes a certain constant amount of time, without specifying
what that constant is.
The functions to measure running times take nonnegative values. Let
R≥ = { x ∈ R | x ≥ 0 } . (2.13)
As an example, 1000 + 10n + 3n2 ∈ O(n2 ), for the following reason: choose
K = 10. Then n ≥ K implies 1000 ≤ 10n2 , 10n ≤ n2 , and thus 1000 + 10n + 3n2 ≤
(10 + 1 + 3)n2 so that the claim is true for C = 14. If we are interested in a
smaller constant C when n is large, we can choose K = 100 and C = 3.2. It is
clear that asymptotically (for large n) the quadratic term dominates the growth
of the function, which is captured by the notation O(n2 ). As a running time,
this particular function may, for example, represent 1000 time units to load the
program, 10n time units for initialization, and 3n2 time units to run the main
algorithm.
The notation f (n) ∈ O( g(n)) is very commonly written as “ f (n) = O( g(n))”.
However, this is imprecise, because f (n) represents a function of n, whereas
O( g(n)) is a set of functions, as stated in Definition 2.21.
36 CHAPTER 2. GRAPH ALGORITHMS
because if there are C, D > 0 with f (n) ≤ C · g(n) for n ≥ K and g(n) ≤ D ·
h(n) for n ≥ L, then f (n) ≤ C · D · h(n) for n ≥ max{K, L}. Note that (2.15) is
equivalent to the statement
for the following reason: Suppose g(n) ∈ O(h(n)). Then (2.15) says that any
function f (n) in O( g(n)) is also in O(h(n)), which shows O( g(n)) ⊆ O(h(n))
and thus “⇒” in (2.16). Conversely, if O( g(n)) ⊆ O(h(n)), then we have clearly
g(n) ∈ O( g(n)) and thus g(n) ∈ O(h(n)), which shows “⇐” in (2.16).
What is O(1)? This is the set of functions f (n) that fulfill f (n) ≤ C for all
n ≥ K, for some constants C and K. Because the finitely many numbers n with
n < K are bounded, we can if necessary increase C to obtain that f (n) ≤ C for all
n ∈ N. In other words, O(1) is the set of functions that are bounded by a constant.
In addition to (2.15), the following rules are useful and easy to prove:
which shows that a sum of two functions can be “absorbed” into the function
with higher growth rate. With the definition
In addition,
f (n) ∈ O( g(n)) ⇒ n · f (n) ∈ O(n · g(n)) . (2.20)
We now apply this notation to analyze the running time of the Bellman–Ford
algorithm, where we consider Algorithm 2.18 because it is slightly more detailed
than Algorithm 2.14. Suppose the input to the algorithm is a network (V, A, w)
with n = |V | and m = | A|. Line 1 takes running time O(n) (we say “time O(n)”
rather than “time in O(n)”) because in that line all nodes are considered, each
with two assignments that take constant time. Lines 2 and 3 take constant time
O(1). The main loop in lines 4–8 is executed n − 1 times. Testing the condition
2.7. SINGLE-SOURCE SHORTEST PATHS: DIJKSTRA’S ALGORITHM 37
i < |V | − 1 in line 4 takes time O(1). Line 5 takes time O(n). The “inner loop”
in lines 6–7b takes time O(m) because the evaluation of the if condition in line 7
and the assigments in lines 7a–7b take constant time (which is shorter when they
are not executed because the if condition is false, but bounded in either case).
Line 8 takes time O(1). So the time to perform one iteration of the main loop
in lines 4–8 is O(1) + O(n) + O(m) + O(1), which by (2.17) we can shorten to
O(n + m) because we can assume n > 0. The main loop is performed n − 1 times,
where in view of the constants this can be simplified to multiplication with n,
that is, it takes together time n · O(n + m) = O(n2 + nm). The test for negative
cycles in lines 9–11 takes time O(m), and the final assigment of distance in line 12
time O(n). So the overall running time from lines 1–3, 4–8, 9–11, 12 is O(n) +
O(n2 + nm) + O(m) + O(n) where the second term absorbs the others according
to (2.17). So the overall running time is O(n2 + nm).
The number m of arcs of a digraph with n nodes is at most n · (n − 1), that is,
m ∈ O(n2 ), so that O(n2 + nm) ⊆ O(n2 + n3 ) = O(n3 ). That is, for a network with
n nodes, the running time of the Bellman–Ford algorithm is O(n3 ). (It is therefore also
called a cubic algorithm.)
The above analysis shows a more accurate running time of O(n2 + nm) that
depends on the number m of arcs in the network. The algorithm works for any
number of arcs (even if m = 0). Normally the number of arcs is at least n − 1
because otherwise some nodes cannot be reached from the source node s (this
can be seen by induction on n by adding nodes one at a time to the network,
starting with s: every new node v requires at least one new arc (u, v) in order to
be reachable from the nodes u that are currently reachable from s). In that case
n ∈ O(m) and thus O(n2 + nm) = O(nm), so that the running time is O(nm).
(We have to be careful here: when we say the digraph has at least n − 1 arcs, we
cannot write this as m ∈ O(n), because this would mean an upper bound on m;
the correct way to say this is n ∈ O(m), which translates to n ≤ Cm and thus to
m ≥ n/C, meaning that the number of arcs is at least proportional to the number
of nodes. An upper bound for m is given by m ∈ O(n2 ).) In short, for a network
with n nodes and m arcs, where n ∈ O(m), the running time of the Bellman–Ford
algorithm is O(nm).
The second version of the Bellman–Ford algorithm has the same running time
O(n3 ). Algorithm 2.19 is faster than the first version but, in the worst case, only
by a constant factor, because the main loop in lines 4–7 is still performed n − 1
times, and the algorithm would in general be incorrect with fewer iterations, as
the example in (2.12) shows (which can be generalized to an arbitrary path).
Figure 2.1 to explain the algorithm, with a suitable table that demonstrates its
progress.
u s a b c x y
0G ∞B ∞B ∞B ∞B ∞B
y
3 1 s 0W 1 G 4 G ∞ ∞ ∞
0 s s
c x a 0 1W 3 G
a 6 G
a ∞ ∞
2
5 2 1 b 0 1 3W 5 G
b 4 G
b ∞
a 2 G
b x 0 1 3 4 x 4W 5 G
x
3
1 4 c 0 1 3 4W 4 5G
s
y 0 1 3 4 4 5W
dist(s, u) 0 1 3 4 4 5
pred[v] NIL s a x b x
The algorithm uses an array that defines for each node v an array entry
colour[v] which is either black, grey, or white (so this “colour” may internally be
stored by one of three distinct integers such as 0, 1, 2). In the course of the com-
putation, each node will change its colour from black to grey to white, unless there
is no path from the source s to the node, in which case the node will stay black. In
addition, the array entries d[v] represent preliminary distances from s, according
2.7. SINGLE-SOURCE SHORTEST PATHS: DIJKSTRA’S ALGORITHM 39
to the following theorem that we prove later and which will be used to show that
the algorithm is correct.
Theorem 2.23 In Algorithm 2.22, at the beginning of each iteration of the main loop
(lines 3–8), d[v] is the smallest weight of a path u0 , ui , . . . , uk from s to v (so u0 , uk = s, v)
that with the exception of v consists exclusively of white nodes, that is, colour[ui ] = white
for 0 ≤ i < k. When u is made white in line 5, we have d[u] = dist(s, u).
In line 1, all nodes v are initially black with d[v] ← ∞. In line 2, the source
s becomes grey and d[s] ← 0. This is also shown as the first row in the table in
Figure 2.1, where we use superscripts B, G, W for a newly assigned colour black,
grey, or white. Because grey nodes are of special interest, we will indicate their
colour all the time, even if it has not been updated in that particular iteration.
The main loop in lines 3–8 operates as long as the set of grey nodes is not
empty, in which case it selects in line 4 a particular grey node u with smallest
value d[u]. Because of the initialization in line 2, the only grey node is s, which is
therefore chosen in the first iteration. Each row in the table in Figure 2.1 repre-
sents one iteration of the main loop, where the node u that is chosen in line 4 is
displayed on the left of that row. The row entries are the values d[v] for all nodes
v, as they become updated or stay unchanged in that iteration. The chosen node
u changes its colour to white in line 5, indicated by the superscript W in the table,
where that node is also underlined.
Lines 6–8 are a second inner loop of the algorithm. It traverses all non-white
neighbours v of u, that is, all nodes v so that colour[v] is not white and so that
(u, v) is an arc. For all these non-white neighbours v of u are set to grey in line 7
(indicated by a superscript G), and their distance d[v] is updated to d[u] + w(u, v)
in case this is smaller than the previous value of d[v] (which happens always if
v is black and therefore d[v] = ∞). If such an update happens, this means that
there is an all-white path from s to u followed by an arc (u, v) that connects u
to the grey node v, and in that case we can also set pred[v] ← u to indicate that
u is the predecessor of v on the current path from s to v (we have omitted that
update of pred[v] to keep Algorithm 2.22 short; it is the same as in lines 7–7b of
Algorithm 2.18). As in example (2.11) for the Bellman–Ford algorithm, the update
of pred[v] with u is shown with the subscript u, and the update of d[v] is shown
by surrounding the new value with a box.
In the first iteration in Figure 2.1 where u = s, the updated nodes neighbours
v of u are a and b. These are also the grey nodes in the next iteration of the main
loop, where node a is selected because d[ a] < d[b], and a is made white. The two
neighbours of a are b and c. Both are non-white and become grey (b is already
grey). The value of d[b] is updated from 4 to d[ a] + w( a, b) = 3, with pred[b] ← a.
The value of d[c] is updated from ∞ to d[ a] + w( a, c) = 6, with pred[c] ← a. The
current row shows that the grey nodes are b and c, where d[b] < d[c].
40 CHAPTER 2. GRAPH ALGORITHMS
In the next iteration therefore u is chosen to be b, which gives the next row of
the table. The neighbors of b are a, c, x. Here a is white and is ignored, c is non-
white and gets the update d[c] ← d[b] + w(b, c) = 5 because this is smaller than
the current value 6, and pred[c] ← b. Node x changes colour from black to grey
and d[ x ] ← d[b] + w(b, x ) = 4. In the next iteration, x is the grey node u with
smallest d[u], creating updates for c and y. The next and penultimate iteration
chooses c among two remaining grey nodes, where the neighbour y of c creates
no update (other than being set to grey, which is already its colour). The final iter-
ation chooses y. Because all nodes are now white, the algorithm terminates with
the output of distances in line 9, as shown in the table in Figure 2.1 in addition to
the predecessors in the shortest-path tree with root s.
Proof of Theorem 2.23. First, we note that because all weights are nonnegative,
the shortest walk from s to any node v can always be chosen as a path by Propo-
sition 2.6 because any tour that the walk contains can be removed and is of non-
negative weight, which will not increase the weight of the walk.
We prove the theorem by induction. Before the main loop in lines 3–8 is
executed for the first time, there are no white nodes. Hence, the only path from
s where all but the last node are white is a path with no arcs that consists of the
single node s, and its weight is zero, where d[s] = 0 as claimed. Furthermore, this
is the only (and shortest) path from s to s, so dist(s, s) = d[s] = 0.
Suppose now that at the beginning of the main loop the condition is true for
any set of white nodes. If there are no grey nodes, then the main loop will no
longer performed and the algorithm proceeds to line 9. If there are grey nodes,
then the main loop will be executed, and we will show that the condition holds
again afterwards. Let u be node that is chosen in line 4, which is made white
in line 5. We prove, as claimed in the theorem, that just before this assignment
we have d[u] = dist(s, u). This has already been shown when u = s. There is
a path from s to u with weight d[u], namely the assumed path (by the induction
hypothesis) where all nodes except u are white. Consider any shortest path P from
s to u; we will show d[u] ≤ w( P) which implies d[u] = w( P) = dist(s, u). Let y
be the first node on the path P which is is not white. Let P0 be the prefix of P
given by the path from s to y, which is a shortest path from s to y by Lemma 2.16.
Moreover, y is grey because there are no arcs ( x, y) where x is white (such as the
previous node x on P before y) and y is black because after x has been made white
in line 5, all its black neighbours y are made grey in line 7. So is P0 is a shortest
path from s to y and certainly a shortest path among those where all but the last
node are white, so by the induction hypothesis, d[y] = w( P0 ). By the choice of u in
line 4 we have d[u] ≤ d[y] = w( P0 ) ≤ w( P), where the latter inequality because
all weights are nonnegative. That is, d[u] ≤ w( P) as claimed.
We now show that updating the non-white neighbours v of u in lines 7–8
will complete the induction step, that is, any shortest path P from s to v where
all nodes but v are white has weight d[v]. If the last arc of such a shortest path
is not (u, v), then this is true by the induction hypothesis. If the last arc of P is
2.7. SINGLE-SOURCE SHORTEST PATHS: DIJKSTRA’S ALGORITHM 41
(u, v), then P without its last node v defines a shortest path from s to u (where
all nodes are white), were we just proved d[u] = dist(s, u), and hence w( P) =
d[u] + w(u, v) = d[v] because that is how d[v] has been updated in line 8. This
completes the induction.
Proof. When the algorithm terminates, every node is either white or black. As
shown in the preceding proof, at the end of each iteration of the main loop there
are no arcs ( x, y) where x is white and y is black. Hence, the white nodes are exactly
the nodes u that can be reached from s by a path, with dist(s, u) = d[u] < ∞ by
Theorem 2.23. The black nodes v are those that cannot be reached from s, where
dist(s, v) = ∞ = d[v] as set at initialization in line 1.
In Dijkstra’s algorithm, a grey node u with minimal d[u] has already its final
distance d(s, v) given by d[u], so that u can be made white. There can be no shorter
“detour” to reach u via nodes that at time are grey or black, because the first grey
node y on such a path from s to u would fulfill d[u] ≤ d[y] (see the proof of
Theorem 2.23), and the remaining part of that path from y to u has nonnegative
weight by assumption. This argument fails for negative weights. In the following
network,
y u
−5
4 1
s
the next node made white after s is u and is recorded with distance 1, and after
that node y with distance 4. However, the path s, y, u has weight −1 which is less
than the computed weight 1 of the path s, u. So the output of Dijkstra’s algorithm
is incorrect, here because of the negative weight of the arc (y, u). It may happen
that the output of Dijkstra’s algorithm is correct (as in the preceding example if
w(s, u) = 5), but in general this is not guaranteed.
We now analyse the running time of Dijkstra’s algorithm. Let n = |V | and
m = | A|. The initialization in lines 1–2 takes time O(n), and so does the final
output (if d[v] is not taken directly as the output) in line 9. In each iteration of the
main loop in lines 3–8, exactly one node becomes (and stays) white. Hence, the
loop is performed n times, assuming (which in general is the case) that all nodes
are eventually white, that is, are reachable by a path from the source node s. We
assume that the colour of a node v is represented by the array entry colour[v],
and that nodes themselves are just represented the numbers 1, . . . , n. By iterating
through the colour array, identifying the grey nodes in line 3 and finding the node
u with minimal d[u] in line 4 takes time O(n). (Even if the number of grey nodes
were somehow represented in an array of shrinking size, it is possible that they
are at least a constant fraction, if not all, of the nodes that are not white, and their
number is initially n, then n − 1, and so on, so that the number of nodes checked in
42 CHAPTER 2. GRAPH ALGORITHMS
Continuous optimization
Rn = { ( x1 , . . . , xn ) | xi ∈ R for 1 ≤ i ≤ n } (3.2)
43
44 CHAPTER 3. CONTINUOUS OPTIMIZATION
d( x, y) = k x − yk , (3.5)
d( x, z) ≤ d( x, y) + d(y, z) . (3.9)
It can be shown that the maximum-norm and the Euclidean distance fulfill these
axioms (see Exercises 5.1 and 5.2).
The triangle inequality is then often stated as
k x + yk ≤ k x k + kyk (3.10)
3.1. EUCLIDEAN NORM AND MAXIMUM NORM 45
which implies (3.9) using x − y and y − z instead of x and y. For an arbitrary set,
a trivially defined distance function that also fulfills axioms (3.7)–(3.9) is given by
d( x, x ) = 0 and d( x, y) = 1 for x 6= y.
Let ε > 0 and x ∈ Rn . The set of all points y that have distance less than ε is
called the ε-ball around x, defined as
B( x, ε) = {y ∈ Rn | ky − x k < ε } . (3.11)
It is also called the open ball because the inequality in (3.11) is strict. That is, B( x, ε)
does not include its “boundary”, called a sphere, which consists of all points y
whose distance to x is equal to ε.
The following picture shows the ε-ball and the maximum-norm ε-ball for
ε = 1 around the origin 0 in R2 . The latter, Bmax (0, 1), is the set of all points
( x1 , x2 ) so that −1 < x1 < 1 and −1 < x2 < 1, which is the open square shown
on the right.
x2 x2
x1 x1
0 0
(1,0) (1,0)
x2 x2
x1 x1
0 0
(1,0) (1,0)
∀ k ∈ N : k x (k) k ≤ M . (3.17)
3.2. SEQUENCES AND CONVERGENCE IN R N 47
Analogously to Proposition 1.12, a sequence can have at most one limit. This is
proved in the same way, where the contradiction (1.12) is proved with the help of
the triangle inequality (3.10), using the norm instead of the absolute value.
In the definitions (3.16) and (3.17), we have used the Euclidean norm, but we
could have used in the same way the maximum norm as defined in (3.4) instead,
as asserted by the following lemma.
Lemma 3.2 The sequence { x (k) }k∈N in Rn has limit x ∈ Rn if and only if
Proof. Suppose { x (k) } converges to x in the Euclidean norm. Let ε > 0, and
choose K in (3.16) so that k ≥ K implies k x (k) − x k < ε, that is, x (k) ∈ B( x, ε).
Because B( x, ε) ⊆ Bmax ( x, ε) by (3.14), this also means x (k) ∈ Bmax ( x, ε), which
shows (3.18).
Conversely, assume (3.18) holds and let ε > 0. Choose K so that k ≥ K implies
√ √
x (k) ∈ Bmax ( x, ε/ n). Then Bmax ( x, ε/ n) ⊆ Bmax ( x, ε) by (3.15), which shows
(3.16).
(k)
| xi − xi | < ε (3.20)
We are concerned with the behaviour of a function f “near a point a”, that is,
how the function value f ( x ) behaves when x is near a, where x and a are points
in some subset S of Rn . For that purpose, it is of interest if a can approached
with x from “all sides”, which is the case if there is an ε-ball around a that is fully
contained in S. If that is the case, then the set S will be called open according to
the following definition.
∀ a ∈ S ∃ε > 0 : B( a, ε) ⊆ S . (3.21)
By (3.14) and (3.15), we could use the maximum-norm ball instead of the
Euclidean-norm ball in (3.21), that is, S is open if and only if
It is a useful exercise to prove that the open balls B( a, ε) and Bmax ( a, ε) are them-
selves open subsets of Rn .
Definition 3.5 Let S ⊆ Rn . Then S is called closed if for all a ∈ Rn and all se-
quences { x (k) } in S (that is, x (k) ∈ S for all k ∈ N) with limit a we have a ∈ S.
Another common term for limit point is accumulation point. Clearly, a set is
closed if and only if it contains all its limit points. Trivially, every element a
of S is a limit point of S, by taking the constant sequence given by x (k) = a in
Definition 3.6.
The next lemma is important to show the connection between open and closed
sets.
∀ε > 0 : B( a, ε) ∩ S 6= ∅ . (3.23)
3.3. OPEN AND CLOSED SETS 49
The next theorem states the connection between open and closed sets: A set
is closed if and only if its set-theoretic complement is open.
Proof. Suppose S is closed, so it contains all its limit points. We want to show that
T is open, so let a ∈ T. We want to show that B( a, ε) ⊆ T for some ε > 0. If that
was not the case, then for all ε > 0 there would be some element a in B( a, ε) that
does not belong to T and hence to S, so that B( a, ε) ∩ S 6= ∅. But then a is a limit
point of S according to Lemma 3.7, hence a ∈ S because S is closed, contrary to
the assumption that a ∈ T.
Conversely, assume T is open, so for all a ∈ T we have B( a, ε) ⊆ T for some
ε > 0. But then B( a, ε) ∩ S = ∅, and thus a is not a limit point of S. Hence
S contains all its limits points (if not, such a point would belong to T), so S is
closed.
It is possible that a set is both open and closed, which applies to the full set
Rn and to the empty set ∅. (For any “connected space” such as Rn , these are
the only possibilities.) A set may also be neither open nor closed, such as the
half-open interval [0, 1) as a subset of R1 . This set does not contain its limit point
1 and is therefore not closed. It is also not open, because its element 0 does not
have a ball B(0, ε) = (−ε, ε) around it that is fully contained in [0, 1). Another
example of a set which is neither open nor closed is the set {1/n | n ∈ N} which
is missing its limit point 0.
The following theorem states that the intersection of any two open sets S and
S0
S
is open, and the arbitrary union i∈ I Si of any open sets Si is open. Similarly,
the intersection of any two closed sets S and S0 is closed, and the arbitrary in-
S
tersection i∈ I Si of any closed sets Si is closed. Here I is any (possibly infinite)
nonempty set of subscripts i for the sets Si , and
Si = { x | ∃ i ∈ I : x ∈ Si } Si = { x | ∀i ∈ I : x ∈ Si } . (3.24)
S T
i∈ I and i∈ I
50 CHAPTER 3. CONTINUOUS OPTIMIZATION
Theorem 3.9 Let S, S0 ⊆ Rn , and let Si ⊆ Rn for i ∈ I for some arbitrary nonempty
set I. Then
(a) If S and S0 are both open, then S ∩ S0 is open.
(b) If S and S0 are both closed, then S ∪ S0 is closed.
(c) If Si is open for i ∈ I, then
S
i∈ I Si is open.
(d) If Si is closed for i ∈ I, then
T
i∈ I Si is closed.
Proof. Assume both S and S0 are open, and let a ∈ S ∩ S0 . Then B( a, ε) ⊆ S and
B( a, ε0 ) ⊆ S0 for suitable positive ε and ε0 . The smaller of the two balls B( a, ε) and
B( a, ε0 ) is therefore a subset of both sets S and S0 and therefore of their intersec-
tion. So S ∩ S0 is open, which shows (a).
Condition (b) holds because if S and S0 are closed, then T = Rn \ S and
T0= Rn \ S0 are open, and so is T ∩ T 0 by (a), and hence S ∪ S0 = Rn \ ( T ∩ T 0 ) is
open by Theorem 3.8.
To see (c), let Si be open for all i ∈ I, and let a ∈ ∪i∈ I Si , that is, a ∈ S j for some
j ∈ I. Then there is some ε > 0 so that B( a, ε) is a subset of S j , which is a subset
of the set ∪i∈ I Si which is therefore open.
We obtain from (d) from (c) because the intersection of complements of sets
is the complement of their union, that is, i∈ I (Rn \ Si ) = Rn \ i∈ I Si , which we
T S
an open set. Similarly, arbitrary unions of closed sets are not necessarily closed,
for example the closed intervals [ n1 , 1] for n ∈ N, whose union is the half-open
interval (0, 1] which is not closed. However, (c) and (d) do allow arbitrary unions
of open sets and arbitrary intersections of closed sets.
That is, S is bounded if and only if the components xi of the points x in S are
bounded.
Theorem 1.16 states that a bounded sequence in R has convergent subse-
quence. The same holds for Rn instead of R.
Definition 3.12 Let S ⊆ Rn . Then S is called compact if and only if every sequence
of points in S has a convergent subsequence whose limit belongs to S.
Theorem 3.13 Let S ⊆ Rn . Then S is compact if and only if S is closed and bounded.
52 CHAPTER 3. CONTINUOUS OPTIMIZATION
Proof. Assume first that S is closed and bounded, and consider an arbitrary se-
quence of points in S. Then by Theorem 3.11, this sequence has a convergent
subsequence with limit x, say, which belongs to S because S is closed. So S is
compact according to Definition 3.12.
Conversely, assume S is compact. Consider any convergent sequence of points
in S with limit x. Because S is compact, that sequence has a convergent subse-
quence, whose limit is also x, and which belongs to S. So every limit point of S
belongs to S, which means that S is closed.
In order to show that S is bounded, assume this is not the case, so that for ev-
ery k ∈ N there is a point x (k) in S with k x (k) k ≥ k. This defines an unbounded se-
quence { x (k) }k∈N in S where clearly every subsequence is also unbounded which
therefore cannot converge (every convergent sequence is bounded), in contra-
diction to the assumption that S is compact. This proves that S is bounded, as
claimed.
3.5 Continuity
We now consider functions that are defined on a subset S of Rn . The concepts
of being open, closed, or bounded apply to such sets S. These are “topologi-
cal” properties of S, which means and they refer to way points in S can be “ap-
proached”, for example by sequences. If S is open, then any point in S has an
ε-ball around it that belongs to S, for sufficiently small ε. If S is closed, then S
contains all its limit points. If S is bounded, then every sequence in S is bounded,
and moreover boundedness is necessary for compactness.
The central notion of “topology” is continuity, which refers to a function f
and means that f preserves “nearness”. That is, a function f is continuous if
it maps nearby points to nearby points. Here, “nearby” means “arbitrarily close”.
“Closeness” is defined in terms of the distance between two points, according to
the Euclidean norm or the maximum norm, as discussed in the first two sections
of this chapter. Basically, we can say that a function is continuous if it preserves
limits in the sense that
lim f ( xk ) = f ( x ) (3.26)
xk → x
lim f ( xk ) = f ( lim xk )
k→∞ k→∞
g( x, y)
6
y
H
Hj
x
With the help of this proposition, we see that g as defined in (3.29) is not con-
tinuous at (0, 0) by considering the sequence ( x, y)(k) = ( 1k , 1k ) which converges
to (0, 0), but where g(( x, y)(k) ) = g( 1k , 1k ) = 21 , which are function values of g that
do not converge to g(0, 0) = 0.
3.6. PROVING CONTINUITY 55
k f ( x ) − f ( x )k < ε
⇔ |1/x − 1/x | < ε
|x − x| (3.31)
⇔ < ε
| xx |
⇔ | x − x | < εxx
where the last equivalence holds because x and x have the same sign which we
assure by choosing δ small enough so that ( x − δ, x + δ) ⊆ S as described (any δ so
that δ ≤ | x | will do) , and thus xx > 0. The last inequality in (3.31) is a condition
on | x − x |, but we cannot choose δ = εxx because this expression does not depend
solely on ε and x but also on x. However, all we want is that | x − x | < δ implies
| x − x | < εxx, so it suffices that δ is anything smaller than εxx (but still δ > 0).
Also, recall that we can force x to be close to x with a sufficiently small δ. So if δ
is | x |/2 or less, then | x − x | < δ or equivalently x ∈ ( x − δ, x + δ) clearly implies
| x | ∈ ( 12 | x |, 32 | x |) and thus in particular | x | > 12 | x | (as well as x ∈ S). With that
consideration, we let
δ = min{ 21 | x |, 12 ε| x |2 }. (3.32)
Then | x − x | < δ implies 21 | x | < | x | and thus | x − x | < δ ≤ 21 ε| x |2 < ε| x || x | = εxx
and therefore k f ( x ) − f ( x )k < ε according to (3.31), as intended. (These consider-
ations had the additional challenge of writing them neatly so as to simultaneously
cover the cases x > 0 and x < 0.)
In the preceding proof, the function f : x 7→ 1/x was shown to be continuous
at x by choosing δ = 21 εx2 (which for small ε is also implies δ ≤ 12 x as required in
56 CHAPTER 3. CONTINUOUS OPTIMIZATION
(3.32)). We see here that we have to choose δ as a function not only of ε but also
of the point x at which we want to prove continuity.
As an aside, the concept of uniform continuity means that δ can be chosen as a
function of ε only. That is, a function f : S → R is called uniformly continuous if
In contrast, the function f is just continuous if (3.27) holds prefixed with the quan-
tification ∀ x ∈ S, so that δ can be chosen depending on ε and x, as in (3.32). It can
be shown that a continuous function on a compact domain is uniformly continu-
ous. This is not the case for the function R \ {0} → R, x 7→ 1/x whose domain is
not compact.
We consider as a second example for which we prove continuity the function
√
f : [0, ∞) → R, x 7→ x. For x > 0, the function f has the derivative f 0 ( x ) =
1 −1/2
2x . At x = 0, the function f has no derivative because that derivative would
have to be arbitrarily steep. The graph of f is a flipped parabola arc and f is
clearly continuous. We prove this using the definition of continuity (3.27), similar
to the equivalences in (3.31):
k f ( x ) − f ( x )k < ε
√ √
⇔ | x − x| < ε
√ √ (3.34)
⇔ ( x − x )2 < ε2
√
⇔ x + x < ε2 + 2 xx.
Proof. Suppose first f is continuous at x. Let ε > 0 and choose δ > 0 so that (3.27)
√
holds, and let δ0 = δ/ n. Then x ∈ Bmax ( x, δ0 ) ⊆ B( x, δ) by (3.15), which implies
f ( x ) ∈ B( f ( x ), ε) by choice of δ and thus f ( x ) ∈ Bmax ( f ( x ), ε) by (3.14), which
implies (3.37) (with δ0 instead of δ) as claimed.
Conversely, given (3.37) and ε > 0, we choose δ > 0 so that x ∈ Bmax ( x, δ)
√
implies f ( x ) ∈ Bmax ( f ( x ), ε/ m). Then x ∈ B( x, δ) implies x ∈ Bmax ( x, δ) by
√
(3.14) and thus f ( x ) ∈ Bmax ( f ( x ), ε/ m) ⊆ B( f ( x ), ε) by (3.15), which proves
(3.27).
| xy − x y| < ε . (3.38)
| xy − x y| = | xy − x y + x y − x y| ≤ | xy − x y| + | x y − x y| = | x − x | |y| + | x | |y − y|
(3.39)
so that we have proved (3.38) if we can prove
Define
ε
δx = . (3.43)
2(|y| + 1)
Then | x − x | < δx and |y − y| < δy imply | x − x | |y| < ε/2, that is, the first
inequality in (3.40). Now let δ = min{δx , δy }. Then |( x, y) − ( x, y)|max < δ implies
58 CHAPTER 3. CONTINUOUS OPTIMIZATION
| x − x | < δx and |y − y| < δy , which in turn imply (3.40) and therefore (3.38). With
Lemma 3.17, this shows the continuity of the function ( x, y) 7→ xy.
This is an important observation: the arithmetic operation of multiplication is
continuous, and it is also easy to prove that addition, that is, the function ( x, y) 7→
x + y, is continuous. Similarly, the function x 7→ − x is continuous, which is
nearly trivial compared to proving that x 7→ 1/x (for x 6= 0) is continuous.
The following lemma exploits that we have defined continuity for functions
that take values in Rm and not just in R1 . It states that the composition of con-
tinuous functions is continuous. Recall that f (S) is the image of f as defined
in (3.1).
Proof. Assume that f and g are continuous. Let x ∈ S and ε > 0. We want to
show that there is some δ > 0 so that k x − x k < δ and x ∈ S imply k g( f ( x )) −
g( f ( x ))k < ε. Because g is continuous at f ( x ), there is some γ > 0 so that for any
y ∈ T with ky − f ( x )k < γ we have k g(y) − g( f ( x ))k < ε. Now choose δ > 0
so that, by continuity of f at x, we have for any x ∈ S that k x − x k < δ implies
k f ( x ) − f ( x )k < γ. Then (for y = f ( x )) this implies k g( f ( x )) − g( f ( x ))k < ε as
required.
Proof. Let x ∈ S and ε > 0. Suppose f and g are continuous at x. Then ac-
cording to Lemma 3.17 there is some δ > 0 so that k x − x kmax < δ and x ∈ S
imply k f ( x ) − f ( x )kmax < ε and k g( x ) − g( x )kmax < ε. But then also kh( x ) −
h( x )kmax < ε because each of the m + ` components of h( x ) − h( x ) is either the
corresponding component of f ( x ) − f ( x ) or of g( x ) − g( x ).
Conversely, if h is continuous at x, there is some δ > 0 so that k x − x kmax < δ
and x ∈ S imply kh( x ) − h( x )kmax < ε. Because k f ( x ) − f ( x )kmax ≤ kh( x ) −
3.6. PROVING CONTINUITY 59
which√is (trivially) the case if ( x, y) = (0, 0), so assume ( x, y) 6= (0, 0). Then
p p p
| x | = x2 ≤ x2 + y2 and |y| = y2 ≤ x2 + y2 , and therefore
| x | |y|
q
|h( x, y) − h(0, 0)| = |h( x, y)| = p ≤ x2 + y2 = k( x, y)k . (3.46)
x 2 + y2
So if we choose δ = ε then k( x, y)k < δ implies |h( x, y)| ≤ k( x, y)k < ε and thus
(3.45) as required.
In this section, we have seen how continuity can be proved for functions that
are defined on Rn . The maximum norm is particularly useful for these proofs.
Lemma 3.22 Let A be a nonempty compact subset of R. Then A has a maximum and a
minimum.
Proof. We only show that A has a maximum. By Theorem 3.13, A is closed and
bounded, sup A exists. We show that sup A is a limit point of A. Otherwise,
B(sup A, ε) ∩ A = ∅ for some ε > 0 by Lemma 3.7. But then there is no t ∈ A
with t > sup A − ε, so sup A − ε is an upper bound of A, but sup A is the least
upper bound, a contradiction. So sup A is a limit point of A and therefore belongs
to A because A is closed, and hence sup A is also the maximum of A.
The second lemma says that the image of a compact set is compact.
Proof. Let {yk }k∈N be any sequence in f ( X ). We show that there exists a subse-
quence {ykn }n∈N and a y ∈ f ( X ) such that limn→∞ ykn = y, which will show that
f ( X ) is compact. For that purpose, for each k choose x (k) ∈ X with f ( x (k) ) = yk ,
3.8. USING THE THEOREM OF WEIERSTRASS 61
S = f −1 ( T ) = { x ∈ Rn | f ( x ) ∈ T } . (3.47)
Proof. We first prove that if T is open, then S is also open. Let x ∈ S, so that
f ( x ) ∈ T by (3.47). Because T is open, there is some ε > 0 so that B( f ( x ), ε) ⊆ T.
Because f is continuous, there is some δ > 0 so that for all y ∈ B( x, δ) we have
f (y) ∈ B( f ( x ), ε) and therefore f (y) ∈ T, that is, y ∈ S. This shows B( x, δ) ⊆ S,
which proves that S is open, as required.
In order to observe the same property for closed sets note that a set is closed
if and only if its set-theoretic complement is open, and that the pre-image of the
62 CHAPTER 3. CONTINUOUS OPTIMIZATION
For our purposes, we only need the part of Theorem 3.25 that is stated in
Lemma 3.24. It implies the following observation, which is most useful to identify
certain subsets of Rn as closed or open.
X = {( x, y) ∈ R2 | x ≥ 0, y ≥ 0, xy ≥ 1} . (3.48)
2 X2
X1
1
0 x
0 1 2 3
Weierstrass. We also need that X1 is not empty: it contains for example the point
(2, 1). Now, the maximum of f on X1 is also the maximum of f on X. Namely, for
( x, y) ∈ X2 we have x + y ≥ 3 and therefore f ( x, y) = x+1 y ≤ 31 = f (2, 1), where
(2, 1) ∈ X1 , so Theorem 1.10 applies.
In this example of maximizing the function f in (3.49) on the domain X in
(3.48), X is closed but not compact. However, we have applied Theorem 1.10
with X = X1 ∪ X2 as in (3.50) in order to obtain a compact domain X1 where we
know the maximization of f has a solution, which then applies to all of X. This is
an important example which we will consider further in some exercises.
The Theorem of Weierstrass only gives us the existence of a a maximum of
f on X1 (and thereby on X), but it does not show how to find it. It seems rather
clear the maximum of f ( x, y) on X is obtained for ( x, y) = (1, 1), but proving
(and finding) this maximum is the topic of the next chapter.
Chapter 4
First-order conditions
f ( x, y) = x + 4y (4.2)
65
66 CHAPTER 4. FIRST-ORDER CONDITIONS
18
16
14
12
10
8
6
4
2
0
3.0 2.5 5
2.0 1.5 4
3
y 1.0
0.5 1
2 x
0.0 0
Figure 4.1 Plot of the function f ( x, y) = x + 4y for 0 ≤ x ≤ 5 and 0 ≤ y ≤ 3,
with the blue curve showing the restriction xy = 1.
(1, 4)
y
0 x
0 1 2 3 4 5
Figure 4.2 Contour lines and gradient (1, 4) of the function f ( x, y) = x + 4y for
x ≥ 0, y ≥ 0.
A more instructive picture that can be drawn in two dimensions uses contour
lines of the function f in (4.2), shown as the dashed lines in Figure 4.2. Such a
contour line for f ( x, y) is the set of points ( x, y) where f ( x, y) = c for some con-
stant c, that is, where f ( x, y) takes a fixed value. One could also say that a contour
line is the pre-image f −1 ({c}) under f of one of its possible values c. Clearly, for
different values of c any two such contour lines are disjoint. Here, because f
is linear, these contour lines are parallel lines. For ( x, y) ∈ R2 , such a contour
line corresponds to the equation x + 4y = c or equivalently y = 4c − 4x (we also
only consider nonnegative values for x and y). Contour lines are known from
topographical maps of, say, mountain regions, where each line corresponds to a
particular height above sea level; the two-dimensional picture of these lines con-
veys information about the three-dimensional terrain. Here, they indicate how
the function should be minimized, by choosing the smallest function value c that
is possible.
Figure 4.2 also shows the gradient of the function f . We will define this gradi-
ent, called D f , later formally. It is given by the derivatives of f with respect to x
d d
and to y, that is, the pair ( dx f ( x, y), dy f ( x, y)), which is here (1, 4) for every ( x, y)
because f is the linear function (4.2). This vector (1, 4) is drawn in Figure 4.2.
The gradient (1, 4) shows in which direction the function increases (which is dis-
cussed in further detail in the introductory Section 5.1 of the next chapter), and
can be interpreted as the direction of “steepest ascent”. Correspondingly, the op-
posite direction of the gradient is the direction of “steepest descent”. In addition,
the gradient is orthogonal to the contour line, because along the contour line the
function neither increases nor decreases.
68 CHAPTER 4. FIRST-ORDER CONDITIONS
3
(12 , 2)
0 x
0 1 2 3 4 5
Moving along any direction which is not orthogonal to the gradient means ei-
ther moving partly in the same direction as the gradient (increasing the function
value), or away from it (decreasing the function value). Consider now Figure 4.3
where we have drawn the hyperbola which represents the constraint xy = 1
in (4.1). A point on this hyperbola is, for example, (1, 1). At that point, the con-
tour lines show that the function value can still be lowered by moving towards
1 1
(1 + ε, 1+ ε ). But at the point ( x, y ) = (2, 2 ) the contour line just touches the hyper-
bola and so the function value of f ( x, y) cannot be reduced further.
The way to compute this point is the method of so-called Lagrange multipli-
ers, here a single multiplier that corresponds to the single constraint xy = 1. We
write this constraint in the form g( x, y) = 0 where g( x, y) = xy − 1. This con-
straint function g( x, y) has itself a gradient, which depends on ( x, y) and is given
d d
by Dg( x, y) = ( dx g( x, y), dy g( x, y)) = (y, x ). The Lagrange multiplier method
says that only when there is a scalar λ so that D f ( x, y) = λDg( x, y), that is, when
the gradients of the objective function f and of the constraint function g are co-
linear, then no further improvement of f ( x, y) is possible. The reason is that only
in this case the contour lines of f and of g, which are orthogonal to the gradients
D f and Dg, touch as required, so moving along the contour line of g (that is,
maintaining the constraint) also neither increases nor decreases the value of f .
4.2 Differentiability in Rn
The idea of differentiability is to approximate a function locally with a linear,
more precisely affine, function. If f : Rn → R, then f is called affine if there are
reals c0 , c1 , . . . , cn so that for all x = ( x1 , . . . , xn ) ∈ Rn we have
f ( x1 , . . . , x n ) = c0 + c1 x1 + · · · + c n x n . (4.4)
f (x)
f (x) + G · (x − x)
f (x)
x x
understand to mean that near a point x, the function f ( x ) (by “zooming in” to
the function graph near x) becomes less and less distinguishable from an affine
function which has a “gradient” G that defines the slope of the “tangent” at f ( x ),
as in
f (x) ≈ f (x) + G · (x − x) . (4.5)
But this is an imprecise statement, which should clearly mean more than the left
and right side in (4.5) approach the same value as x tends to x, because this would
be true for any G as long as f is continuous. What we mean is that G should
represent the “rate of change” of f ( x ) near f ( x ). (Then G will be the gradient or
derivative of f at x.)
f ( x ) = c0 + c1 x
f ( x ) = c0 + c1 x
x x
If c1 as defined in (4.7) has a limit as x → x, then this limit will take the role
of G in (4.5) and will be called the gradient or derivative of f at x and denoted
by D f ( x ). Geometrically, the “secant” line defined by two points ( x, f ( x )) and
( x, f ( x )) on the curve in Figure 4.5 becomes the tangent line when x → x.
For f : Rn → R, the aim will be to replicate (4.8) where the single coefficient
c1 will be replaced by an n-vector of coefficients (c1 , . . . , cn ). One problem is that
this vector can no longer be represented as a quotient as in (4.7). The following
is the central definition of this section, which we explain in detail afterwards.
Interior points are defined in Definition 3.4.
f (x) − f (x) 1
≈ G · (x − x) · (4.10)
kx − xk kx − xk
where “≈” is here meant to say “holds as equality in the limit as x → x”. The
right-hand side in (4.10) is a product of three terms: a row vector G in R1×n , a
column vector x − x in Rn = Rn×1 , and a scalar 1/k x − x k in R. Written in
this way, each such product is the familiar matrix product. Recall that the matrix
product of two matrices A and B is defined if and only if A is of dimension m × k
and B is of dimension k × n; the result is an m × n matrix that in row i and column
j has entry ∑ks=1 ais bsj where ais and bsj are the respective entries of A and B. We
normally consider elements of Rn as column vectors, that is, as n × 1 matrices.
A scalar is a 1 × 1 matrix, and so we multiply a vector z in Rn with a scalar α
as the matrix product zα. In contrast, a row vector such as G in R1×n has to be
multiplied with a scalar α from the left as in αG. Otherwise these products would
not be defined as matrix products. It is very useful to keep all such products
as matrix products, not least in order to check that the product is written down
correctly. If a and b are two (column) vectors in Rn , then the matrix product a>b
denotes their scalar product ∑in=1 ai bi in R, where a> is the row vector in R1×n that
is obtained from a by matrix transposition. In (4.9) and (4.10), the vector G is
directly given as a row vector so such a transposition is not needed.
Both sides of (4.10) are real numbers, because f is real-valued and the term
f ( x ) − f ( x ) is divided by the Euclidean norm k x − x k of the vector x − x. Because
we have kzαk = kzk |α| for any z ∈ Rn and α ∈ R, the vector y = z · 1/kzk (for
72 CHAPTER 4. FIRST-ORDER CONDITIONS
z 6= 0) has unit length, that is, kyk = 1. Therefore, on the right-hand side of (4.10)
the vector ( x − x ) · 1/k x − x k has unit length. This vector is a scalar multiple of
x − x and can thus be interpreted as the direction of x − x (a vector normalized
to have length one). If (4.10) holds, then the scalar product of G with this vector
given by G · ( x − x ) · 1/k x − x k shows the “growth rate” of f ( x ) − f ( x ) in the
direction x − x.
Consider now the case n = 1, where kzk = |z| for any z ∈ R1 , so that z · 1/kzk
is either 1 (if z > 0) or −1 (if z < 0). Then the right-hand side in (4.10) is G if x > x
f ( x )− f ( x )
and − G if x < x. Similarly, the left-hand side of (4.10 is x−x if x > x and
− f (xx)−
−x
f (x)
if x < x. Hence for x 6= x these two conditions state
f (x) − f (x)
lim = G, (4.11)
x→x x−x
which is exactly the familiar notion of differentiability of a function defined on
R1 . The case distinction x > x and x < x that we just made emphasizes that the
limit of the quotient in (4.11) has to exist for any possible approach of x to x, which
is also stated in Definition 4.1. For example, consider the function f ( x ) = | x |
which is well known not to be differentiable at 0. Namely, if we restrict x to be
f ( x )− f ( x )
positive (that is, x > x = 0), then x− x = |xx| = 1, whereas for x < x = 0 we
f ( x )− f ( x ) |x|
have x− x = x = −1. Therefore, there is no common limit to these fractions
as x → x, for example if we approach x with the sequence { xk } defined by xk =
(−1/2)k which converges to 0 but with alternating signs of xk . In Definition 4.1,
the limit has to exist for any possible approach of x to x (for example, by letting x
be the points of a sequence that converges to x).
Next, we show that the gradient G of a differentiable function is unique and
nicely described by the row vector of “partial derivatives” of the function, and
then describe again how the derivative represents a local linear approximation of
the function in “Taylor’s theorem”, in (4.15) below.
f ( x + e j t) − f ( x ) ∂ f (x)
lim = . (4.12)
t →0 t ∂x j
4.3. PARTIAL DERIVATIVES AND C1 FUNCTIONS 73
d
We have earlier (in our introductory Section 4.1) used the notation dx j f (x)
rather than ∂x∂ f ( x ), which means the same, namely differentiating f ( x1 , . . . , xn )
j
as a function of x j only, while keeping the values of all other variables x1 , . . . , x j−1 ,
x j+1 , . . . xn fixed. For example, if f ( x1 , x2 ) = x1 x2 + x1 , then ∂x∂ f ( x1 , x2 ) = x2 + 1
1
∂
and ∂x2 f ( x1 , x2 ) = x1 .
Next we show that the gradient of a differentiable function is the vector of
partial derivatives.
∂ f ( x + e j t) − f ( x )
f ( x ) = lim = G · ej
∂x j t →0 t
∂ y( x2 + y2 ) − 2x ( xy) y3 − yx2
g( x, y) = = (4.14)
∂x ( x 2 + y2 )2 ( x 2 + y2 )2
x3 − xy2
and for x 6= 0 we have ∂y ∂
g( x, y) = ( x2 +y2 )2 because g( x, y) is symmetric in x
and y. So the partial derivatives of g exist everywhere. However, g( x, y) is not
even continuous, let alone differentiable.
It can be shown that the continuous function h( x, y) defined (3.44) is not dif-
ferentiable at (0, 0).
Nevertheless, the partial derivatives of a function are very useful if they are
continuous, which is often the case.
74 CHAPTER 4. FIRST-ORDER CONDITIONS
Theorem 4.5 is not a trivial observation, because the existence of partial deriva-
tives does not imply differentiability. However, if the partial derivatives exist
and are continuous, then the function is differentiable and continuously differen-
tiable. As with many further results in this chapter, we will not prove Theorem 4.5
for reasons of space. A proof is given in W. Rudin (1976), Principles of Mathematical
Analysis (3rd edition, McGraw-Hill, New York), Theorem 9.21, page 219. What is
important is that the partial derivatives have to be jointly continuous in all vari-
ables, not just separately continuous. For example, the function g( x, y) defined
in (3.29) is separately continuous in x and y, respectively (when the other vari-
∂ y3 −yx2
able is fixed), and so is in fact each partial derivative, such as ∂x g( x, y) = ( x2 +y2 )2
in (4.14). However, this function is not jointly continuous at (0, 0), because for
3 −2x3
y = 2x, say, we have (for x 6= 0) ∂x ∂
g( x, y) = (8x
x +4x2 )2
2
6
= 5x which does not tend
to ∂
∂x g (0, 0) = 0 when x → 0.
By definition, a C1 function is differentiable. Many functions (in particular
if they are not defined by case distinctions) are easily seen to have continuous
partial derivatives, so this is often assumed for simplicity rather than the more
general differentiability.
The following theorem expresses, once more, that differentiability means local
approximation by a linear function. It will also be used to prove (sometimes only
with a heuristic argument) first-order conditions for optimality that we consider
later.
f ( x ) = f ( x ) + G · ( x − x ) + R( x ) · k x − x k (4.15)
where limx→ x R( x ) = R( x ) = 0.
4.4. TAYLOR’S THEOREM 75
f (x) − f (x) − G · (x − x)
R( x ) = (4.16)
kx − xk
The important part in (4.15) is that the remainder term R( x ) tends to zero as
x → x. The norm k x − x k also tends to zero as x → x, so the product R( x ) ·
k x − x k becomes negligible in comparison to G · ( x − x ) which is therefore the
dominant linear term. Condition (4.15) is perhaps the best way to understand
differentiability as “local approximation by a linear function”.
f 00 ( x )
f ( x ) = f ( x ) + f 0 ( x )( x − x ) + ( x − x )2 + R̂( x ) · | x − x |2 (4.17)
2
where limx→ x R̂( x ) = 0. By iterating this process for functions that are differ-
entiable sufficiently many times, one obtains a “Taylor expansion” that approxi-
mates the function not just linearly but by a higher-degree polynomial. Secondly,
the expression (4.17) for a function that is twice differentiable is more informa-
tive than the expression (4.15), with the following additional observation: one
can show that it allows to represent the original remainder term R( x ) to be of the
form f 00 (z)/2 · ( x − x ) for some “intermediate value” z that is between x and x;
hence bounds on f 00 (z) translate to bounds on R( x ). These variations of Taylor’s
theorem are often stated in the literature. We do not consider them here, only the
simple version of Theorem 4.6.
∆
x
f ( x + ∆ x , y + ∆y ) = f ( x, y) + D f ( x, y) · + R(∆ x , ∆y ) · k(∆ x , ∆y )k. (4.18)
∆y
f ( x, y) = f ( x + ∆ x , y + ∆y ) = ( x + ∆ x ) · (y + ∆y )
= x y + y∆ x + x∆y + ∆ x ∆y
(4.19)
∆
x
= f ( x, y) + D f ( x, y) · + ∆ x ∆y
∆y
which is of the form (4.18) if we can find a remainder term R(∆ x , ∆y ) so that
R(∆ x , ∆y ) · k(∆ x , ∆y )k = ∆ x ∆y . This holds if R(∆ x , ∆y ) = ∆ x ∆y /k(∆ x , ∆y )k, and
then s
|∆ x ∆y | ∆2x ∆2y 1
| R(∆ x , ∆y )| = q = =q (4.20)
∆2x + ∆2y ∆ x + ∆y
2 2 1
+ 1
∆2 y ∆2x
In Figure 4.7, a, c, and e are local maximizers and b and d are local minimizers
of the function shown, where b, c, d are unconstrained. The function attains its
global minimum at b and its global maximum at e.
Proof. The direction “⇒” is immediate from Definition 4.7. To see the converse
direction “⇐”, if x is a local maximizer of f on X then there is some ε 1 > 0 so that
f ( x ) ≤ f ( x ) for all x ∈ X ∩ B( x, ε 1 ), and x is an interior point of X if B( x, ε 2 ) ⊆ X
for some ε 2 > 0. With ε = min{ε 1 , ε 2 } we obtain B( x, ε) ⊆ X and f ( x ) ≤ f ( x ) for
all x ∈ B( x, ε), that is, x is an unconstrained local maximizer of f on X.
4.5. UNCONSTRAINED OPTIMIZATION 77
a b
c d e
Figure 4.6 Illustration of Definition 4.7 for a function defined on the interval
[ a, e].
f ( x + ∆ x ) = f ( x ) + D f ( x ) · ∆ x + R(∆ x ) · k∆ x k (4.21)
f ( x + ∆ x ) = f ( x ) + D f ( x ) · ∆ x + R(∆ x ) · k∆ x k
= f ( x ) + G · G >t + R( G >t) · k G kt
= f ( x ) + k G kt · (k G k + R( G >t))
We show how to use Lemma 4.9 with two examples. First, consider the func-
tion f : R2 → R,
x−y
f ( x, y) = , (4.22)
2 + x 2 + y2
2 + x2 + y2 − ( x − y)2x −2 − x2 − y2 − ( x − y)2y
= 0, )=0
(2 + x 2 + y2 )2 (2 + x 2 + y2 )2
or equivalently
2 − x2 + y2 + 2xy = 0 ,
(4.23)
−2 − x2 + y2 − 2xy = 0 .
2
f ( x, y) ≤ f (1, −1) =
2+1+1
x−y 1
≤
2 + x 2 + y2 2
(4.24)
2x − 2y ≤ 2 + x2 + y2
0 ≤ 1 − 2x + x2 + 1 + 2y + y2
0 ≤ (1 − x )2 + (1 + y )2
which is true (with equality for ( x, y) = (1, −1), a useful check). The inequality
f ( x, y) ≥ f (−1, 1) = − 12 is shown very similarly, which shows that f (−1, 1) is
the global minimum of f .
In the following example, the first-order condition of a zero derivative gives
also useful information, although of a different kind. Consider the function g :
R2 → R,
xy
g( x, y) = , (4.25)
1 + x 2 + y2
where Dg( x, y) = (0, 0) means
or equivalently
y − yx2 + y3 = 0 ,
(4.26)
x + x3 − xy2 = 0 .
An obvious solution to (4.26) is ( x, y) = (0, 0), but this is only a stationary point
of g and neither maximum nor minimum (not even locally), because g(0, 0) = 0
but g( x, y) takes positive as well as negative values (also near (0, 0)). Similarly,
when x = 0 or y = 0, then g( x, y) = 0 but this is not a maximum or minimum, so
that we can assume x 6= 0 and y 6= 0. Then the equations (4.26) are equivalent to
1 − x 2 + y2 = 0
(4.27)
1 + x 2 − y2 = 0
which when added give 2 = 0 which is a contradiction. This shows that there
is no solution to (4.26) where x 6= 0 and y 6= 0 and thus g( x, y) has no local
and therefore also no global maximum or minimum. This is possible because the
domain R2 of g is not compact. For x = y and large x, for example, we have
x2 1
g( x, x ) = =
1 + 2x2 1/x2 + 2
xy 1
g( x, y) = <
1 + x 2 + y2 2
2xy < 1 + x2 + y2
0 < 1 + x2 − 2xy + y2
0 < 1 + ( x − y )2
which is true. We can show similarly that g( x, y) > − 21 and that g( x, − x ) gets
arbitrarily close to − 12 . This shows that the image of g is the interval (− 21 , 12 ).
To understand this theorem, consider first the case k = 1, that is, a single
constraint g( x ) = 0. Then (4.28) states D f ( x ) = λ Dg( x ), which means that the
gradient of f (a row vector) is a scalar multiple of the gradient of g. The two gra-
dients have the n partial derivatives of f and g as components, and each partial
derivative of g is multiplied with the same λ to equal the respective partial deriva-
tive of f . These are n equations for the n components of x and λ as unknowns,
and an additional equation is g( x ) = 0, so these are n + 1 equations for n + 1 un-
knowns in total. If there are k constraints gi ( x ) = 0 for 1 ≤ i ≤ n, then (4.28) and
these constraints are n + k equations for n + k unknowns x and λ1 , . . . , λk . Often
these equations have only finitely many solutions that can then be investigated
further.
As an example with a single constraint, consider the functions f , g : R2 → R,
f ( x, y) = x · y, g( x, y) = x2 + y2 − 2, (4.29)
4.6. EQUALITY CONSTRAINTS AND THE THEOREM OF LAGRANGE 81
Figure 4.7 Illustration of the Theorem of Lagrange for f and g in (4.29). The
arrows indicate the gradients of f and g, which are orthogonal to
the contour lines. These gradients have to be co-linear in order to
find a local maximum or minimum of f ( x, y) subject to the constraint
g( x, y) = 0.
For (4.29), D f ( x, y) = (y, x ) and Dg( x, y) = (2x, 2y). Here Dg( x, y) is linearly
dependent only if ( x, y) = (0, 0), which however does not fulfill g( x, y) = 0, so
the constraint qualification holds always. The Lagrange multiplier λ has to fulfill
D f ( x, y) = λ Dg( x, y), that is, (y, x ) = λ(2x, 2y). Here x = 0 would imply y = 0
and vice versa, so we have x 6= 0 and y 6= 0, and the first equation y = λ2x
implies λ = y/2x, which when substituted into the second equation gives x =
λ2y = 2y2 /2x, and thus x2 = y2 or | x | = |y|. The constraint g( x, y) = 0 then
implies x2 + y2 − 2 = 2x2 − 2 = 0 and therefore | x | = 1, which gives the four
solutions (1, 1), (−1, −1), (−1, 1), and (1, −1). For the first two solutions, f takes
the value 1, and for the last two the value −1, so these are the local and in fact
global maxima and minima of f on the circle X.
The following functions illustrate why the constraint qualification is needed
in Theorem 4.10. Let
f ( x, y) = −y, g( x, y) = x2 − y3 , (4.30)
Here D f ( x, y) = (0, −1) and Dg( x, y) = (2x, −3y2 ). However, the equation
D f ( x, y) = λ Dg( x, y), that is, (0, −1) = λ(2x, −3y2 ) has no solution at all, since
0 = λ2x implies λ = 0 or x = 0 and in either case y = 0. However, the unique
maximizer of f ( x, y) on X is clearly (0, 0). The equation D f ( x, y) = λ Dg( x, y)
fails to hold because the constraint qualification is not fulfilled, because then the
gradient Dg( x, y) equals (0, 0), which is not a linearly independent vector.
An example that gives a geometric justification for Theorem 4.10 was given
in Figure 4.3. We do not prove Theorem 4.10, but give a more general plausibility
argument for the case k = 1, with the help of Taylor’s Theorem 4.6. Consider x so
that f ( x ) is a local maximum of f on X = { x ∈ U | g( x ) = 0}, and thus g( x ) = 0.
Any variation ∆ x around x so that x + ∆ x ∈ X requires
0 = g( x ) = g( x + ∆ x ) ≈ g( x ) + Dg( x ) · ∆ x (4.31)
where “≈” means that we neglegt the remainder term because we assume ∆ x to
be sufficiently small. By (4.31), Dg( x ) · ∆ x = 0, and the set of these ∆ x ’s is a sub-
space of Rn of dimension n − 1 provided Dg( x ) 6= 0, which holds by the constraint
qualification (this just says that the gradient of g at the point x is orthogonal to
the “contour set” { x ∈ Rn | g( x ) = 0}). Similarly, a local maximum f ( x ) requires
k
∂ ∂ ∂
F ( x, λ) = f ( x ) − ∑ λi g (x) = 0 (1 ≤ j ≤ n ). (4.36)
∂x j ∂x j i =1
∂x j i
k
D f ( x ) − ∑ λi Dgi ( x ) = 0 (4.37)
i =1
which is equivalent to (4.28). The last k equations in (4.35) are for the k partial
derivatives with respect to λi of F, that is,
∂
F ( x, λ) = − gi ( x ) = 0 (1 ≤ i ≤ k ) (4.38)
∂λi
ε = g j ( x + ∆ x ) ≈ g j ( x ) + Dg j ( x ) · ∆ x = Dg j ( x ) · ∆ x , (4.40)
and thus
f ( x (ε)) = f ( x + ∆ x ) ≈ f ( x ) + D f ( x ) · ∆ x
= f ( x ) + ∑ik=1 λi Dgi ( x ) · ∆ x
(4.41)
= f ( x ) + λ j Dg j ( x ) · ∆ x
= f (x) + λ j ε
which shows (4.39). The interpretation of (4.39) is that adding ε more manpower
(amount of resource j ) so that the constraint g j ( x ) = 0 is changed to g j ( x ) = ε
increases the firm’s profit by λ j ε. Hence, λ j is the price per extra unit of man-
power that the firm should be willing to pay, given the current maximizer x and
associated Lagrangian multipliers λ1 , . . . , λk in (4.28).
The following is a typical problem that can be solved with the help of La-
grange’s Theorem 4.10. A manufacturer of rectangular milkcartons wants to min-
imize the material used to obtain a carton of a given volume. A carton is x cm
high, y cm wide and z cm deep, and is folded according to the layout shown on
the right in Figure 4.9 (which is used twice, for front and back). Each of the four
squares in a corner of the layout with length z/2 is (together with its counterpart
on the back) folded into a triangle as shown on the left (the triangles at the bot-
tom are folded underneath the carton). We ignore any overlapping material used
for gluing. What are the optimal dimensions x, y, z for a carton with volume 500
cm3 ?
The layout on the right shows that the area f ( x, y, z) of the material used is
( x + z)(y + z) times two (for front and back, but the factor 2 can be ignored in the
4.6. EQUALITY CONSTRAINTS AND THE THEOREM OF LAGRANGE 85
z /2
z /2 z /2
x
x
z y
z /2
y
Because clearly x, y, z > 0, the derivative Dg( x, y, z) is never zero and therefore
linearly independent. By Lagrange’s theorem, there is some λ so that
y + z = λyz ,
x + z = λxz , (4.42)
x + y + 2z = λxy .
These equations are nonlinear, but simpler equations can be found by exploiting
their symmetry. Multiplying the first, second, and third equation in (4.42) by
x, y, z, respectively (all of which are nonzero), these equations are equivalent to
x (y + z) = λxyz ,
y( x + z) = λxyz , (4.43)
z( x + y + 2z) = λxyz ,
that is, they all have the same right-hand side. The first two equations in (4.43)
imply xz = yz and thus x = y. With x = y, the second and third equation give
and thus x = 2z. That is, the only optimal solution is of the form (2z, 2z, z).
Applied to the volume equation this gives 4z3 = 500 or z3 = 125, that is, x = y =
10 cm and z = 5 cm.
86 CHAPTER 4. FIRST-ORDER CONDITIONS
Df
h( x ) < 0
Dh
x
f (x) = c
h( x ) = 0
h( x ) < 0
Df Dh
x
f (x) = c
h( x ) = 0
where the contour line of f touches the contour line of h. This is exactly the same
situation as in the Lagrange multiplier problem, meaning D f ( x ) = µDh( x ) for
some Lagrange multiplier µ, with the additional constraint that the gradients of f
and h have to be not only co-linear, but point in the same direction, that is, µ ≥ 0.
(We use a different Greek letter µ instead of the usual λ to emphasize this.) The
reason is that at the point x both h( x ) and f ( x ) are maximized, by “getting out of
the lake”, and by maximizing f , in the direction of the gradients.
The following theorem is also known as the Kuhn–Tucker Theorem, pub-
lished in 1951. Later Kuhn found out that it had already be shown in 1939 in
an unpublished Master’s thesis by Karush, and it is now also called the KKT or
Karush–Kuhn–Tucker Theorem, or simply the “first-order conditions” for maxi-
mization under inequality constraints.
X = U ∩ { x ∈ Rn | hi ( x ) ≤ 0 , 1 ≤ i ≤ `} . (4.45)
E = {i | 1 ≤ i ≤ ` , hi ( x ) = 0 } . (4.48)
Proof of Theorem 4.11. We prove the KKT theorem with the help of the Theorem
4.10 of Lagrange. Let x be a local maximizer of f on X, and let E be the set of
4.7. INEQUALITY CONSTRAINTS AND THE KKT CONDITIONS 89
V = U ∩ { x ∈ Rn | hi ( x ) < 0, 1 ≤ i ≤ `, i 6∈ E } (4.50)
V ∩ { x ∈ Rn | hi ( x ) = 0, i ∈ E } (4.51)
where it has the local maximizer x. Because the constraint qualification holds for
the gradients Dhi ( x ) for i ∈ E, there are Lagrange multipliers µi for i ∈ E so that
(4.49) holds. It remains to show that they are nonnegative.
Suppose µ j < 0 for some j ∈ E. Because x is in the interior of V, for suffi-
ciently small ε > 0 we can find ∆ x ∈ Rn so that x + ∆ x ∈ V and, as in (4.40),
h j ( x + ∆ x ) = −ε (4.52)
and hi ( x + ∆ x ) = 0 for i ∈ E − { j}, so that x + ∆ x ∈ X in (4.45), that is, all
inequality constraints are fulfilled. Then, as in (4.41),
f (x)
h( x )
x
a b c d
The sign conditions in the KKT theorem are most easily remembered (or re-
constructed) for a single constraint in dimension n = 1, as shown in Figure 4.12.
There the condition h( x ) ≤ 0 holds on the two intervals [ a, b] and [c, d] and is tight
90 CHAPTER 4. FIRST-ORDER CONDITIONS
at any end of either interval. For x = a both f and h have a negative derivative,
and hence D f ( x ) = µDh( x ) for some µ ≥ 0, and indeed x is a local maximizer
of f . For x ∈ {b, c} the derivatives of f and h have opposite sign, and in each
case D f ( x ) = λDh( x ) for some λ but λ < 0, so these are not maximizers of f .
However, in that case − D f ( x ) = −λDh( x ) and hence both b and c are local max-
imizers of − f and hence local minimizers of f , in agreement with the picture. For
x = d we have a local maximum f ( x ) with D f ( x ) > 0 but Dh( x ) = 0 and hence
no µ with D f ( x ) = µDh( x ) because the constraint qualification fails. Moreover,
there are two points x in the interior of [c, d] where f has zero derivative, which
is a necessary condition for a local maximum of f because h( x ) ≤ 0 is not tight.
One of these points is indeed a local maximum.
Method 4.12 The following is a “cookbook procedure” to use the KKT Theorem
4.11 in order to find the optimum of a function f : Rn → R.
1. Write all inequality constraints in the form hi ( x ) ≤ 0 for 1 ≤ i ≤ `. In
particular, write a constraint such as g( x ) ≥ 0 in the form − g( x ) ≤ 0 .
2. Assert that the functions f , h1 , . . . , h` are C1 functions on Rn . If the function f
is to be minimized, replace it by − f to obtain a maximization problem.
3. Check if the set
S = { x ∈ Rn | hi ( x ) ≤ 0, 1 ≤ i ≤ `} (4.54)
is bounded and hence compact, which ensures the existence of a (global) max-
imum of f on S by the Theorem of Weierstrass.
3a. If not, check if the set
T = S ∩ { x ∈ Rn | f ( x ) ≥ c } (4.55)
5. Compare the function values of f ( x ) found in 4b, and of f ( x ) for the critical
points x in 4a, to determine the global maximum (which may occur for more
than one maximizer).
The main step in this method is Step 4. As an example, we apply Method 4.12
to the problem
1
maximize x + y subject to 2y ≤ 21 x, y≤ 5
4 − 14 x2 , y ≥ 0 . (4.56)
f ( x, y) = x + y
h1 ( x, y) = − 21 x + 12 y
(4.57)
1 2
h2 ( x, y) = 4x + y − 54
h3 ( x, y) = − y
D f ( x, y) = (1, 1)
Dh1 ( x, y) = (− 12 , 12 )
(4.58)
Dh2 ( x, y) = ( 12 x, 1)
Dh3 ( x, y) = (0, −1) .
There are eight possible subsets E of {1, 2, 3}. If E = ∅, then (4.49) holds if
D f ( x, y) = (0, 0), which is never the case. Next, consider the three “corners”
of the set S which are defined when two inequalities are tight, where E has two
elements.
If E = {1, 2} then h1 ( x, y) = 0 and h2 ( x, y) = 0 hold if x = y and 14 x2 + x −
5 2
4 = 0 or x + 4x − 5 = 0, that is, ( x − 1)( x + 5) = 0 or x ∈ {1, −5} where only
x = y = 1 fulfills y ≥ 0, so this is the point a = (1, 1) shown in Figure 4.13. In
this case h3 (1, 1) = −1 < 0, so the third inequality is indeed not tight (if it was
then this would correspond to the case E = {1, 2, 3}). Then the two gradients
are Dh1 (1, 1) = (− 21 , 12 ) and Dh2 (1, 1) = ( 21 , 1) which are not scalar multiples
of each other and therefore linearly independent, so the constraint qualification
92 CHAPTER 4. FIRST-ORDER CONDITIONS
y Dh2
C
Dh1
1
a
Df
S
h2 ≤ 0
h1 ≤ 0
h3 ≤ 0 b
0 x
0 1 2
Figure 4.13 The set S defined by the constraints in (4.57). The triple short lines
next to each line defined by hi ( x, y) = 0 (for i = 1, 2, 3) show the side
where hi ( x, y) ≤ 0 holds, abbreviated as hi ≤ 0. The (infinite) set
C is the cone of all nonnegative linear combinations of the gradients
Dh1 ( x, y) and Dh2 ( x, y) for the point a = ( x, y) = (1, 1), which does
not contain D f ( x, y), so f ( a) is not a local maximum. At the point
b = (2, 14 ) we have D f (b) = µ2 Dh2 (b) for the (only) tight constraint
h2 (b) = 0, and µ2 ≥ 0, so f (b) is a local maximum.
holds. Because these are two linearly independent vectors in R2 , any vector, in
particular D f (1, 1), can be represented as a linear combination of them. That is,
there are µ1 and µ2 with
D f (1, 1) = (1, 1) = µ1 Dh1 (1, 1) + µ2 Dh2 (1, 1) = µ1 (− 21 , 12 ) + µ2 ( 12 , 1), (4.59)
which are uniquely given by µ1 = − 32 , µ2 = 43 . Because µ1 < 0, we do not have
a local maximum of f . We can also this in the picture: By allowing the constraint
h1 ( x, y) ≤ 0 to be non-tight and keeping h2 ( x, y) ≤ 0 tight, ( x, y) can move along
the line h2 ( x, y) = 0 and increase f ( x, y) in the direction D f ( x, y) (exactly as in
(4.53) in the proof of Theorem 4.11 for a negative µ j ). Figure 4.13 shows the cone C
spanned by Dh1 ( a) and Dh2 ( a), that is, C = {µ1 Dh1 ( a) + µ2 Dh2 ( a) | µ1 , µ2 ≥ 0 }.
Only gradients D f ( a) in that cone “push” the function values of f in such a way
that f is maximized at a. This is not the case here.
If E = {1, 3} then h1 ( x, y) = 0 and h3 ( x, y) = 0 require x = y and y = 0,
which is the point (0, 0). The two gradients Dh1 (0, 0) and Dh2 (0, 0) in (4.58 are
linearly independent. The unique solution to D f (0, 0) = (1, 1) = µ1 Dh1 (0, 0) +
µ3 Dh3 (0, 0) = µ1 (− 21 , 21 ) + µ3 (0, −1) is µ1 = −2, µ3 = −2. Here both multi-
pliers are negative, so then − D f (0, 0) is a positive combination of Dh1 (0, 0) and
4.7. INEQUALITY CONSTRAINTS AND THE KKT CONDITIONS 93
Dh2 (0, 0), which shows that f (0, 0) is a local minimum of f (it easily seen to be the
global minimum).
If E = {2, 3} then h2 ( x, y) = 0 and h3 ( x, y) = 0 require y = 0 and 14 x2 − 54 = 0
√ √ √
or√x = 5 (for x =√ − 5 we would have h1 ( x, 0) > 0). Then Dh2 ( 5, 0) =
( 5/2, 1) and Dh3 ( 5,√0) = (0, −1) which √ are also linearly√ independent. The
unique
√ solution to D f ( 5, 0 ) = µ 2 Dh√2 ( 5, 0 ) + µ Dh
3√ 3 ( 5, 0 ) , that is, to (1, 1) =
µ2 ( 5/2, 1) + µ3 (0, −1), is µ2 = 2/ 5, µ3 = 2/ 5 − 1 < 0, √ so this point is
also not a local maximum, even though it has the highest value 5 of the three
“corners” of the set S considered so far.
If E = {1, 2, 3} then there is no common solution to the three equations
hi ( x, y) = 0 for i ∈ E because any two of them already have different unique
solutions.
When E is a singleton {i }, the gradient Dhi ( x, y) is always a nonzero vec-
tor and so the constraint qualification holds. If E = {1} then (4.49) requires
D f ( x, y) = (1, 1) = µ1 Dh1 ( x, y) = µ1 (− 21 , 12 ) which has no solution µ1 because
the two gradients are not scalar multiples of each other. The same applies when
E = {3} where D f ( x, y) = (1, 1) = µ3 Dh3 ( x, y) = µ3 (0, −1) has no solution µ3 .
However, for E = {2} we do have a solution to Step 4b in Method 4.12. The
equation D f ( x, y) = (1, 1) = µ2 Dh2 ( x, y) = µ2 ( 12 x, 1) has the unique solution
µ2 = 1 ≥ 0 and x = 2, and h2 ( x, y) = 0 is solved for 44 + y − 54 = 0 or y = 14 . This
is the point b = (2, 14 ) shown in Figure 4.13. Hence, f (b) is a local maximum of f .
Because it is the only local maximum found, it is also the global maximum of f ,
which exists as confirmed in Step 3.
A rather laborious part of Method 4.12 in this example has been to compute
the multipliers µi for i ∈ E in Step 4b and to check their signs. We have done
this in detail to explain the interpretation of these signs, with the cone C shown
in Figure 4.13 at the candidate point a which fulfills all conditions except for the
sign of µ1 . However, there is a shortcut which avoids computing the multipliers,
by going directly to Step 5, and simply comparing the function values √ at these
points. In our example, the candidate points where √ ( 1, 1 ) , ( 0, 0 ) , ( 5, 0), and
1 9
(2, 4 ), with corresponding√function values 2, 0, 5, and 4 , of which the latter is
9 9 2
the largest (we have 4 > 5 because ( 4 ) = 81 16 > 5).
We consider a second example that has two local maxima, where one local
maximum has a tight constraint that has a zero multiplier. The problem is again
about a function f : R2 → R and says
h1 ( x, y) = − x ≤ 0 , h2 ( x, y) = x2 + y2 − 1 ≤ 0 . (4.61)
94 CHAPTER 4. FIRST-ORDER CONDITIONS
f ( x, y) = c
h2 ( x, y) = 0
S
x
h1 ( x, y) = 0
b Df
Dh2
Dh1
a
Df
Dh2
The corresponding set S is shown in Figure 4.14 and compact, so f has a maxi-
mum. We have
(2x, −1) = µ2 Dh2 ( x, y) = µ2 (2x, 2y) has the following solution: We have x > 0
because the first constraint h1 ( x, y) ≤ 0 is not tight, which requires µ2 = 1 and
hence 2y = −1, that is, y = − 21 . Then h2 ( x, y) = 0 means x2 + 14 − 1 = 0 or
√ √
x = 3/2. This is the point b = ( 3/2, − 12 ) with f (b) = 43 + 12 = 54 , which is
larger than f ( a) = f (0, −1) = 1, so f (b) is the global maximum of f .
Finally, we consider an example of the KKT theorem where the constraint
qualification fails, where unusually the set over which f is optimized does not
have a cusp. This example provides also a motivation for the next chapter on
linear optimization. The problem says:
f ( x, y) = 2x + y, D f ( x, y) = (2, 1)
h1 ( x, y) = − x ≤ 0, Dh1 ( x, y) = (−1, 0)
(4.64)
h2 ( x, y) = −y ≤ 0, Dh2 ( x, y) = (0, −1)
h3 ( x, y) = y · ( x + y − 1) ≤ 0, Dh3 ( x, y) = (y, x + 2y − 1).
(0,1)
(1,0)
(0,0)
x
We first consider the corners of the triangle as possible solutions to the KKT
conditions. (It is obviously much easier to just evaluate the function on those
96 CHAPTER 4. FIRST-ORDER CONDITIONS
(2,1)
(0,1)
(1,0)
(0,0)
x
Figure 4.16 The feasible set and the objective function for the example (4.63),
and the optimal point (1, 0).
corners as in Step 5 of Method 4.12, but we want to check if the KKT theorem
can be applied.) Let ( x, y) = (0, 0). Then all three constraints are tight, with E =
{1, 2, 3} in (4.48). The three vectors of derivatives Dh1 (0, 0), Dh2 (0, 0), Dh3 (0, 0)
in R2 are necessarily linearly independent, so the constraint qualification fails
and we have to investigate this critical point as a possible maximum.
For ( x, y) = (1, 0), we have h2 (1, 0) = 0 and h3 (1, 0) = 0 but h1 (1, 0) < 0, so
E = {2, 3} in (4.48). By (4.64), Dh2 (1, 0) = (0, −1) and Dh3 (1, 0) = (0, 0), which
are linearly dependent vectors, so this is another critical point where the constraint
qualification fails.
For ( x, y) = (0, 1), the tight constraints are given by h1 (0, 1) = h3 (0, 1) = 0
whereas h2 (0, 1) < 0, so E = {1, 3} in (4.48). By (4.64), Dh1 (0, 1) = (−1, 0) and
Dh3 (0, 1) = (1, 1), which are linearly independent vectors. We want to find µ1
and µ3 that are nonnegative so that
that is,
(2, 1) = µ1 (−1, 0) + µ3 (1, 1)
which has the unique solution µ3 = 1 and µ1 = −1. Because µ1 < −1, the KKT
conditions (4.46) fail and ( x, y) = (0, 1) cannot be a local maximum of f (nor, for
that matter, a minimum because µ3 > 0, because for a minimum of f , that is, a
maximum of − f , a we would need µ1 ≤ 0 and µ3 ≤ 0).
For completeness, we consider the cases of fewer tight constraints. E = ∅
would require D f ( x, y) = (0, 0), which is never the case. If E = {1} then D f ( x, y)
would have to be scalar multiple of Dh1 ( x, y) but it is not, and neither is it a scalar
multiple of Dh2 ( x, y) when E = {2}. Consider E = {3}, so h3 ( x, y) = 0 is the
only tight constraint, that is, x > 0 and y > 0. Then h3 ( x, y) = 0 is equivalent to
x + y − 1 = 0. Then we need µ3 ≥ 0 so that D f ( x, y) = µ3 Dh3 ( x, y), that is,
(2, 1) = µ3 (y, x − 2y − 1)
4.7. INEQUALITY CONSTRAINTS AND THE KKT CONDITIONS 97
Linear optimization
In this form, the problem is in the standard inequality form of a linear optimiza-
tion problem, also called linear programming problem, or just linear program. (The
term “programming” was popular in the middle of the 20th century when opti-
mization problems started to be solved with computer programs, with electronic
computers also being developed around the same time.)
A linear function f : Rn → R is of the form
f ( x1 , . . . , x n ) = c1 x1 + · · · + c n x n (5.2)
98
5.1. LINEAR FUNCTIONS, HYPERPLANES, AND HALFSPACES 99
matrix product, which in this case produces a 1 × 1 matrix which is a real num-
ber that represents the normal scalar product of the vectors c and x in (5.2). For
that reason, we write the multiplication of a vector x with a scalar λ in the form
xλ (rather than as λx) because it is the product of a n × 1 with a 1 × 1 matrix.
This consistency is very helpful in re-grouping products of several matrices and
vectors.
Recall that we write the derivative Dh( x ) of a function h as a row vector, so
that we multiply it with a scalar like λ from the left as in λDh( x ). Also, when we
write x = ( x1 , . . . , xn ), say, then this is just meant to define x as an n-tuple of real
numbers and not as a row vector, because otherwise we would always have to
introduce x tediously as x = ( x1 , . . . , xn )>. The thing to remember is that when
we use matrix multiplication, then a vector like x is always a column vector and
x > is a row vector.
Let c 6= 0 (where 0 is the vector with all components zero, in any dimension),
and let f be the linear function defined by f ( x ) = c>x as in (5.2). The set { x |
f ( x ) = 0} where f takes value 0 is a linear subspace of Rn . By definition, it
consists of all vectors x that are orthogonal to c, that is, have scalar product 0
with c. If n = 2, then this “nullspace” of f is a line, but in general it will be a
“hyperplane” in Rn , a space of dimension n − 1.
More generally, let u ∈ R and consider the set
H = { x ∈ Rn | f ( x ) = u} = { x ∈ Rn | c>x = u} (5.3)
where f takes value u, which we have earlier called a contour set or level set for f .
Then for any two x and x̂ on this level set H, that is, so that f ( x ) = f ( x̂ ) = u,
we have c>( x − x̂ ) = 0, so that the vector x − x̂ is orthogonal to c. Then H is also
called a hyperplane through the point x (which does not contain the origin 0 unless
u = 0) with normal vector c. To repeat, such a hyperplane H is of the form (5.3)
for some c ∈ Rn , c 6= 0, and u ∈ R. The different contour sets for f are therefore
parallel hyperplanes, all with the same normal vector c.
Figure 5.1 shows an example of such level sets, where these “hyperplanes”
are contour lines because n = 2. The vector c, here c = (2, −1), is orthogonal to
any such level set. Moreover, c points in the direction in which the function value
of f ( x ) increases, because if we replace x by x + c then f ( x ) changes from c>x to
f ( x + c) = c>x + c>c which is larger than c>x because c>c = ∑in=1 c2i > 0 because
c 6= 0. Note that c may have negative components (as in the figure). Only the
direction of c matters to find out where f ( x ) gets larger.
Similar to a hyperplane H defined by c and u in (5.3), a halfspace S is defined
by an inequality according to
S = { x ∈ Rn | c>x ≤ u} (5.4)
which consists of all points x that are on the hyperplane H or “below” it, that
is, with smaller values of c>x than the points on H. Figure 5.1 shows such a
100 CHAPTER 5. LINEAR OPTIMIZATION
0 0 (1,0) (2,0)
c S c
Figure 5.1 Left: Contour lines (level sets) of the function f : R2 → R defined
by f ( x ) = c>x for c = (2, −1). Right: Halfspace S in (5.4) given by
c>x ≤ 5.
halfspace S for c = (2, −1) and u = 5, which contains, for example, the point
x = (2.5, 0). It is customary to “shade” the side of the hyperplane H that defines
S with a few small parallel strokes as shown in the picture, and then it is not
needed to indicate c which is the orthogonal vector to H that points away from S.
x2
c
(0,1)
(1,0)
(0,0)
x1
Figure 5.2 The feasible set and the objective function for the example (5.1), and
the optimal point (1, 0).
With these conventions, Figure 5.2 gives a graphical description of the prob-
lem in (5.1) where the feasible set where all inequalities hold is the intersection of
the three halfspaces defined by x1 ≥ 0 x2 ≥ 0, and x1 + x2 ≤ 0. This is the shaded
triangle. In this graphical way, the optimal solution is nearly obvious.
The example in this section is taken from J. Matoušek and B. Gärtner (2007), Un-
derstanding and Using Linear Programming (Springer Verlag, Berlin).
5.2. LINEAR PROGRAMMING: INTRODUCTION 101
maximize x1 + x2
subject to x1 ≥0
x2 ≥ 0
(5.5)
− x1 + x2 ≤ 1
x1 + 6x2 ≤ 15
4x1 − x2 ≤ 10 .
The set of points ( x1 , x2 ) in R2 that fulfill these inequalities is called the feasible set
and shown in Figure 5.3.
x2
D f = (1, 1)
− x1 + x2 ≤ 1
x1 + 6x2 ≤ 15
(3, 2)
x1 + x2 = 5
x2 ≥ 0
x1
x1 ≥ 0
4x1 − x2 ≤ 10
Figure 5.3 Feasible set and objective function vector (1, 1) for the LP (5.5), with
optimum at (3, 2) and objective function value 5.
x2
D f = ( 16 , 1)
− x1 + x2 ≤ 1
x1 + 6x2 ≤ 15 1 5
6 x1 + x2 = 2
x2 ≥ 0
x1
x1 ≥ 0
4x1 − x2 ≤ 10
Figure 5.4 Feasible set and objective function vector ( 61 , 1) with non-unique
maximum along the side where the constraint x1 + 6x2 ≤ 15 is tight.
x1 ≥0
x2 ≥ 0
− x1 + x2 ≥ 1 (5.6)
x1 + 6x2 ≤ 15
4x1 − x2 ≥ 10 .
Finally, an optimal solution need not exist even when there are feasible solu-
tions. This happens when the objective function can attain arbitrarily large val-
ues; such a linear program is called unbounded. This is the case when we remove
the constraints 4x1 x2 ≤ 10 and x1 + 6x2 ≤ 15 from the initial example (5.5), as
shown in Figure 5.6.
5.2. LINEAR PROGRAMMING: INTRODUCTION 103
x2
− x1 + x2 ≥ 1
x1 + 6x2 ≤ 15
x2 ≥ 0
x1
x1 ≥ 0
4x1 − x2 ≥ 10
Figure 5.5 Example of an infeasible set, for the constraints (5.6). Recall that the
little strokes indicate the side where the inequality is valid, and here
there is no point ( x1 , x2 ) where all inequalities are valid. This would
be the case even without the constraints x1 ≥ 0 and x2 ≥ 0.
x2
− x1 + x2 ≤ 1
D f = (1, 1)
x2 ≥ 0
x1
x1 ≥ 0
The pictures shown in this section provide a good intuition of how linear pro-
grams look in principle. However, this graphical method hardly extends beyond
R2 or R3 . Our development of the theory of linear programming will proceed
largely algebraically, with some geometric intuition for the important Lemma 5.5
of Farkas.
104 CHAPTER 5. LINEAR OPTIMIZATION
We use the following notation. For positive integers m, n, the set of m × n matrices
is denoted by Rm×n . An n-vector is an element of Rn . Unless stated otherwise, all
vectors are column vectors, so a vector x in Rn is considered as an n × 1 matrix.
Its transpose x > is the corresponding row vector in R1×n . The components of an
n-vector x are x1 , . . . , xn . The vectors 0 and 1 have all components equal to zero
and one, respectively, and have suitable dimension, which may vary with each
use of 0 or 1. An inequality between vectors like x ≥ 0 holds for all components.
The identity matrix, of any dimension, is denoted by I.
A linear optimization problem or linear program (LP) says: optimize (max-
imize or minimize) a linear objective function subject to linear constraints (in-
equalities or equalities).
The standard inequality form of an LP is given by an m × n matrix A, an m-
vector b and an n-vector c and says:
maximize c>x
subject to Ax ≤ b , (5.7)
x ≥ 0.
The horizontal line is often written to separate the objective function from the
constraints.
(3 + 6) x1 + (4 + 6) x2 + (2 + 6) x3 ≤ 7 + 6 · 2
or
9x1 + 10x2 + 8x3 ≤ 19.
In this inequality, which holds for any feasible solution, all coefficients of the non-
negative variables x j are at least as large as in the primal objective function, so the
right-hand side 19 is certainly an upper bound for this objective function. In fact,
we can obtain an even better bound by multiplying the two primal inequalities
by y1 = 2 and y2 = 2, getting
(3 · 2 + 2) x1 + (4 · 2 + 2) x2 + (2 · 2 + 2) x3 ≤ 2 · 7 + 2 · 2
or
8x1 + 10x2 + 6x3 ≤ 18.
Again, all coefficients are at least as large as in the primal objective function. Thus,
it cannot be larger than 18, which was achieved by the above solution x1 = 1,
x2 = 1, x3 = 0, which is therefore optimal.
In general, the dual LP for the primal LP (5.7) is obtained as follows:
• Sum the resulting entries of each of the n columns and require that the re-
sulting coefficient of x j for j = 1, . . . , n is at least as large as the coefficient
c j of the objective function. (Because x j ≥ 0, this will at most increase the
objective function.)
• Minimize the resulting right-hand side y1 b1 + · · · + ym bm (because it is an
upper bound for the primal objective function).
y≥0 A ≤ b (5.10)
∨ ,→ min
c> → max
The diagram (5.10) shows the m × n matrix A with the m-vector b on the right and
the row vector c> at the bottom. The top shows the primal variables x with their
constraints x ≥ 0. The left-hand side shows the dual variables y with their con-
straints y ≥ 0. The primal LP is to be read horizontally, with constraints Ax ≤ b,
and the objective function c>x that is to be maximized. The dual LP is to be read
vertically, with constraints y>A ≥ c> (where in the diagram (5.10) ≥ is written
vertically as ∨ ), and the objective function y>b that is to be minimized. A way
to remember the direction of the inequalities is to see that one inequality Ax ≤ b
points “towards” A and the other, y>A ≥ c>, “away from” A, where maximiza-
tion is subject to upper bounds and minimization subject to lower bounds, apart
from the nonnegativity constraints for x and y.
The fact that the primal and dual objective functions are mutual bounds is
known as the “weak duality” theorem, which is very easy to prove – essentially
in the way we have motivated the dual LP above.
Theorem 5.2 (Weak LP duality) For a pair x, y of feasible solutions of the primal LP
(5.7) and its dual LP (5.9), the objective functions are mutual bounds:
c>x ≤ y>b .
If thereby c>x = y>b (equality holds), then these two solutions are optimal for both LPs.
5.4. LEMMA OF FARKAS AND PROOF OF STRONG LP DUALITY 107
The following “strong duality” theorem is the central theorem of linear pro-
gramming.
Theorem 5.3 (Strong LP duality) Whenever both the primal LP (5.7) and its dual LP
(5.9) are feasible, they have optimal solutions with equal value of their objective functions.
We will prove this theorem in Section 5.4. Its proof is not trivial. In fact, many
theorems in economics have a hidden LP duality so that they can be proved by
writing down a suitable LP and interpreting its dual LP. For that reason, Theo-
rem 5.3 is extremely useful.
∑ Aj xj = b . (5.13)
j∈ J
108 CHAPTER 5. LINEAR OPTIMIZATION
where we can assume that the set S = { j ∈ J | z j > 0} is not empty (otherwise
replace z by −z). Then
∑ A j ( x j − z j α) = b
j∈ J
C A4 C A4
A1 A1
A3 A3
A2 A2
v
H y = v−b
0 0
b b
Figure 5.7 Left: Vectors A1 , A2 , A3 , A4 , the cone C generated by them (which
extends to infinity between the two “rays” that extend A3 and A2 ),
and a vector b not in C. Right: A separating hyperplane H for b with
normal vector y = v − b.
The right diagram in Figure 5.7 shows a vector y so that y>A j ≥ 0 for all j, for
1 ≤ j ≤ n, and y>b < 0. The set H = {z ∈ Rm | y>z = 0} is called a separating
5.4. LEMMA OF FARKAS AND PROOF OF STRONG LP DUALITY 109
hyperplane with normal vector y because all vectors A j are on one side of H (they
fulfill y>A j ≥ 0, which includes the case y>A j = 0 where A j belongs to H, like
A2 in Figure 5.7), whereas b is strictly on the other side of H because y>b < 0.
Farkas’s Lemma asserts that such a separating hyperplane exists for any b that
does not belong to C.
Lemma 5.5 (Farkas) Let A ∈ Rm×n and b ∈ Rm . Then exactly one of the following
statements holds:
(a) ∃ x ∈ Rn : x ≥ 0, Ax = b, or
(b) ∃y ∈ Rm : y>A ≥ 0>, y>b < 0.
In Lemma 5.5, it is clear that (a) and (b) cannot both hold because if (a) holds,
then y>A ≥ 0> implies y>b = y>( Ax ) = (y>A) x ≥ 0.
If (a) is false, that is, b does not belong to the cone C in (5.12), then y can be
constructed by the following intuitive geometric argument: Take a vector v in C
that is closest to b (see Figure 5.7), and let y = v − b. We will show that y fulfills
the conditions in (b).
Apart from this geometric argument, an important part of the proof is to show
that the cone C is closed, that is, it contains any point nearby. Otherwise, b could be
a point near C but not in C which would mean that the distance kv − bk for√ any v
in C can be arbitrarily small, where kzk denotes the Euclidean norm, kzk = z>z.
In that case, one could not define y as described. We first show, as a separate
property, that C is closed.
C A4
A1
A3
v (1)
A2
v (2)
0 b ← v(k)
Figure 5.8 Illustration of the proof of Lemma 5.6 where J = {2} since v(k) for
large k is a positive linear combination of A2 only.
Proof. Let b be a point in Rm near C, that is, for all ε > 0 there is a v in C so
that kv − bk < ε. Consider a sequence v(k) (for k = 1, 2, . . .) of elements of C that
converges to b. By Lemma 5.4, there exists for each k a subset J (k) of {1, . . . , n} and
110 CHAPTER 5. LINEAR OPTIMIZATION
(k)
unique positive real numbers x j for j ∈ J (k) so that the columns A j for j ∈ J (k)
are linearly independent and
∑
(k)
v(k) = Aj xj .
j ∈ J (k)
There are only finitely many different sets J (k) , so there is a set J that appears
infinitely often among them (see Figure 5.8 for an example). We consider the
subsequence of the vectors v(k) that use this set, that is,
∑ Aj xj
(k) (k)
v(k) = = AJ xJ (5.15)
j∈ J
(k)
where A J is the matrix with columns A j for j ∈ J and x J is the vector with
(k) (k)
components xj for j ∈ J. Now, xJ in (5.15) is a continuous function of v(k) : In
order to see this, consider a set I of | J | linearly independent rows of A J , let A I J be
(k)
the square submatrix of A J with these rows and let v I be the subvector of v(k)
(k) (k)
with these rows, so that x J = A− 1
I J vI in (5.15). Hence, as v(k) converges to b,
(k) (k)
the | J |-vector x J converges to some x ∗J with b = A J x ∗J , where x J > 0 implies
x ∗J ≥ 0, which shows that b ∈ C. So C is closed.
Remark 5.7 In Lemma 5.6, it is important that C is the cone generated by a finite set
A1 , . . . , An of vectors. The cone generated from an infinite set may not be closed. For
example, let C be the set of nonnegative linear combinations of the vectors (n, 1) in R2 ,
for n = 0, 1, 2, . . .. Then (1, 0) is a vector near C that does not belong to C.
X
v
y cε c
0
b
Figure 5.9 Proof why a point c with y>(c − v) < 0, that is, (5.18), creates a point
cε closer to b than v, where cε is a convex combination of v and c. This
implies c 6∈ C.
which by (5.18) is less than kv − bk2 for sufficiently small positive ε, which con-
tradicts the minimality of kv − bk2 for v ∈ C (see also Figure 5.9). So (5.17) holds.
In particular, for c = A j + v we have y>( A j + v) ≥ y>v and thus y>A j ≥ 0,
for 1 ≤ j ≤ n, that is, y>A ≥ 0.
Equations (5.17) and (5.16) imply (v − b)>c > (v − b)>b, that is, y>c > y>b,
for all c ∈ C. For c = 0 this shows 0 > y>b.
Lemma 5.8 (Farkas with inequalities) Let A ∈ Rm×n and b ∈ Rm . Then exactly
one of the following statements holds:
(a) ∃ x ∈ Rn : x ≥ 0, Ax ≤ b, or
(b) ∃y ∈ Rm : y ≥ 0, y>A ≥ 0>, y>b < 0 .
Proof. Clearly, there is a vector x so that Ax ≤ b and x ≥ 0 if and only if there are
x ∈ Rn and s ∈ Rm with
Ax + s = b, x ≥ 0, s ≥ 0. (5.19)
112 CHAPTER 5. LINEAR OPTIMIZATION
The system (5.19) is a system of equations as in Lemma 5.5 with the matrix [ A I ]
instead of A, where I is the m × m identity matrix, and vector ( x, s) instead of x.
The condition y>[ A I ] ≥ 0> in Lemma 5.5(b) is then simply y>A ≥ 0>, y ≥ 0 as
stated here in (b).
− A>y ≤ − c
Ax ≤ b (5.21)
−c>x + b>y ≤ 0
but
−u>c + v>b < 0 . (5.23)
in contradiction to (5.23).
If t > 0, then u and v are essentially primal and dual feasible solutions that
violate weak LP duality, because then by (5.22) b t ≥ Au and v>A ≥ tc> and
therefore
v>b t ≥ v>Au ≥ tc>u
So far, the strong duality Theorem 5.3 makes only a statement when both primal
and dual LP are feasible. In principle, it could be the case that the primal LP has
an optimal solution while its dual is not feasible. The following theorem excludes
this possibility. Its proof is a typical application of Theorem 5.3 itself.
Theorem 5.9 (Boundedness implies dual feasibility) Suppose the primal LP (5.7)
is feasible. Then its objective function is bounded if and only if the dual LP (5.9) is
feasible.
Proof. By weak duality (Theorem 5.2), if the dual LP has a feasible solution y, then
its objective function y>b provides an upper bound for the primal objective func-
tion c>x. Conversely, suppose that the dual LP (5.9) is infeasible, and consider
the following LP which uses an additional real variable t and the vector 1 which
has all components equal to 1:
minimize t
(5.24)
subject to y>A + t1> ≥ c>, y ≥ 0, t ≥ 0.
maximize c>z
subject to Az ≤ 0 ,
(5.25)
1>z ≤ 1 ,
z≥0.
This LP is also feasible with z = 0. By strong duality, it has the same value as its
dual LP (5.24), which is positive, given by c>z = t > 0 for some z that fulfills the
constraints in (5.25). Consider now a feasible solution x to the original primal LP,
that is, Ax ≤ b, x ≥ 0, and let α ∈ R, α ≥ 0. Then A( x + zα) = Ax + Azα ≤
b + 0α = b and x + zα ≥ 0, so x + zα is also a feasible solution to (5.7) with
objective function value c>( x + zα) = c>x + (c>z)α which gets arbitrarily large
with growing α. So the original LP is unbounded. This proves the theorem.
An alternative way of stating the preceding theorem, for the dual LP, is as
follows.
Corollary 5.10 Suppose the dual LP (5.9) is feasible. Then the primal LP (5.7) is infea-
sible if and only if the objective function of the dual LP (5.9) is unbounded.
114 CHAPTER 5. LINEAR OPTIMIZATION
Proof. This is just an application of Theorem 5.9 with dual and primal exchanged:
Rewrite (5.9) as a primal LP in the form: maximize −b>y subject to − A>y ≤ −c,
y ≥ 0, so that its dual is: minimize − x >c subject to − x >A> ≥ −b>, x ≥ 0, which
is the same as (5.7), and apply Theorem 5.9.
On the other hand, the fact that one LP is infeasible does not imply that its
dual LP is unbounded, because both could be infeasible.
Remark 5.11 It is possible that both the primal LP (5.7) and its dual LP (5.9) are infea-
sible.
maximize x2
subject to x1 ≤ −1
x1 − x2 ≤ 1
x1 , x2 ≥ 0,
minimize − y1 + y2
subject to y1 + y2 ≥ 0
− y2 ≥ 1
y1 , y2 ≥ 0 .
PP primal
PP
optimal unbounded infeasible
dual PPP
P
optimal yes no no
unbounded no no yes
Table 5.1 The possibilities for primal and dual LP, where “optimal” means the
LP is feasible and bounded and then has an optimal solution, and
“unbounded” means the LP is feasible but its objective function is
unbounded.
Table 5.1 shows the four possibilities that can occur for the primal LP and its
dual: both have optimal solutions, one is infeasible and the other unbounded, or
both are infeasible. If one LP is feasible, its dual cannot be unbounded by weak
duality (Theorem 5.2), and if it has an optimal solution then its dual cannot be
infeasible by Theorem 5.9.
5.6. GENERAL LP DUALITY 115
Table 5.1 does not state the equality of primal and dual objective functions
when both have optimal solutions, but it does state Corollary 5.10. We show that
this implies Farkas’s Lemma 5.8 for inequalities that we have used to prove the
strong duality Theorem 5.3. Consider the LP
maximize 0
subject to Ax ≤ b , (5.26)
x ≥ 0.
with its Tucker diagram
x≥0
y≥0 A ≤ b (5.27)
∨ ,→ min
0> → max
Its dual LP: minimize y>b subject to y>A ≥ 0>, y ≥ 0, is feasible with y = 0.
The LP (5.26) is feasible if and only if there is a solution x to the inequalities
Ax ≤ b, x ≥ 0. By Corollary 5.10, there is no such solution if and only if the
dual is unbounded, that is, assumes an arbitrarily negative value of its objective
function y>b. This is equivalent to the existence of some y ≥ 0 with y>A ≥ 0>
and y>b < 0 which can then be made arbitrarily negative by replacing y with
yα for any α > 0. This proves Lemma 5.8, which can therefore be remembered
with the Tucker diagram (5.27) and Corollary 5.10. So the possibilities described
in Table 5.1 capture the important theorems of LP duality.
In the following, the matrix A and vectors b and c will always have dimen-
sions A ∈ Rm×n , b ∈ Rm , and c ∈ Rn . These data A, b, c will simultaneously
define a primal LP with variables x in Rn , and a dual LP with variables y in
Rm . In the primal LP, we compare Ax with the right-hand side b, and maximize
the objective function c>x. In the dual LP, we compare y>A with the right-hand
side c>, and minimize the objective function y>b. In a general LP, “compare” can
mean inequalities or equations (or a mixture of the two). The corresponding dual
or primal variable will be nonnegative or unconstrained.
We first consider a primal LP with nonnegative variables and equality con-
straints, which is often called an LP in equality form:
maximize c>x
subject to Ax = b , (5.28)
x ≥ 0.
minimize y>b
(5.29)
subject to y>A ≥ c>.
This motivation is just the weak duality theorem, which is immediate: it says that
for feasible solutions x and y to (5.28) and (5.29) we have x ≥ 0 and y>A − c> ≥ 0
and thus
c>x ≤ y>A x = y>b. (5.30)
maximize c>x
(5.31)
subject to Ax ≤ b.
To find the dual LP to the primal LP (5.31), we can again multiply any of the in-
equalities in Ax ≤ b with a variable yi , with the aim of finding an upper bound
to the primal objective function c>x. The inequality is preserved when yi is non-
negative, but in order to obtain an upper bound on the primal objective function
c>x we have to require that y>A = c> because the sign of any variable x j is not
known. That, is the dual to (5.31) is
minimize y>b
(5.32)
subject to y>A = c>, y ≥ 0.
5.6. GENERAL LP DUALITY 117
Observe that compared to (5.7) the LP (5.31) is missing the nonnegativity con-
straints x ≥ 0, and that compared to (5.9) the dual LP (5.32) states n equations
y>A = c> rather than inequalities.
Again, the choice of primal and dual LP is motivated by weak duality, which
states that for feasible solutions x to (5.31) and y to (5.32) the corresponding ob-
jective functions are mutual bounds. Including proof, it says
Hence, we have the following types of pairs of a primal LP and its dual LP,
including the original more symmetric situation of LPs in inequality form:
• a primal LP (5.7) and its dual LP (5.9), both with nonnegative variables sub-
ject to inequality constraints.
In all cases, by changing signs and transposing the matrix, we see that the dual
of the dual is again the primal.
Remark 5.12 The seemingly missing case of a primal and dual problem with unre-
stricted variables subject to equality constraints, which has no inequality constraints of
any kind, would have as the primal problem: maximize c>x subject to Ax = b, and as its
dual: minimize y>b subject to y>A = c>. Then if and only if both problems are feasible,
any x with Ax = b is primal optimal and any y with y>A = c> is dual optimal. Finding
such solutions is an easy problem of linear algebra and therefore not considered as a linear
optimization problem.
Proof. If Ax = b and y>A = c>, then c>x = y>Ax = y>b, which proves primal
optimality of x (because, given y, it holds for any x) and similarly dual optimality
of y. Such x and y exist if and only if b is in the “column space” of A and c>
is in the “row space” of A, both of which can be determined using Gaussian
elimination. This fails, for example, if Az = 0 but c>z 6= 0 for some z ∈ Rn ;
then the primal problem is unbounded (similar to (5.25) above) and the dual is
infeasible.
b can be converted to equality form by introducing a slack variable si for each in-
equality. These slack variables define a nonnegative vector s in Rm . Then (5.7) is
equivalent to:
maximize c>x
subject to Ax + s = b , (5.34)
x, s ≥ 0 .
This amounts to extending the original constraint matrix A to the right by an
m × m identity matrix and adding coefficients 0 in the objective function for the
slack variables, as shown in the following Tucker diagram:
x≥0 s≥0
1
1
y ∈ Rm A .. = b (5.35)
.
1
∨ ∨ ,→ min
c> 0 0 · · · 0 → max
Note that converting the inequality form (5.7) to an LP in equality form (5.34)
defines a new dual LP with unrestricted variables y1 , . . . , ym , but the former in-
equalities yi ≥ 0 reappear now explicitly via the identity matrix and objective
function zeros introduced with the slack variables, as shown in (5.35). So the
resulting dual LP is exactly the same as in (5.9).
Even simpler, an LP in inequality form (5.7) can also be seen as the special
case of an LP with unrestricted variables x j as in (5.31) since the condition x ≥ 0
can be written in the form Ax ≤ b by explicitly listing the n inequalities − x j ≤
" #x ≥ 0 become with unrestricted x ∈ R the m + n
n
" ≤
0. That is, Ax # b and
A b
inequalities x ≤ with an n × n identity matrix I. The corresponding
−I 0
dual LP according to (5.32) (easily seen with a suitable Tucker diagram) has an
additional n-vector of slack variables r, say, with the dual constraints y>A − r > =
c>, y ≥ 0, r ≥ 0, which are equivalent to the inequalities y>A ≥ c>, y ≥ 0, again
exactly as in (5.9).
It is useful to consider all these forms of linear programs as a special case of
an LP in general form. Such an LP has inequalities and equalities as constraints, as
well as nonnegative and unrestricted variables. Let J be the subset of the column
set {1, . . . , n} where j ∈ J means that x j ≥ 0, and let K be the subset of the row set
{1, . . . , m} where i ∈ K means that row i is an inequality in the primal constraint
with corresponding dual variable yi ≥ 0 (the letter I denotes an identity matrix so
we use K instead). Let J and K be sets of unconstrained primal and dual variables,
with corresponding dual and primal equality constraints,
To define the LP in general form, we first draw the Tucker diagram, shown in
Figure 5.10. The diagram assumes that columns and rows are arranged so that
those in J and K come first. The big boxes contain the respective parts of the
constraint matrix A, the vertical boxes on the right the parts of the right-hand
side b, and the horizontal box at the bottom the parts of the primal objective
function c>.
xj ≥ 0 (j ∈ J) xj ∈ R (j ∈ J)
yi ≥ 0
≤
(i ∈ K )
A b
yi ∈ R
=
(i ∈ K )
∨
c>
Figure 5.10 Tucker diagram for an LP in general form.
In order to state the duality theorem concisely, we define the feasible sets X
and Y for the primal and dual LP. The entries of the m × n matrix A are aij in
row i and column j. Let
n
X = { x ∈ Rn | ∑ aij x j ≤ bi , i ∈ K,
j =1
n
(5.37)
∑ aij x j = bi , i ∈ K,
j =1
xj ≥ 0 , j ∈ J }.
Any x belonging to X is called primal feasible, and the primal LP is called feasible
if X is not the empty set ∅. The primal LP is the problem
(This results when reading the Tucker diagram in Figure 5.10 horizontally.) The
corresponding dual LP has the feasible set
m
Y = { y ∈ Rm | ∑ yi aij ≥ c j , j ∈ J,
i =1
m
(5.39)
∑ yi aij = c j , j ∈ J,
i =1
yi ≥ 0 , i ∈ K}
120 CHAPTER 5. LINEAR OPTIMIZATION
primal LP dual LP
constraint variable
n
row i ∈ K inequality ∑ aij x j ≤ bi nonnegative yi ≥ 0
j =1
n
row i ∈ K equation ∑ aij x j = bi unconstrained yi ∈ R
j =1
objective function
m
minimize ∑ y i bi
i =1
variable constraint
m
column j ∈ J nonnegative xj ≥ 0 inequality ∑ yi aij ≥ c j
i =1
m
column j ∈ J unconstrained xj ∈ R equation ∑ yi aij = c j
i =1
objective function
n
maximize ∑ cj xj
j =1
(This results when reading the Tucker diagram in Figure 5.10 vertically.) By re-
versing signs, one can verify that the dual of the dual LP is again the primal.
Table 5.2 shows the roles of the sets K, K, J, J.
For an LP in general form, the strong duality theorem states that (a) for any
primal and dual feasible solutions, the corresponding objective functions are mu-
tual bounds, (b) if the primal and the dual LP both have feasible solutions, then
they have optimal solutions with the same value of their objective functions, (c)
if the primal or dual LP is bounded, the other LP is feasible. This implies the
possibilities shown in Table 5.1.
Theorem 5.13 (General LP duality) For the primal of LP (5.38) and its dual LP (5.40),
(a) (Weak duality) c>x ≤ y>b for all x ∈ X and y ∈ Y.
(b) (Strong duality) If X 6= ∅ and Y 6= ∅ then c>x = y>b for some x ∈ X and y ∈ Y,
so that both x and y are optimal.
5.6. GENERAL LP DUALITY 121
or equivalently
minimize (ŷ − y)>b
subject to (ŷ − y)>A ≥ c>, (5.43)
ŷ, y ≥ 0.
Any solution y to the dual LP (5.29) with unconstrained dual variables y can be
written in the form (5.43) where ŷ represents the “positive” part of y and y the
negated “negative” part of y according to
Suppose this is done so that all rows are inequalities. In a similar way, we
then write an unrestricted primal variable x j for j ∈ J as the difference x̂ j − x j
of two new primal variables that are nonnegative. The jth column A j of A and
the jth component c j of c> are then replaced by a pair A j , − A j and c j , −c j with
coefficients x̂ j , x j , so that aij x j is written as aij x̂ j − aij x j and c j x j as c j x̂ j − c j x j .
For these columns, the resulting pair of inequalities in the dual LP
m m
∑ yi aij ≥ c j , ∑ −yi aij ≥ −c j
i =1 i =1
is then equivalent to a dual equation for j ∈ J, as stated in (5.39) and Table 5.2.
The claim then follows as before for the known statements for an LP in inequality
form.
The optimality condition c>x = y>b, already stated in the weak duality Theo-
rem 5.2, is equivalent to a combinatorial condition known as “complementary
slackness”. It states that in each column j and row i at least one of the associated
inequalities in the dual or primal LP has to be tight, that is, hold as an equality. In
a general LP, this is only relevant for the inequality constraints, that is, for j ∈ J
and i ∈ K (see the Tucker diagram in Figure 5.10).
that is,
m n
xj > 0 ⇒ ∑ yi aij = c j , yi > 0 ⇒ ∑ aij x j = bi . (5.47)
i =1 j =1
For an LP in general form (5.38) and its dual (5.40), a feasible pair x ∈ X, y ∈ Y is also
optimal if and only if (5.45) holds, or equivalently (5.46) or (5.47).
Proof. Suppose x and y are feasible for (5.7) and (5.9), so Ax ≤ b, x ≥ 0, y>A ≥ c>,
y ≥ 0. They are both optimal if and only if their objective functions are equal,
c>x = y>b. This means that the two inequalities c>x ≤ y>A x ≤ y>b used to
5.7. COMPLEMENTARY SLACKNESS 123
prove weak duality hold as equalities c>x = y>A x and y>A x = y>b, which are
equivalent to (5.45).
The left equation in (5.45) says
!
n m
0 = (y>A − c>) x = ∑ ∑ yi aij − c j xj . (5.48)
j =1 i =1
Then y>A ≥ c> and x ≥ 0 imply that the sum over j on the right-hand side of
(5.48) is a sum of nonnegative terms, which is zero only if each of them is zero,
as stated on the left in (5.46). Similarly, the second equation y>(b − Ax ) = 0 in
(5.45) holds only if the equations on the right of (5.46) hold for all j. Clearly, (5.47)
is equivalent to (5.46).
For an LP in general form, the feasibility conditions y ∈ Y and x ∈ X with
(5.39) and (5.37) imply
m n
∑ yi aij = c j for j ∈ J, ∑ aij x j = bi for i ∈ K, (5.49)
i =1 j =1
so that (5.46) holds for j ∈ J and i ∈ K. Hence, the respective terms in (5.46) are
zero in the scalar products (y>A − c>) x and y>(b − Ax ). These scalar products
are nonnegative because ∑im=1 yi aij ≥ c j and x j ≥ 0 for j ∈ J, and yi ≥ 0 and
bi ≥ ∑nj=1 aij x j for i ∈ K. So the weak duality proof c>x ≤ y>A x ≤ y>b applies
as well. As before, optimality c>x = y>b is equivalent to (5.45) and thus to (5.46)
and (5.47), which for j ∈ J and i ∈ K hold trivially by (5.49) irrespective of the
sign of x j or yi .
Consider the standard LP in inequality form (5.7) and its dual LP (5.9). The
dual feasibility constraints imply nonnegativity of y>A − c>, which is the n-vector
of “slacks”, that is, of differences in the inequalities y>A ≥ c>; such a slack is
zero in some column if the inequality is tight. The condition (y>A − c>) x = 0
in (5.45) says that this nonnegative slack vector is orthogonal to the nonnegative
vector x, because the scalar product of these two vectors is zero. The conditions
(5.46) and (5.47) state that this orthogonality can hold only if the two vectors are
complementary in the sense that in each component at least one of them is zero.
Similarly, the nonnegative m-vector y and the m-vector of primal slacks b − Ax
are orthogonal in the second equation y>(b − Ax ) = 0 in (5.45). In a compact
way, we can write
y>A ≥ c> ⊥ x≥0
(5.50)
y≥0 ⊥ Ax ≤ b
to state the following:
• all the inequalities in (5.50) have to hold, where those on the left state dual
feasibility and those on the right state primal feasibility, and
124 CHAPTER 5. LINEAR OPTIMIZATION
• the orthogonality signs ⊥ in the two rows in (5.50) say that the n- and m-
vectors of slacks (differences in these inequalities) have to be orthogonal as
in (5.45) for x and y to be optimal.
In the Tucker diagram (5.10), the first orthogonality in (5.50) refers to the n columns
and the second orthogonality to the m rows. By (5.46) or (5.47), orthogonality
means complementarity in the sense that for each column j or row i at least one
inequality is tight.
We demonstrate with Example 5.1 how complementary slackness is useful in
the search for optimal solutions. The dual to (5.8) (which we normally see directly
from the Tucker diagram) says explicitly: for y1 , y2 ≥ 0 subject to
3y1 + y2 ≥ 8
4y1 + y2 ≥ 10
(5.51)
2y1 + y2 ≥ 5
One feasible primal solution is x = ( x1 , x2 , x3 ) = (0, 1, 1). Then the first inequality
in (5.8) is not tight so by (5.47) we need y1 = 0 in an optimal solution. Because
x2 > 0 and x3 > 0 the second and third inequality in (5.51) have to be tight, which
implies y2 = 10 and y2 = 5 which is impossible. So this x is not optimal.
Another feasible primal solution is ( x1 , x2 , x3 ) = (0, 1.75, 0), where y2 = 0
because the second primal inequality is not tight. Only the second inequality in
(5.51) has to be tight, that is, 4y1 = 10 or y1 = 2.5. However, this violates the first
dual inequality in (5.51).
Finally, for the primal solution x = ( x1 , x2 , x3 ) = (1, 1, 0) both primal in-
equalities are tight, which allows for y1 > 0 and y2 > 0. Then the first two dual
inequalities in (5.51) have to be tight, which determines y as (y1 , y2 ) = (2, 2),
which also fulfills the third dual inequality (which is allowed to have positive
slack because x3 = 0). So here x and y are optimal.
The complementary slackness condition is a good way to verify that a con-
jectured primal solution is optimal, because the resulting equations for the dual
variables typically determine the values of the dual variables which can then be
checked for dual feasibility (or for equality of primal and dual objective function).
As stated in Theorem 5.14, the complementary slackness conditions charac-
terize optimality of a primal-dual pair x, y also for an LP in general form. How-
ever, in such an LP they only impose constraints for the primal or dual inequali-
ties, that is, for the columns j ∈ J and i ∈ K in the Tucker diagram in Figure 5.10.
The other columns and rows already define dual or primal equations which by
definition have zero slack. This is also the case if such an equality is converted to
a pair of inequalities. For example, for i ∈ K, the primal equation ∑nj=1 aij x j = bi
with unrestricted dual variable yi can be rewritten as a pair of two inequalities
5.8. CONVEX SETS AND FUNCTIONS 125
∑nj=1 aij x j ≤ bi and − ∑nj=1 aij x j ≤ −bi with associated nonnegative dual vari-
ables ŷi and yi so that yi = ŷi − yi . As noted in the proof of Theorem 5.13, we
can add a constant zi to the two variables ŷi and yi in any dual feasible solution,
so that they are both positive when zi > 0. By complementary slackness, the two
primal inequalities then have to be tight, but they can anyhow only be fulfilled if
they both hold as an equation. This confirms that for a general LP complementary
slackness is informative only for the inequality constraints.
The purpose of this and the next section is to connect linear programming with
what we know about constrained optimization (which because of its greater gen-
erality is sometimes called “nonlinear programming”) and the Karush-Kuhn-
Tucker (KKT) conditions. Because the KKT conditions concern local optima, we
show in this section that for so-called convex optimization problems, which in-
clude linear programs, a local minimum is a global minimum.
x
c a y
b
{ x + ( y − x ) p | p ∈ R}
0
Figure 5.11 The line through the points x and y consists of points written as
x + (y − x ) p where p ∈ R. Examples are point a for p = 0.6, point
b for p = 1.5, and point c when p = −0.4. The line segment that
connects x and y (drawn as a solid line) results when p is restricted
to 0 ≤ p ≤ 1.
Let x and y be two vectors in Rm . Figure 5.11 shows two points x and y in
the plane, but the picture may also be regarded as a suitable view of the situation
in a higher-dimensional space. The line that goes through the points x and y is
obtained by adding to the point x, regarded as a vector, any scalar multiple of the
difference y − x. The resulting vector x + (y − x ) p, for p ∈ R, gives x when p = 0,
and y when p = 1. Figure 5.11 gives some examples a, b, c of other points. When
0 ≤ p ≤ 1, as for point a, the resulting points define the line segment that joins x
and y. If p > 1, then one obtains points on the line through x and y on the other
side of y relative to x, like the point b in Figure 5.11. For p < 0, the corresponding
point, like c in Figure 5.11, is on that line but on the other side of x relative to y.
126 CHAPTER 5. LINEAR OPTIMIZATION
x
x
y y
Figure 5.12 Examples of sets that are convex (left) and not convex (right).
Figure 5.13 Illustration of Theorem 5.15 for m = 2. Any point in the pentagon
belongs to one of the three shown triangles (which are not unique,
there are others ways to “triangulate” the pentagon). A triangle is
the set of convex combinations of its corners.
those points (see Figure 5.13). This is the case m = 2 of the following theorem,
which we mention as an easy consequence of Lemma 5.4.
{ ( x, u) ∈ Rn × R | x ∈ X, f ( x ) ≤ u } (5.53)
Figure 5.14 illustrates condition (5.54). This condition says that if one con-
nects x and y in the domain X of f with a line segment, the function value should
be on or below the line segment that connects the endpoints f ( x ) and f (y). That
is, let p ∈ [0, 1] and z = x (1 − p) + yp. The white dot in Figure 5.14 shows the
convex combination in Rn+1 of ( x, f ( x )) and (y, f (y)) with 1 − p and p given by
( x, f ( x ))(1 − p) + (y, f (y)) p, which is at least as high as the point (z, f (z)), ac-
cording to the condition f (z) ≤ f ( x )(1 − p) + f (y) p in (5.54). Note that z ∈ X
requires that X itself is convex. It is easy to see that then (5.54) is equivalent to
the convexity of the epigraph of f .
The important property of convex functions is the following theorem. Its
proof (by way of contradiction) is illustrated in Figure 5.15.
Theorem 5.17 Let X ⊆ Rn and let f : X → R be a convex function. Then any local
minimum of f is a global minimum of f .
128 CHAPTER 5. LINEAR OPTIMIZATION
f (x)
f (y)
f (z)
X
x z y
Figure 5.14 Illustration of Definition 5.16. The shaded set is the epigraph of f
in (5.53), that is, the set of points ( x, u) above the graph of f , with u
(unbounded above) drawn in vertical direction.
f (x)
f (z)
f (y)
X
x z y
Proof. Let x ∈ X and let x be a local minimum of f , so there is some ε > 0 so that
f ( x ) ≤ f (z) for all z in X with kz − x k < ε. Suppose x is not a global minimum
of f , that is, f ( x ) > f (y) for some y ∈ X. Let p > 0 so that ky − x k p < ε, and
Corollary 5.18 Any local optimum (maximum or minimum) of an LP (in general form)
is a global optimum.
Proof. The inequality and equality constraints of an LP in general form are pre-
served under taking convex combinations, so the primal feasible set X in (5.37) is
convex. Any linear function f , such as x 7→ −c>x or x 7→ c>x, fulfills (5.54) (even
with “=” instead of “≤”) and is therefore convex, so that Theorem 5.17 applies to
5.9. LP DUALITY AND THE KKT THEOREM 129
minimizing −c>x (that is, maximizing c>x) for x ∈ X. (It also applies to the dual
LP.)
In this section we show that the KKT Theorem 4.11 applied to a linear program
is essentially the strong LP duality theorem, applied to an LP (5.31) in inequality
form where any inequalities such as x ≥ 0 would have to be written as part of
Ax ≤ b, so the variables x ∈ Rn are unrestricted. In order to match the notation
in Theorem 4.11, let the number of rows of A be `. That is, (5.31) states: maximize
f ( x ) = c>x subject to hi ( x ) = ∑nj=1 aij x j − bi ≤ 0 for 1 ≤ i ≤ `. The functions f
and hi are affine functions that have constant derivatives, with D f ( x ) = c> and
Dhi ( x ) = ( ai1 , . . . , ain ) for 1 ≤ i ≤ `. The open set U in Theorem 4.11 is Rn .
Suppose that this LP is feasible and that c>x has a local maximum at x = x,
which by Corollary 5.18 is also a global maximum. By the duality Theorem 5.13
for an LP in general form, there exists an optimal dual vector y ∈ R` with y ≥ 0
and y>A = c> (see also (5.32)), which is equivalent to D f ( x ) = c> = y>A =
∑i`=1 yi Dhi ( x ) which is the last equation in (4.46) with µi = yi for 1 ≤ i ≤ `.
Moreover, the optimality condition c>x = y>b is equivalent to the complemen-
tary slackness conditions (5.46). In (5.46), the first set of equations hold automat-
ically because y>A = c>, and the second equations yi (bi − ∑`j=1 aij x j ) = 0 are
equivalent to yi (−hi ( x )) = 0 and therefore to µi hi ( x ) = 0 as stated in (4.46). So
Theorem 4.11 is a consequence of the strong duality theorem, in fact in a stronger
form because it does not require the constraint qualification that the gradients in
(4.46) for the tight constraints are linearly independent.
For a given dictionary, the entering variable is chosen so as to improve the value
of the objective function when that variable is increased from zero in the current ba-
sic feasible solution. In (5.57), this will happen by increasing any of x1 , x2 , x3
because they all have a positive coefficient in the linear equation for z. Suppose
x2 is chosen as the entering variable (for example, because it has the largest co-
efficient). Suppose the other nonbasic variables x1 and x3 stay at zero and x2
increases. Then z = 0 + 10x2 (the desired increase), s1 = 7 − 4x2 , and s2 = 2 − x2 .
In order to maintain feasibility, we need s1 = 7 − 4x2 ≥ 0 and s2 = 2 − x2 ≥ 0,
where these two constraints are equivalent to 47 = 1.75 ≥ x2 and 2 ≥ x2 . The first
of these is the stronger constraint: when x2 is increased from 0 to 1.75, then s1 = 0
and s2 = 0.25 > 0. For that reason, s1 is chosen as the leaving variable, and we
rewrite the first equation in (5.57) so that x2 is on the left and s1 is on the right,
giving
4x2 = 7 − 3x1 − s1 − 2x3
s2 = 2 − x1 − x2 − x3 (5.58)
z = 0 + 8x1 + 10x2 + 5x3
However, this is not a dictionary because x2 is still on the right-hand side of the
second and third equation, but should appear only on the left. To remedy this,
we first rewrite the first equation so that x2 has coefficient 1, and then substitute
this equation into the other two equations:
which gives the new dictionary with basic variables x2 and s2 and nonbasic vari-
ables x1 , s1 , x3 :
x2 = 1.75 − 0.75x1 − 0.25s1 − 0.5x3
s2 = 0.25 − 0.25x1 + 0.25s1 − 0.5x3 (5.60)
z = 17.5 + 0.5x1 − 2.5s1 + 0x3
The basic feasible solution corresponding to (5.60) is x2 = 1.75, s2 = 0.25 and has
objective function value z = 17.5. The latter can still be improved by increasing
x1 , which is now the unique choice for entering variable because neither s1 nor
x3 have a positive coefficient in this representation of z. Increasing (only) x1 from
zero imposes the constraints x2 = 1.75 − 0.75x1 ≥ 0 and s2 = 0.25 − 0.25x1 ≥ 0,
where the second is stronger, since s2 becomes zero when x1 = 1 while x2 is still
positive. So x1 enters and s2 leaves the basis. Similar to the step from (5.57) to
132 CHAPTER 5. LINEAR OPTIMIZATION
(5.58), we bring x1 to the left and s2 to the right side of the equation,
and substitute the resulting equation x1 = 1 − 4s2 + s1 − 2x3 for x1 into the other
two equations:
x2 = 1.75 − 0.25s1 − 0.5x3
− 0.75(1 − 4s2 + s1 − 2x3 )
x1 = 1 − 4s2 + s1 − 2x3 (5.62)
z= 17.5 − 2.5s1 + 0x3
+ 0.5(1 − 4s2 + s1 − 2x3 )
which gives the next dictionary with x2 and x1 as basic variables and s2 , s1 , x3 as
nonbasic variables:
x2 = 1 + 3s2 − s1 + x3
x1 = 1 − 4s2 + s1 − 2x3 (5.63)
z = 18 − 2s2 − 2s1 − x3
As always, this dictionary is equivalent to the original system of equations (5.56),
with basic feasible solution x1 = 1, x2 = 1, and corresponding objective func-
tion value z = 18. In the last line in (5.63), no nonbasic variable has a positive
coefficient. This means that no increase from zero of a nonbasic variable can im-
prove the objective function. Hence this basic feasible solution is optimal, and the
algorithm terminates.
Converting one dictionary to another by exchanging a nonbasic (entering)
variable with a basic (leaving) variable is commonly referred to as pivoting. The
column of the entering variable and the row of the leaving variable define a
nonzero coefficient of the entering variable known as a pivot element. Pivoting
amounts to a manipulation of the matrix of coefficients of all variables and of the
right-hand side (often called a tableau). This manipulation involves row operations
that represent the substitutions as performed in (5.59) and (5.62). In such a row
operation, the pivot row is divided by the pivot element, and suitable multiples
of the resulting new row are subtracted from the other rows.
That is, we can express the variable substitution that leads to the new dictio-
nary in terms of suitable row operations of the system of equations. This is easiest
seen by keeping all variables on one side similar to (5.56). We rewrite (5.57) as
where we now have to remember that in the expression for z a potential entering
variable is identified by a negative coefficient. In (5.64) the basic variables are s1
and s2 which have a unit vector as a column of coefficients, which has entry 1 in
the row of the basic variable and entry 0 elsewhere.
With the x2 as the entering and s1 as the leaving variable in (5.64), pivoting
amounts to creating a unit vector in the column for x2 . This means to divide the
first (pivot) row by 4 so that x2 has coefficient 1 in that row. The new first row is
then subtracted from the second row, and 10 times the new first row is added to
the third row, so that the coefficient of x2 in those rows becomes zero:
These row operations have the same effect as the substitutions in (5.59). The
system (5.65) is equivalent to the second dictionary (5.60). The basic variables x2
and s2 are identified by their unit-vector columns. Note that z is expressed only
in terms of the nonbasic variables.
The entering variable in (5.65) is x1 and the leaving variable is s2 , so that the
unit vector for that second row should now appear in the column for x1 rather
than s2 . The second row is divided by the pivot element 0.25 (i.e., multiplied
by 4) to give x1 coefficient 1, and the coefficients of x1 in the other rows give the
suitable multiplier to subtract the new second row from the other rows, namely
0.75 for the first and −0.5 for the third row. This gives
x2 − x3 + s1 − 3s2 = 1
x1 + 2x3 − s1 + 4s2 = 1 (5.66)
z + x3 + 2s1 + 2s2 = 18
This section describes the simplex algorithm in general. We also define in gen-
erality the relevant terms, many of which have already been introduced in the
previous section.
134 CHAPTER 5. LINEAR OPTIMIZATION
if the LP is infeasible, a case that is discovered at that point. We will describe this
initializing “first phase” later.
Consider a basic feasible solution with basis B, and let N denote the index
set of the nonbasic columns as above. The following equations are equivalent for
any x ∈ Rn :
Ax = b
AB xB + A N x N = b
A− 1 −1 −1
B AB xB + AB A N x N = AB b
x B = A− 1 −1
B b − AB A N x N (5.67)
xB = A− 1
B b− ∑ A− 1
B Aj xj
j∈ N
xB = b − ∑ Aj xj
j∈ N
which expresses the objective function c>x in terms of the nonbasic variables, as
in the equation below the horizontal line in the examples (5.57), (5.60), (5.63). In
−1
(5.68), c>
B A B b is the value of the objective function for the basic feasible solution
where x N = 0. This is an optimal solution if
c j − c> −1
B AB A j ≤ 0 for all j ∈ N , (5.69)
y> = c> −1
B AB , (5.70)
136 CHAPTER 5. LINEAR OPTIMIZATION
which is feasible for the dual LP (5.29) because y> A B = c> >
B by (5.70) and y A j ≥
c j for j ∈ N by (5.69), that is, y>A N ≥ c> > >
N, so altogether y A ≥ c . It is optimal
−1
because y>b = c> > >
B A B b = c B x B = c x when x N = 0, that is, dual and primal
objective function have the same value.
The optimality criterion (5.69) fails if
c j − c> −1
B AB A j > 0 for some j ∈ N. (5.71)
In that case, the value of the objective function will be increased if x j can assume
a positive value. The simplex algorithm therefore looks for such a j in (5.71) and
makes x j a new basic variable, called the entering variable. The index j is said to
enter the basis. This has to be done while preserving feasibility, and so that there
are again m basic variables. Thereby, some element i of B leaves the basis, where
xi is called the leaving variable.
To demonstrate this change of basis, consider the last equation in (5.67) that
expresses the variables x B in terms of the nonbasic variables x N . Assume that all
components of x N are kept zero except x j . Then (5.67) has the form
xB = b − A j x j , (5.72)
For at least one i ∈ B, the minimum ratio is achieved as stated in (5.74). The
corresponding variable xi is made the leaving variable and becomes nonbasic.
This defines the pivoting step: The entering variable x j is made basic and the
leaving variable xi is made nonbasic, and the basis B is replaced by B − {i } ∪ { j}.
5.11. THE SIMPLEX ALGORITHM: GENERAL DESCRIPTION 137
ues. This implies bi > 0 in (5.74), so that the entering variable x j takes on a posi-
tive value and the objective function for the basic feasible solution increases with
each iteration by (5.73). Hence, no basis is revisited, and the simplex algorithm
terminates because there are only finitely many bases. Furthermore, Step 3 of the
above summary shows that in the absence of degeneracy the leaving variable xi ,
and thus the minimum in the minimum-ratio test (5.74), is unique, because if two
variables could leave the basis because they become zero at the same time, then
only one of them leaves and the other remains basic but has value zero in the new
basic feasible solution.
If there are degenerate basic feasible solutions, then the minimum in (5.74)
may be zero because bi = 0 for some i where aij > 0. Then the entering variable
x j , which was zero as a nonbasic variables, enters the basis but stays at zero in
the new basic feasible solution. In that case, only the basis has changed but not
the feasible solution and also not the value of the objective function. In fact, it is
possible that this results in a cycle of the simplex algorithm (when the same basis
is revisted) and thus a failure to terminate. This behaviour is rare, and degeneracy
itself an “accident” that only occurs when there a special relationships between
the entries of the payoff matrix. Nevertheless, degeneracy can be dealt with in
a systematic manner, which we do not treat in these notes that are already long
enough.
We also need to find an initial feasible solution to start the simplex algorithm.
For that purpose, we use a “first phase” with a different objective function that
establishes whether the LP (5.28) is feasible, similar to the approach in (5.24).
First, choose an arbitrary basis B and let b = A− 1
B b. If b ≥ 0, then x B = b is
already a basic feasible solution and nothing needs to be done. Otherwise, b has
at least one negative component. Define the m-vector h = A B 1 where 1 is the all-
one vector. That is, h is just the sum of the columns of A B . We add −h as an extra
column to the system Ax = b with a new variable t and consider the following
LP:
maximize −t
subject to Ax − ht = b (5.75)
x, t≥0
We find a basic feasible solution to this LP with a single pivoting step from the
(infeasible) basis B. Namely, the following are equivalent, similar to (5.67):
Ax − ht = b
A B x B + A N x N − ht = b
(5.76)
x B = A− 1 −1 −1
B b − AB A N x N + AB h t
xB = b − A− 1
B AN xN + 1 t
where we now let t enter the basis and increase t such that b + 1 t ≥ 0. For the
smallest such value of t, at least one component xi of x B is zero and becomes
5.11. THE SIMPLEX ALGORITHM: GENERAL DESCRIPTION 139
the leaving variable. After the pivot with xi leaving and t entering the basis one
obtains a basic feasible solution to (5.75).
The LP (5.75) is therefore feasible, and its objective function bounded from
above by zero. The original system Ax = b, x ≥ 0 is feasible if and only if the
optimum in (5.75) is zero. Suppose this is the case, which will be found out by
solving the LP (5.75) with the simplex algorithm. Then this “first phase” termi-
nates with a basic feasible solution to (5.75) where t = 0 which is then also a
feasible solution to Ax = b, x ≥ 0. The simplex algorithm can then proceed with
maximizing the original objective function c>x as described earlier.