Convex Functions: See P. 10 of The Handout On Preliminary Material
Convex Functions: See P. 10 of The Handout On Preliminary Material
Our next topic is that of convex functions. Again, we will concentrate on the context of a
map f : Rn → R although the situation can be generalized immediately by replacing Rn
with any real vector space V . We will state many of the definitions below in this more
general setting.
We will also find it useful, and in fact modern algorithms reflect this usefulness, to consider
functions f : Rn → R∗ where R∗ is the set of extended real numbers introduced earlier1 .
Before beginning with the main part of the discussion, we want to keep a couple of
examples in mind.
The primal example of a convex function is x 7→ x2 , x ∈ R. As we learn in elementary
calculus, this function is infinitely often differentiable and has a single critical point at
which the function in fact takes on, not just a relative minimum, but an absolute minimum.
d 2
A critical point is, by definition, the solution of the equation x = 2 x or 2 x = 0. We
dx
can apply the second derivative test at the point x = 0 to determine the nature of the
d2
critical point and we find that, since 2 x2 = 2 > 0, the function is ”concave up” and
dx
1
See p. 10 of the handout on Preliminary Material.
1
the critical point is indeed a point of relative minimum. That this point gives an absolute
minimum to the function, we need only remark that the function values are bounded
below by zero since x2 > 0 for all x 6= 0.
We can give a similar example in R2 .
2
1
x
–4
z –1
–2
2 –2
4
y
–1
–2
This is an elliptic paraboloid. In this case we expect that, once again, the minimum will
occur at the origin of coordinates and, setting f (x, y) = z, we can compute
x 1 0
grad (f ) (x, y) = 2 , and H(f(x, y)) = 2 .
y 0
3 3
Notice that, in our terminology, the Hessian matrix H(f ) is positive definite at all points
(x, y) ∈ R2 . Here the critical points are exactly those for which grad [f(x, y)] = 0 whose
only solution is x = 0, y = 0. The second derivative test is just that
2 2
∂ f ∂ f ∂2f
det H(f(x, y)) = − > 0
∂x2 ∂y2 ∂x ∂y
2
which is clearly satisfied.
Again, since for all (x, y) 6= (0, 0), z > 0, the origin is a point where f has an absolute
minimum.
As the idea of convex set lies at the foundation of our analysis, we want to describe the
notion of convex functions in terms of convex sets. We recall that, if A and B are two
non-empty sets, then the Cartesian product of these two sets A × B is defined as the set
of ordered pairs {(a, b) : a ∈ A, b ∈ B}. Notice that order does matter here and that
A × B 6= B × A! Simple examples are
2. R2 itself can be identified (and we usually do!) with the Cartesian product R × R.
3
We emphasize that this definition has the advantage of directly relating the theory of
convex sets to the theory of convex functions. However, a more traditional definition is
that a function is convex provided that, for any x, y ∈ C and any λ ∈ [0, 1]
f ( (1 − λ) x + λ y) ≤ (1 − λ) f (x) + λ f (y) ,
which is sometimes referred to as Jensen’s inequality.
In fact, these definitions turn out to be equivalent. Indeed, we have the following result.
Theorem 1.5 Let C ⊂ Rn be convex and f : C −→ R∗ . Then the following are equiva-
lent:
n
! n
X X
f λi x(i) ≤ λi f (x(i) ) .
i=1 i=1
f ( (1 − λ) x + λ y ) ≤ (1 − λ) f (x) + λ f (y) .
Proof: To see that (a) implies (b) we note that, if for all i = 1, 2, . . . , n, (x(i) , f (x(i) ) ∈
epi (f ), then since this latter set is convex, we have
n n n
!
X X X
λi (x(i) , f (x(i) )) = λi x(i) , λi f (x(i) ) ∈ epi (f ) ,
i=1 i=1 i=1
n
! n
X X
(i)
f λi x ≤ λi f (x(i) ) .
i=1 i=1
This establishes (b). It is obvious that (b) implies (c). So it remains only to show that
(c) implies (a) in order to establish the equivalence.
To this end, suppose that (x(1) , z1 ), (x(2) , z2 ) ∈ epi (f ) and take 0 ≤ λ ≤ 1. Then
4
and since f (x(1) ) ≤ z1 and f (x(2) ) ≤ z2 we have, since (1 − λ) > 0, and λ > 0, that
(1 − λ) f (x(1) ) + λ f (x(2) ) ≤ (1 − λ) z1 + λ z2 .
Hence, by the assumption (c), f (1 − λ) x(1) + λ x(2) ≤ (1 − λ) z1 + λ z2 , which shows
the point (1 − λ) x(1) + λ x(2) , (1 − λ) z1 + λ z2 is in epi(f ).
2
We can see another connection between convex sets and convex functions if we introduce
the indicator function, ψK of a set K ⊂ Rn . Indeed, ψK : Rn → R∗ is defined by
(
0 if x ∈ K,
ψK (x) =
+∞ if x 6∈ K .
Proof: The result follows immediately from the fact that epi (ψD ) = D × R≥0 .
Certain simple properties follow immediately from the analytic form of the definition (part
(c) of the equivalence theorem above). Indeed, it is easy to see, and we leave it as an
exercise for the reader, that if f and g are convex functions defined on a convex set C,
then f + g is likewise convex on C provided there is no point for which f (x) = ∞ and
g(x) = −∞. The same is true if β ∈ R , β > 0 and we consider βf .
Proposition 1.7 Let f : Rn → R be given, x(1) , x(2) ∈ Rn be fixed and define a function
ϕ : [0, 1] → R by ϕ(λ) := f ((1 − λ)x(1) + λx(2) ). Then the function f is convex on Rn if
and only if the function ϕ is convex on [0, 1].
Proof: Suppose, first, that f is convex on Rn . Then it is sufficient to show that epi (ϕ)
is a convex subset of R2 . To see this, let (λ1 , z1 ) , (λ2 , z2 ) ∈ epi (ϕ) and let
2
This will also be true of quasi-convex and quasi-concave functions which we will define below.
5
ŷ (1) = λ1 x(1) + (1 − λ1 ) x(2) ,
ŷ (2) = λ2 x(1) + (1 − λ2 ) x(2) .
Then
Hence (y (1) , z1 ) ∈ epi (f ) and (y (2) , z2 ) ∈ epi (f ). Since epi(f ) is a convex set, we also
have (µ y (1) + (1 − µ) y (2) , µ z1 + (1 − µ) z (2) ) ∈ epi (f ) for every µ ∈ [0, 1]. It follows that
f (µ y(1) + (1 − µ) y (2) ) ≤ µ z1 + (1 − µ) z2 ).
Now
and since
we have from the definition of ϕ that f (µy (1) + (1 − µ)y (2) ) = ϕ(µλ1 + (1 − µ)λ2 ) and so
(µλ1 + (1 − µ)λ2 , µz1 + (1 − µ)z2 ) ∈ epi (ϕ) i.e., ϕ is convex.
f ( (1 − λ) x + λ y) = < a, (1 − λ) x + λ y > +b
= (1 − λ) < a, x > +λ < a, y > +(1 − λ) b + λ b
= (1 − λ) (< a, x > +b) + λ (< a, y > +b) = (1 − λ) f (x) + λ f (y) ,
6
In the case that f is linear, that is f (x) =< a, x > for some a ∈ Rn then it is easy to see
that the map ϕ : x → [f (x)]2 is also convex. Indeed, if x, y ∈ Rn then, setting α = f (x)
and β = f (y), and taking 0 < λ < 1 we have
(1 − λ) ϕ(x) + λ ϕ(y) − ϕ( (1 − λ) x + λ y)
= (1 − λ) α2 + λ β 2 − ( (1 − λ) α + λ β)2
= (1 − λ) λ (α − β)2 ≥ 0 .
Note, that in particular for the function f : R −→ R given by f (x) = x is linear and that
[f (x)]2 = x2 so that we have a proof that the function that we usually write y = x2 is a
convex function.
(c) Let {fαA be a family of functions fα : Rn −→ R∗ then its upper envelope supα∈A fα
is convex.
(ϕ ◦ f ) [ (1 − λ) x + λ y ] ≤ ϕ [ (1 − λ) f (x) + λ f (y) ]
≤ (1 − λ) ϕ(f (x)) + λ ϕ(f (y)) = (1 − λ) (ϕ ◦ f ) (x) + λ (ϕ ◦ f ) (y) ,
where the first inequality comes from the convexity of f and the monotonicity of ϕ and
the second from the convexity of this later function. This proves part (b).
7
To establish part (c) we note that, since the arbitrary intersection of convex sets is convex,
it suffices to show that
[
epi sup fα = epi (fα ).
α∈A
αıA
Then z ≥ supα∈A fα (x) and so, for all β ∈ A , z ≥ fβ (x). Hence, by definition, (x, z) ∈
epi fβ for all β and therefore
\
(x, z) ∈ epi (fα ) .
α∈A
Conversely, suppose (x, z) ∈ epi (fα ) for all α ∈ A. Then z ≥ fα (x) for all α ∈ A and
hence z ≥ supα∈A fα . But this, by definition, implies (x, z) ∈ epi (supα∈A fα ) . This
completes the proof of part (c) and the proposition. 2
Proposition 1.10 If f : R → R∗ is convex, then its lower sections are likewise convex.
The converse of this last proposition is false as can be easily seen from the function
1
x 7→ x 2 from R> to R. However, the class of functions whose lower level sets S(f, α) (or
equivalently the sets S(f, α)) are all convex is likewise an important class of functions and
are called quasi-convex. These functions appear in game theory nonlinear programming
(optimization) problems and mathematical economics. For example, quasi-convex utility
8
functions imply that consumers have convex preferences. They are obviously generaliza-
tions of convex functions since every convex function is clearly quali-convex. However
they are not as easy to work with. In particular, while the sum of two convex functions
is convex, the same is not true of quasi-convex functions as the following example shows.
0 x ≤ −2
0 x≤0
−(x + 2) −2 < x ≤ −1 −x 0<x≤1
f (x) = and g(x) = .
x −1 < x ≤ 0
x−2 1<x≤2
0 x>0 0 x>2
Here, the functions are each concave, the level sections are convex for each function so
that each is quasi-convex, and yet the level section corresponding to α = −1/2 for the sum
f + g is not convex. Hence the sum is not quasi-convex.
It is useful for applications to have an analytic criterion for quasi-convexity. This is the
content of the next result.
Proof: Suppose that the sets S(f, α) are convex for every α. Let x, y ∈ Rn and let
α̃ := max{f (x), f (y)}. Then S(f, α̃) is convex and, since both f (x) ≤ α̃ and f (y) ≤ α̃,
we have that both x and y belong to S(f, α̃). Since this latter set is convex, we have
As we have seen above, the sum of two quasi-convex functions may well not be quasi-
convex. With this analytic test for quasi-convexity, we can check that there are certain
operations which preserve quasi-convexity. We leave the proof of the following result to
the reader.
9
(b) If ϕ : R → R is a non-decreasing function and f : Rn → R is quasi-convex, then the
composition ϕ ◦ f is a quasi-convex function.
A simple sketch of the parabola y = x2 and any horizontal cord (which necessarily lies
above the graph) will convince the reader that all points in the domain corresponding to
the values of the function which lie below that horizontal line, form a convex set in the
domain. Indeed, this is a property of convex functions which is often useful.
Notice that, since the intersection of convex sets is convex, the set of points simultaneously
satisfying m inequalities f1 (x) ≤ c1 , f2 (x) ≤ c2 , . . . , fm (x) ≤ cm where each fi is a convex
function, defines a convex set. In particular, the polygonal region defined by a set of such
inequalities when the fi are affine is convex.
From this result, we can obtain an important fact about points at which a convex function
attains a minimum.
Proof: If the function does not attain its minimum at any point of C, then the set of
such points in empty, which is a convex set. So, suppose that the set of points at which
the function attains its minimum is non-empty and let m be the minimal value attained
by f . If x, y ∈ M and λ ∈ [0, 1] then certainly (1 − λ)x + λy ∈ C and so
m ≤ f ( (1 − λ) x + λ y) ) ≤ (1 − λ) f (x) + λ f (y) = m ,
and so the point (1 − λ)x + λy ∈ M. Hence M, the set of minimal points, is convex.
Now, suppose that x⋆ ∈ C is a relative minimum point of f , but that there is another
point x̂ ∈ C such that f (x̂) < f (x⋆ ). On the line (1 − λ)x̂ + λx⋆ , 0 < λ < 1, we have
10
f ((1 − λ) x̂ + λ x⋆ ) ≤ (1 − λ) f (x̂) + λ f (x⋆ ) < f (x⋆ ) ,
contradicting the fact that x⋆ is a relative minimum point. 2
Again, the example of the simple parabola, shows that the set M may well contain only
a single point, i.e., it may well be that the minimum point is unique. We can guarantee
that this is the case for an important class of convex functions.
Proof: Suppose that the set of minimal points M is not empty and contains two distinct
points x and y. Then, for any 0 < λ < 1, since M is convex, we have (1 − λ)x + λy ∈ M.
But f is strictly convex. Hence
h∇ f (y) − ∇ f (x), y − xi ≥ 0 ,
for all x, y ∈ C.
11
10
y 5
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0
x
We can now characterize convexity in terms of the function E and the monotonicity
concept just introduced. However, before stating and proving the next theorem, we need
a lemma.
Proof: Choose x, y ∈ I with x < y, and for any λ ∈ [0, 1], define zλ := (1 − λ)x + λy. By
the Mean Value Theorem, there exist u, v ∈ R , x ≤ v ≤ zλ ≤ u ≤ y such that
Since, by choice, v < u, and since f ′ is non-decreasing, this latter equation yields
12
Hence, multiplying this last inequality by (1 − λ) and the expression for f (y) by −λ and
adding we get
(1−λ) f (zλ)−λ f (y) ≤ (1−λ) f (x)+λ(1−λ)(y −x)f ′(u)−λf (zλ )−λ(1−λ)(y −x)f ′(u) ,
We can now prove a theorem which gives three different characterizations of convexity for
continuously differentiable functions.
Proof: Suppose that (a) holds, i.e. E(x, y) ≥ 0 on C × C. Then we have both
13
ϕ((1−λ)u+λv) = f (x+[(1−λ)u+λv](y−x)) = f ( (1−[(1−λ)u+λv]) x+((1−λ)u+λv) y) ,
14
and so, by hypothesis,
The inequality E(x, y) ≥ 0 shows that local information about a convex function, given
in terms of the derivative at a point) gives us global information in terms of a global
underestimator of the function f . In a way, this is the the key property of convex functions.
For example, suppose that ∇f (x) = 0. Then, for all y ∈ dom (f ) , f (y) ≥ f (x) so that
x is a global minimizer of the convex function f .
It is also important to remark that the hypothesis that the convex function f is defined
on a convex set is crucial, both for the first order conditions as well as for the second order
conditions. Indeed, if we consider the function f (x) = 1/x2 with domain {x ∈ R | x 6= 0}.
The usual second order condition f ′′ (x) > 0 for all x ∈ dom (f ) yet f is not convex there
so that the second order test fails.
⊤
∇ f (x) y x
− ≤ 0.
−1 z f (x)
This shows that the hyperplane defined by (∇f (x), −1)⊤ supports epi(f ) at the boundary
point (x, f (x)).
We now turn to so-called second order criteria for convexity. The discussion involves
the Hessian matrix of a twice continuously differentiable function, and depends on the
question of whether this matrix is positive semi-definite or even positive definite (for strict
convexity)4 . Let us recall some definitions.
15
(b) Negative definite provided x⊤ A x < 0 for all x ∈ Rn , x 6= 0.
x⊤ A x = λ x⊤ x = λkxk2 , .
Hence
x⊤ A x
λ = > 0
kxk
Converesly, suppose that all the eigenvalues of A are positive. Let {x1 , . . . xn } be an
orthonormal set of eigenvectors of A.5 Hence any x ∈ Rn can be written as
x = α1 x1 + α2 bx2 + · · · + αn xn
with
n
X
⊤
αi = x xi for i = 1, 2, . . . n , and αi2 = kxk2 > 0 .
i=1
It follows that
5
The so-called Spectral Theorem for real symmetric matrices states that such a matrix can be diagonal-
ized, and hence has n linearly independent eigenvectors. These can be replaced by a set of n orthonormal
eigenvectors.
16
x⊤ A x = (α1 x1 + · · · + αn xn )⊤ (α1 λa x1 + · · · + αn λn xn )
Xn
= αi2 λi ≥ (min{λi } kxk > 0 .
i
i=1
In simple cases where we can compute the eigenvalues easily, this is a useful criterion.
det (A − λI) = (2 − λ) (5 − λ) − 4 = (λ − 1) (λ − 6) .
Hence the eigenvalues are both positive and hence the matrix is positive definite. In this
particular case it is easy to check directly that A is positive definite. Indeed
2 −2 x1 2x1 − 2x2
(x1 , x2 ) = (x1 , x2 )
−2 5 x2 −2x1 + 5x2
= 2 x22 − 4 x1 x2 + 5 x22
This last theorem has some immediate useful consequences. First, if A is positive defi-
nite, then A must be nonsingular, since singular matrices have λ = 0 as an eigenvalue.
Moreover, since we know that the det(A) is the product of the eigenvalues, and since
each eigenvalue is positive, then det(A) > 0. Finally, we have the following result which
depends on the notion of leading principle submatrices.
Definition 1.25 Given any n × N matrix A, let Ar denote the matrix formed by deleting
the last n − r rows and columns of A. Then A4 is called the leading principal submatrix of
A.
17
Proposition 1.26 If A is a symmetric positive definite matrix then the leading principal
submatrices A1 , A2 , . . . , An of A are all positive definite. In particular, det(Ar ) > 0.
x = (x1 , x2 , . . . , xr , 0, . . . , 0)⊤ .
Since x⊤ ⊤
r Ar xr = x A x > 0, it follows that Ar is positive definite, by definition. 2
This proposition is half of the famous criterion of Sylvester for positive definite matrices.
Theorem 1.27 A real, symmetric matrix A is positive definite if and only if all of its
leading principle minors are positive definite.
We will not prove this theorem here but refer the reader to his or her favorite treatise on
linear algebra.
2 −1 0
A = −1 2 −1
0 −1 2
Then
2 −1
A2 = (2) , A2 = , A3 = A .
−1 2
Then
det A1 = 2 , det A2 = 4 − 1 = 3 , and det A = 4 .
Hence, according to Sylvester’s criterion, the matrix A is positive definite.
18
Proof: By Taylor’s Theorem we have
y − x, ∇2 f (x + λ (y − x))(y − x) ,
f (y) = f (x) + h∇ f (x), y − xi +
2
for some λ ∈ [0, 1]. Clearly, if the Hessian is positive semi-definite, we have
Conversely, suppose that the Hessian is not positive semi-definite at some point x ∈ D.
Then, by the continuity of the Hessian, there is a y ∈ D so that, for all λ ∈ [0, 1],
1 ⊤
f (x) =x Q x + q⊤ x + r ,
2
with Q and n × n symmetric matrix, q ∈ Rn and r ∈ R. Then since as we have seen
previously, ∇2 f (x) = Q, the function f is convex if and only if Q is positive semidefinite.
Strict convexity of f is likewise characterized by the positive definiteness of Q.
These first and second-order necessary conditions give us methods of showing that a
given function is convex. Thus, we either check the definition, Jensen’s inequality, using
the equivalence that is given by Theorem 2.1.3, or showing that the Hessian is positive
semi-definite. Let us look as some simple examples.
(b) The max function f (x) = max{x1 , . . . xn } is convex on Rn . Here we can use Jensen’s
inequality. Let λ ∈ [0, 1] then
6
Recall that by R+ we mean the set {x ∈ R | x > 0}.
19
f ( (1 − λ) x + λ y) = max ( λ xi + λ yi ) ≤ λ max xi + (1 − λ) max yi
1≤i≤n 1≤i≤n 1≤i≤n
= (1 − λ) f (x) + +λ f (y) .
y 2 −xy
2 2
∇ q(x, y) = 3 .
y −xy x2
Since y > 0 and
y 2 −xy
(u1 , u2) (u1, u2 )⊤ = (u1 y − u2 x)2 ≥ 0 ,
−xy x2
20