Nonlinear Systems
Nonlinear Systems
by Peter J. Olver
University of Minnesota
1. Introduction.
Nonlinearity is ubiquitous in physical phenomena. Fluid and plasma mechanics, gas
dynamics, elasticity, relativity, chemical reactions, combustion, ecology, biomechanics, and
many, many other phenomena are all governed by inherently nonlinear equations. (The one
notable exception is quantum mechanics, which is a fundamentally linear theory, although
recent attempts at grand unification of all fundamental physical theories, such as string
theory and conformal field theory, [8], are nonlinear.) For this reason, an ever increasing
proportion of modern mathematical research is devoted to the analysis of nonlinear systems
and nonlinear phenomena.
Why, then, does one devote so much time studying linear mathematics? The facile
answer is that nonlinear systems are vastly more difficult to analyze. In the nonlinear
regime, many of the most basic questions remain unanswered: existence and uniqueness of
solutions are not guaranteed; explicit formulae are difficult to come by; linear superposition
is no longer available; numerical approximations are not always sufficiently accurate; etc.,
etc. A more intelligent answer is that a thorough understanding of linear phenomena
and linear mathematics is an essential prerequisite for progress in the nonlinear arena.
Therefore, one must first develop the proper linear foundations in sufficient depth before
we can realistically confront the untamed nonlinear wilderness. Moreover, many important
physical systems are “weakly nonlinear”, in the sense that, while nonlinear effects do
play an essential role, the linear terms tend to dominate the physics, and so, to a first
approximation, the system is essentially linear. As a result, such nonlinear phenomena are
best understood as some form of perturbation of their linear approximations. The truly
nonlinear regime is, even today, only sporadically modeled and even less well understood.
The advent of powerful computers has fomented a veritable revolution in our under-
standing of nonlinear mathematics. Indeed, many of the most important modern analytical
techniques drew their inspiration from early computer-aided investigations of nonlinear sys-
tems. However, despite dramatic advances in both hardware capabilities and sophisticated
mathematical algorithms, many nonlinear systems — for instance, fully general Einsteinian
gravitation, or the Navier–Stokes equations of fluid mechanics at high Reynolds numbers
— still remain beyond the capabilities of today’s computers.
The goal of these lecture notes is to provide a brief overview of some of the most
important ideas, mathematical techniques, and new physical phenomena in the nonlinear
realm. We start with iteration of nonlinear functions, also known as discrete dynamical
systems. Building on our experience with iterative linear systems, as developed in Chap-
ter 10 of [15], we will discover that functional iteration, when it converges, provides a
2. Iteration of Functions.
Iteration, meaning repeated application of a function, can be viewed as a discrete
dynamical system in which the continuous time variable has been “quantized” to assume
integer values. Even iterating a very simple quadratic scalar function can lead to an amaz-
ing variety of dynamical phenomena, including multiply-periodic solutions and genuine
chaos. Nonlinear iterative systems arise not just in mathematics, but also underlie the
growth and decay of biological populations, predator-prey interactions, spread of commu-
nicable diseases such as Aids, and host of other natural phenomena. Moreover, many
numerical solution methods — for systems of algebraic equations, ordinary differential
equations, partial differential equations, and so on — rely on iteration, and so the the-
ory underlies the analysis of convergence and efficiency of such numerical approximation
schemes.
In general, an iterative system has the form
u(0) = c, (2.2)
then the resulting solution to the discrete dynamical system (2.1) is easily computed:
†
The superscripts on u(k) refer to the iteration number, and do not denote derivatives.
†
In view of the equivalence of norms on finite-dimensional vector spaces, cf. [ 15 ], any norm
will do here.
cos u = u.
Later we will see how to rigorously prove this observed behavior.
Of course, not every solution to a discrete dynamical system will necessarily converge,
but Proposition 2.2 says that if it does, then it must converge to a fixed point. Thus, a
key goal is to understand when a solution converges, and, if so, to which fixed point —
if there is more than one. (In the linear case, only the actual convergence is a significant
issues since most linear systems admit exactly one fixed point, namely u⋆ = 0.)
Fixed points are roughly divided into three classes:
• asymptotically stable, with the property that all nearby solutions converge to it,
• stable, with the property that all nearby solutions stay nearby, and
• unstable, almost all of whose nearby solutions diverge away from the fixed point.
Thus, from a practical standpoint, convergence of the iterates of a discrete dynamical
system requires asymptotic stability of the fixed point. Examples will appear in abundance
in the following sections.
Scalar Functions
As always, the first step is to thoroughly understand the scalar case, and so we begin
with a discrete dynamical system
g(u) = a u + b, (2.7)
leading to an affine discrete dynamical system
u⋆2
u⋆1
The formula for u⋆ requires that a 6= 1, and, indeed, the case a = 1 has no fixed point, as
the reader can easily confirm.
Since we already know the value of u⋆ , we can readily analyze the differences
Starting with the initial condition u(0) = 0, the ensuing values are
k 1 2 3 4 5 6 7 8
(k)
u 2.0 2.5 2.625 2.6562 2.6641 2.6660 2.6665 2.6666
8
Thus, after 8 iterations, the iterates have produced the fixed point u⋆ = 3 to 4 decimal
places. The rate of convergence is 14 , and indeed
1 k
1 k
| e(k) | = | u(k) − u⋆ | = 4
| u(0) − u⋆ | = 8
3 4
−→ 0 as k −→ ∞.
Let us now turn to the fully nonlinear case. First note that the fixed points of g(u)
correspond to the intersections of its graph with the graph of the function i(u) = u. For
instance Figure 1 shows the graph of a function that has 3 fixed points, labeled u⋆1 , u⋆2 , u⋆3 .
In general, near any point in its domain, a (smooth) nonlinear function can be well
approximated by its tangent line, which represents the graph of an affine function; see
Figure 2. Therefore, if we are close to a fixed point u⋆ , then we might expect the iterative
system based on the nonlinear function g(u) to behave very much like that of its affine
tangent line approximation. And, indeed, this intuition turns out to be essentially correct.
This result forms our first concrete example of linearization, in which the analysis of a
nonlinear system is based on its linear (or, more precisely, affine) approximation.
The explicit formula for the tangent line to g(u) near the fixed point u = u⋆ = g(u⋆ )
is
g(u) ≈ g(u⋆ ) + g ′ (u⋆ )(u − u⋆ ) ≡ a u + b, (2.12)
where
a = g ′ (u⋆ ), b = g(u⋆ ) − g ′ (u⋆ ) u⋆ = 1 − g ′ (u⋆ ) u⋆ .
Note that u⋆ = b /(1 − a) remains a fixed point for the affine approximation: a u⋆ + b =
u⋆ . According to the preceding discussion, the convergence of the iterates for the affine
approximation is governed by the size of the coefficient a = g ′ (u⋆ ). This observation
inspires the basic stability criterion for fixed points of scalar iterative systems.
u = m + ǫ sin u (2.16)
is known as Kepler’s equation. It arises in the study of planetary motion, in which 0 < ǫ < 1
represents the eccentricity of an elliptical planetary orbit, u is the eccentric anomaly,
defined as the angle formed at the center of the ellipse by the planet and the major axis,
and m = 2 π t / T is its mean anomaly, which is the time, measured in units of T /(2 π)
where T is the period of the orbit, i.e., the length of the planet’s year, since perihelion or
point of closest approach to the sun; see Figure 3.
The solutions to Kepler’s equation are the fixed points of the discrete dynamical
system based on the function
g(u) = m + ǫ sin u.
Note that
| g ′ (u) | = | ǫ cos u | ≤ | ǫ | < 1, (2.17)
which automatically implies that the as yet unknown fixed point is stable. Indeed, condi-
tion (2.17) is enough to prove the existence of a unique stable fixed point; see Theorem 2.18
below. In the particular case m = ǫ = 12 , the result of iterating u(k+1) = 12 + 21 sin u(k)
starting with u(0) = 12 is
k 0 1 2 3 4 5 6 7 8
u(k) .5 .7397 .8370 .8713 .8826 .8862 .8873 .8877 .8878
After 12 iterations, we have converged sufficiently close to the solution (fixed point) u⋆ =
.887862 to have computed its value to 6 decimal places.
g(u)
L− (u)
u
⋆
u
Inspection of the proof of Theorem 2.4 reveals that we never really used the differen-
tiability of g, except to verify the inequality
| g(u) − g(u⋆ ) | ≤ σ | u − u⋆ | for some fixed σ < 1. (2.18)
A function that satisfies (2.18) for all nearby u is called a contraction at the point u⋆ . Any
function g(u) whose graph lies between the two lines
L± (u) = g(u⋆ ) ± σ (u − u⋆ ) for some σ < 1,
for all u sufficiently close to u⋆ , i.e., such that | u − u⋆ | < δ for some δ > 0, defines a
contraction, and hence fixed point iteration starting with | u(0) − u⋆ | < δ will converge to
u⋆ ; see Figure 4. In particular, any function that is differentiable at u⋆ with | g ′ (u⋆ ) | < 1
defines a contraction at u⋆ .
Example 2.6. The simplest truly nonlinear example is a quadratic polynomial. The
most important case is the so-called logistic map
g(u) = λ u(1 − u), (2.19)
where λ 6= 0 is a fixed non-zero parameter. (The case λ = 0 is completely trivial. Why?)
In fact, an elementary change of variables can make any quadratic iterative system into
one involving a logistic map.
The fixed points of the logistic map are the solutions to the quadratic equation
u = λ u(1 − u), or λ u2 − λ u + 1 = 0.
Using the quadratic formula, we conclude that g(u) has two fixed points:
1
u⋆1 = 0, u⋆2 = 1 − .
λ
10/23/22 9 c 2022 Peter J. Olver
1 1 1
†
The term “chaotic” does have a precise mathematical definition, [ 5 ], but the reader can take
it more figuratively for the purposes of this elementary exposition.
0.8
0.6
0.4
0.2
2.5 3 3.5 4
story. Embedded within this chaotic regime are certain small ranges of λ where the system
settles down to a stable orbit, whose period is no longer necessarily a power of 2. In fact,
there exist values of λ for which the iterates settle down to a stable orbit of period k for any
positive integer k. For instance, as λ increases past λ3,⋆ ≈ 3.83, a period 3 orbit appears
over a small range of values, after which, as λ increses slightly further, there is a period
doubling cascade where period 6, 12, 24, . . . orbits successively appear, each persisting on
a shorter and shorter range of parameter values, until λ passes yet another critical value
where chaos breaks out yet again. There is a well-prescribed order in which the periodic
orbits make their successive appearance, and each odd period k orbit is followed by a very
closely spaced sequence of period doubling bifurcations, of periods 2n k for n = 1, 2, 3, . . . ,
after which the iterates revert to completely chaotic behavior until the next periodic case
emerges. The ratios of distances between bifurcation points always have the same Feigen-
baum limit (2.20). Finally, these periodic and chaotic windows all pile up on the ultimate
parameter value λ⋆⋆ = 4. And then, when λ > 4, all the iterates go off to ∞, and the
system ceases to be interesting.
The reader is encouraged to write a simple computer program and perform some
numerical experiments. In particular, Figure 6 shows the asymptotic behavior of the
iterates for values of the parameter in the interesting range 2 < λ < 4. The horizontal
axis is λ, and the marked points show the ultimate fate of the iteration for the given
value of λ. For instance, each point the single curve lying above the smaller values of
λ represents a stable fixed point; this bifurcates into a pair of curves representing stable
period 2 orbits, which then bifurcates into 4 curves representing period 4 orbits, and so
on. Chaotic behavior is indicated by a somewhat random pattern of points lying above the
value of λ. To plot this figure, we ran the logistic iteration u(n) for 0 ≤ n ≤ 100, discarded
the first 50 points, and then plotted the next 50 iterates u(51) , . . . , u(100) . Investigation of
the fine detailed structure of the logistic map requires yet more iterations with increased
numerical accuracy. In addition one should discard more of the initial iterates so as to give
†
The degree of precision is to be specified by the user and the application.
‡
Note that since σ < 1, the logarithm log10 σ −1 = − log10 σ > 0 is positive.
Theorem 2.8. Suppose that† g ∈ C2 , and u⋆ = g(u⋆ ) is a fixed point such that
g ′ (u⋆ ) = 0. Then, for all iterates u(k) sufficiently close to u⋆ , the errors e(k) = u(k) − u⋆
satisfy the quadratic convergence estimate
Proof : Just as that of the linear convergence estimate (2.15), the proof relies on
approximating g(u) by a simpler function near the fixed point. For linear convergence, an
affine approximation sufficed, but here we require a higher order approximation. Thus, we
replace the mean value formula (2.13) by the first order Taylor expansion
where the final error term depends on an (unknown) point w that lies between u and u⋆ .
At a fixed point, the constant term is g(u⋆ ) = u⋆ . Furthermore, under our hypothesis
g ′ (u⋆ ) = 0, and so (2.22) reduces to
g(u) − u⋆ = 1
2 g ′′ (w) (u − u⋆ )2 .
Therefore,
| g(u) − u⋆ | ≤ τ | u − u⋆ |2 , (2.23)
for all w sufficiently close to u⋆ . Therefore, the magnitude of τ is governed by the size
of the second derivative of the iterative function g(u) near the fixed point. We use the
inequality (2.23) to estimate the error
†
The notation means that g(u) is twice continuously differentiable, i.e., g, g ′ , g ′′ are all defined
and continuous near the fixed point u⋆ .
2 u3 + 3
g(u) = .
3 u2 + 3
There is a unique (real) fixed point u⋆ = g(u⋆ ), which is the real solution to the cubic
equation
1 3
3
u + u − 1 = 0.
Note that
′ 2 u4 + 6 u2 − 6 u 6 u 31 u3 + u − 1
g (u) = = ,
3 (u2 + 1)2 3 (u2 + 1)2
and hence g ′ (u⋆ ) = 0 vanishes at the fixed point. Theorem 2.8 implies that the iterations
will exhibit quadratic convergence to the root. Indeed, we find, starting with u(0) = 0, the
following values:
k 1 2 3
(k)
u 1.00000000000000 .833333333333333 .817850637522769
4 5 6
.817731680821982 .817731673886824 .817731673886824
The convergence rate is dramatic: after only 5 iterations, we have produced the first 15
decimal places of the fixed point. In contrast, the linearly convergent scheme based on
ge(u) = 1 − 31 u3 takes 29 iterations just to produce the first 5 decimal places of the same
solution.
Vector–Valued Iteration
for all u sufficiently close to u⋆ , i.e., k u − u⋆ k < δ for some fixed δ > 0.
Theorem 2.11. If u⋆ = g(u⋆ ) is a fixed point for the discrete dynamical system
(2.1) and g is a contraction at u⋆ , then u⋆ is an asymptotically stable fixed point.
Proof : The proof is a copy of the last part of the proof of Theorem 2.4. We write
using the assumed estimate (2.25). Iterating this basic inequality immediately demon-
strates that
Since σ < 1, the right hand side tends to 0 as k → ∞, and hence u(k) → u⋆ . Q.E.D.
denotes the n × n Jacobian matrix of the vector-valued function g, whose entries are the
partial derivatives of its individual components. Since u⋆ is fixed, the the right hand
side of (2.27) is an affine function of u. Moreover, u⋆ remains a fixed point of the affine
approximation. Proposition 10.44 of [15] tells us that iteration of the affine function
will converge to the fixed point if and only if its coefficient matrix, namely g ′ (u⋆ ), is
a convergent matrix, meaning that its spectral radius ρ(g ′ (u⋆ )) < 1. This observation
motivates the following theorem and corollary.
Theorem 2.12. Let u⋆ be a fixed point for the discrete dynamical system u(k+1) =
g(u ). If the Jacobian matrix norm k g ′ (u⋆ ) k < 1, then g is a contraction at u⋆ , and
(k)
There are four (real) fixed points; stability is determined by the size of the eigenvalues of
the Jacobian matrix !
9 3 2 1
− u − v
g′ (u, v) = 8 4 2
3 2 3
4
v 4
− 12 u
†
For linear iterative systems, the stable manifold of the origin coincides with the stable sub-
space spanned by the real and imaginary parts of eigenvectors corresponding to the stable eigen-
values of modulus | λ | < 1. In the nonlinear case, the stable manifold at u⋆ is tangent to the
stable subspace of its Jacobian matrix. Practically detecting such convergent points whose iter-
ates eventually lie on the stable manifold is rather challenging, since (almost) any small numerical
error can dislodge the iterate off the stable submanifold, causing it to eventually go away from
the fixed point again.
at each of the fixed points. The results are summarized in the following table:
! ! !
0 √1 − √12 − 21
⋆
fixed point u1 = u⋆2 = 2 u⋆3 = u⋆4 =
0 0 0 1
2
! ! ! !
9 3 3 15
8 0 4 0 4 0 16 − 14
Jacobian matrix 3 1 3 1
3 0 − 0 + 3
0 4 4
√
2 2 4
√
2 2 16
1
Thus, u⋆2 and u⋆4 are stable fixed points, whereas u⋆1 and u⋆3 are both unstable. Indeed,
T
starting with u(0) = ( .5, .5 ) , it takes 24 iterates to converge to u⋆2 with 4 significant
T
decimal digits, whereas starting with u(0) = ( −.7, .7 ) , it takes 1049 iterates to converge
to within 4 digits of u⋆4 ; the slower convergence rate is predicted by the larger Jacobian
spectral radius. The two basins of attraction are plotted in Figure 7. The stable fixed
points are indicated by black dots. The light gray region contains u⋆2 and indicates all the
points that converge to it; the darker gray indicates points converging, more slowly, to u⋆4 .
All other initial points, except u⋆1 and u⋆3 , have rapidly unbounded iterates: k u(k) k → ∞.
The smaller the spectral radius or matrix norm of the Jacobian matrix at the fixed
point, the faster the nearby iterates will converge to it. As in the scalar case, quadratic
convergence will occur when the Jacobian matrix g ′ (u⋆ ) = O is the zero matrix, i.e.,
all first order partial derivatives of the components of g vanish at the fixed point. The
quadratic convergence estimate
k u(k+1) − u⋆ k ≤ τ k u(k) − u⋆ k2 (2.30)
Ω Ω
is a consequence of the second order Taylor expansion at the fixed point. Details of the
proof are left as an exercise.
Of course, in practice we don’t know the norm or spectral radius of the Jacobian
matrix g ′ (u⋆ ) because we don’t know where the fixed point is. This apparent difficulty
can be easily circumvented by requiring that k g ′ (u) k < 1 for all u — or, at least, for all
u in a domain Ω containing the fixed point. In fact, this hypothesis can be used to prove
the existence and uniqueness of asymptotically stable fixed points. Rather than work with
the Jacobian matrix, let us return to the contraction condition (2.25), but now imposed
uniformly on an entire domain.
Definition 2.16. A function g: R n → R n is called a contraction mapping on a
domain Ω ⊂ R n if
(a) it maps Ω to itself, so g(u) ∈ Ω whenever u ∈ Ω, and
(b) there exists a constant 0 ≤ σ < 1 such that
e ⋆ − u⋆ k = k g(e
0 ≤ ku u⋆ ) − g(u⋆ ) k ≤ σ k u
e ⋆ − u⋆ k < k u
e ⋆ − u⋆ k,
e ⋆ − u⋆ k = 0 and hence u
which implies k u e ⋆ = u⋆ , proving the result. Q.E.D.
Example 2.19. The function
1
g(u) = u + 12 π − tan−1 u satisfies | g ′ (u) | = 1 − <1
1 + u2
for all u ∈ R, and hence defines a contraction mapping. However, g(u) has no fixed point.
Why does this not contradict Theorem 2.18?
u5 + u + 1 = 0. (3.2)
-1 -0.5 0.5 1
-1
-2
-3
-4
Figure 9. Graph of u5 + u + 1.
Graphing the left hand side of the equation, as in Figure 9, convinces us that there is just
one real root, lying somewhere between −1 and −.5. While there are explicit algebraic
formulas for the roots of quadratic, cubic, and quartic polynomials, a famous theorem† due
to the Norwegian mathematician Nils Henrik Abel in the early 1800’s states that there is
no such formula for generic fifth order polynomial equations.
(b) Any fixed point equation u = g(u) has the form (3.4) where f (u) = u − g(u). For
example, the trigonometric Kepler equation
u − ǫ sin u = m
arises in the study of planetary motion, cf. Example 2.5. Here ǫ, m are fixed constants,
and we seek a corresponding solution u.
(c) Suppose we are given chemical compounds A, B, C that react to produce a fourth
compound D according to
2 A + B ←→ D, A + 3 C ←→ D.
Let a, b, c be the initial concentrations of the reagents A, B, C injected into the reaction
chamber. If u denotes the concentration of D produced by the first reaction, and v that
by the second reaction, then the final equilibrium concentrations
a⋆ = a − 2 u − v, b⋆ = b − u, c⋆ = c − 3 v, d⋆ = u + v,
of the reagents will be determined by solving the nonlinear system
†
A modern proof of this fact relies on Galois theory, [ 7 ].
a b
u⋆
f (u) = 0. (3.4)
Our immediate goal is to develop numerical algorithms for solving such nonlinear scalar
equations. The most primitive algorithm, and the only one that is guaranteed to work in
all cases, is the Bisection Method. While it has an iterative flavor, it cannot be properly
classed as a method governed by functional iteration as defined in the preceding section,
and so must be studied directly in its own right.
The starting point is the Intermediate Value Theorem, which we state in simplified
form. See Figure 10 for an illustration, and [14] for a proof.
Lemma 3.1. Let f (u) be a continuous scalar function. Suppose we can find two
points a < b where the values of f (a) and f (b) take opposite signs, so either f (a) < 0 and
f (b) > 0, or f (a) > 0 and f (b) < 0. Then there exists at least one point a < u⋆ < b where
f (u⋆ ) = 0.
Note that if f (a) = 0 or f (b) = 0, then finding a root is trivial. If f (a) and f (b)
have the same sign, then there may or may not be a root in between. Figure 11 plots
the functions u2 + 1, u2 and u2 − 1, on the interval −2 ≤ u ≤ 2. The first has two simple
roots; the second has a single double root, while the third has no root. We also note
†
Complex roots to complex equations will be discussed later.
4 4 4
3 3 3
2 2 2
1 1 1
-2 -1 1 2 -2 -1 1 2 -2 -1 1 2
-1 -1 -1
-2 -2 -2
that continuity of the function on the entire interval [ a, b ] is an essential hypothesis. For
example, the function f (u) = 1/u satisfies f (−1) = −1 and f (1) = 1, but there is no root
to the equation 1/u = 0.
Note carefully that the Lemma 3.1 does not say there is a unique root between a and
b. There may be many roots, or even, in pathological examples, infinitely many. All the
theorem guarantees is that, under the stated hypotheses, there is at least one root.
Once we are assured that a root exists, bisection relies on a “divide and conquer”
strategy. The goal is to locate a root a < u⋆ < b between the endpoints. Lacking any
additional evidence, one tactic would be to try the midpoint c = 21 (a + b) as a first guess
for the root. If, by some miracle, f (c) = 0, then we are done, since we have found a
solution! Otherwise (and typically) we look at the sign of f (c). There are two possibilities.
If f (a) and f (c) are of opposite signs, then the Intermediate Value Theorem tells us that
there is a root u⋆ lying between a < u⋆ < c. Otherwise, f (c) and f (b) must have opposite
signs, and so there is a root c < u⋆ < b. In either event, we apply the same method to
the interval in which we are assured a root lies, and repeat the procedure. Each iteration
halves the length of the interval, and chooses the half in which a root is sure to lie. (There
may, of course, be a root in the other half interval, but as we cannot be sure, we discard
it from further consideration.) The root we home in on lies trapped in intervals of smaller
and smaller width, and so convergence of the method is guaranteed.
Example 3.2. The roots of the quadratic equation
f (u) = u2 + u − 3 = 0
can be computed exactly by the quadratic formula:
√ √
⋆ − 1 + 13 ⋆ − 1 − 13
u1 = ≈ 1.302775 . . . , u2 = ≈ − 2.302775 . . . .
2 2
Let us see how one might approximate them by applying the Bisection Algorithm. We start
the procedure by choosing the points a = u(0) = 1, b = v (0) = 2, noting that f (1) = −1
and f (2) = 3 have opposite signs and hence we are guaranteed that there is at least one
root between 1 and 2. In the first step we look at the midpoint of the interval [ 1, 2 ],
which is 1.5, and evaluate f (1.5) = .75. Since f (1) = −1 and f (1.5) = .75 have opposite
signs, we know that there is a root lying between 1 and 1.5. Thus, we take u(1) = 1 and
v (1) = 1.5 as the endpoints of the next interval, and continue. The next midpoint is at
0 1 2 1.5 .75
1 1 1.5 1.25 −.1875
2 1.25 1.5 1.375 .2656
3 1.25 1.375 1.3125 .0352
4 1.25 1.3125 1.2813 −.0771
5 1.2813 1.3125 1.2969 −.0212
6 1.2969 1.3125 1.3047 .0069
7 1.2969 1.3047 1.3008 −.0072
8 1.3008 1.3047 1.3027 −.0002
9 1.3027 1.3047 1.3037 .0034
10 1.3027 1.3037 1.3032 .0016
11 1.3027 1.3032 1.3030 .0007
12 1.3027 1.3030 1.3029 .0003
13 1.3027 1.3029 1.3028 .0001
14 1.3027 1.3028 1.3028 −.0000
1.25, where f (1.25) = −.1875 has the opposite sign to f (1.5) = .75, and so a root lies
between u(2) = 1.25 and v (2) = 1.5. The process is then iterated as long as desired — or,
more practically, as long as your computer’s precision does not become an issue.
The table displays the result of the algorithm, rounded off to four decimal places.
After 14 iterations, the Bisection Method has correctly computed the first four decimal
digits of the positive root u⋆1 . A similar bisection starting with the interval from u(1) = − 3
to v (1) = − 2 will produce the negative root.
A formal implementation of the Bisection Algorithm appears in the accompanying
pseudocode program. The endpoints of the k th interval are denoted by u(k) and v (k) . The
midpoint is w(k) = 21 u(k) + v (k) , and the key decision is whether w(k) should be the
right or left hand endpoint of the next interval. The integer n, governing the number of
iterations, is to be prescribed in accordance with how accurately we wish to approximate
the root u⋆ .
The algorithm produces two sequences of approximations u(k) and v (k) that both
converge monotonically to u⋆ , one from below and the other from above:
v (k) − u(k) = 1
2 (v (k−1) − u(k−1) ).
start
if f (a) f (b) < 0 set u(0) = a, v (0) = b
else print “Bisection Method not applicable”
for k = 0 to n − 1
set w(k) = 21 (u(k) + v (k) )
if f (w(k) ) = 0, stop; print u⋆ = w(k)
if f (u(k) ) f (w(k) ) < 0, set u(k+1) = u(k) , v (k+1) = w(k)
else set u(k+1) = w(k) , v (k+1) = v (k)
next k
print u⋆ = w(n) = 21 (u(n) + v (n) )
end
The midpoint
w(n) = 1
2 (u(n) + v (n) )
of the root. Consequently, if we desire to approximate the root within a prescribed tolerance
ε, we should choose the number of iterations n so that
1 n+1 b−a
2
(b − a) < ε, or n > log2 − 1. (3.5)
ε
Summarizing:
Theorem 3.3. If f (u) is a continuous function, with f (a) f (b) < 0, then the Bisec-
tion Method starting with u(0) = a, v (0) = b, will converge to a solution u⋆ to the equation
f (u) = 0 lying between a and b. After n steps, the midpoint w(n) = 12 (u(n) + v (n) ) will be
within a distance of ε = 2− n−1 (b − a) from the solution.
For example, in the case of the quadratic equation in Example 3.2, after 14 iterations,
we have approximated the positive root to within
1 15
ε= 2 (2 − 1) ≈ 3.052 × 10−5 ,
f (u) = u5 + u + 1 = 0
has one real root, whose value can be readily computed by bisection. We start the algorithm
with the initial points u(0) = − 1, v (0) = 0, noting that f (− 1) = − 1 < 0 while f (0) = 1 > 0
are of opposite signs. In order to compute the root to 6 decimal places, we set ε = .5×10−6
in (3.5), and so need to perform n = 20 > 19.93 ≈ log2 2 × 106 − 1 bisections. Indeed,
the algorithm produces the approximation u⋆ ≈ − .754878 to the root, and the displayed
digits are guaranteed to be accurate.
f (u) = u3 − u − 1 = 0 (3.6)
we note that f (1) = −1 while f (2) = 5, and so there is a root between 1 and 2. Indeed,
the Bisection Method leads to the approximate value u⋆ ≈ 1.3247 after 17 iterations.
†
This assumes we have sufficient precision on the computer to avoid round-off errors.
u = u3 − 1 = ge(u).
Starting with the initial guess u(0) = 1.5, successive approximations to the solution are
found by iterating
u(k+1) = e
g (u(k) ) = (u(k) )3 − 1, k = 0, 1, 2, . . . .
However, their values
Newton’s Method
Our immediate goal is to design an efficient iterative scheme u(k+1) = g(u(k) ) whose
iterates converge rapidly to the solution of the given scalar equation f (u) = 0. As we
learned in Section 2, the convergence of the iteration is governed by the magnitude of its
derivative at the fixed point. At the very least, we should impose the stability criterion
| g ′ (u⋆ ) | < 1, and the smaller this quantity can be made, the faster the iterative scheme
converges. if we are able to arrange that g ′ (u⋆ ) = 0, then the iterates will converge
quadratically fast, leading, as noted in the discussion following Theorem 2.8, to a dramatic
improvement in speed and efficiency.
Now, the first condition requires that g(u) = u whenever f (u) = 0. A little thought
will convince you that the iterative function should take the form
1
h(u) = , (3.9)
f ′ (u)
which certainly guarantees that it holds at the solution u⋆ . The result is the function
f (u)
g(u) = u − , (3.10)
f ′ (u)
and the resulting iteration scheme is known as Newton’s Method , which, as the name
suggests, dates back to the founder of the calculus. To this day, Newton’s Method remains
the most important general purpose algorithm for solving equations. It starts with an
initial guess u(0) to be supplied by the user, and then successively computes
f (u(k) )
u(k+1) = u(k) − . (3.11)
f ′ (u(k) )
As long as the initial guess is sufficiently close, the iterates u(k) are guaranteed to converge,
quadratically fast, to the (simple) root u⋆ of the equation f (u) = 0.
†
This assumes we are working in a sufficiently high precision arithmetic so as to avoid round-off
errors.
0.05
-0.05
-0.1
-0.15
-0.2
-0.25
Incidentally, if we start with the interval [ 0, 1 ] and apply bisection, we converge (perhaps
surprisingly) to the largest root u⋆3 in 17 iterations.
Fixed point iteration based on the formulation
u = g(u) = − u3 + 23 u2 + 94 u + 1
27
can be used to find the first and third roots, but not the second root. For instance, starting
with u(0) = 0 produces u⋆1 to 5 decimal places after 23 iterations, whereas starting with
u(0) = 1 produces u⋆3 to 5 decimal places after 14 iterations. The reason we cannot produce
u⋆2 is due to the magnitude of the derivative
g ′ (u) = − 3 u2 + 3 u + 4
9
f (u) u3 − 32 u2 + 95 u − 1
27
g(u) = u − ′ =u− .
f (u) 3 u2 − 3 u + 59
Starting with an initial guess of u(0) = 0, the method computes u⋆1 to 6 decimal places
after only 4 iterations; starting with u(0) = .5, it produces u⋆2 to similar accuracy after 2
iterations; while starting with u(0) = 1 produces u⋆3 after 3 iterations — a dramatic speed
up over the other two methods.
Newton’s Method has a very pretty graphical interpretation, that helps us understand
what is going on and why it converges so fast. Given the equation f (u) = 0, suppose we
u(k+1) u(k)
know an approximate value u = u(k) for a solution. Nearby u(k) , we can approximate the
nonlinear function f (u) by its tangent line
As long as the tangent line is not horizontal — which requires f ′ (u(k) ) 6= 0 — it crosses
the axis at
f (u(k) )
u(k+1) = u(k) − ′ (k) ,
f (u )
which represents a new, and, presumably more accurate, approximation to the desired
root. The procedure is illustrated pictorially in Figure 13. Note that the passage from
u(k) to u(k+1) is exactly the Newton iteration step (3.11). Thus, Newtonian iteration is
the same as the approximation of function’s root by those of its successive tangent lines.
Given a sufficiently accurate initial guess, Newton’s Method will rapidly produce
highly accurate values for the simple roots to the equation in question. In practice, barring
some kind of special exploitable structure, Newton’s Method is the root-finding algorithm
of choice. The one caveat is that we need to start the process reasonably close to the
root we are seeking. Otherwise, there is no guarantee that a particular set of iterates will
converge, although if they do, the limiting value is necessarily a root of our equation. The
behavior of Newton’s Method as we change parameters and vary the initial guess is very
similar to the simpler logistic map that we studied in Section 2, including period dou-
bling bifurcations and chaotic behavior. The reader is invited to experiment with simple
examples; further details can be found in [16].
u − ǫ sin u = m (3.13)
1.2
0.8
0.6
0.4
0.2
Figure 14. The Solution to the Kepler Equation for Eccentricity ǫ = .5.
Systems of Equations
Let us now turn our attention to nonlinear systems of equations. We shall only
consider the case when there are the same number of equations as unknowns:
f1 (u1 , . . . , un ) = 0, ... fn (u1 , . . . , un ) = 0. (3.14)
We shall rewrite the system in vector form
f (u) = 0, (3.15)
x2 + y 2 = (x2 + y 2 )2 .
Rewriting the equation in polar coordinates as
r = r2 or r(r − 1) = 0,
we immediately see that the solutions consist of the origin x = y = 0 and all points on the
unit circle r 2 = x2 + y 2 = 1. Only the origin is an isolated solution, since every solution
lying on the circle has plenty of other points on the circle that lie arbitrarily close to it.
Typically, solutions to a system of n equations in n unknowns are isolated, although
this is not always true. For example, if A is a singular n × n matrix, then the solutions to
the homogeneous linear system A u = 0 form a nontrivial subspace, and so are not isolated.
Nonlinear systems with non-isolated solutions can similarly be viewed as exhibiting some
form of degeneracy. In general, the numerical computation of non-isolated solutions, e.g.,
solving the implicit equations for a curve or surface, is a much more difficult problem, and
we will not attempt to discuss these issues in this introductory presentation. (However,
our continuation approach to the Kepler equation in Example 3.9 indicates how one might
proceed in such situations.)
In the case of a single scalar equation, the simple roots, meaning those for which
′ ⋆
f (u ) 6= 0, are the easiest to compute. In higher dimensions, the role of the derivative
of the function is played by the Jacobian matrix (2.28), and this motivates the following
definition.
Definition 3.12. A solution u⋆ to a system f (u) = 0 is called nonsingular if the
associated Jacobian matrix is nonsingular there: det f ′ (u⋆ ) 6= 0.
Note that the Jacobian matrix is square if and only if the system has the same number
of equations as unknowns, which is thus one of the requirements for a solution to be
nonsingular in our sense. Moreover, the Inverse Function Theorem from multivariable
calculus, [2, 12], implies that a nonsingular solution is necessarily isolated.
should be the inverse of the Jacobian matrix of f at the solution, which, fortuitously, was
already assumed to be nonsingular.
As in the scalar case, we don’t know the solution u⋆ , but we can arrange that condition
(3.21) holds by setting
−1
L(u) = f ′ (u)
everywhere — or at least everywhere that f has a nonsingular Jacobian matrix. The
resulting fixed point system
u = g(u) = u − f ′ (u)−1 f (u), (3.22)
leads to the quadratically convergent Newton iteration scheme
u(k+1) = u(k) − f ′ (u(k) )−1 f (u(k) ). (3.23)
All it requires is that we guess an initial value u(0) that is sufficiently close to the desired
solution u⋆ . We are then guaranteed that the iterates u(k) will converge quadratically fast
to u⋆ .
Theorem 3.15. Let u⋆ be a nonsingular solution to the system f (u) = 0. Then,
provided u(0) is sufficiently close to u⋆ , the Newton iteration scheme (3.23) converges at a
quadratic rate to the solution: u(k) → u⋆ .
β
v1
ℓ
with inverse
′ −1 1 − ℓ sin α m sin β
f (α, β) = .
ℓ m sin(β − α) − ℓ cos α m cos β
As a result, the Newton iteration equation (3.23) has the explicit form
α(k+1) α(k)
= −
β (k+1) β (k)
1 − ℓ cos α(k) m sin β (k) ℓ cos α(k) + m cos β (k) − a
− .
ℓ m sin(β (k) − α(k) ) − ℓ cos α(k) m sin β (k) ℓ sin α(k) + m sin β (k) − b
when running the iteration, one must be careful to avoid points at which α(k) − β (k) = 0
or π, i.e., where the robot arm has straightened out.
As an example, let us assume that the rods have lengths ℓ = 2, m = 1, and the
T
desired location of the hand is at a = ( 1, 1 ) . We start with an initial guess of α(0) = 0,
β (0) = 21 π, so the first rod lies along the x–axis and the second is perpendicular. The first
few Newton iterates are given in the accompanying table. The first column is the iterate
number k; the second and third columns indicate the angles α(k) , β (k) of the rods. The
fourth and fifth give the position (x(k) , y (k) )T of the joint or elbow, while the final two
indicate the position (z (k) , w(k) )T of the robot’s hand.
Observe that the robot has rapidly converged to one of the two possible configurations.
(Can you figure out what the second equilibrium is?) In general, convergence depends on
the choice of initial configuration, and the Newton iterates do not always settle down to
a fixed point. For instance, if k a k > ℓ + m, there is no possible solution, since the arms
are too short for the hand to reach to desired location; thus, no choice of initial conditions
will lead to a convergent scheme and the robot arm flaps around in a chaotic manner.
Now that we have gained a little experience with Newton’s Method for systems of
equations, some supplementary remarks are in order. As we know, [15], except perhaps
in very low-dimensional situations, one should not directly invert a matrix, but rather use
Gaussian elimination, or, in favorable situations, a linear iterative scheme, e.g., Jacobi,
Gauss–Seidel or even SOR. So a better strategy is to leave the Newton system (3.23) in
unsolved, implicit form
4. Optimization.
We have already noted the importance of quadratic minimization principles for char-
acterizing the equilibrium solutions of linear systems of physical significance. In nonlinear
systems, optimization — either maximization or minimization — retains its centrality, and
the wealth of practical applications has spawned an entire sub-discipline of applied mathe-
matics. Physical systems naturally seek to minimize the potential energy function, and so
determination of the possible equilibrium configurations requires solving a nonlinear min-
imization principle. Engineering design is guided by a variety of optimization constraints,
such as performance, longevity, safety, and cost. Non-quadratic minimization principles
also arise in the fitting of data by schemes that go beyond the simple linear least squares
approximation method discussed in [15; Section 4.3]. Additional applications naturally
appear in economics and financial mathematics — one often wishes to minimize expenses
or maximize profits, in biological and ecological systems, in pattern recognition and signal
processing, in statistics, and so on. In this section, we will describe the basic mathematics
underlying simple nonlinear optimization problems along with basic numerical techniques.
The Objective Function
Throughout this section, the real-valued function F (u) = F (u1 , . . . , un ) to be op-
timized — the energy, cost, entropy, performance, etc. — will be called the objective
T
function. As such, it depends upon one or more variables u = ( u1 , u2 , . . . , un ) that
belong to a prescribed subset Ω ⊂ R n .
Definition 4.1. A point u⋆ ∈ Ω is a global minimum of the objective function F (u)
on the domain Ω if
F (u⋆ ) ≤ F (u) for all u ∈ Ω. (4.1)
The minimum is called strict if
Remark : In fact, any system of equations can be readily converted into a minimization
principle. Given a system f (u) = 0, we introduce the objective function
where k · k is any convenient norm on R n . By the basic properties of the norm, the
minimum value is F (u) = 0, and this is achieved if and only if f (u) = 0, i.e., at a solution
to the system. More generally, if there is no solution to the system, the minimizer(s) of
F (u) play the role of a least squares solution, at least for an inner product-based norm,
along with the extensions to more general norms.
Although Theorem 4.2 assures us of the existence of a global minimum of any contin-
uous function on a bounded domain, it does not guarantee uniqueness, nor does it indicate
how to go about finding it. Just as with the solution of nonlinear systems of equations, it
is quite rare that one can extract explicit formulae for the minima of non-quadratic func-
tions. Our goal, then, is to formulate practical algorithms that can accurately compute
the minima of general nonlinear functions.
The most naı̈ve algorithm, but one that is often successful in small scale problems,
[17], is to select a reasonably dense set of sample points u(k) in the domain and choose
the one that provides the smallest value for F (u(k) ). If the points are sufficiently densely
distributed and the function is not too wild, this will give a reasonable approximation to
the minimum. The algorithm can be speeded up by appealing to more sophisticated means
of selecting the sample points.
In the rest of this section, we will discuss optimization strategies that exploit the
differential calculus. Let us first review the basic procedure for optimizing functions that
you learned in first and second year calculus. As you no doubt remember, there are two
different possible types of minima. An interior minimum occurs at an interior point of the
domain of definition of the function, whereas a boundary minimum occurs on its boundary
-1 -0.5 0.5 1
∂Ω. Interior local minima are easier to find, and, to keep the presentation simple, we shall
focus our efforts on them. Let us begin with a simple scalar example.
Example 4.3. Let us optimize the scalar objective function
F (u) = 8 u3 + 5 u2 − 6 u
on the domain −1 ≤ u ≤ 1. To locate the minimum, the first step is to look at the critical
points where the derivative vanishes:
The Gradient
As you first learn in multi-variable calculus, [12], the interior extrema — minima
and maxima — of a smooth function F (u) = F (u1 , . . . , un ) are necessarily critical points,
meaning places where the gradient of F vanishes. The standard gradient is the vector field
whose entries are its first order partial derivatives:
T
∂F ∂F
∇F (u) = , ... , . (4.4)
∂u1 ∂un
d
h ∇F (u) , v i = F (u + t v) for all v ∈ V. (4.5)
dt t=0
Remark : The function F does not have to be defined on all of the space V in order
for this definition to make sense.
The quantity displayed in the preceding formula is known as the directional derivative
of F with respect to v ∈ V , and typically denoted by ∂F/∂v. The directional derivative
measures the rate of change of F in the direction of the vector v, scaled in proportion to
its length.
In the Euclidean case, when F (u) = F (u1 , . . . , un ) is a function of n variables, defined
T
for u = ( u1 , u2 , . . . , un ) ∈ R n , we can use the chain rule to compute
d d
F (u + t v) = F (u1 + t v1 , . . . , un + t vn )
dt dt
(4.6)
∂F ∂F
= (u + t v) v1 + · · · + (u + t v) vn .
∂u1 ∂un
Setting t = 0, the right hand side of (4.6) reduces to
d ∂F ∂F
F (u + t v) = (u) v1 + · · · + (u) vn = ∇F (u) · v.
dt t=0 ∂u1 ∂un
Therefore, the directional derivative equals the Euclidean dot product between the usual
gradient of the function (4.4) and the direction vector v, justifying (4.5) in the Euclidean
case.
Remark : In this chapter, we will only deal with the standard Euclidean dot product,
which results in the usual gradient formula (4.4). If we introduce an alternative inner
product on R n , then the notion of gradient, as defined in (4.5) will change.
Theorem 4.5. The gradient ∇F (u) of a scalar function F (u) points in the direction
of its steepest increase at the point u. The negative gradient, − ∇F (u), which points in
the opposite direction, indicates the direction of steepest decrease.
Thus, when F represents elevation, ∇F tells us the direction that is steepest uphill,
while − ∇F points directly downhill — the direction water will flow. Similarly, if F
represents the temperature of a solid body, then ∇F tells us the direction in which it
is heating up the quickest. Heat energy (like water) will flow in the opposite, coldest
direction, namely that of the negative gradient vector − ∇F .
But you need to be careful in how you interpret Theorem 4.5. Clearly, the faster you
move along a curve, the faster the function F (u) will vary, and one needs to take this
into account when comparing the rates of change along different curves. The easiest way
to effect the comparison is to assume that the tangent vector a = u has unit norm, so
k a k = 1, which means that we are passing through the point u(t) with unit speed. Once
this is done, Theorem 4.5 is an immediate consequence of the Cauchy–Schwarz inequality,
cf. [15]. Indeed,
∂F
= | a · ∇F | ≤ k a k k ∇F k = k ∇F k, when k a k = 1,
∂a
with equality if and only if a points in the same direction as the gradient. Therefore,
the maximum rate of change is when a = ∇F/k ∇F k is the unit vector in the gradient
direction, while the minimum is achieved when a = − ∇F/k ∇F k points in the opposite
direction.
Critical Points
Thus, the only points at which the gradient fails to indicate directions of increase/de-
crease of the objective function are where it vanishes. Such points play a critical role in
the analysis, whence the following definition.
Definition 4.6. A point u⋆ is called a critical point of the objective function F (u)
if
∇F (u⋆ ) = 0. (4.8)
We conclude that the gradient vector ∇F (u⋆ ) at the critical point must be orthogonal to
every vector v ∈ R n , which is only possible if ∇F (u⋆ ) = 0. Q.E.D.
Thus, provided the objective function is continuously differentiable, every interior
minimum, both local and global, is necessarily a critical point. The converse is not true;
critical points can be maxima; they can also be saddle points or of some degenerate form.
The basic analytical method† for determining the (interior) minima of a given function is
to first find all its critical points by solving the system of equations (4.8). Each critical
point then needs to be examined more closely — as it could be either a minimum, or a
maximum, or neither.
Example 4.8. Consider the function
F (u, v) = u4 − 2 u2 + v 2 ,
T
which is defined and continuously differentiable on all of R 2 . Since ∇F = 4 u3 − 4 u, 2 v ,
its critical points are obtained by solving the pair of equations
4 u3 − 4 u = 0, 2 v = 0.
†
Numerical methods are discussed below.
The solutions to the first equation are u = 0, ± 1, while the second equation requires v = 0.
Therefore, F has three critical points:
0 1 −1
u⋆1 = , u⋆2 = , u⋆3 = . (4.10)
0 0 0
Inspecting its graph in Figure 18, we suspect that the first critical point u⋆1 is a saddle
point, whereas the other two appear to be local minima, having the same value F (u⋆2 ) =
F (u⋆3 ) = − 1. This will be confirmed once we learn how to rigorously distinguish critical
points.
The student should also pay attention to the distinction between local minima and
global minima. In the absence of theoretical justification, the only practical way to deter-
mine whether or not a minimum is global is to find all the different local minima, including
those on the boundary, and see which one gives the smallest value. If the domain is un-
bounded, one must also worry about the asymptotic behavior of the objective function for
large u.
The status of critical point — minimum, maximum, or neither — can often be resolved
by analyzing the second derivative of the objective function at the critical point. Let us
first review the one variable second derivative test you learned in first year calculus.
Proposition 4.9. Let g(t) ∈ C2 be a scalar function, and suppose that t⋆ a critical
point: g ′ (t⋆ ) = 0. If t⋆ is a local minimum, then g ′′ (t⋆ ) ≥ 0. Conversely, if g ′′ (t⋆ ) > 0,
then t⋆ is a strict local minimum. Similarly, g ′′ (t⋆ ) ≤ 0 is required at a local maximum,
while g ′′ (t⋆ ) < 0 implies that t⋆ is a strict local maximum.
g(t) ≈ g(t⋆ ) + 1
2
(t − t⋆ )2 g ′′ (t⋆ ),
since g ′ (t⋆ ) = 0, and so the linear terms in the Taylor polynomial vanish. If g ′′ (t⋆ ) 6= 0,
then the quadratic Taylor polynomial has a minimum or maximum at t⋆ according to the
sign of the second derivative, and this provides the key to the proof. In the borderline
case, when g ′′ (t⋆ ) = 0, the second derivative test is inconclusive, and the point could be
either maximum or minimum or neither. One must analyze the higher order terms in the
Taylor expansion to resolve the status of the critical point.
In multi-variate calculus, the “second derivative” of a function F (u) = F (u1 , . . . , un )
is represented by the n × n Hessian † matrix , whose entries are its second order partial
derivatives:
∂ 2F ∂ 2F ∂ 2F
∂u2 ...
1 ∂u1 ∂u2 ∂u1 ∂un
∂ F 2 2
∂ F ∂ F
2
...
∂u ∂u ∂u 2 ∂u ∂u
2 1 2 2 n
∇ F (u) =
2
,
(4.11)
.. .. .. ..
. . . .
∂ 2F ∂ 2
F ∂ 2
F
...
∂un ∂u1 ∂un ∂u2 ∂u2n
We will always assume that F (u) ∈ C2 has continuous second order partial derivatives. In
this case, its mixed partial derivatives are equal: ∂ 2 F/∂ui ∂uj = ∂ 2 F/∂uj ∂ui , cf. [2, 12].
As a result, the Hessian is a symmetric matrix: ∇2 F (u) = ∇2 F (u)T .
The second derivative test for a local minimum of scalar function relies on the positiv-
ity of its second derivative. For a function of several variables, the corresponding condition
is that the Hessian matrix be positive definite. See [15] for a detailed discussion of positive
definite matrices. More specifically:
Theorem 4.10. Let F (u) = F (u1 , . . . , un ) ∈ C2 (Ω) be a real-valued, twice contin-
uously differentiable function defined on an open domain Ω ⊂ R n . If u⋆ ∈ Ω is a (local,
interior) minimum for F , then it is necessarily a critical point, so ∇F (u⋆ ) = 0. Moreover,
the Hessian matrix (4.11) must be positive semi-definite at the minimum, so ∇2 F (u⋆ ) ≥ 0.
Conversely, if u⋆ is a critical point with positive definite Hessian matrix ∇2 F (u⋆ ) > 0,
then u⋆ is a strict local minimum of F .
A maximum requires a negative semi-definite Hessian matrix. If, moreover, the Hes-
sian at the critical point is negative definite, then the critical point is a strict local maxi-
mum. If the Hessian matrix is indefinite, then the critical point is a saddle point — neither
†
Named after the early eighteenth century German mathematician Ludwig Otto Hesse.
F (u, v) = u2 − 2 u v + 3 v 2 .
To minimize F , we begin by computing its gradient
2u − 2v
∇F (u, v) = .
−2u + 6v
Solving the pair of equations ∇F = 0, namely
2 u − 2 v = 0, − 2 u + 6 v = 0,
we see that the only critical point is the origin u = v = 0. To test whether the origin is a
maximum or minimum, we further compute the Hessian matrix
2 Fuu Fuv 2 −2
H = ∇ F (u, v) = = .
Fuv Fvv −2 6
Using the methods of [15; Section 3.5], we easily prove that the Hessian matrix is positive
definite. Therefore, by Theorem 4.10, u⋆ = 0 is a strict local minimum of F .
Indeed, we recognize F (u, v) to be, in fact, a homogeneous positive definite quadratic
form, which can be written in the form
T 1 −1 1 u
F (u, v) = u K u, where K= = 2 H, u= .
−1 3 v
Positive definiteness of the coefficient matrix K implies that F (u, v) > 0 for all u =
T
( u, v ) 6= 0, and hence 0 is, in fact, a global minimum.
In general, any quadratic function Q(u) = Q(u1 , . . . , un ) can be written in the form
m
X n
X
T T
Q(u) = u K u − 2 b u + c = kij ui uj − 2 bi ui + c, (4.12)
i,j = 1 i=1
us that, in the positive definite case, u⋆ is a strict global minimum for Q(u). Thus, the
algebraic approach of [15; Chapter 4] provides additional, global information that cannot
be gleaned directly from the local, multivariable calculus Theorem 4.10. But algebra is only
able to handle quadratic minimization problems with ease. The analytical classification of
minima and maxima of more complicated objective functions necessarily relies the gradient
and Hessian criteria of Theorem 4.10.
Example 4.12. The function
2 2 3 2u
F (u, v) = u + v − v has gradient ∇F (u, v) = .
2 v − 3 v2
⋆ 0 ⋆ 0
The critical point equation ∇F = 0 has two solutions: u1 = and u2 = 2 . The
0 3
Hessian matrix of the objective function is
2 2 0
∇ F (u, v) = .
0 2 − 6v
2 2 0
At the first critical point, the Hessian ∇ F (0, 0) = is positive definite. Therefore,
0 2
2
2 0
the origin is a strict local minimum. On the other hand, ∇2 F 0, 3 = is
0 −2
0
indefinite, and hence u⋆2 = 2 a saddle point. The function is graphed in Figure 19,
3
with the critical points indicated by the small solid balls. The origin is, in fact, only a
local minimum, since F (0, 0) = 0, whereas F (0, v) < 0 for all v > 1. Thus, this particular
function has no global minimum or maximum on R 2 .
Next, consider the function
2 4 2u
F (u, v) = u + v , with gradient ∇F (u, v) = .
4 v3
The only critical point is the origin u = v = 0. The origin is a strict global minimum
T
because F (u, v) > 0 = F (0, 0) for all (u, v) 6= ( 0, 0 ) . However, its Hessian matrix
2 2 0
∇ F (u, v) =
0 12 v 2
vT ∇2 F (u⋆ ) v ≥ 0.
Since this condition is required for every direction v ∈ R n , the Hessian matrix ∇2 F (u⋆ ) ≥ 0
satisfies the criterion for positive semi-definiteness, proving the first part of the theorem.
The proof of the converse relies† on the second order Taylor expansion of the function:
| S(v, u⋆ ) | < 1
2 C k v k2 whenever 0 < k v k = k u − u⋆ k < δ.
But then the Taylor formula (4.15) implies that, for all u satisfying the preceding inequality,
0< 1
2 vT ∇2 F (u⋆ ) v + S(v, u⋆ ) = F (u) − F (u⋆ ),
which implies u⋆ is a strict local minimum of F (u). Q.E.D.
†
Actually, it is not hard to prove the first part using the first order Taylor expansion without
resorting to the scalar function g. On the other hand, when we look at infinite-dimensional
minimization problems arising in the calculus of variations, we will no longer have the luxury of
appealing to the finite-dimensional Taylor expansion, whereas the previous argument continues
to apply in general contexts.
d
0 = g ′ (0) = = ∇F (u(0)) · u(0) = ∇F (u⋆ ) · u(0) .
F (u(t)) (4.17)
dt t=0
Thus, the gradient of the objective function at the surface minimum must be orthogonal
to the tangent vector to the curve. Since the curve was constrained to lies entirely in S, its
tangent vector u(0) is tangent to the surface at the point u⋆ . Since every tangent vector to
the surface is tangent to some curve contained in the surface, ∇F (u⋆ ) must be orthogonal
to every tangent vector, and hence point in the normal direction to the surface. Thus, a
constrained critical point u⋆ ∈ S of a function on a surface is defined so that
∇F (u⋆ ) = λ n, (4.18)
where n denotes the normal to the surface at the point u⋆ . The scalar factor λ is known
as the Lagrange multiplier in honor of Joseph-Louis Lagrange, one of the pioneers of
constrained optimization. The value of the Lagrange multiplier is not fixed a priori, but
must be determined by solving the critical point system (4.18). The same reasoning applies
to local maxima, which are also constrained critical points. The nature of a constrained
critical point — local minimum, local maximum, local saddle point, etc. — is fixed by a
constrained second derivative test.
Example 4.13. Our problem is to find the minimum value of the objective function
F (u, v, w) = u2 −2 w3 when u, v, w are restricted to the unit sphere S = (u2 + v 2 + w2 = 1).
T
The radial vector n = ( u, v, w ) is normal to the sphere, and so the critical point condition
(4.18) is
2u u
∇F = 0 = λ v .
2
−6w w
2 u = λ u, 0 = λ v, − 6 w2 = λ w, subject to u2 + v 2 + w2 = 1,
for the unknowns u, v, w and λ. This needs to be done carefully to avoid missing any cases.
First, if u 6= 0, then λ = 2, v = 0, and either w = 0 whence u = ±1, or w = − 31 and so
√ √
u = ± 1 − w2 = ± 2 3 2 . On the other hand, if u = 0, then either λ = 0, w = 0 and so
v = ±1, or v = 0, w = ±1, and λ = ∓6. Collecting these together, we discover that there
G(u, v, w) = c, (4.19)
then at any point u⋆ ∈ S, the gradient vector ∇G(u⋆ ) points in the normal direction to
the surface, and hence, provided n = ∇G(u⋆ ) 6= 0, the surface critical point condition can
be rewritten as
∇F (u⋆ ) = λ ∇G(u⋆ ), (4.20)
T
or, in full detail, the critical point ( u⋆ , v ⋆ , w⋆ ) must satisfy
∂F ∂G
(u, v, w) = λ (u, v, w),
∂u ∂u
∂F ∂G
(u, v, w) = λ (u, v, w), (4.21)
∂v ∂v
∂F ∂G
(u, v, w) = λ (u, v, w).
∂w ∂w
Thus, to find the constrained critical points, one needs to solve the combined system (4.19,
21) of 4 equations for the four unknowns u, v, w and the Lagrange multiplier λ.
Formally, one can reformulate the problem as an unconstrained optimization problem
by introducing the augmented objective function
E(u, v, w, λ) = F (u, v, w) − λ G(u, v, w) − c . (4.22)
The critical points of the augmented function are where its gradient, with respect to all four
arguments, vanishes. Setting the partial derivatives with respect to u, v, w to 0 reproduces
The gradient with respect to u reproduces the critical point system (4.24), while its gradient
with respect to λ = (λ1 , . . . , λk ) recovers the constraints (4.23).
F (u, v, w) = u2 + v 2 + w2
subject to the constraints
G(u, v, w) = u2 + 4 v 2 = 1, H(u, v, w) = u2 + 9 w2 = 4,
The augmented objective function (4.22) is
E(u, v, w, λ, µ) = u2 + v 2 + w2 − λ u2 + 4 v 2 − 1 + µ u2 + 9 w2 − 4 .
To find its critical points, we set all its partial derivatives to zero:
∂E ∂E ∂E
= 2 u + 2 λ u + 2 µ u = 0, = 2 v + 8 λ v = 0, = 2 w + 18 λ w = 0,
∂u ∂v ∂w
while the partial derivatives with respect to the Lagrange multipliers λ, µ reproduce the
two constraints (4.26). Thus,
1
either u = 0 or λ + µ = −1, either v = 0 or λ = − 14 , and either w = 0 or µ = − 18 .
Thus, at least one of u, v, w must be zero. If u = 0, then v = ± 12 , w = ± 32 ; if v = 0,
then u = ± 1, w = ± √13 ; while there are no real solutions to the constraints when w = 0.
T
The first four critical points, 0, ± 12 , ± 23 , all lie a distance 65 ≈ .8333 from the origin,
T √
while the second four, ±1, 0, ± 32 , are further away, at distance 32 ≈ 1.1547. Thus,
the closest points on intersection of the cylinders are the first four, while the furthest
points from the origin are the last four. (The latter comment relies on the fact that the
intersection is a compact subset of R 3 .)
The second derivative test for regular constrained minima and maxima can be found
as follows. Let u(t) be a parametrized curve satisfying the constraints (4.23) for all t, with
u(0) = u⋆ ,
u(0) = v, u(0) = w.
The second order Taylor expansions of the functions evaluated on the curve take the form
F u(t) = F (u⋆ ) + t ∇F (u⋆ ) · v + 12 t2 ∇F (u⋆ ) · w + vT ∇2 F (u⋆ ) v + · · · ,
(4.27)
Gi u(t) = Gi (u⋆ ) + t ∇Gi (u⋆ ) · v + 21 t2 ∇Gi (u⋆ ) · w + vT ∇2 Gi (u⋆ ) v + · · · .
†
Any distance minimizer also minimizes the squared distance; we work with the latter in order
to avoid square roots in the computation.
At a regular point, this implies that v must be orthogonal to each of the gradient vectors
∇G1 (u⋆ ), . . . , ∇Gk (u⋆ ), and hence must belong to the tangent space to the constraint
set (4.23), which can be identified as the orthogonal complement to the normal subspace
spanned by the gradient vectors. Indeed, the Implicit Function Theorem, [2, 12], implies
that the constraint set is, locally, a submanifold of dimension n − k, and hence every
tangent vector v can be realized as the tangent to some curve u(t) contained therein.
Furthermore, F u(t) has a minimum at t = 0, which implies the necessary conditions
which must hold whenever the vectors v, w satisfy (4.28). Thus, ∇F (u⋆ ) must be orthog-
onal to the tangent space to the constraint set, and hence a linear combination of the
normal gradient vectors, which implies the Lagrange multiplier equation (4.24). Thus, in
view of (4.28),
k
X k
X
∇F (u⋆ ) · w = λi ∇Gi (u⋆ ) · w = − λi ∇2 Gi (u⋆ ) v.
i=1 i=1
k
!
X
vT ∇2 F (u⋆ ) − λi ∇2 Gi (u⋆ ) v ≥ 0, (4.30)
i=1
which must hold whenever v satisfies the constraints in (4.28). In other words, the Hessian
matrix of the augmented objective function (4.25) with respect to u, evaluated at the
critical point u⋆ , must be positive semi-definite when restricted to the tangent space of the
constraint set. As in the unconstrained minimization problem, if this restricted augmented
Hessian matrix is positive definite, then the critical point is a strict local minimum to the
constrained optimization problem.
To convert these conditions into a practical criterion, let v1 , . . . , vk be a basis for
the tangent space, i.e., a basis for the solution space to the system of linear equations
∇Gi (u⋆ ) · v = 0, i = 1, . . . , k. Let V = (v1 , . . . , vk ) be the corresponding n × k matrix
with the indicated columns. Then a necessary condition for minimality of u⋆ is that the
restricted augmented Hessian matrix
k
!
X
H =VT ∇2 F (u⋆ ) − λi ∇2 Gi (u⋆ ) V (4.31)
i=1
[1] Alligood, K.T., Sauer, T.D., and Yorke, J.A., Chaos. An Introduction to Dynamical
Systems, Springer-Verlag, New York, 1997.
[2] Apostol, T.M., Mathematical Analysis, 2nd ed., Addison–Wesley Publ. Co.,
Reading, Mass., 1974.
[3] Bradie, B., A Friendly Introduction to Numerical Analysis, Prentice–Hall, Inc.,
Upper Saddle River, N.J., 2006.
[4] Burden, R.L., and Faires, J.D., Numerical Analysis, Seventh Edition, Brooks/Cole,
Pacific Grove, CA, 2001.
[5] Devaney, R.L., An Introduction to Chaotic Dynamical Systems, Addison–Wesley,
Redwood City, Calif., 1989.
[6] Feigenbaum, M.J., Qualitative universality for a class of nonlinear transformations,
J. Stat. Phys. 19 (1978), 25–52.
[7] Gaal, L., Classical Galois theory, 4th ed., Chelsea Publ. Co., New York, 1988.
[8] Greene, B., The Elegant Universe: Superstrings, Hidden Dimensions, and the Quest
for the Ultimate Theory, W. W. Norton, New York, 1999..
[9] Henry, D., Geometric Theory of Semilinear Parabolic Equations, Lecture Notes in
Math., vol. 840, Springer–Verlag, Berlin, 1981.
[10] Lanford, O., A computer-assisted proof of the Feigenbaum conjecture, Bull. Amer.
Math. Soc. 6 (1982), 427–434.
[11] Mandelbrot, B.B., The Fractal Geometry of Nature, W.H. Freeman, New York,
1983.
[12] Marsden, J.E., and Tromba, A.J., Vector Calculus, 4th ed., W.H. Freeman, New
York, 1996.
[13] Moon, F.C., Chaotic Vibrations, John Wiley & Sons, New York, 1987.
[14] Olver, P.J., Continuous Calculus, Lecture Notes, University of Minnesota, 2020.
[15] Olver, P.J., and Shakiban, C., Applied Linear Algebra, Prentice–Hall, Inc., Upper
Saddle River, N.J., 2005.
[16] Peitgen, H.-O., and Richter, P.H., The Beauty of Fractals: Images of Complex
Dynamical Systems,Springer–Verlag, New York, 1986.
[17] Press, W.H., Teukolsky, S.A., Vetterling, W.T., and Flannery, B.P., Numerical
Recipes: The Art of Scientific Computing, 3rd ed., Cambridge University Press,
Cambridge, 2007.
[18] Robinson, R.C., An Introduction to Dynamical Systems: Continuous and Discrete,
2nd ed., Pure and Applied Undergraduate Texts, vol. 19, Amer. Math. Soc.,
Providence, R.I., 2012.
[19] Royden, H.L., Real Analysis, Macmillan Co., New York, 1988.