MAT321 Lecture Notes Boumal 2019
MAT321 Lecture Notes Boumal 2019
Fall 2019
ii
Contents
iii
iv CONTENTS
5 Eigenproblems 83
5.1 The power method . . . . . . . . . . . . . . . . . . . . . . . . 85
5.2 Inverse iteration . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.3 Rayleigh quotient iteration . . . . . . . . . . . . . . . . . . . . 91
5.4 Sturm sequences . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.5 Gerschgorin disks . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.6 Householder tridiagonalization . . . . . . . . . . . . . . . . . . 107
9 Integration 161
9.1 Computing the weights . . . . . . . . . . . . . . . . . . . . . . 162
9.2 Bounding the error . . . . . . . . . . . . . . . . . . . . . . . . 164
9.3 Composite rules . . . . . . . . . . . . . . . . . . . . . . . . . . 167
9.4 Gaussian quadratures . . . . . . . . . . . . . . . . . . . . . . . 167
9.4.1 Computing roots of orthogonal polynomials . . . . . . 169
9.4.2 Getting the weights, too: Golub–Welsch . . . . . . . . 171
9.4.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . 175
9.4.4 Error bounds . . . . . . . . . . . . . . . . . . . . . . . 175
CONTENTS v
These notes contain some of the material covered in MAT 321 / APC
321 – Numerical Methods taught at Princeton University during the Fall
semesters of 2016–2019. They are extensively based on the two reference
books of the course, namely,
Lloyd N. Trefethen and David Bau III. Numerical linear algebra, vol-
ume 50. SIAM, 1997.
Nicolas Boumal
1
2 CONTENTS
Chapter 1
3
4 CHAPTER 1. SOLVING ONE NONLINEAR EQUATION
f(x) = e x-2x-1
2.5
1.5
0.5
-0.5
-2 -1 0 1 2
x
1.1 Bisection
The main theorem we need to describe our first algorithm is a consequence
of the Intermediate Value Theorem (IVT). It offers a sufficient (but not
necessary) criterion to decide whether Problem 1.1 has a solution at all.
In both last cases, we identified an interval which (i) contains a root, and
(ii) is twice as small as our original interval. By iterating this procedure,
we can repeatedly halve the length of our interval with a single function
evaluation, always with the certainty that this interval contains a solution
to our problem. After k iterations, the interval has length |b − a|2−k . The
midpoint of that interval is at a distance at most |b − a|2−k−1 of a solution
ξ. We formalize this in Algorithm 1.1, called the bisection algorithm.
Theorem 1.3. When Algorithm 1.1 returns c, there exists ξ ∈ [a0 , b0 ] such
that f (ξ) = 0 and |ξ − c| ≤ |b0 − a0 |2−1−K . Assuming f (a0 ), f (b0 ) were
already computed, this is achieved in at most K function evaluations.
10 -5 10 -5
10 -10 10 -10
10 -15 10 -15
0 10 20 30 40 50 0 10 20 30 40 50
Iteration number k Iteration number k
10 -5 10 0
10 -10 10 -5
10 -15 10 -10
10 -20 10 -15
0 20 40 60 0 20 40 60
Iteration number k Iteration number k
Figure 1.3: Bisection on f (x) = (5−x)ex −5 with [a0 , b0 ] = [4, 5] and K = 60.
The interval length `k = bk − ak decreases only to 8.9 · 10−16 . Furthermore,
the function value stagnates instead of converging to 0. The culprit: inexact
arithmetic. We will see that this is actually as accurate as one can hope.
1.1. BISECTION 7
3: for k = 0, 1, 2 . . . , K − 1 do
4: Compute f (ck ) . We really only need the sign
5: if f (ck ) has sign opposite to f (ak ) then
6: Let (ak+1 , bk+1 ) = (ak , ck )
7: else if f (ck ) has sign opposite to f (bk ) then
8: Let (ak+1 , bk+1 ) = (ck , bk )
9: else
10: return c = ck . f (ck ) = 0
11: end if
12: Let ck+1 = ak+1 +b 2
k+1
function c = my bisection(f, a, b, K)
% Example: c = my bisection(@(x) exp(x) - 2*x - 1, 1, 2, 60);
c = (a+b)/2;
8 CHAPTER 1. SOLVING ONE NONLINEAR EQUATION
for k = 1 : K
if fc == 0
return; % f(c) = 0: done
end
c = (a+b)/2;
end
The bisection algorithm relies heavily on Theorem 1.2. For its many
qualities (not the least of which is its simplicity), this approach has three
main drawbacks:
1. The user needs to find a sign change interval [a0 , b0 ] as initialization;
2. Convergence is fast, but we can do better;
3. Theorem 1.2 is fundamentally a one-dimensional thing: it won’t gener-
alize when we aim to solve several nonlinear equations simultaneously.
In the next section, we discuss simple iterations: a family of iterative algo-
rithms designed to solve Problem 1.1, and which will (try to) address these
shortcomings.
Recall the performance of fzero reported at the beginning of this chapter.
Based on our initial guess x0 = 1, Matlab’s fzero used 17 function evalua-
tions to find a sign-change interval of length 0.64. After that, it needed only
6 additional function evaluations to find the same root our bisection found
in 48 iterations (starting from that interval, bisection would reach an error
bound of 0.005: only two digits after the decimal point are correct.) If we
give fzero the same interval we gave bisection, then it needs only 10 function
evaluations to do its job. This confirms Problem 1.1 can be solved faster.
We won’t discuss how fzero finds a sign-change interval too much (you will
think about it during precept). We do note in Figure 1.4 that this can be
a difficult task. The methods we discuss next do not require a sign-change
interval.
1.2. SIMPLE ITERATION 9
0.4
0.2
0
-0.2
-0.4
0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8
g(x) = x − f (x).
Clearly, f (ξ) = 0 if and only if g(ξ) = ξ, that is, if ξ is a fixed point of g. Given
f , there are many ways to construct a function g whose fixed points coincide
with the roots of f , so that Problem 1.1 is equivalent to the following.
This theorem has a big “if”. The main concern for this section will be:
Given f as in Problem 1.1, how do we pick an appropriate function g so that
(i) simple iteration on g converges, and (ii) it converges fast.
Let’s do an example, with f (x) = ex − 2x − 1, as in Figure 1.1. Here are
three possible functions gi which all satisfy f (ξ) = 0 ⇐⇒ gi (ξ) = ξ:
g1 (x) = log(2x + 1),
ex − 1
g2 (x) = ,
2
g3 (x) = ex − x − 1. (1.1)
(The domain of g1 is restricted to (−1/2, ∞).) See Figure 1.5. Notice how
g1 ([1, 2]) ⊂ [1, 2] and g2,3 ([−1/2, 1/2]) ⊂ [−1/2, 1/2]: fixed points exist.
Let’s run simple iteration with these functions and see what happens.
First, initialize all three sequences with x0 = 0.5 and run 20 iterations.
fprintf(' g1 g2 g3\n');
fprintf('%10.8e\t%10.8e\t%10.8e\n', x');
1.2. SIMPLE ITERATION 11
1.5
g1
1
f
0.5 g3
y
g2
-0.5 y=x
-1
-0.5 0 0.5 1 1.5 2
x
Figure 1.5: Functions gi intersect the line y = x (that is, gi (x) = x) exactly
when f (x) = 0.
g1 g2 g3
5.00000000e-01 5.00000000e-01 5.00000000e-01
6.93147181e-01 3.24360635e-01 1.48721271e-01
8.69741686e-01 1.91573014e-01 1.16282500e-02
1.00776935e+00 1.05576631e-01 6.78709175e-05
1.10377849e+00 5.56756324e-02 2.30328290e-09
1.16550958e+00 2.86273445e-02 0.00000000e+00
1.20327831e+00 1.45205226e-02 0.00000000e+00
1.22570199e+00 7.31322876e-03 0.00000000e+00
1.23878110e+00 3.67001786e-03 0.00000000e+00
1.24633153e+00 1.83838031e-03 0.00000000e+00
1.25066450e+00 9.20035584e-04 0.00000000e+00
1.25314261e+00 4.60229473e-04 0.00000000e+00
1.25455714e+00 2.30167698e-04 0.00000000e+00
1.25536366e+00 1.15097094e-04 0.00000000e+00
1.25582323e+00 5.75518590e-05 0.00000000e+00
1.25608500e+00 2.87767576e-05 0.00000000e+00
1.25623408e+00 1.43885858e-05 0.00000000e+00
1.25631897e+00 7.19434467e-06 0.00000000e+00
1.25636731e+00 3.59718527e-06 0.00000000e+00
1.25639483e+00 1.79859587e-06 0.00000000e+00
Recall that the larger root is about 1.2564312086261697. Let’s try again with
12 CHAPTER 1. SOLVING ONE NONLINEAR EQUATION
initialization x0 = 1.5.
g1 g2 g3
1.50000000e+00 1.50000000e+00 1.50000000e+00
1.38629436e+00 1.74084454e+00 1.98168907e+00
1.32776143e+00 2.35107853e+00 4.27329775e+00
1.29623914e+00 4.74844242e+00 6.64845880e+01
1.27884229e+00 5.72021964e+01 7.47979509e+28
1.26910993e+00 3.47991180e+24 Inf
1.26362374e+00 Inf NaN
1.26051782e+00 Inf NaN
1.25875516e+00 Inf NaN
1.25775344e+00 Inf NaN
1.25718372e+00 Inf NaN
1.25685955e+00 Inf NaN
1.25667505e+00 Inf NaN
1.25657003e+00 Inf NaN
1.25651024e+00 Inf NaN
1.25647620e+00 Inf NaN
1.25645683e+00 Inf NaN
1.25644579e+00 Inf NaN
1.25643951e+00 Inf NaN
1.25643594e+00 Inf NaN
Now x0 = 10.
g1 g2 g3
1.00000000e+01 1.00000000e+01 1.00000000e+01
3.04452244e+00 1.10127329e+04 2.20154658e+04
1.95855062e+00 Inf Inf
1.59271918e+00 Inf NaN
1.43161144e+00 Inf NaN
1.35150178e+00 Inf NaN
1.30914426e+00 Inf NaN
1.28600113e+00 Inf NaN
1.27312630e+00 Inf NaN
1.26589144e+00 Inf NaN
1.2. SIMPLE ITERATION 13
If this quantity is in (0, 1), then g is a contraction on [a, b]. (Note that the
sup over (a, b) is equivalent to a max over [a, b] if g 0 is continuous on [a, b].)
Are the functions gi contractions? Yes, on some intervals. Consider
Figure 1.6 which depicts |gi0 (x)|. We have:
g1 ([1, 2]) ⊂ [1, 2] and L1 = maxη∈[1,2] |g10 (η)| ≤ 0.667.
g2 ([−1/2, 1/2]) ⊂ [−1/2, 1/2] and L2 = maxη∈[−1/2,1/2] |g20 (η)| ≤ 0.825.
g3 ([−1/2, 1/2]) ⊂ [−1/2, 1/2] and L3 = maxη∈[−1/2,1/2] |g30 (η)| ≤ 0.649.
Theorem 1.9 (the contraction mapping theorem) guarantees convergence
to a unique fixed point for these gi ’s, given appropriate initialization. What
can be said about the speed of convergence? Consider the proof of Theo-
rem 1.9. In the last step, we established |xk − ξ| ≤ Lk |x0 − ξ|. What does it
take to ensure |xk − ξ| ≤ ε? Certainly, if Lk |x0 − ξ| ≤ ε, we are in the clear.
Taking logarithms, this is the case if and only if:
k log(L) + log |x0 − ξ| ≤ log(ε) (multiply by −1)
k log (1/L) ≥ log(|x0 − ξ|) + log (1/ε)
1 |x0 − ξ|
k≥ log .
log(1/L) ε
1.2. SIMPLE ITERATION 15
|g'1|
1.5 |g'2|
1
y
0.5
|g'3|
0
-0.5 0 0.5 1 1.5 2
x
Theorem 1.11. Under the assumptions and with the notations of the con-
traction mapping theorem, with x0 ∈ [a, b], for all k ≥ k(ε) where
1 |x0 − x1 |
k(ε) = log ,
log(1/L) (1 − L)ε
This last theorem only gives an upper bound on how many iterations
might be necessary to reach a desired accuracy. In practice, convergence
may be much faster. Take for example g3 , which converged to the root 0
(exactly) in only 5 iterations when initialized with x0 = 0.5. Meanwhile, the
bound with L3 = 0.649 only guarantees an accuracy of L53 |x0 − ξ| ≈ 0.058
for 5 iterations. Why is that?
1
In the first line, we use the triangle inequality: https://en.wikipedia.org/wiki/
Triangle_inequality#Example_norms.
16 CHAPTER 1. SOLVING ONE NONLINEAR EQUATION
One important reason is that the constant L is valid for a whole interval
[a, b]. Yet, this choice of interval is somewhat arbitrary. If xk → ξ, eventually,
it is really only g 0 close to ξ which matters. For g1 , the derivative g10 evaluated
at the positive root is about 0.57: not a big difference from 0.667. But for
g3 , we have g30 (0) = 0—as we get closer and closer to 0, the convergence gets
faster and faster!
Thus, informally, if g is continuously differentiable at ξ and xk → ξ,
asymptotically, the rate depends on g 0 (ξ). In fact, much of the behavior of
simple iteration is linked to g 0 (ξ). Consider the following definition.
Definition 1.12. Let g : [a, b] → [a, b] have a fixed point ξ, and let x0 , x1 . . .
be a sequence generated by xk+1 = g(xk ) for some x0 ∈ [a, b].
If there exists a neighborhood I of ξ such that x0 ∈ I implies xk → ξ,
we say ξ is a stable fixed point.
If there exists a neighborhood I of ξ such that x0 ∈ I\{ξ} implies we
do not have xk → ξ, we say ξ is an unstable fixed point.
Consider f (x) =
( (ξ can be either or neither.)
1
2
x if x ≤ 0,
2x otherwise. For a continuously differentiable function g with fixed point ξ, we can
make the following statements (note that their “if” parts are quite different
in nature.)
If |g 0 (ξ)| > 1, then ξ is unstable. Indeed, if xk is very close to ξ (but
not equal!), then, by the MVT,
10 -10
g1
10 -20
g2
10 -30
10 -40
10 -50
0 2000 4000 6000 8000 10000
Iteration count k
Figure 1.7: Even though the sequence generated by g2 converges only sub-
linearly, it takes over 9000 iterations for the linearly convergent sequence
generated by g1 to take over.
If Newton’s method converges (that’s a big if!), then the rate is superlin-
ear provided f 0 (ξ) 6= 0. Indeed, Newton’s method is simple iteration with:
Thus, g 0 (ξ) = 0. How fast exactly is this superlinear convergence? Let’s look
at an example on f (x) = ex − 2x − 1:
for k = 1 : 12
x = x - f(x) / df(x);
fprintf('x = %+.16e, \t f(x) = %+.16e\n', x, f(x));
end
We get fast convergence to the root ξ = 0. After a couple iterations, the error
|xk − ξ| appears to be squared at every iteration, until we run into an error
of 10−17 , which is an issue of numerical accuracy. (As a practical concern, it
is nice to observe that, if we keep iterating, we do not move away from this
excellent approximation of the root.) Let’s give a name to this kind of fast
convergence.
While Newton’s method is a great algorithm, bear in mind that the the-
orem we just established does not provide a practical way of initializing the
sequence. This remains a practical issue, which can only be resolved on a
case by case basis.
As a remark, note that the convergence guarantees given here are of the
form: if initialization is close enough to a root, then we get convergence to
that root. It is a common misconception to infer that if there is convergence
from a given initialization, then convergence is to the closest root. That
is simply not true. See [SM03, §1.7] for illustrations of just how compli-
cated the behavior of Newton’s method (and others) can be as a function of
initialization.
Definition 1.22. For given x0 , x1 , the secant method generates the se-
quence:
xk − xk−1
xk+1 = xk − f (xk ) .
f (xk ) − f (xk−1 )
for k = 1 : 12
end
Convergence to the positive root is very fast indeed, though after getting
there things go out of control. Why is that? Propose an appropriate stopping
criterion to avoid this situation.
We state a convergence result with a proof sketch here (See [SM03,
Thm. 1.10] for details). In [SM03, Ex. 1.10], you are guided to establish
superlinear convergence. (This theorem is only for your information.)
Proof sketch. Assume f 0 (ξ) = α > 0 (the argument is similar for α < 0.) In
a subinterval Iδ of I, by continuity of f 0 , we have f 0 (x) ∈ [ 34 α, 54 α]. Following
the first part of the proof for the convergence rate of Newton’s method, this
is sufficient to conclude that |xk+1 − ξ| ≤ 23 |xk − ξ|, leading to at least linear
convergence.
As a closing remark to this chapter, we note that it is a good strategy to
use bisection to zoom in on a root at first, thus exploiting the linear conver-
gence rate of bisection and its robustness; then to switch to a superlinearly
convergent method such as Newton’s or the secant method to “finish the
job.” This two-stage procedure is part of the strategy implemented in Mat-
lab’s fzero, as described in a series of blog posts by Matlab creator Cleve
Moler.4
4
https://blogs.mathworks.com/cleve/2015/10/12/zeroin-part-1-dekkers-algorithm/,
https://blogs.mathworks.com/cleve/2015/10/26/zeroin-part-2-brents-version/,
https://blogs.mathworks.com/cleve/2015/11/09/zeroin-part-3-matlab-zero-finder-fzero/
1.5. A QUICK NOTE ABOUT TAYLOR’S THEOREM 25
Cauchy’s mean value theorem states there exists η (strictly) between a and
x such that
F 0 (η) F (x) − F (a)
= .
G0 (η) G(x) − G(a)
5
https://en.wikipedia.org/wiki/Taylor%27s_theorem#Explicit_formulas_
for_the_remainder
6
https://en.wikipedia.org/wiki/Mean_value_theorem#Cauchy’s_mean_value_
theorem
26 CHAPTER 1. SOLVING ONE NONLINEAR EQUATION
Chapter 2
27
28 CHAPTER 2. FLOATING POINT ARITHMETIC
As often, we start with a Taylor expansion. For any 0 < h < h̄, there
exist η1 and η2 both in [x − h̄, x + h̄] such that
h2 00 h3
f (x + h) = f (x) + hf 0 (x) +f (x) + f 000 (η1 ),
2 6
2 3
h h
f (x − h) = f (x) − hf (x) + f (x) − f 000 (η2 ).
0 00
2 6
Our goal is to obtain a formula for f 0 (x). Thus, it is tempting to compute
the difference between the two formulas above:
h3 000
f (x + h) − f (x − h) = 2hf 0 (x) + (f (η1 ) + f 000 (η2 )) .
6
Solving for f 0 (x), we get:
f (x + h) − f (x − h) h2 000
f 0 (x) = − (f (η1 ) + f 000 (η2 )) . (2.1)
2h 12
Since f 000 is continuous in [x− h̄, x+ h̄], it is also bounded in that interval. Let
M3 be such that |f 000 (η)| ≤ M3 for all η in the interval. The approximation
f (x + h) − f (x − h)
f 0 (x) ≈ ∆f (x; h) =
2h
is called a finite difference approximation. From (2.1), we deduce it incurs
an error bounded as:
M3 2
|f 0 (x) − ∆f (x; h)| ≤ h. (2.2)
6
This formula suggests that the smaller h, the smaller the approximation
error, which is certainly in line with our intuition
about derivatives. Let’s
verify this on a computer with f (x) = sin x + π3 , approximating f 0 (0).
for k = 1 : numel(hh)
h = hh(k);
2.2. A SIMPLIFIED MODEL FOR IEEE ARITHMETIC 29
end
This code generates Figure 2.1. It is pretty clear that when h is “too”
small, something breaks. Specifically, it is the fact that our computations
are inexact. Fortunately, we will be able to give a precise description of what
happens, also allowing us to pick an appropriate value for h in practice.
10 -5
10 -15 10 -10 10 -5 10 0
Step size h
Figure 2.1: Mathematically, we predicted |f 0 (x) − ∆f (x; h)| (the blue curve)
should stay below M63 h2 (the red line of slope 2). Clearly, something is wrong.
From the plot, it seems h = 10−5 is a good value. In practice though, we
cannot draw this plot for we do not know f 0 (x): we need to predict what a
good h is via other means.
a b = f l(a − b) = (a − b)(1 + ε2 ),
a b = f l(ab) = ab(1 + ε3 ),
2
https://en.wikipedia.org/wiki/IEEE_floating_point
3
Overflow occurs when one attempts to work with a number larger than the biggest
number which can be stored (about 10308 ); underflow occurs when one attempts to store a
number which is closer to zero than the closest nonzero number which can be represented.
4
Denormalized numbers fill the gap around zero to improve accuracy there, but the
relative accuracy is not as good as everywhere else.
5
Remark that εmach is half of ε above, since if a is in the interval [x, x(1 + ε)] whose
limits are exactly represented, its distance to either limit is at most half of the interval
length, that is, 2ε x. Then, the relative error upon rounding a to its closest representable
number is |a−f|a|l(a)| ≤ 2ε |x| ε
|a| ≤ 2 , εmach .
32 CHAPTER 2. FLOATING POINT ARITHMETIC
a b = f l(a/b) = ab (1 + ε4 ),
√
sqrt(a) = a(1 + ε5 ),
This follows from the fact that computers typically represent numbers in
binary, so that multiplying and dividing by 2 can be done exactly. Conse-
quently, powers of 2 (and of 1/2) are exactly representable, and multiplying
or dividing by them is done exactly. Similarly, since we typically use one bit
to encode the sign,
a ⊕ (b ⊕ c) = a ⊕ (b + c)(1 + ε1 )
= [a + (b + c)(1 + ε1 )] (1 + ε2 )
= (a + b + c) + ε2 (a + b + c) + ε1 (b + c) + ε1 ε2 (b + c)
= (a + b + c) + ε2 (a + b + c) + ε1 (b + c) + O(ε2mach ).
In the last equation, we made a simplification which we will always do: terms
proportional to ε2mach (or ε3mach , ε4mach . . .) are so small that we do not care;
so we hide them in the notation O(ε2mach ). Notice how the formula tells us
the result of the addition (after both round-offs) is equal to the correct sum
a + b + c, plus some extra terms. It is useful to bound the error:
The first term is fine: it is the usual form for a relative error. The last term
is also fine: we consider O(ε2 ) very small. It is the middle term which is
the culprit. To understand why it is harmful, recall that ε1 and ε2 can be
both positive and negative (corresponding to rounding up or down when the
operation was computed.) Thus, if the signs of ε1 f (h) and ε2 f (−h) happen
to be opposite (which might very well be the case), then the numerator is
quite large. Using f (±h) = f (0) ± hf 0 (0) + O(h2 ):
Clearly, if h is small, this is bad. Overall then, we find the following round-off
error (where we also used |∆f (h) − f 0 (0)| = O(h2 )):
|f (0)|
|f l(∆f (h)) − ∆f (h)| ≤ 3ε|f 0 (0)| + ε + O(ε2 ) + O(εh).
h
Finally, we have the following formula to bound the error; at this point, we
re-integrate x in our notation:
M3 2 |f (x)|
|f l(∆f (x; h)) − f 0 (x)| ≤ h + 3ε|f 0 (x)| + ε + O(ε2 ) + O(εh).
6 h
(2.3)
Good. This should help us pick a suitable value of h. The goal is to minimize
the parts of the bound that depend on h. This is pretty much the case when
both error terms are equal:8
M3 2 |f (x)|
h ≈ε ,
6 h
q
thus, h = 3 6|fM(x)| ε is an appropriate choice. If the constant is not too
3
√
different from 1, the magic value to remember is h ≈ 3 ε ≈ 10−5 .
8
More precisely, you could observe that 6 h+ ε |f (x)|
M3 2
h q attains its minimum when the
derivative with respect to h is zero; this happens for h = 3 3|fM(x)|
3
ε.
36 CHAPTER 2. FLOATING POINT ARITHMETIC
Question 2.5. Can you track down what happens if the computation of f (h)
and f (−h) only has a relative accuracy of 100εmach instead of εmach ? (Follow
ε1 and ε2 .)
The following bit of code adds the IEEE-aware error bound to the mix,
as depicted in Figure 2.2.
hold all;
loglog(hh, (M3/6)*hh.ˆ2 + 3*eps(1)*abs(df(0)) + ...
(eps(1)./hh)*abs(f(0)));
legend('Actual error', 'Theoretical bound', 'IEEE-aware bound');
Question 2.6. Notice how, in this log-log plot, the right bit has a slope of 2,
whereas the left bit has a slope of −1. Can you tell why? The precise value
of the optimal h depends on M3 which is typically unknown. Is it better to
overestimate or underestimate?
2.4 Bisection
Recall the bisection method: for a certain continuous function f , let a0 < b0
(represented exactly in memory) be the end points of an interval such that
f (a0 )f (b0 ) < 0 (so the interval contains a root). The bisection method
computes c0 = a0 +b 2
0
, the mid-point of the interval [a0 , b0 ], and decides to
make the next interval either [a1 , b1 ] = [a0 , c0 ] or [a1 , b1 ] = [c0 , b0 ]. Then, it
iterates. If this is done exactly, the interval length
`k = bk − ak
1
is reduced by a factor of two at each iteration, so that `k = `
2k 0
→ 0. Ah,
but computations are not exact. . .
2.4. BISECTION 37
10 0
Error |f'(0) - FD(f, 0, h)|
10 -5
Actual error
-10 Theoretical bound
10
IEEE-aware bound
10 -15 10 -10 10 -5 10 0
Step size h
Figure 2.2: Factoring in round-off error in our analysis, we get a precise un-
derstanding of how accurately the finite difference formula can approximate
derivatives in practice. The rule of thumb h ≈ 10−5 is fine in this instance.
9
The bound is only approximate because the interval is not quite exactly halved at
each iteration, also because of round-off errors. That effect is negligible, but it is a good
exercise to try and account for it.
38 CHAPTER 2. FLOATING POINT ARITHMETIC
a = 4.9651142317442760e+00
b = 4.9651142317442769e+00
c = 4.9651142317442769e+00
b - a = 8.8817841970012523e-16
eps(a) = 8.8817841970012523e-16
Indeed, the distance between a and the next representable number as given
by eps(a) is exactly the distance between a and b. As a result, c is rounded
to either a or b (in this case, to b.)
A final remark: the story above suggests we can get an approximation of
a root ξ basically up to machine precision. If you feel courageous, you could
challenge this statement and ask: how about errors in computing f (ck )?
When ck is close to a root, f (ck ) is close to zero, hence we might get the sign
wrong and move to the wrong half interval. . . We won’t go there today.
10 -10
10 -20
0 10 20 30 40 50 60
Iteration number k
Figure 2.3: Figure 1.3 with an extra curve: the red curve shows the bound
1
` +εmach |ξ| which better predicts the behavior of the interval length during
2k 0
bisection under inexact arithmetic.
where we have to specify the order of summation. Let’s say we sum x1 with
x2 , then the result with x3 , then the result with x4 , etc.—that is, we sum the
big numbers first. Using f l(xi ) = xi (1 + εi ), establish the following identity,
where ε(i) is the relative error incurred by the ith addition (there are n − 1
of them):
n n n n−1
!
M X X X
f l(xi ) = xi + xi εi + ε(j) + O(ε2mach ).
i=1 i=1 i=1 j=i−1
Since ∞ π2
P 1
Pn 1
i=1 i2 = 6 and log(n) ≤ i=1 i ≤ 1 + log(n), we further get
n n
π2
X 1 M
2
− f l(1/i/i) ≤ (3 + n) − log(n) εmach + O(ε2mach ).
i=1
i i=1
6
2
This bounds the error to roughly π6 nεmach , which, for large n, is a relative
error of about nεmach . This is not so good: if n = 109 , then we only expect
6 accurate digits after the decimal point.
On the other hand, if we had summed with i ranging from n to 1 rather
than from 1 to n—small numbers first—then in (2.4) we would have summed
i/i2 = 1/i rather than (n − i)/i2 , so that
n n n n
X 1 M X 1 X 1
2
− f l(1/i/i) ≤ 4εmach 2
+ εmach + O(ε2mach )
i=1
i i=1
i i
2 i=1 i=1
π
≤ 4 + log(n) + 1 εmach + O(ε2mach ).
6
40 CHAPTER 2. FLOATING POINT ARITHMETIC
This is much better! For n = 109 , the error is smaller than 29εmach < 21 10−14 ,
which means the result is accurate up to 14 digits after the decimal point!
PnOne 1
final point: in experimenting with this, be careful that even though
π2
i=1 i2 → 6 as n → ∞, for finite n, the difference may be bigger than
10−14 . In particular, if we let n = 109 , then
n ∞ 2·10 9
π2 X 1 X 1 X 1 1
− 2
= 2
≥ 2
≥ 109 9 )2
= .25 · 10−9 .
6 i=1
i 9
i 9
i (2 · 10
i=10 +1 i=10 +1
2
Thus, when comparing the finite sum with π6 , at best, only 9 digits after the
decimal point will coincide (and it could be fewer); that is not a mistake: it
only has to do with the convergence of that particular sum.
total1 = 0;
for ii = 1:1:n
total1 = total1 + 1/iiˆ2;
end
fprintf('Sum large first: %.16f\n', total1);
total2 = 0;
for ii = n:-1:1
total2 = total2 + 1/iiˆ2;
end
fprintf('Sum small first: %.16f\n', total2);
Question 2.7. At a “big picture level”, why does it make sense to sum the
small numbers first?
Chapter 3
In this chapter, we solve linear systems of equations (Ax = b), we discuss the
sensitivity of the solution x to perturbations in b, we consider the implica-
tions for solving least-squares problems, and get into the important problem
of computing QR factorizations. As we do so, we encounter a number of
algorithms for which we ask the questions: What is its complexity in flops
(floating point operations)? And also: What could break the algorithm?
We follow mostly the contents of [TBI97]: specific lectures are referenced
below. See blackboard for Matlab codes used in class.
3.1 Solving Ax = b
We aim to solve the following problem, where we assume A is invertible to
ensure existence and uniqueness of the solution.
41
42 CHAPTER 3. LINEAR SYSTEMS OF EQUATIONS
Question 3.4. What is the complexity of this algorithm in flops, that is:
how many floating point operations (or arithmetic operations) are required to
execute it, as a function of n?
3.2 Conditioning of Ax = b
See Lecture 12 in [TBI97] for general context. We discuss mainly the follow-
ing points in class.
When solving Ax = b with A ∈ Rn×n invertible and b nonzero, how
sensitive is the solution x to perturbations in the right hand side b? If b is
perturbed and becomes b + δb for some δb ∈ Rn , then surely the solution of
the linear system changes as well, and becomes:
How large could the deviation δx be? We phrase this question in relative
terms, that is, we want to bound the relative deviation in terms of the relative
perturbation:
kδxk kδbk
≤ [ something to determine ] .
kxk kbk
46 CHAPTER 3. LINEAR SYSTEMS OF EQUATIONS
In asking this question, we want to consider the worst case over all possible
perturbations b. It is also important to stress that this is about the sensitivity
of the problem, not about the sensitivity (or rather, stability) of any given
algorithm. If a problem is sensitive to perturbations, this affects all possible
algorithms to solve that problem. We must keep this in mind when assessing
candidate algorithms (“à l’impossible, nul n’est tenu.”)
One concept is particularly important to our discussion: the notion of
matrix norm—see√ Lecture 3 in [TBI97]. Given a vector norm k · k (for
example, kxk2 = xT x), we can define a subordinate matrix norm as
kAxk
kAk = max .
x6=0 kxk
Intuitively, this is the largest factor by which A can change the norm of any
given vector.
Question 3.6. Show that for the 2-norm, kAk2 = σmax (A), the largest sin-
gular value of A.
σmax (A)
Question 3.7. Show that for the 2-norm, κ(A) = σmin (A)
.
We assume m > d + 1, so that there are strictly more data than there are
unknowns. Notice that
ad
ad−1
· · · t1i t0i ... − bi .
d d−1
p(ti ) − bi = ti ti
a1
a0
Question 3.9. Show that if A does not have full column rank, then the least
squares problem cannot have a unique solution.
3.3. LEAST SQUARES PROBLEMS 49
If b is in the image of A (also called the range of A), that is, if b belongs to
the subspace of linear combinations of the columns of A, then, by definition,
there exists a solution x to the over-determined system of equations Ax = b.
If that happens, x is a solution to the least squares problem (why?).
That is typically not the case. In general, for an over determined system,
we de not expect b to be in the image of A. Instead, the solution x to the
least squares problem is such that Ax is the vector in the image of A which
is closest to b, in the 2-norm. In your introductory linear algebra course, you
probably argued that this implies b − Ax (the residue) is orthogonal to the
image of A. Equivalently, b − Ax is orthogonal to all columns of A (since
they form a basis of the image of A). Algebraically:
AT (b − Ax) = 0.
AT Ax = AT b.
Question 3.10. Show that AT A is invertible iff A has full column rank.
Problem 3.11. In the 2-norm, show that κ(AT A) = κ(A)2 , where we extend
the definition κ(A) = σσmax (A)
min (A)
to rectangular matrices.
A = Q̂R̂,
AT Ax = AT b
R̂T Q̂T Q̂R̂x = R̂T Q̂T b.
Question 3.12. Show that R̂ is invertible iff A has full column rank.
Thus, using both invertibility of R̂T (under our assumption that A has
full column rank) and the fact that Q̂T Q̂ = I (since Q̂ has orthonormal
columns), we find that the system simplifies to:
R̂x = Q̂T b.
x = (AT A)−1 AT b,
x̃ = (AT A)−1 AT b̃.
The matrix (AT A)−1 AT plays a special role: it is called the pseudo-inverse
of A. We denote it by A+ :
Notice that this is an SVD of A+ (up to the fact that one would normally
order the singular values from largest to smallest, and they are here ordered
from smallest to largest.) In particular, the largest singular value of A+
52 CHAPTER 3. LINEAR SYSTEMS OF EQUATIONS
is 1/σn . This gives the following meaning to the definition we gave earlier
without justification for the condition number of a full-rank, rectangular
matrix:
1 σmax (A)
kAk = σmax (A), kA+ k = , κ(A) = = kAkkA+ k.
σmin (A) σmin (A)
The first factor is at most 1: it can only help (in that its effect on sensitivity
can only be to lower it.) It indicates that it is preferable (from a sensitivity
point of view) to have x in a subspace that is damped rather than amplified
by A. The second factor has a good geometric interpretation. Consider the
following:
Ax = AA+ b
kbk kbk
= ,
kAxk kAA+ bk
the ratio between the norm of b and that of its orthogonal projection to the
range of A, is given by the cosine of the angle between b and that subspace.
3.5. COMPUTING QR FACTORIZATIONS, A = Q̂R̂ 53
kAkkxk
Let θ denote this angle and let η = kAxk
≥ 1. Then, we can summarize our
findings as:
kδxk κ(A) kδbk κ(A) kδbk
≤ ≤ .
kxk η cos θ kbk cos θ kbk
Notice the role of θ: if b is close to the range of A, then Ax ≈ b “almost”
has a consistent solution: this is rewarded by having cos θ close to 1. On the
other hand, if b is almost orthogonal to the range of A, then we are very far
from having a consistent solution and sensitivity is exacerbated, as indicated
by cos θ close to 0.
a1 = r11 q1 (3.5)
a2 = r12 q1 + r22 q2
a3 = r13 q1 + r23 q2 + r33 q3
..
.
a⊥ T
2 = a2 − (q1 a2 )q1 ,
a⊥ T T
3 = a3 − (q1 a3 )q1 − (q2 a3 )q2 ,
The procedure is continued until n vectors have been produced, with general
formulas for j ranging from 1 to n:
j−1
X 1 ⊥
a⊥
j = aj − (qiT aj )qi , qj = aj . (3.6)
i=1
ka⊥
j k
a1 = ka1 k q1
|{z}
r11
a2 = (q1T a2 ) q1 + ka⊥ kq
| {z } | {z2 } 2
r12 r22
a3 = T
(q1 a3 ) q1 + (q2T a3 ) q2 + ka⊥ kq
| {z } | {z } | {z3 } 3
r13 r23 r33
..
.
3
This ensures a⊥
2 is orthogonal to q1 , hence q2 (which is just a scaled version of a2 )
⊥
is also orthogonal to q1 , as desired. To get the formulas, we use the fact that these are
equivalent: (a) project to the orthogonal complement of a space; and (b) project to the
space, then subtract that from the original vector.
4
We could also set q2 = − ka1⊥ k a⊥
2 : taking the positive sign is a convention; it will lead
2
to a positive diagonal for R.
3.5. COMPUTING QR FACTORIZATIONS, A = Q̂R̂ 55
[m, n] = size(A);
Q = zeros(m, n);
R = zeros(n, n);
for j = 1 : n
v = A(:, j);
R(j, j) = norm(v);
Q(:, j) = v / R(j, j);
end
end
56 CHAPTER 3. LINEAR SYSTEMS OF EQUATIONS
See codes Numerically, CGS is seen to behave rather poorly. What could possibly have
lecture LSQ first contact.m
and
gone wrong? Here is a list of suspects:
lecture QR comparison.m
(shown in class). Columns of Q̂ are not really orthonormal?
a⊥ T T
j = aj − (q1 aj )q1 − · · · − (qj−1 aj )qj−1 .
a⊥ T T
3 = a3 − (q1 a3 )q1 − (q2 a3 )q2 .
3.5. COMPUTING QR FACTORIZATIONS, A = Q̂R̂ 57
v3 ← a3
v3 ← v3 − (q1T v3 )q1 = (I − q1 q1T )v3
v3 ← v3 − (q2T v3 )q2 = (I − q2 q2T )v3 .
This procedure reads as follows: initialize v3 to a3 ; project v3 to the orthog-
onal complement of q1 ; and project the result to the orthogonal complement
of q2 . More generally, the rule to obtain a⊥
j is:
a⊥ T T
j = (I − qj−1 qj−1 ) · · · (I − q1 q1 )aj .
This small modification turns out to be beneficial. Why? Follow the nu-
merical errors: each projection I − qi qiT introduces some round-off error.
Importantly, much of that round-off error will be eliminated by the next
T
projector, I − qi+1 qi+1 , because components of the error that happen to be
aligned with qi+1 will be mathematically zeroed out (numerically, they are
greatly reduced). One concrete effect for example is that qj is as orthogonal
to qj−1 as numerically possibly, because the last step in obtaining a⊥ j is to
T
apply the projector I − qj−1 qj−1 . This is not so for CGS.
This modification of CGS, called MGS, is presented in a neatly organized
procedure as Algorithm 3.5. The algorithm is organized somewhat differ-
ently from Algorithm 3.4 (specifically, it applies the projector I − qi qiT to
all subsequent vj ’s as soon as qi becomes available), but other than that it
is equivalent to taking Algorithm 3.4 and only replacing rij ← qiT aj with
rij ← qiT vj . Here is Matlab code for MGS.
58 CHAPTER 3. LINEAR SYSTEMS OF EQUATIONS
[m, n] = size(A);
Q = zeros(m, n);
R = zeros(n, n);
for j = 1 : n
v = A(:, j);
R(j, j) = norm(v);
Q(:, j) = v / R(j, j);
end
end
MGS provides much better stability than CGS, but it is still not perfect.
In particular, while it is still true that A ≈ Q̂R̂, and indeed Q̂ is closer to
having orthonormal columns than when we use CGS, it may still be far from
orthonormality when A is poorly conditioned. We discuss fixes below. Before
we move on, ponder the following questions:
All the answers (with developments) are in [TBI97, Lec. 7–9]. In a nutshell:
3. If A has full column rank, we have existence of course (since the al-
gorithm produces such a factorization). As for uniqueness, the only
freedom we have (as a result of R̂ being upper triangular) is in the sign
of rjj . By forcing rjj > 0, the factorization is unique.
As a side note: if A is not full column rank, there still exists a Q̂R̂ factor-
ization, but it is not unique and its properties are different from the above:
see [TBI97, Thm. 7.1]. Furthermore, the reason we write the factorization as
A = Q̂R̂ instead of simply A = QR is because the latter notation is tradition-
ally reserved for the “full” QR decomposition (as opposed to the “economy
size” or “thin” or “partial” QR we have discussed here), in which Q ∈ Rm×m
is orthogonal (square matrix) and R ∈ Rm×n has an n × n upper triangular
block at the top, followed by rows of zeros. Matlab’s qr(A) produces the
full QR decomposition by default. To get the economy size, call qr(A, 0).
2. Plug this into the normal equations and cancel terms using orthonor-
mality of Û , V̂ and invertibility of Σ̂:
AT Ax = AT b
V̂ Σ̂T Û T Û Σ̂V̂ T x = V̂ Σ̂T Û T b
Σ̂V̂ T x = Û T b
x = V̂ Σ̂−1 Û T b,
This algorithm is the most stable of all the ones we discuss in these notes.
The drawback is that it requires computing the SVD of A, which is typically
more expensive than computing a QR factorization: more on this later.
3.7 Regularization
Poor conditioning is often fought with regularization (on top of using stable
algorithms as above). We do not discuss this much in this course; keywords
are: Tikhonov regularization, or regularized least squares.
The general idea is to minimize kAx − bk2 + λkxk2 , with some chosen
λ > 0, instead of merely minimizing kAx − bk2 . This intuitively promotes
smaller solutions x by penalizing (we say, regularizing) the norm of x. In
practice, this is done as follows: given a least squares problem Ax = b,
replace it with the larger least squares problem Ãx = b̃:
√A b
x= . (3.7)
λIn 0
| {z } |{z}
à b̃
Clearly, for λ = 0, κ(Ã) = κ(A), while for larger values of λ, the condition
number of à decreases, thus reducing numerical difficulties. On the other
hand, larger values of λ change the problem (and its solution): which value
of λ is appropriate depends on the application.
using MGS results in A ≈ Q̂1 R̂1 yet columns of Q̂1 are not quite orthonormal.
Importantly though, Q̂1 is much better conditioned than A: after all, even
though it is not quite orthogonal, it is much closer to being orthogonal than
A itself, and orthogonal matrices have condition number 1 in the 2-norm.
The trick is to apply MGS a second time, this time to Q̂1 :
MGS
Q̂1 −→ Q̂2 R̂2 .
Defining Q̂ = Q̂2 and R̂ = R̂2 R̂1 (which is indeed upper triangular because it
is a product of two upper triangular matrices) typically results in an excellent
QR factorization of A. In the experiments on Blackboard, this QR is as
good as the one built into Matlab as the function qr. (The latter is based
on Householder triangularization.) It is however slower, since Householder
triangularization is cheaper than two calls to MGS (by some constant factor.)
62 CHAPTER 3. LINEAR SYSTEMS OF EQUATIONS
At this point, instead of canceling Q̂T Q̂, we could solve these systems:
Q̂T Q̂y = Q̂T b
R̂x = y.
We would expect this to fare better, because Q̂T Q̂ ought to be better con-
ditioned than AT A. However, this would be fairly expensive, as the system
Q̂T Q̂ does not have particularly favorable structure (such as triangularity).
Instead, we do the following: augment the matrix A with the vector b,
and compute a QR factorization of that matrix (for example, using simple
MGS):
MGS
à = A b −→ Q̃R̃.
R̂x = r.
In summary: compute the QR factorization Q̃R̃ of A b , extract the top-
left block of R̃, namely, R̂, and extract the top-right column of R̃, namely,
r; then solve the triangular system R̂x = r. This turns out to yield a
numerically well-behaved solution to the least-squares problem.
64 CHAPTER 3. LINEAR SYSTEMS OF EQUATIONS
Chapter 4
where
1 2
x1 + x22 − 9 ,
f1 (x1 , x2 ) =
9
f2 (x1 , x2 ) = −x1 + x2 − 1.
The locus f1 = 0, that is, the set of points x ∈ R2 such that f1 (x) = 0 is the
circle of radius 3 centered at the origin. The locus f2 = 0 is the line passing
through the points (0, 1) and (−1, 0). The two loci intersect at two points:
these are the solutions to the system f (x) = 0. See Figure 4.1
We could solve this particular system analytically, but in general this task
may prove difficult. Thus, we set out to study general algorithms to compute
65
66 CHAPTER 4. SYSTEMS OF NONLINEAR EQUATIONS
x2
0
-5
-5 0 5
x1
x(k+1) = g(x(k) ).
g(x) = x − λf (x)
for some nonzero λ ∈ R. Let’s see how this fares on our example.
4.1. SIMULTANEOUS ITERATION 67
xlim([-5, 5]);
ylim([-5, 5]);
for k = 1 : 100
x = g(x);
end
f(x)
1 0
1.0444444 0.1
1.0883285 0.19722222
1.1315321 0.29177754
1.173946 0.38376527
1.2154714 0.4732743
1.2560194 0.56038416
1.2955105 0.64516592
1.3338739 0.72768315
1.3710475 0.80799268
1.4069774 0.88614543
1.4416172 0.96218702
68 CHAPTER 4. SYSTEMS OF NONLINEAR EQUATIONS
1.4749279 1.0361585
% ...
1.6138751 2.6717119
1.6097494 2.66882
1.6057833 2.6658665
1.6019756 2.6628623
1.5983247 2.659818
1.5948288 2.6567433
1.5914856 2.6536476
1.588293 2.6505395
1.5852484 2.6474272
1.582349 2.6443182
1.5795921 2.6412198
1.5769746 2.6381384
1.5744933 2.6350802
1.5721451 2.6320509
0.0444
0.0599
1
x2
-1
-2
-3
-4
-5
-5 0 5
x1
with the parameters. You should find that it easily diverges. I never saw it
converge to the other root.
Thus, there is at least hope for simultaneous iteration. To understand
convergence, we turn toward the contraction mapping theorem we had in
one dimension, and try to generalize it to Rn .
4.2 Contractions in Rn
We shall prove Theorem 4.1 in [SM03].
Here are some typical vector norms that are often useful, called the 2-
norm, 1-norm and ∞-norm respectively:
See [TBI97, Lec. 3] for further details, including the associated subordinate
matrix norms: you should be comfortable with these notions. A couple of
remarks are in order.
2. All norms in Rn are equivalent, that is, if k · ka , k · kb are any two vector
norms, there exist constants c, C > 0 such that for any vector x, we
have
Here is one consequence: if kx(k) −ξk → 0 in some norm, then the same
is true in any other norm.
The key point here is If we assume existence for now, then we can easily show parts 2 and 3.
to notice that in the The technical part is proving existence, which we will do later.
proofs of uniqueness
and convergence, we
assume existence of a
fixed point. Thus,
Uniqueness. Assume there exists a fixed point ξ ∈ D, that is, g(ξ) = ξ.
these proofs do not For contradiction, assume there exists another fixed point η ∈ D. Then,
achieve much on their
own: we need to prove
existence separately. kξ − ηk = kg(ξ) − g(η)k ≤ Lkξ − ηk.
To prove existence of a fixed point ξ, we use the notion of Cauchy se- For a single equation,
quence. we proved existence
using the intermediate
value theorem: that
Definition 4.4. A sequence x(0) , x(1) , x(2) , . . . in Rn is a Cauchy sequence in does not generalize to
Rn if for any ε > 0 there exists k0 (which usually depends on ε) such that: higher dimensions,
which is why we need
the extra work.
∀k, m ≥ k0 , kx(m) − x(k) k ≤ ε.
Here are a few facts related to this notion, which we shall not prove.
All of these terms are of the same form. Let’s look at one of them:
By induction,
where E(x, h) “goes to zero faster than h when h goes to zero”; that
is:
kE(x, h)k
lim = 0,
h→0 khk
for any vector norm k · k.
3. What this limit means is: for any tolerance ε > 0 (of our own choosing),
there exists a bound δ > 0 (perhaps very small) such that the fraction
is smaller than ε provided khk ≤ δ. Explicitly:
We can ensure x(k+1) is closer to ξ than x(k) is close to ξ in the chosen norm
if we can make kJg (ξ)k + ε < 1. Surely, if kJg (ξ)k ≥ 1, then we lose. But, as
soon as kJg (ξ)k < 1, there exists ε > 0 small enough such that kJg (ξ)k + ε
is still strictly less than 1. And since we get to pick ε, we can do that. The
only downside is that it may require us to force khk to be very small, that
is: this may only be useful if x(k) is very close to ξ. Here is the take-away:
74 CHAPTER 4. SYSTEMS OF NONLINEAR EQUATIONS
% Get a root with Matlab's fzero (we will develop our own ...
algorithm below)
x0 = [1 ; 2];
xi = fsolve(f, x0);
1.1
X: 0.3303
0.9 Y: 0.9488
0 0.2 0.4 0.6 0.8
6
Figure 4.3: The root of f in the positive orthant is indeed a stable fixed
point for g with λ between 0 and about 0.6, as indicated by the 2-norm of
the Jacobian. After the fact, we find that λ = 0.33 would have been a better
choice than λ = 0.05, but this is hard to know ahead of time.
A word about different norms. Notice that the contraction mapping the-
orem asserts convergence to a unique fixed point provided g is a contraction
in some norm. It is important to remark that it is sufficient to have the
contraction property in one norm to ascertain convergence. Indeed, as the
example below indicates, it may be that g is not a contraction in some norm,
but is a contraction in another norm. The other way around: observing
kJg (ξ)k > 1 in any number of norms is not enough to rule out convergence
(contrary to the one-dimensional case where we only had to check |g 0 (ξ)|.)
lambda = 0.05;
g = @(x) x - lambda*f(x);
Jg = @(x) eye(2) - lambda*Jf(x);
% Get a root with Matlab's fzero (we will develop our own ...
algorithm below)
x0 = [1 ; 2];
xi = fsolve(f, x0);
end
We have kJg (ξ)k > 1 in all three norms. We cannot conclude from this
observation, but it does decrease our hope to see simple iteration with g
converge to that root.
g(x) = x − M f (x),
At any point x, for any choice of λ, the norm of the Jacobian is at least 1
in the 2-norm (and also in the 1-norm and in the ∞-norm).
Yet, if we allow
1 0
ourselves to use a matrix M , then setting M = yields:
0 −1
1
g(x) = x − M f (x), Jg (x) = I2 − M Jf (x) = I2 .
2
The 2-norm (and 1-norm, and ∞-norm) of this matrix is 12 < 1, hence, local
convergence is guaranteed.
Based on our understanding of the critical role of a small Jacobian at ξ,
ideally, we would pick M such that
assuming the inverse exists. Of course, we do not know ξ, let alone (Jf (ξ))−1 .
Reasoning that at iteration k our best approximation for ξ is x(k) (as far as
we know at that point), we are lead to Newton’s method:
One important remark is that one should not compute the inverse of
Jf (x(k) ). Instead, solve the linear system implicitly defined by the iteration
(using LU with pivoting for example, possibly via Matlab’s backslash):
Add the solution to x(k) to obtain x(k+1) . This is faster and numerically more
accurate than computing the inverse of the matrix.
Importantly, with Newton’s method, we can get any root of f where the
Jacobian is nonsingular. On our original example (intersection of circle and
line), initializing Newton’s method at various points can yield convergence
to either root.
78 CHAPTER 4. SYSTEMS OF NONLINEAR EQUATIONS
x = [1 ; 2]; % initialize
for k = 1 : 8
fx = f(x);
Jfx = Jf(x);
end
The main theorem about Newton’s method follows [SM03, Thm. 4.4].
where we are not too precise about the term in the bracket for now. From This is similar to the
this and the fact that f (ξ) = 0, we deduce that Jg (ξ) = 0. We now give observation that
1
x 7→ a(x) is
further details about this computation. Define A(x) = (Jf (x))−1 . This is a differentiable at x
differentiable function of x at ξ.2 Since g(x) = x − A(x)f (x), we have: provided a(x) 6= 0 and
a is differentiable.
n
X
gi (x) = xi − aik (x)fk (x),
k=1
n
∂gi X ∂aik ∂fk
(x) = δij − (x)fk (x) + aik (x) (x) ,
∂xj k=1
∂x j ∂x j
= [In ]ij
= δij .
3
We omitted to verify that Newton’s equation is well defined, that is, that Jf (x(k) ) is
invertible for all k under some conditions on x(0) . The argument goes as follows: Jf (x) is
assumed to be a continuous function of x, hence its determinant is a continuous function
of x. Since the determinant is nonzero at ξ by assumption, it must be nonzero in a neigh-
borhood of ξ by continuity, hence: Newton’s method is well defined in that neighborhood.
Details are in [SM03, Thm. 4.4]: we omit them for simplicity.
4.4. NEWTON’S METHOD 81
kx(k+1) − ξk
≤ c.
kx(k) − ξk2
kx(k+1) − ξk
lim ≤ c.
k→∞ kx(k) − ξk2
Consider Figure 4.4, known as a Newton fractal, to anchor this remark. De-
spite this warning, Newton’s is one of the most useful algorithms in numerical
analysis, as it allows to refine crude approximations to high accuracy.
4
That is, such that the Jacobian of f at that root is nonsingular.
82 CHAPTER 4. SYSTEMS OF NONLINEAR EQUATIONS
83
84 CHAPTER 5. EIGENPROBLEMS
2 1
Example 5.3. For example consider A = . Its characteristic poly-
−3 3
nomial is
2−λ 1
pA (λ) = det (A − λI2 ) = det = λ2 − 5λ + 9.
−3 3 − λ
Its two complex roots are 2.5 ± i1.658 . . .: these are the eigenvalues of A.
Remark 5.4. Matlab provides the function poly which returns the coeffi-
cients of pA in the monomial basis. Notice that Matlab defines the charac-
teristic polynomial as det(λIn − A), which is equivalent to (−1)n pA (λ). Of
course, this does not change the roots.
pA (λ) = an λn + · · · + a1 λ + a0 ,
then to compute the roots of pA (for example, using the iterative algorithms
we covered to solve nonlinear equations.) This turns out to be a terrible idea.
subplot(1,2,1);
plot(real(d1), imag(d1), 'o', 'MarkerSize', 8); hold all;
plot(real(d2), imag(d2), 'x', 'MarkerSize', 8); hold off;
xlim([-1.3, 1.3]);
5.1. THE POWER METHOD 85
ylim([-1.3, 1.3]);
pbaspect([1,1,1]);
title('Eigenvalues of A and computed roots of p A');
xlabel('Real part');
ylabel('Imaginary part');
legend('Computed roots', '"True" eigenvalues');
subplot(1,2,2);
stem(n:-1:0, abs(p));
pbaspect([1.6,1,1]);
xlim([0, n]);
set(gca, 'YScale', 'log');
title('Coefficients of p A(\lambda)');
xlabel('Absolute coefficient of \lambdaˆk for k = 0...n');
ylabel('Absolute value of coefficient');
Running the code with n = 50 and 400 produces Figures 5.1 and 5.2. As
can be seen from the figures, the coefficients of pA in the monomial basis
(1, λ, λ2 , . . .) grow out of proportions already for small values of n: there are
many orders of magnitude between them, far more than 16. Even if we find
an excellent algorithm to compute the coefficients in IEEE arithmetic and to
compute the roots of the polynomial from there, the round-off error on the
coefficients alone is already too much to preserve a semblance of accuracy.
This is because the conditioning of the problem “given the coefficients of
a polynomial in the monomial basis, find the roots of that polynomial” is
very bad. In other words: small perturbations of the coefficients may lead to
large perturbations of the roots. We claim this here without proof. We will
develop other ways.
Computed roots
1 "True" eigenvalues
Coefficients of p A ( )
0
10
10-5
Imaginary part
-10
10
-0.5
10-15
0 10 20 30 40 50
-1
Absolute coefficient of k for k = 0...n
-1 -0.5 0 0.5 1
Real part
Computed roots
1 "True" eigenvalues
Coefficients of p A ( )
100
Absolute value of coefficient
0.5
Imaginary part
0 -50
10
-0.5
10-100
0 100 200 300 400
-1
Absolute coefficient of k for k = 0...n
-1 -0.5 0 0.5 1
Real part
Figure 5.2: Same as Figure 5.1 with n = 400: the computed roots are com-
pletely off. The strange pattern of the roots of the computed characteristic
polynomial is not too relevant for us. As a side note on that topic, notice that
the roots of a polynomial with random Gaussian coefficients follow the same
kind of pattern: x = roots(randn(401, 1)); plot(real(x), imag(x),
’.’); axis equal;.
5.1. THE POWER METHOD 87
A = V DV −1 .
In other words,
AV = V D,
Assumption 0 A is diagonalizable.
Let the eigenvalues be ordered in such a way that
Ak = (V DV −1 )(V DV −1 ) · · · (V DV −1 )
= V D(V −1 V )D(V −1 V )D · · · (V −1 V )DV −1
= V Dk V −1 . (5.4)
88 CHAPTER 5. EIGENPROBLEMS
Equivalently:
Ak V = V Dk , (5.5)
x(0) = c1 v1 + · · · + cn vn = V c. (5.6)
Thus,
Let us factor out the (complex) number λk1 from this expression:
n k !
X λj
Ak x(0) = λk1 c1 v1 + cj vj . (5.8)
j=2
λ 1
λk1
If λ1 is a positive real number, then |λk1 |
= 1 for all k so that the limit exists:
c1 v 1 c1
Here, we use kv1 k2 = 1. lim x(k) = lim = v1 . (5.9)
k→∞ k→∞ kc1 v1 k2 |c1 |
If A is Hermitian (A = A∗ ), this is equivalent to saying: x(0) is not orthogonal to v1 .
1
For a general matrix, it is equivalent to saying: x(0) does not lie in the subspace spanned
by v2 , . . . , vn . If x(0) is taken as a (complex) Gaussian random vector, this is satisfied with
probability 1.
5.2. INVERSE ITERATION 89
The complex number |cc11 | has modulus one: its presence is indicative of the
fact that unit-norm eigenvectors are defined only up to phase: if v1 is an
eigenvector, then so are −v1 , iv1 , −iv1 and all other vectors of the form eiθ v1
for any θ ∈ R. The important point is that x(k) asymptotically aligns with v1 ,
up to an unimportant complex phase. More generally, if λ1 is not a positive
real number, then the limit of x(k) does not exist because the phase may keep
λk
changing due to the term |λk1 | ; this is inconsequential as it does not affect the
1
direction spanned by x(k) .2
|λ2 |
The convergence rate is dictated by the ratio |λ 1|
. If this is close to 1,
convergence is slow; if this is close to 0, convergence is fast. In all cases,
unless this ratio is zero, the convergence is linear since the error decays as
k
|λ2 |
|λ1 |
times a constant.
If after k iterations v = x(k) is deemed a reasonable approximate dominant
eigenvector of A, we can extract an approximate corresponding eigenvalue
using the Rayleigh quotient:
Av ≈ λ1 v, thus
v Av ≈ λ1 v ∗ v, so
∗
v ∗ Av
λ1 ≈ . (5.10)
kvk22
For A real and symmetric, see Theorem 5.13 in [SM03] or Theorem 27.1 in
[TBI97] for guarantees on the approximation quality.
The central idea behind inverse iteration is to use the power method on a
different matrix to address both these issues.
For a diagonalizable matrix A = V DV −1 and a given µ ∈ C, observe the
following:
A − µIn = V DV −1 − µV V −1 = V (D − µIn )V −1 .
2
In comparison, observe that limk→∞ x(k) (x(k) )∗ = v1 v1∗ : this limit exists even if λ1 is
not real positive.
90 CHAPTER 5. EIGENPROBLEMS
2. Once per iteration, solve the system M y (k+1) = x(k) using the LU
factorization for only O(n2 ) flops each time.
Question 5.9. Compare the costs of inverse iteration and the power method.
(A − µIn )y = b.
5.3. RAYLEIGH QUOTIENT ITERATION 91
δy = (A − µIn )−1 δb
= V (D − µIn )−1 V −1 δb
= V (D − µIn )−1 u
n
X ui
= vi .
i=1
λi − µ
The main idea behind Rayleigh quotient iteration (RQI) is to use the
Rayleigh quotient at each iteration to redefine µ. Given A and x(0) (with the
latter often chosen at random), RQI iterates as follows:
(x(k) )T Ax(k)
µk = , (5.14)
(x(k) )T x(k)
y (k+1) = (A − µk In )−1 x(k) , (5.15)
y (k+1)
x(k+1) = . (5.16)
ky (k+1) k2
A = P HP T .
A =V DV T , (5.17)
4
A matrix is in Hessenberg form if Aij 6= 0 =⇒ j ≥ i + 1.
5.4. STURM SEQUENCES 93
A 0
Question 5.14. Let M = be a matrix with blocks A, B (square).
0 B
Show that the eigenvalues of A are eigenvalues of M . Similarly, show that
the eigenvalues of B are eigenvalues of M . Deduce that the set of eigenvalues
of M is exactly the union of the sets of eigenvalues of A and B.
p1 (λ) = a1 − λ,
p2 (λ) = (a2 − λ)(a1 − λ) − b22 ,
pk+1 (λ) = (ak+1 − λ)pk (λ) − b2k+1 pk−1 (λ), for k = 2, . . . , n − 1. (5.19)
p0 (λ) = 1
Question 5.16. Verify that the recurrence (5.19) for k = 1 is valid with the
definition p0 (λ) = 1.
This is important, since we know from the beginning of this chapter that the
coefficients of pn may be badly behaved. On the other hand, evaluating the
polynomials using the recurrence is fairly safe.
The following is the main theorem of this section. It tells us how we
can use the recurrence relation to determine where the eigenvalues of T are
located. Remember we assume b2 , . . . , bn are nonzero.
To count agreements in sign in the Sturm sequence, see Figure 5.6.6 The
following code illustrates this theorem; see also Figure 5.3.
The proof of the Sturm sequence property relies crucially on the following
theorem. (Remember that, by the spectral theorem, the k roots of pk are
real since Tk is symmetric.) We still assume all bi ’s are nonzero.
6
The rule given in [SM03] is incorrect for Sturm sequences ending with a 0.
5.4. STURM SEQUENCES 97
0.5
-0.5
-1
0 1 2 3 4 5 6
Question 5.19. Write your own code to generate Figure 5.4, based on
the three-term recurrence. Notice that to evaluate pn (θ), you will evaluate
p0 (θ), p1 (θ), . . . , pn (θ), so that you will be able to plot all polynomials right
away. See if you can write your code to evaluate the three-term recurrence
at several values of θ simultaneously (using matrix and vector notations).
The Sturm sequence theorem allows to find any desired eigenvalue of T
through bisection. Indeed, define
](x) = the number of eigenvalues of T strictly larger than x. (5.20)
98 CHAPTER 5. EIGENPROBLEMS
6 T6 T5 T4 T3 T2 T1
-2
-4
-6
0 0.5 1 1.5 2 2.5 3 3.5 4
Say we want to compute λk . Start with a0 < b0 such that λk ∈ (a0 , b0 ]—we
will see how to compute such initial bounds later. Consider the middle point
c0 = a0 +b
2
0
and compute ](c0 ). Two things can happen:
In the first case, we determined that λk ∈ (c0 , b0 ], while in the second case
we found λk ∈ (a0 , c0 ]. In both cases, we found an interval twice smaller
than the original interval, still with the guarantee that it contains λk . Upon
iterating this procedure, we produce an interval of length |b0 − a0 |/2K in K
iterations, each of which involves a manageable number of operations. This
5.4. STURM SEQUENCES 99
pk+1 (α) = (ak+1 − α)pk (α) − b2k+1 pk−1 (α) = −b2k+1 pk−1 (α),
pk+1 (β) = (ak+1 − β)pk (β) − b2k+1 pk−1 (β) = −b2k+1 pk−1 (β).
By the intermediate value theorem, if pk+1 (α) and pk+1 (β) have opposite
signs, then pk+1 admits a root in (α, β). According to the above (using that
bk+1 6= 0), this is the case exactly if pk−1 (α) and pk−1 (β) have opposite signs.
This in turn follows from the induction hypothesis: we assume here that the
100 CHAPTER 5. EIGENPROBLEMS
roots of pk−1 and those of pk interlace, which implies that pk−1 has exactly
one root in (α, β). Hence, pk−1 (α) and pk−1 (β) have opposite signs as desired.
Apply this whole reasoning to each consecutive pair of roots of pk to locate
k − 1 roots of pk : we only have two more to locate.
Second, we want to show that pk+1 has a root strictly smaller than the
smallest root of pk ; let us call the latter γ. The key observation is that all
the characteristic polynomials considered here go to positive infinity on the
left side, that is, for all r:
pk+1 (γ) = (ak+1 − γ)pk (γ) − b2k+1 pk−1 (γ) = −b2k+1 pk−1 (γ).
If this is negative, then pk+1 must have a root strictly smaller than γ since
it must obey (5.22): pk+1 (λ) has to become positive eventually as λ → −∞.
According to the above, this happens exactly if pk−1 (γ) is positive (using
again that bk+1 6= 0). By the induction hypothesis, all the roots of pk−1 are
strictly larger than γ, and since pk−1 itself obeys (5.22), it follows that indeed
pk−1 (γ) > 0, as desired. This shows that pk+1 has at least one root strictly
smaller than γ.
To conclude, we need only show that pk+1 has a root strictly larger than
the largest root of pk . The argument is similar to the one above. As
only significant difference, we here use the fact that limλ→+∞ pk−1 (λ) and
limλ→+∞ pk+1 (λ) are both infinite of the same sign.
Before we get into the proof of the Sturm sequence property, let us make
three observations about how zeros may appear in such sequences:
1. The first entry of the sequence is always + (in particular, it is not 0).
2. There can never be two subsequent zeros. Indeed, assume for con-
tradiction that pk (θ) = pk−1 (θ) = 0. Then, the recurrence implies
0 = −b2k pk−2 (θ). Thus, pk−2 (θ) = 0 (under our assumption that
bk 6= 0.) Applying this same argument to pk−1 (θ) = pk−2 (θ) = 0
implies pk−3 (θ) = 0, etc. Eventually, we conclude p0 (θ) = 0, which is a
contradiction since p0 (θ) = 1 for all θ.
7
You can also see it from the recurrence relation (5.19), which shows the sign of the
highest order term changes at every step.
5.4. STURM SEQUENCES 101
Our goal is to show that sn (θ) = gn (θ). By induction, we prove that sk (θ) =
gk (θ) for all k, which implies the result. The first step is to secure the base
case, k = 1. By definition,
(
1 if a1 − θ > 0,
s1 (θ) = number of sign agreements in (1, a1 − θ) =
0 otherwise.
Similarly, owing to the Cauchy interlacing theorem, for any θ, the difference
gk (θ) − gk−1 (θ) can be either 0 or 1. Specifically,
(
gk−1 (θ) + 1 if, compared to pk−1 , pk has one more root > θ,
gk (θ) =
gk−1 (θ) otherwise.
102 CHAPTER 5. EIGENPROBLEMS
Thus, using the induction hypothesis sk−1 (θ) = gk−1 (θ), in order to show
that sk (θ) = gk (θ), we only need to verify that the following conditions are
in fact equivalent:
This is best checked on a drawing: see Figure 5.5. There are four cases to
verify, using Cauchy interlacing and the fact that limλ→−∞ pr (θ) = +∞ for
r = k − 1, k.
2. θ < the smallest root of pk : then both pk−1 (θ) and pk (θ) are positive.
We have a sign agreement, and indeed pk has k roots strictly larger
than θ, which is one more than pk−1 .
3. θ > the largest root of pk : then pk−1 (θ) and pk (θ) (nonzero) have
opposite sign. We have no sign agreement, and indeed pk and pk−1
have the same number of roots strictly larger than θ, namely, zero.
(a) θ ∈ (α, γ): there is no sign agreement since pk−1 (θ) and pk (θ)
(both nonzero) have opposite sign (check this by following the
sign patterns on a drawing). And indeed, since θ passed one more
root of pk than it passed roots of pk−1 , there are an equal number
of roots of pk and pk−1 strictly larger than θ.
(b) θ = γ: the sequence is (0, pk (θ)). There is a sign agreement, and
indeed θ passed as many roots of pk as it passed roots of pk−1 , so
that pk has one more strictly larger than θ;
(c) θ ∈ (γ, β): there is sign agreement (for the same reason that there
was no sign agreement in the first case), and there is one more
root of pk to the right of θ than there are roots of pk−1 .
where kAk is the norm of A, subordinate to the vector norm k · k. For sym-
metric matrices, this means each eigenvalue is in the interval [−kAk, +kAk].
For this observation to be practical, it only remains to pick a vector norm
whose subordinate matrix norm is straightforward to compute. A poor choice
is the 2-norm (since the subordinate norm is the largest singular value: this
is as hard to compute as an eigenvalue); good choices are the 1-norm and
∞-norm:
Any eigenvalue λ of A obeys |λ| ≤ kAk1 and |λ| ≤ kAk∞ .
104 CHAPTER 5. EIGENPROBLEMS
start + 1
0
0 0 0
0
−
1
Thus, we can start the Sturm bisection with the interval [−kAk1 , kAk1 ] or
[−kAk∞ , kAk∞ ].8
Technique 2. The above strategy determines one large disk in the com-
plex plane which contains all eigenvalues. If the eigenvalues are spread out,
this can only give a very rough estimate of their location. A more refined
approach consists in determining a collection of (usually) smaller disks whose
union contains all eigenvalues. These disks are called Gerschgorin disks. The
code below produces Figure 5.7; then we will see how they are constructed.
8
The Sturm bisection technically works with intervals of the form (a, b], not [a, b].
Hence, if the smallest eigenvalue is targeted, one should check whether −kAk1 or −kAk∞
is a root of pn , simply by evaluating it there, or one can make the interval slightly larger.
5.5. GERSCHGORIN DISKS 105
%% Locating eigenvalues
A = [7 2 0 ; -1 8 1 ; 2 2 0];
e = eig(A);
plot(real(e), imag(e), '.', 'MarkerSize', 25);
xlabel('Real');
ylabel('Imaginary');
title('Eigenvalues of A');
xlim([-15, 15]);
ylim([-15, 15]);
axis equal;
hold on;
%%
% subordinate norms
circles(0, 0, norm(A, 1), 'FaceAlpha', .1);
circles(0, 0, norm(A, inf), 'FaceAlpha', .1);
% The code for 'circles' is here:
% https://www.mathworks.com/matlabcentral/ ...
% fileexchange/45952-circle-plotter
%%
% Gerschgorin disks
for k = 1 : size(A, 1)
radius = sum(abs(A(k, [(1:k-1), (k+1:end)])));
circles(real(A(k, k)), imag(A(k, k)), radius, ...
'FaceAlpha', .1);
end
hold off;
Reorganizing:
X
(λ − akk )xk = aki xi . (5.26)
i6=k
Eigenvalues of A
15
10
5
Imaginary
-5
-10
-15
-15 -10 -5 0 5 10 15
Real
= det M −1 AM − λM −1 In M
= det M −1 (A − λIn )M
From here, it is only a matter of iterating the same idea. Specifically, let us
design QT2 such that applying it to the left will make zeros appear in the right
places in the second column. Upon doing this, we must make sure not to
affect the first two rows, so that when we apply Q2 to the right, our work in
columns 1 and 2 will be unaffected. Specifically, build an orthogonal matrix
H2 of size n − 2 such that
× × 0 0 0
× × × × ×
T T
Q2 Q1 AQ1 = 0 × × × × ,
0 0 × × ×
0 0 × × ×
where
I2 0
QT2 = .
0 H2
where
I3 0
QT3 = .
0 H3
In general, define
Q = Q1 Q2 · · · Qn−2 .
110 CHAPTER 5. EIGENPROBLEMS
QT AQ = T
A = QT QT ,
Hx = ±kxke1 , (5.29)
where e1 is the first canonical vector in Rk , that is, eT1 = 1 0 · · · 0 .
Either sign is good: we will commit later. How do we construct a matrix H
satisfying these requirements? One good way is by reflection.10 Specifically,
consider
x − Hx
u= . (5.30)
kx − Hxk
10
See Figure 10.1 in [TBI97], where their F is our H and their normalized v is our u.
5.6. HOUSEHOLDER TRIDIAGONALIZATION 111
This unit-norm vector defines a plane normal to it. We will reflect x across
that plane. To this end, consider the orthogonal projection of x to the span
of u: it is (uT x)u. Hence, the orthogonal projection of x to the plane itself is
P x = x − (uT x)u.
We can reflect x across the plane by moving from x to the plane, orthogonally,
then traveling the exact same distance once more (along the same direction).
Thus,
Hx = x + 2(P x − x) = 2P x − x = x − 2(uT x)u = (Ik − 2uuT )x.
In other words, a good candidate is
H = Ik − 2uuT .
Since kuk = 1, it is clear that H is orthogonal. Indeed,
H T H = (Ik − 2uuT )T (Ik − 2uuT ) = Ik − 4uuT + 4u(uT u)uT = Ik ,
so that H T = H −1 . If we can find an efficient way to compute u, then we
have an efficient way to find H. To this end, combine (5.29) and (5.30):
u ∝ x ∓ kxke1 ,
where by ∝ we mean that the vector on the left hand side is proportional to
the vector on the right hand side. The vector on the right hand side is readily
computed, and furthermore we know that u has unit norm, so that it is clear
how to compute u in practice: compute the right hand side, then normalize.
How do we pick the sign? Floating-point arithmetic considerations suggest
to pick the sign such that sign(x1 ) and sign(∓kxk) = sign(∓) agree (think
about it). Overall, given x ∈ Rk , we can compute u as follows:
1. u ← x + sign(x1 )kxke1 ,
u
2. u ← kuk
.
This costs ∼ 3k flops. The corresponding reflector is H = Ik − 2uuT . In
practice, we would not compute H explicitly: that would be quite expensive.
Instead, observe that applying H to a matrix can be done efficiently by
exploiting its structure, namely:
HM = M − 2u(uT M ).
Computing the vector uT M costs ∼ 2k 2 flops. With uT M computed, com- To compute 2uv T , first
puting HM costs another ∼ 2k 2 flops. This is much cheaper than if we form multiply u by 2 for k
flops, then do the
H explicitly (∼ k 2 ), then compute a matrix-matrix product, at ∼ 2k 3 flops. vector product for k2
Of course, we can also apply H on the right, as in M H, for ∼ 4k 2 as well. flops (as opposed to
doing the vector
product first, then
multiplying each entry
by 2, which costs 2k2
flops.)
112 CHAPTER 5. EIGENPROBLEMS
Question 5.27. Using the above considerations, how many flops do you need
to compute the matrix T , given the matrix A? You should find that, without
exploiting the symmetry of A, you can do this in ∼ 10
3
n3 flops.
Question 5.28. The vectors u1 , . . . , un−2 produced along the way (to design
the reflectors H1 , . . . , Hn−2 ) can be saved in case it becomes necessary to apply
Q to a vector later on. How many flops does it take to apply Q to a vector
y ∈ Rn in that scenario? How does that compare to the cost of a matrix-vector
product Qy as if we had Q available as a matrix directly?
Polynomial interpolation
113
114 CHAPTER 6. POLYNOMIAL INTERPOLATION
1, x, x2 , . . . , xn .
1010
5
10
100
0 2 4 6 8 10 12 14 16 18 20
degree n (n+1 interpolation points)
Figure 6.1: For equispaced and Chebyshev interpolation points, the Vander-
monde matrix has exponentially growing condition number. Beyond degree
20 or so, one cannot hope to compute the coefficients of the interpolation
polynomial in the monomial basis with any reasonable accuracy in double
precision arithmetic. This does not mean, however, that the interpolation
polynomial cannot be evaluated accurately: only that methods based on
computing the coefficients in the monomial basis first are liable to incur a
large error.
116 CHAPTER 6. POLYNOMIAL INTERPOLATION
“better” means that the matrix appearing in the linear system should have
better condition number that the Vandermonde matrix. In doing so, we
might as well be greedy and aim for the best conditioned matrix of all: the
identity matrix.
We are looking for a basis of Pn made of n + 1 polynomials of degree at
most n,
such that the interpolation problem reduces to a linear system with an iden-
tity matrix. Using this (for now unknown) basis, the solution p can be written
as
n
X
p(x) = ak Lk (x).
k=0
For the matrix to be identity, we need each Lk to satisfy the following (con-
sider each column separately):
The latter specifies n roots for each Lk , only leaving a scaling indeterminacy;
that scaling is determined by the former condition. There is only one possible
choice:
Q
i6=k (x − xi )
Lk (x) = Q . (6.5)
i6=k (xk − xi )
subplot(2, 3, k);
hold all;
stem(x, ek, '.', 'MarkerSize', 15, 'LineWidth', 1.5);
6.1. LAGRANGE INTERPOLATION, THE LAGRANGE WAY 119
hold off;
ylim([-1, 1.5]);
set(gca, 'YTick', [-1, 0, 1]);
set(gca, 'XTick', [-1, 0, 1]);
end
1 1 1
0 0 0
-1 -1 -1
-1 0 1 -1 0 1 -1 0 1
1 1 1
0 0 0
-1 -1 -1
-1 0 1 -1 0 1 -1 0 1
It is clear from above that the solution p exists and is unique. Neverthe-
less, we give here a uniqueness proof that does not involve any specific basis,
primarily because it uses an argument we will use frequently.
Theorem 6.4. The solution of the interpolation problem is unique.
Proof. For contradiction, assume there exist two distinct polynomials p, q ∈
Pn verifying p(xi ) = yi and q(xi ) = yi for i = 0, . . . , n. Then, the polynomial
h = p − q is also in Pn and it has n + 1 roots:
Yet, the only polynomial of degree at most n which has strictly more than n
roots is the zero polynomial. Thus, h = 0, and it follows that p = q.
If the data points (xi , yi ) are obtained by sampling a function f , that is,
yi = f (xi ), and x0 , . . . , xn are distinct points in [a, b], then, a natural question
is:
How large can the error f (x) − pn (x) be for x ∈ [a, b], where pn
is the interpolation polynomial for n + 1 points?
120 CHAPTER 6. POLYNOMIAL INTERPOLATION
cos(x) + 1
f1 (x) = over [0, 2π], and (6.7)
2
1
f2 (x) = over [−5, 5]. (6.8)
1 + x2
Both are infinitely continuously differentiable. Yet, one will prove much
harder to approximate than the other. Run the lecture code to observe the
following:
In both cases, the errors seem largest close to the boundaries of [a, b],
which suggests sampling more points there. This is the idea behind
Chebyshev points, which we will justify later.
f (n+1) (ξ)
f (x) − pn (x) = πn+1 (x), (6.9)
(n + 1)!
6.1. LAGRANGE INTERPOLATION, THE LAGRANGE WAY 121
(This is clearer if you simply make a drawing of the situation on the real
line.) Combine all inequalities to control πn+1 :
|πn+1 (x)| = |x − x0 | · · · |x − xi−1 ||(x − xi )(x − xi+1 )||x − xi+2 | · · · |x − xn |
1
≤ (i + 1) · · · 2 · · 2 · · · (n − i)hn+1 .
4
122 CHAPTER 6. POLYNOMIAL INTERPOLATION
The product of integers attains its largest value if x lies in one of the extreme
intervals: (x0 , x1 ) or (xn−1 , xn ), that is, i = 0 or i = n − 1. Hence,
n! n+1
|πn+1 (x)| ≤ h
4
for all x ∈ [a, b].
Combining with Theorem 6.5, a direct corollary is that for equispaced
hn+1
points the approximation error is bounded by 4(n+1) Mn+1 . As n increases, the
fraction decreases exponentially fast. Importantly, the function-dependent
Mn+1 can increase with n, possibly fast enough to still push the bound to
infinity. This in itself does not imply the actual error will go to infinity, but
it is indicative that large errors are possible.
Question 6.7. For f1 (6.7), what is a good value for Mn+1 ? Based on this,
what is your best explanation for the experimental behavior of the approxi-
mation error?
We now give a proof of the main theorem, following [SM03, Thm 6.2].
Proof of Theorem 6.5. It is only necessary to establish equation (6.9). If x
is equal to any of the interpolation points xi , that equation clearly holds.
Thus, it remains to prove (6.9) holds for x ∈ [a, b] distinct from any of the
interpolation points. To this end, consider the following function on [a, b]:
f (x) − pn (x)
φ(t) = f (t) − pn (t) − πn+1 (t).
πn+1 (x)
(Note that x is fixed in this definition: φ is only a function of t.) Here is are
the two crucial observations:
1. φ has at least n + 2 distinct roots in [a, b]. Indeed, φ(xi ) = 0 for
i = 0, . . . , n, and also φ(x) = 0, and
2. φ is n + 1 times continuously differentiable.
Using both facts, by the mean value theorem (or Rolle’s theorem), the deriva-
tive of φ has at least n+1 distinct roots in [a, b]. In turn, the second derivative
of φ has at least n distinct roots in [a, b]. Continuing this argument, we con-
clude that the (n + 1)st derivative of φ (which we denote by φ(n+1) ) has at
least one root in [a, b]: let us call it ξ. (Of course, ξ may depend on x in a
complicated way, but that is not important: we only need to know that such
a ξ exists.) Verify the two following claims:
6.2. HERMITE INTERPOLATION 123
For example, Kk must have a double root at each xi6=k and a simple root at
xk : this accounts for all the 2n + 1 possible roots of Kk , thus it must be that
The scale is set by enforcing Kk0 (xk ) = 1. By chance, the above is already
properly scaled: it is an equality. To construct Hk , one can work similarly
(though it’s a tad less easy to guess), and one obtains:
f (2n+2) (ξ)
f (x) − p2n+1 (x) = (πn+1 (x))2 , (6.12)
(2n + 2)!
where πn+1 (x) = (x−x0 ) · · · (x−xn ). Defining M2n+2 = maxξ∈[a,b] |f (2n+2) (ξ)|,
M2n+2
|f (x) − pn (x)| ≤ (πn+1 (x))2 (6.13)
(2n + 2)!
Minimax approximation
This is well defined since |g| is continuous on the compact set [a, b] (Weier-
strass). It is an exercise to verify this is indeed a norm.
With this notion of norm, we can express (a slightly relaxed version of)1
the error bound above as:
Mn+1
kf − pn k∞ ≤ kπn+1 k∞ . (7.2)
(n + 1)!
Furthermore, we have the convenient notation Mn+1 = kf (n+1) k∞ .
This formalism leads to three natural questions:
1
The error bound (7.1) specifies it is the same x on both sides of the inequality,
whereas (7.2) takes the worst case on both sides independently.
125
126 CHAPTER 7. MINIMAX APPROXIMATION
min kf − pn k∞ .
pn ∈Pn
That is: minimize the actual error rather than a bound on the error.
This is the central question of this chapter. We will characterize the
solutions (show existence, uniqueness and more), but we won’t do much
in the way of computations.
kf − pn k∞ = min kf − qk∞
q∈Pn
The solution to this problem is called the minimax polynomial for f on [a, b],
because of the min-max combination.
Question 7.4. Show that the opposite cannot happen. Specifically, show
there exists a constant c such that kf k2 ≤ ckf k∞ .
The catch is: the theorem does not specify how large the degree of p may
need to be as a function of f and ε. Admittedly, it could be impractically
large in general. In what follows, we maintain some control over the degree,
using results with the same flavor as Theorem 6.5. Note that the above result
extends to k · k2 directly using Question 7.4.
Theorem 7.6 (Existence, Theorem 8.2 in [SM03]). Given f ∈ C[a, b], there
exists pn ∈ Pn such that
kf − pn k∞ ≤ kf − qk∞ ∀q ∈ Pn .
Before we get to the proof, you may wonder: why does this necessitate
a proof at all? It all hinges upon the distinction between infimum and
minimum.4 The infimum of a set of reals is always well defined: it is the
largest lower-bound on the elements of the set. In contrast, the minimum is
only defined if the infimum is equal to some element in the set; we then say
the infimum or minimum is attained. In our scenario, each polynomial q ∈ Pn
maps to a real number kf − qk∞ . The set of those numbers necessarily has
an infimum. The question is whether some number in that set is equal to
the infimum, that is, whether there exists a polynomial pn in Pn such that
kf − pn k∞ is equal to the infimum. If that is the case, the infimum is called
the minimum, and pn is called a minimizer. This existence theorem is all
about that particular point.
Proof. Minimax approximation is an optimization problem:
min E(q),
q∈Pn
3
https://en.wikipedia.org/wiki/Remez_algorithm
4
https://math.stackexchange.com/questions/342749/
what-is-the-difference-between-minimum-and-infimum
7.1. CHARACTERIZING THE MINIMAX POLYNOMIAL 129
If we prove the first equality, then the above states the infimum over Pn is
equivalent to a min over S, thus showing existence of an optimal q.
Define S as follows:
S = {q ∈ Pn : E(q) ≤ kf k∞ }.
E(0) = kf k∞ ≤ kf k∞ .
5
https://en.wikipedia.org/wiki/Extreme_value_theorem#Generalization_to_
metric_and_topological_spaces
6
That’s another Weierstrass theorem: Bolzano–Weierstrass, https://en.wikipedia.
org/wiki/Bolzano%E2%80%93Weierstrass_theorem.
130 CHAPTER 7. MINIMAX APPROXIMATION
0.2
0.6 0.1
X: -0.5636 X: 0.5568
0.05 X: -0.9558 Y: 0.0288 Y: 0.02859 X: 0.9439
0.4 Y: 0.01115 Y: 0.01128
In inequality (7.6), we showed that, in absolute value, the first term is strictly
larger than the second term. Thus,
Sufficiency: We assume x0 < · · · < xn+1 are given and satisfy the
conditions of the theorem; we aim to show r is minimax. Apply De la
Vallée Poussin’s theorem:
DlVP
kf − rk∞ = min |f (xi ) − r(xi )| ≤ min kf − qk∞ ≤ kf − rk∞ .
i=0,...,n+1 q∈Pn
Necessity: This is the technical part of the proof. We omit it—see the
reference book for a full account.9
1. Let E = kf − pn k∞ = kf − qn k∞ . Then,
pn + qn 1
E≤ f− = kf − pn + f − qn k∞
2 ∞ 2
1
≤ (kf − pn k∞ + kf − qn k∞ ) = E.
2
2. Since pn +q
2
n
∈ Pn is minimax, the oscillation theorem gives x0 < · · · <
xn+1 in [a, b] such that (first property)
pn (xi ) + qn (xi )
f (xi ) − =E ∀i.
2
This implies f (xi ) − pn (xi ) = f (xi ) − qn (xi ) for each i, in turn showing
pn (xi ) = qn (xi ) for each i. Thus, the polynomial pn − qn ∈ Pn has n + 2
distinct roots: this can only happen if pn = qn .
kπn+1 k∞ = kxn+1 − qn k∞ .
134 CHAPTER 7. MINIMAX APPROXIMATION
2. kTn k∞ = 1,
so that an denotes the coefficient of the leading order term in Tn (and an−1
denotes the coefficient of the leading order term in Tn−1 , etc.) Then, define
1
qn (x) = xn+1 − Tn+1 (x).
an+1
Crucially, this qn indeed has degree n, even though it was built from two
polynomials of degree n + 1. We argue qn is minimax for xn+1 , as desired.
Indeed, consider the error function:
1
xn+1 − qn (x) = Tn+1 (x).
an+1
10
To work in a different interval [a, b], simply execute the affine change of variable
a+b b−a
t 7→ 2 + 2 t, so that −1 is mapped to a and +1 is mapped to b.
7.2. INTERPOLATION POINTS TO MINIMIZE THE BOUND 135
1
By definition of Tn+1 , the error alternates n + 2 times between ± an+1 , which
n+1
coincides with ±kx − qn k∞ . By the oscillation theorem (the part with the
easy proof), this guarantees we found a minimax. In conclusion, with
1
(x − x0 ) · · · (x − xn ) = πn+1 (x) = xn+1 − qn (x) = Tn+1 (x), (7.7)
an+1
we find that picking the n + 1 interpolation points as the roots of Tn+1 yields
a bound on the approximation error as
Mn+1 1
kf − pn k∞ ≤ ,
(n + 1)! |an+1 |
T0 (x) = 1,
T1 (x) = x,
Tn+1 (x) = 2xTn (x) − Tn−1 (x) for n = 1, 2, . . .
(These are also called the Chebyshev polynomials of the first kind.)
We need to
1. Check that these polynomials indeed fulfill our set task; and
a0 = 1,
a1 = 1,
an+1 = 2an for n = 1, 2, . . . .
11
See https://en.wikipedia.org/wiki/Chebyshev_polynomials for the tip of the
iceberg.
136 CHAPTER 7. MINIMAX APPROXIMATION
Explicitly,
a0 = 1,
an+1 = 2n for n = 0, 1, . . . (7.8)
This notably confirms each Tn is of degree exactly n. Yet, the other state-
ments regarding kTn k∞ , alternation between ±1 and roots are not straight-
forward from the recurrence. For this, we need a surprising lemma.
Lemma 7.11. For all n ≥ 0, for all x ∈ [−1, 1],
Proof. The proof is by induction. Start by verifying that the statement holds
for T0 and T1 . Then, as induction hypothesis, assume
for k = 0, . . . , n, for all x ∈ [−1, 1]. We aim to show the same holds for Tn+1 .
To this end, recall the trigonometric identity:
Set u = nθ and v = θ:
as desired.
This lemma makes it straightforward to continue our work. In particular,
1. It is clear that kTn k∞ = maxx∈[−1,1] | cos(n arccos(x))| = 1; and
The above roots are the so-called Chebyshev nodes (of the first kind). Plot
them to confirm they are more concentrated near the edges of [−1, 1].
We close this investigation here (for now at least.) To state the obvious:
there is a lot more to Chebyshev nodes and Chebyshev polynomials than
meets the eye thus far. Having a look at the Wikipedia page linked above
can give you some hints of the richness of this topic. We will encounter some
of it in the next chapter, as an instance of orthogonal polynomials.
imax approximation (which is the main reason for using it here) and for La-
grange interpolation at Chebyshev nodes (which is used here too, but for
which we could have just as easily used our own codes based on the previous
chapter.)
%%
clear;
close all;
clc;
set(0, 'DefaultLineLineWidth', 2);
set(0, 'DefaultLineMarkerSize', 30);
f = @(t) 1./(1+25*t.ˆ2);
% f = @(t) sin(t).*(t.ˆ2-2);
% f = @(t) sqrt(cos((t+1)/2));
% f = @(t) log(cos(t)+2);
% f = @(t) t.ˆ(n+1);
% f = @abs;
% f = @sign; % Not continuous! Our theorems break down.
a = -1;
b = 1;
subplot(1, 2, 1);
plot(fc, 'k-');
subplot(1, 2, 2);
plot([a, b], [0, 0], 'k-');
% Chebyshev interpolation
% Get a polynomial of degree n to approximate f on [a, b]
% through interpolation at the Chebyshev points
7.3. CODES AND FIGURES 139
grid on;
title(sprintf('Approximation errors for degree %d', n));
pn = minimax(fc, n);
%% Legend
subplot(1, 2, 1);
legend('True f', 'Chebyshev interp', 'Minimax', ...
'Orientation', 'horizontal', 'Location', 'North');
legend('boxoff');
pbaspect([1.6, 1, 1]);
subplot(1, 2, 2);
legend('0', 'Chebyshev interp', 'Minimax', ...
'Orientation', 'horizontal', 'Location', 'North');
legend('boxoff');
pbaspect([1.6, 1, 1]);
140 CHAPTER 7. MINIMAX APPROXIMATION
0.8 0.05
0.6
0.4 0
0.2
0 -0.05
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
for x ∈ [−1, 1], computed with Chebfun (Remez algorithm.) The Chebyshev
approximation is pretty good compared to the optimal minimax, yet is much
simpler to compute. If you only had the Chebyshev polynomial, what could
you deduce about the minimax from De la Vallée Poussin’s theorem?
0.02
0.8
0.01
0.6
0
0.4
-0.01
0.2
-0.02
0 -0.03
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
Figure 7.3: Same as Figure 7.2 but with n = 16. Since f is even, the minimax
polynomial has 19 alternation points rather than only 18 (think about it). If
you only had the Chebyshev polynomial, what could you deduce about the
minimax from De la Vallée Poussin’s theorem?
7.3. CODES AND FIGURES 141
Figure 7.4: The first few Chebyshev polynomials. Picture credit: https:
//en.wikipedia.org/wiki/Chebyshev_polynomials#/media/File:
Chebyshev_Polynomials_of_the_First_Kind.svg.
Figure 7.5: Chebyshev nodes are more concentrated toward the ends of the in-
terval, though they come from evenly spaced points on the upper-half of a cir-
cle. Picture credit: https://en.wikipedia.org/wiki/Chebyshev_nodes#
/media/File:Chebyshev-nodes-by-projection.svg.
142 CHAPTER 7. MINIMAX APPROXIMATION
Chapter 8
min kf − pk2 ,
p∈Pn
where k · k2 is a 2-norm defined below. One key difference with the minimax
approximation problem from the previous chapter is that the 2-norm is in-
duced by an inner product (this is not so for the ∞-norm): this will make
our life a lot easier.
To begin this chapter, we:
For the first three elements at least, the math looks very similar to things you
have already learned (including in this course) about least squares problems.
The crucial point is: we are now working in an infinite dimensional space.
It is important to go through the steps carefully from first principles, and to
keep our intuitions in check.
Remark 8.1. Take a moment to reflect on this: now and for the last couple
of chapters, functions are often considered as vectors. As strange as this may
143
144 CHAPTER 8. APPROXIMATION IN THE 2-NORM
4. hf, f i > 0 ∀f ∈ V, f 6= 0.
hx, yi = xT W y
Example 8.5. For our purpose, the more interesting case is V = C[a, b],
the space of continuous functions on [a, b]. On that space, given a positive,
8.1. INNER PRODUCTS AND 2-NORMS 145
From these expressions for the inner product and the norm, it becomes clear
that we do not actually need to restrict ourselves to continuous functions f ,
as was the case for the ∞-norm. We only need to make sure all integrals
that occur are well defined. This is the case over the following linear space:
L2w (a, b) = f : [a, b] → R : w(x)(f (x))2 is integrable over (a, b) . (8.3)
Question 8.7. Verify L2w (a, b) is indeed a linear space, and show that it
contains C[a, b] strictly (that is, it contains strictly more than C[a, b].)
We are finally in a good position to frame the central problem of this
chapter.
Problem 8.8. For a given weight function w on (a, b) and a given function
f ∈ L2w (a, b), find pn ∈ Pn such that kf − pn k2 ≤ kf − qk2 for all q ∈ Pn .
We show momentarily that the solution to this problem exists and is
unique. It is called the polynomial of best approximation of degree n to f in
the 2-norm on (a, b). Since polynomials are dense in C[a, b] for the 2-norm as
we discussed in the previous chapter, it is at least the case that for continuous
f , as n → ∞, the approximation becomes arbitrarily good. We won’t discuss
rates of approximation here (that is, how fast the approximation error goes
to zero as n increases.)
146 CHAPTER 8. APPROXIMATION IN THE 2-NORM
Equipped with this basis, the unknowns are simply the coefficients c0 , . . . , cn
forming the vector c ∈ Rn+1 such that
n
X
p n = c0 q 0 + · · · + cn q n = ck q k . (8.4)
k=0
h(c) = kf − pn k22 = hf − pn , f − pn i
= hf, f i + hpn , pn i − 2 hf, pn i
* + * +
X X X
= kf k22 + ck q k , c` q` − 2 f, ck q k
k ` k
XX X
= kf k22 + ck c` hqk , q` i − 2 ck hf, qk i .
k ` k
Introduce the matrix M ∈ R(n+1)×(n+1) and the vector b ∈ Rn+1 defined by:
Then,
XX X
h(c) = kf − pn k22 = kf k22 + ck c` Mk` − 2 c k bk
k ` k
= kf k22 T
+ c M c − 2b c. T
h(x) = ax2 + bx + c.
The recipe is well known: if h00 (x) = 2a > 0 (the function is convex rather
than concave, to make sure there indeed is a minimizer), then the unique
minimizer of h is such that h0 (x) = 2ax + b = 0, that is, x = − 2a b
. This
recipe generalizes.
8.2. SOLVING THE APPROXIMATION PROBLEM 147
We need to get the gradient and Hessian. Formally, using the simple rule
(
∂ci 1 if i = j,
= δij =
∂cj 0 otherwise,
we get:
∂h XX ∂ X ∂ck
(∇h(c))j = (c) = Mk` (ck c` ) − 2 bk +0
∂cj k `
∂c j
k
∂c j
XX X
= Mk` (δkj c` + ck δ`j ) − 2 bk δkj
k ` k
X X
= Mj` c` + Mkj ck − 2bj
` k
T
= (M c)j + (M c)j − 2bj
= (2M c − 2b)j .
In the last equality, we used that M = M T since Mk` = hqk , q` i and inner
products are symmetric. Compactly:
∇h(c) = 2(M c − b).
The Hessian is similarly straightforward to obtain:
∂ ∂h
(∇2 h(c))ij = (c)
∂ci ∂cj
!
∂ X X
= Mj` c` + Mkj ck − 2bj
∂ci ` k
= Mji + Mij .
Hence,
∇2 h(c) = 2M.
Lemma 8.9 indeed applies in our case since M 0.
148 CHAPTER 8. APPROXIMATION IN THE 2-NORM
M c = b. (8.5)
1. Pick a basis q0 , . . . , qn of Pn ;
Just as in the finite dimensional case, the solution here also can be interpreted
as the orthogonal projection of f to Pn , where orthogonal is defined with
8.3. A GEOMETRIC VIEWPOINT 149
Reorganizing yields
hf, gi2 ≤ kf k22 kgk22 .
Take the square root to conclude.
%% Hilbert matrix
n = 11;
M = zeros(n+1, n+1);
for k = 0 : n
for l = 0 : n
M(k+1, l+1) = 1/(k+l+1);
end
end
% Equivalently, M = hilb(n+1);
cond(M)
2. hφk , φ` i = 0 if k 6= `, and
3. hφk , φk i =
6 0.
bk hf, φk i
ck = = .
Mkk hφk , φk i
8.5.1 Gram–Schmidt
Given a weight function w and a basis of polynomials q0 , q1 , q2 , . . . such that
qk has degree exactly k, we can apply the Gram–Schmidt procedure to figure
out a system of orthogonal polynomials.2 Concretely,
1. Let φ0 = q0 ; then
φn+1 = qn+1 − an φn − · · · − a0 φ0 ,
Hence,
hqn+1 , φk i
ak = .
hφk , φk i
1
with respect to the weight w(x) = √1−x 2 . This weight function puts more
Hence,
pn = c0 T0 + · · · + cn Tn
is the best 2-norm approximation to f over [−1, 1] with respect to the weight
1
w(x) = √1−x 2 if and only if
1 π
Z
b0
c0 = = f (cos θ)dθ,
M00 π 0
2 π
Z
bk
ck = = f (cos θ) cos(kθ)dθ, for k = 1, 2, . . .
Mkk π 0
(This is closely related to taking a Fourier transform of f ◦ cos.)
3
It is now clear why we needed to allow w to take on infinite values at the extreme
points of the interval.
154 CHAPTER 8. APPROXIMATION IN THE 2-NORM
φ0 (x) = 1,
φ1 (x) = x − α0 ,
φk+1 (x) = (x − αk )φk (x) − βk2 φk−1 (x), for k = 1, 2, . . . (8.6)
and
hxφk , φk i
αk = , for k = 0, 1, 2, . . .
kφk k22
kφk k22
βk2 = , for k = 1, 2, . . .
kφk−1 k22
Proof. φ0 (x) = 1 is the only monic polynomial of degree 0, hence this is the
only possibility. The polynomial φ1 (x) must be of the form x − α0 for some
α0 : imposing 0 = hφ1 , φ0 i = hx − α0 , 1i gives α0 = hx,1i
h1,1i
as prescribed.
8.5. ORTHOGONAL POLYNOMIALS 155
Our primary goal is to show that only ck and ck−1 are nonzero. To this end,
first take inner products of (8.7) with φ` for some `:
k
X
hxφk , φ` i − hφk+1 , φ` i = cj hφj , φ` i .
j=0
For ` = k, we find
hxφk , φk i
ck = = αk . (8.9)
kφk k22
hxφk , φk−1 i
ck−1 = .
kφk−1 k22
Think about this last equation: it’s rather special. We exploit it as follows:
q = a0 φ0 + · · · + ak−1 φk−1 .
156 CHAPTER 8. APPROXIMATION IN THE 2-NORM
Thus,
kφk k22
ck−1 = 2
= βk2 . (8.10)
kφk−1 k2
Finally, consider (8.8) for ` ≤ k − 2. Then,
c0 = · · · = ck−2 = 0. (8.11)
Question 8.17. Using the expression Tn (x) = cos(n arccos(x)) for Cheby-
shev polynomials and the fact they are orthogonal with respect to w(x) =
√ 1 , use the theorem above to recover the recurrence relation for the Tn ’s.
1−x2
(Be mindful of normalization: Tn is not monic.)
which implies φj changes sign at least once on (a, b). (Here, we used that φj
is not identically zero, that w is positive, and that both are continuous on
(a, b).) Define
πk (x) = (x − ξ1 ) · · · (x − ξk ).
Then, the product φj (x)πk (x) no longer changes sign in (a, b). This implies
Z b
0 6= φj (x)πk (x)w(x)dx = hφj , πk i .
a
In the chapters about integration, we will see that the roots of orthogo-
nal polynomials are particularly appropriate to design numerical integration
schemes (Gauss quadrature rules.) How can we compute these roots? As
it turns out, they are the eigenvalues of a tridiagonal matrix.6 This means
any of our fast and reliable algorithms to compute eigenvalues of tridiagonal
matrices can be used here. But more on this in the next chapter, about
integration.
10
https://en.wikipedia.org/wiki/Sturm%E2%80%93Liouville_theory
160 CHAPTER 8. APPROXIMATION IN THE 2-NORM
Chapter 9
Integration
n
X Y x − xi
f (x) ≈ pn (x) = f (xk )Lk (x), Lk (x) = .
x k − xi
k=0 i6=k
161
162 CHAPTER 9. INTEGRATION
Then, informally,
Z b Z b Xn Z b n
X
f (x)dx ≈ pn (x)dx = f (xk ) Lk (x)dx = wk f (xk ).
a a k=0 | a {z } k=0
wk
Some work goes into computing the quadrature weights w0 , . . . , wn —but no-
tice that this is independent of f : it needs only be donePonce, and the
n
weights can be stored to disk. Then, the quadrature rule k=0 wk f (xk ) is
easily applied. The points x0 , . . . , xn where f needs to be evaluated are called
quadrature points.
Similar computations yield w1 , w2 . But we can save some work here. For
example, by a symmetry argument similar as above, we expect w2 = w0 .
You can verify it. What about w1 ? Let’s think about the (very) special case
f (x) = 1. Since f is a polynomial of degree 0, its interpolation polynomial
in P2 is exact: f = p2 . Thus,
Z b
b−a= f (x)dx
a
Z b
= p2 (x)dx = w0 f (x0 ) + w1 f (x1 ) + w2 f (x2 ) = w0 + w1 + w2 .
a
That is: the weights sum to b − a (the length of the interval.) This allows to
conclude already:
If f ∈ Pn , then f = pn .R
b Rb
Hence, if f ∈ Pn , then a f (x)dx = a pn (x)dx = nk=0 wk f (xk ).
P
kf (n+1) k∞
|f (x) − pn (x)| ≤ |πn+1 (x)|,
(n + 1)!
1
In previous chapters, to address ill-conditioning of the Vandermonde matrix, we
changed bases: instead of using monomials 1, x, x2 , . . . we used Lagrange polynomials
L0 , L1 , L2 , . . . which turned the Vandermonde matrix into an identity matrix. Notice that
doing this here, that is, imposing that L0 , . . . , Ln are integrated exactly, only shifts the
Rb
burden to that of computing the right hand side, which contains a Lk (x)dx: this is our
original task. At any rate, later in this chapter we will see that there are better ways to
build quadrature rules, namely, composite rules and Gaussian rules.
9.2. BOUNDING THE ERROR 165
whose infinity norm is 1/2n . What about a general interval [a, b]? Consider
the linear change of variable
a+b b−a
t = t(x) = + x,
2 2
constructed to satisfy t(−1) = a and t(1) = b: it maps [−1, 1] to [a, b]. The
Chebyshev nodes on [a, b] are defined as
tk = t(xk ), for k = 0, . . . , n.
π̃n+1 (t) , (t − t0 ) · · · (t − tn )
a+b b−a a+b b−a
= t− − x0 · · · t − − xn .
2 2 2 2
Plugging this into the expression for π̃n+1 (t) and factoring out b−a
2
from each
of the n + 1 terms gives:
n+1 n+1
b−a b−a
π̃n+1 (t) = (x − x0 ) · · · (x − xn ) = πn+1 (x).
2 2
As a result,
n+1
b−a 1
kπ̃n+1 k∞ = .
2 2n
(b − a)n+2
En (f ) ≤ kf (n+1) k∞ . (Chebyshev nodes) (9.4)
(n + 1)! 22n+1
In particular,
(b − a)3
E1 (f ) ≤ kf 00 k∞ ,
16
(b − a)4
E2 (f ) ≤ kf 000 k∞ .
192
9.3. COMPOSITE RULES 167
q(x) = (x − x0 )2 · · · (x − xn )2 .
Rb
This is a polynomial of degree 2n + 2. Surely, P a
q(x)dx is positive (in
particular, nonzero). Yet, the quadrature rule yields nk=0 wk q(xk ) = 0, since
all the quadrature points are roots of q. We conclude that no quadrature rule
based on n+1 points can integrate q correctly; in other words: no quadrature
168 CHAPTER 9. INTEGRATION
φ0 , φ1 , φ2 , . . . ,
as defined in Definition 8.12. Observe that for any polynomial p2n+1 ∈ P2n+1 ,
there exist two polynomials q and r in Pn such that
(That is, q is the quotient and r is the remainder after division of p2n+1
by φn+1 .) Furthermore, let x0 < · · · < xn in [a, b] and w0 , . . . , wn form a
quadrature rule of degree of precision at least n (to be determined.) Then,
Z b Z b
p2n+1 (x)w(x)dx = (q(x)φn+1 (x) + r(x))w(x)dx
a a
Z b Z b
= q(x)φn+1 (x)w(x)dx + r(x)w(x)dx
|a {z } a
=hq,φn+1 i=0 since q∈Pn
n
X
= wk r(xk ) (since the rule is exact for r ∈ Pn .)
k=0
Thus, using the quadrature rule of degree of precision at least n, one could
conceivably integrate all polynomials of degree up to 2n + 1, if only one could
evaluate r instead of p2n+1 at the quadrature nodes. In general, this is not
an easy task. Here comes the key part:
If we pick the quadrature nodes x0 < . . . < xn to be the n + 1
roots of φn+1 (known to be real, distinct and in [a, b]), then
Consequently,
Z b n
X
p2n+1 (x)w(x)dx = wk p2n+1 (xk ).
a k=0
The matrix on the left, J, is the Jacobi matrix. The crucial observation
follows:
2
The recurrence is set up assuming the polynomials are monic, which doesn’t affect
their roots so this is inconsequential to our endeavor.
170 CHAPTER 9. INTEGRATION
If x is a root of φ5 , then
Jv(x) = xv(x). (9.5)
That is: if x is a root of φ5 , then x is an eigenvalue of the Jacobi
matrix, with eigenvector v(x).3
Let’s say this again: all five distinct roots of φ5 are eigenvalues of J. Since
J is a 5 × 5 matrix, it has five eigenvalues, so that the roots of φ5 are exactly
the eigenvalues of J. This generalizes for all n of course.
Theorem 9.5. The roots of φn+1 are the eigenvalues of the Jacobi matrix
α0 1
β 2 α1 1
1
Jn+1 =
. . . . . . .
. . .
2
βn−1 αn−1 1
βn2 αn
This matrix is tridiagonal, but it is not symmetric. The eigenvalue com-
putation algorithms we have discussed require a symmetric matrix. Let’s
try to make J symmetric without changing its eigenvalues. Using a diagonal
similarity transformation (coefficients sk 6= 0 to be determined):
S −1 J S
z }| { z }| { z }| {
s−1
0 α0 1 s0
s−1
1
β12 α1 1 s1
s−1
2
2
β2 α2 1 s2
s−1
2
3
β3 α3 1 s3
−1
s4 β42 α4 s4
s1
α0
s0
ss0 β12 α1 s2
s1
1 s1 2 s3
= β
s2 2
α 2 s2
.
s2 2 s4
s3
β 3 α 3 s3
s3 2
s4
β 4 α4
| {z }
J
Verify that the matrix J on the right hand side has the same eigenvalues as
J. We wish to choose the sk ’s in such a way that J is symmetric. This is
the case if, for each k,
sk−1 2 sk
βk = .
sk sk−1
3
Note that v(x) 6= 0 since its first entry is 1.
9.4. GAUSSIAN QUADRATURES 171
kφk k22
Since βk2 = kφk−1 k22
, a valid choice is:
sk = kφk k2 .
This is indeed tridiagonal and symmetric, and has the same eigenvalues as
J, so that its eigenvalues are the roots of φ5 . We can generalize.
Theorem 9.6. The roots of φn+1 are the eigenvalues of the matrix
α0 β1
β1 α1 β2
Jn+1 =
. . . . . . .
(9.6)
. . .
βn−1 αn−1 βn
βn αn
The eigenvalues give us the roots. To get the weights, for now, we only
know one way: solve the ill-conditioned Vandermonde system (9.1). Fortu-
nately, there is a better way. It involves the eigenvectors of Jn+1 .
where x0 < · · · < xn are the eigenvalues of Jn+1 , and the weights can be
chosen to ensure exact integration if f ∈ P2n+1 .
172 CHAPTER 9. INTEGRATION
δk` = hϕk , ϕ` i
Z b
= ϕk (x)ϕ` (x)w(x)dx
a
n
!
X
= wj ϕk (xj )ϕ` (xj ).
j=0
This last equality5 follows from the fact that ϕk (x)ϕ` (x) is a polynomial of
degree at most 2n; thus, it is integrated exactly by the Gauss quadrature
rule. Verify that these equations can be written in matrix form:
I = PWPT,
W = P −1 (P T )−1 = (P T P )−1 .
Alternatively,
W −1 = P T P.
In other words:
n n
1 X X
= (P T P )jj = 2
Pkj = (ϕk (xj ))2 . (9.7)
wj k=0 k=0
4
Unique if we further require the leading coefficient to be positive, as is the case here.
5
This equality also shows that ϕ0 , ϕ1 , . . . are not only orthonormal with respect to
a continuous inner product, but also with respect to a discrete inner product (see also
Remark 9.10). But this is a story for another time.
9.4. GAUSSIAN QUADRATURES 173
How do we compute the right hand side? This is where the eigenvectors come
in.
Recall Jv(x) = xv(x) from eq. (9.5). Apply S −1 on the left and insert
SS −1 to find:
−1 −1
| {zJS} S
S v(x) = x S −1 v(x) .
| {z } | {z }
J u(x) u(x)
Corollary 9.8. The roots of ϕn+1 (equivalently, the roots of φn+1 ) are the
eigenvalues of the symmetric, tridiagonal matrix Jn+1 . Let x0 < · · · < xn in
(a, b) denote these roots. Any eigenvector u(j) associated to the eigenvalue xj
is of the form:
ϕ0 (xj )
ϕ1 (xj )
(j)
u = cj .. , (9.8)
.
ϕn (xj )
wj = c2j .
ϕ0 (x) = µ, ∀x.
Then,
(j)
u0 = cj ϕ0 (xj ) = cj µ.
so that
1
µ2 = R b .
a
w(x)dx
Thus,
(j) b
(u )2
Z
(j)
wj = c2j = 0 2 = (u0 )2 w(x)dx . (9.9)
µ a
| {z }
compute once
After these derivations, the practical implications are clear: this last equation
is all you need to figure out how to implement the Golub–Welsch algorithm.
This algorithm computes the nodes and the weights of a Gaussian quadrature:
see Algorithm 9.1.
Question 9.9. Implement the procedure above, known as the Golub–Welsch
algorithm, to compute the nodes and weights of the Gauss–Legendre quadra-
ture rule (Gauss with Legendre polynomials, that is, w(x) = 1 over [−1, 1].)
Remark 9.10. The procedure above also proves the weights in a Gauss
quadrature rule are positive. This is great numerically. Indeed, assume you
are integrating f (x) which is nonnegative over [a, b]. Then, the quadrature
rule is a sum of nonnegative numbers: there is no risk of catastrophic can-
cellation due to round-off errors in computing the sum. This is not so for
Newton–Cotes rules, which have negative weights already for n ≥ 8: these
may lead to intermediate computations of differences of large numbers.
9.4. GAUSSIAN QUADRATURES 175
9.4.3 Examples
Figures 9.1–9.4 compare different quadrature rules, obtained in different
ways, on four different integrals.
Newton–Cotes (Vandermonde) uses n + 1 equispaced points on the
interval, and obtains the weights by solving the Vandermonde system
(ill conditioned).
Chebyshev nodes (Vandermonde) does the same but with n + 1 Cheby-
shev nodes in the interval.
Legendre nodes (Vandermonde) does the same but using the n + 1
roots of the (n + 1)st Legendre polynomial. Mathematically, the later
is the Gauss quadrature with n + 1 points for unit weight w(x) = 1,
but because solving the Vandermonde system is unstable, it eventually
fails.
Legendre nodes (Golub–Welsch) is the same rule as the previous one,
but it computes the weights with Golub–Welsch, which is far more
trustworthy numerically.
The composite trapezium rule uses n trapezia to approximate the in-
tegral.
At point n, all rules use exactly n + 1 evaluations of the integrand. They
require varying amounts of work to compute the nodes and weights, but
notice that these can be precomputed once and for all.
f = p2n+1 + e2n+1 ,
n
X n
X n
X
wk f (xk ) = wk p2n+1 (xk ) + wk e2n+1 (xk )
k=0 k=0 k=0
Z b n
X
= p2n+1 (x)dx + wk e2n+1 (xk ),
a k=0
where we used the fact that the quadrature rule is exact when applied to
p2n+1 . On the other hand:
Z b Z b Z b
f (x)dx = p2n+1 (x)dx + e2n+1 (x)dx.
a a a
9.4. GAUSSIAN QUADRATURES 177
Here is the take-away: the integration error of the Gauss quadrature rule
is determined by the minimax error of approximating f with a polynomial of
degree 2n + 1 (even though we are using only n + 1 evaluations of f !). The
error goes to zero with n → ∞. If f lends itself to good minimax approxima-
tions of low degree, then the results should be quite good already for finite
n (and remember: we do not need to compute the minimax approximation:
we only use the fact that it exists). Furthermore, we can get an upper-bound
on the bound by plugging in any other polynomial of degree at most 2n + 1
which is a good ∞-norm approximation of f . For example, we know that the
polynomial of degree 2n + 1 which interpolates f at the 2n + 2 Chebyshev
nodes on (a, b) reaches a pretty good ∞-norm error as long as f is many
times continuously differentiable and those derivatives are not catastrophi-
cally crazy. We do at least as well as if we were using that polynomial, even
though finding that polynomial would have required 2n + 2 evaluations of f !
178 CHAPTER 9. INTEGRATION
10-5
10-10
10-15
10-20
R1
Figure 9.1: Computation of 0
cos(2πx)dx = 0 with various quadratures.
Newton-Cotes (Vandermonde)
104 Chebyschev nodes (Vandermonde)
Legendre nodes (Vandermonde)
Legendre nodes (Golub-Welsch)
Composite trapezium
102
100
10-2
10-4
10-6
10-8
10-10
R1√ 2
Figure 9.2: Computation of 0
xdx = 3
with various quadratures.
9.4. GAUSSIAN QUADRATURES 179
10-5
10-10
10-15
10-20
R1 1
Figure 9.3: Computation of −1 1+25x2
dx = 25 arctan(5) with various quadra-
tures.
10-5
10-10
10-15
10-20
R1 1
Figure 9.4: Computation of 0
x8 dx = 9
with various quadratures.
180 CHAPTER 9. INTEGRATION
Chapter 10
Unconstrained optimization
that is, compute the minimal value that f can take for any choice of x.
In general, this value may not exist and may not be attainable, in which
case the problem is less interesting. We make the following blanket assump-
tion throughout the chapter to avoid this issue.
Assumption 10.2. The function f is twice continuously differentiable2 and
attains the minimal value f ∗ .
When the minimum exists and is attainable, one is often also interested
in determining an x∗ ∈ Rn for which this minimal value is attained. Such an
x∗ is called an optimum.3 The set of all optima is denoted
In general, this set can contain any number of elements. Because the variable
x is free to take up any value, we say the problem is unconstrained. It is im-
portant to make a distinction between globally optimal points and points that
appear optimal only locally (that is, when compared only to their immediate
surroundings.)
1
It is traditional to talk about minimization. To maximize, consider −f (x) instead.
2
Many statements hold without demanding this much smoothness. Given the limited
time at our disposal, we will keep technicality low to focus on ideas.
3
Or also: an optimizer, a minimum, a minimizer.
181
182 CHAPTER 10. UNCONSTRAINED OPTIMIZATION
as desired.
We now show that (b) and (c) are equivalent.
If (c) holds, then simply consider a Taylor expansion of f : for any x, y ∈
n
R , there exists α ∈ [0, 1] such that
1
f (y) = f (x) + (y − x)T ∇f (x) + (y − x)T ∇2 f (x + α(y − x))(y − x).
2
(10.8)
Since the Hessian is assumed everywhere positive semidefinite, this yields
Lemma 10.11. If the step-sizes yield sufficient decrease, then gradient∗ de-
scent produces an iterate xk such that k∇f (xk )k2 ≤ ε with k ≤ d f (x0c)−f ε12 e
and k∇f (xk )k2 → 0. There is no condition on x0 .
Proof. Assume that x0 , . . . , xK−1 all have gradient larger than ε. Then, using
both the fact that f is lower bounded and the sufficient decrease property, a
classic telescoping sum argument gives
4
Notice the plural: we won’t show convergence to a unique critical point, even though
this is what typically happens in practice. We only show that all accumulation points are
critical points.
186 CHAPTER 10. UNCONSTRAINED OPTIMIZATION
Hence, K ≤ (f (x0 )−f ∗ )/cε2 , so that if more iterations are computed, it must
be that the gradient dropped below ε at least once. Furthermore, since the
sum of squared gradient norms is upper bounded, the gradient norm must
converge to 0.
Remark 10.12. Let us stress this: there are no assumptions on x0 . On
the other hand, the theorem only guarantees that all accumulation points of
the sequence of iterates are critical points: it does not guarantee that these
are global optima. Importantly, if f is convex, then critical points and global
optima coincide, which shows all accumulation points are global optima re-
gardless of initialization: this is powerful!
Sufficient decrease can be achieved easily if we assume f has a Lipschitz
continuous gradient with known constant L.
Lemma 10.13. If k∇f (x) − ∇f (y)k2 ≤ Lkx − yk2 for all x, y, then the
1
constant step size ηk = L1 yields sufficient decrease with c = 2L .
Proof. First, we show the Lipschitz condition implies the following statement:
for all x, y,
L
f (y) − f (x) − (y − x)T ∇f (x) ≤ ky − xk22 . (10.11)
2
Indeed, by the fundamental theorem of calculus,
Z 1
f (y) − f (x) = (y − x)T ∇f (x + s(y − x))ds
0
Z 1
T
(y − x)T ∇f (x + s(y − x)) − ∇f (x) ds.
= (y − x) ∇f (x) +
0
one can expect: set x = xk and y = xk+1 , then, removing the absolute value
on the left-hand side,
L 2
f (xk+1 ) − f (xk ) + ηk∇f (xk )k22 ≤ η k∇f (xk )k22 ,
2
or, equivalently,
L
f (xk ) − f (xk+1 ) ≥ η 1 − η k∇f (xk )k22 . (10.12)
2
Notice that we only use the Lipschitz property along the piecewise linear
curve that joins the iterates x0 , x1 , x2 . . .. This can help when analyzing
functions f whose gradients are not globally Lipschitz continuous.
in at most
Lη̄
max 1, 2 + logτ −1
2(1 − c1 )
calls to the cost function f . In particular, this means the sufficient decrease
condition (Def. 10.10) is met with
2c1 (1 − c1 )τ
c ≥ min c1 η̄, ,
L
Proof. When the line-search algorithm tries the step-size η, the Lipschitz
continuous gradient assumption, via (10.11), guarantees that
L
f (xk ) − f (xk − η∇f (xk )) ≥ η 1 − η k∇f (xk )k22 .
2
(This is the same as (10.12).) If the algorithm does not stop, it must be that
L
η 1 − η k∇f (xk )k22 < c1 ηk∇f (xk )k22 ,
2
2(1 − c1 )
η> .
L
As soon as η drops below this bound, we can be sure that the line-search
will return. This happens either at the very first guess, when η = η̄, or after
a step-size reduction by a factor τ , which cannot have reduced the step-size
below τ times the right-hand side, so that, when the algorithm returns, η
satisfies:
2(1 − c1 )τ
η ≥ min η̄, . (10.13)
L
10.1. A FIRST ALGORITHM: GRADIENT DESCENT 189
`+1≤2+ .
log(τ )
This concludes the proof, since ` + 1 is here the number of calls to f .
The main advantage of Lemma 10.11 is that it makes no assumptions
about the initial iterate x0 (compare this with our work in Chapters 1 and 4
of [SM03]). Yet, the guaranteed convergence rate is very slow: it is sublinear.
This is because, in the worst case, f may present large, almost flat regions
where progress is slow. Luckily, the convergence rate becomes linear if the
iterates get sufficiently close to a (strict) local optimum. Here is a statement
to that effect, with somewhat overly restrictive assumptions.
Lemma 10.15. Under the Lipschitz assumptions on the gradient of f , if x∗
is a local optimum where 0 ≺ ∇2 f (x∗ ) ≺ LIn , there exists a neighborhood
U of x∗ such that, if the sequence x0 , x1 , x2 . . . generated by gradient descent
with constant step-size ηk = 1/L ever enters the neighborhood U , then the
sequence converges at least linearly to x∗ .
Proof. Proof sketch: it is sufficient to observe that gradient descent in this
setting is simultaneous iteration through relaxation:
1
xk+1 = g(xk ) = xk − ∇f (xk ). (10.14)
L
The Jacobian of g at the fixed-point x∗ is Jg (x∗ ) = In − L1 ∇2 f (x∗ ). Under the
assumptions, the eigenvalues of ∇2 f (x∗ ) all lie in (0, L), so that kJg (x∗ )k2 <
1. Thus, by continuity, g is a contraction map in a neighborhood of x∗ . From
there, one can deduce linear convergence of xk to to x∗ (after U has been
entered.)
190 CHAPTER 10. UNCONSTRAINED OPTIMIZATION
assuming the inverse exists. This step is computed by solving a linear system
where the matrix is the Hessian of f at xk . Importantly, one does not con-
struct the Hessian matrix to do this. That would be very expensive in most
applications. Instead, one resorts to matrix-free solvers, which only require
the ability to compute products of the form ∇2 f (xk )u for vectors u.
Another interpretation of Newton’s method for optimization is the fol-
lowing: at xk , approximate f with a second-order Taylor expansion:
1
f (x) ≈ f (xk ) + (x − xk )T ∇f (xk ) + (x − xk )T ∇2 f (xk )(x − xk ).
2
If f is strictly convex,5 then this quadratic approximation of f is itself strictly
convex. Find x which minimizes the quadratic: this coincides with xk+1 ! In
other words: an iteration of Newton’s method for optimization consists in
moving to the critical point of the quadratic approximation of f around the
current iterate.
5
Strictly convex means the Hessian is positive definite, rather than only positive
semidefinite.
10.2. MORE ALGORITHMS 191
What if f is not strictly convex? Then, the quadratic may not be convex,
so that its critical points is not necessarily a minimizer: it could be a maxi-
mizer, or a saddle point. If such is the case, then moving to that point is ill
advised. A better strategy consists in recognizing that the Taylor expansion
can only be trusted in a small neighborhood around xk . Thus, the quadratic
should be minimized in that trusted region only (instead of blindly jumping
to the critical point.) This is the starting point of the so-called trust region
method, which is an excellent algorithm widely used in practice.
What if we do not have access to the Hessian? One possibility is to ap-
proximate the Hessian using finite differences of the gradient. Alternatively,
one can resort to the popular BFGS algorithm, which only requires access to
the gradient and works great in practice as well.
192 CHAPTER 10. UNCONSTRAINED OPTIMIZATION
Chapter 11
What now?
For most problems, we also assessed which aspects of it can make it harder or
easier, through analysis. For example, Ax = b is more difficult to solve if A is
poorly conditioned, and f is more difficult to approximate with polynomials
if its high-order derivatives go wild. We also acknowledged the effects of
inexact arithmetic.
Hopefully, you were convinced that mathematical proofs and numerical
experimentation inform each other. Both are necessary to gain confidence in
the algorithms we develop, and eventually use in settings where failure has
consequences.
What now? The problems we studied are fundamental, in that they
appear throughout the sciences and engineering. I am confident that you
will encounter these problems in your own work, in various ways. With even
more certainty, you will encounter problems we did not address at all. Some
193
194 CHAPTER 11. WHAT NOW?
of these appear in chapters of [SM03] we did not open. By now, you are well
equipped to study these problems and algorithms on your own.
1. Piecewise polynomial approximation (splines): instead of interpolating
f with a polynomial of high degree, divide [a, b] into smaller intervals,
and approximate f with a low-degree polynomial on each interval sep-
arately. Only, do it in a way that the polynomials “connect”: the
piecewise polynomial function should be continuous, and continuously
differentiable a number of times. These requirements lead to a banded
linear system: you know how to solve this efficiently.
conditions are as prescribed; thus: we are looking for the roots of this
function. It is nonlinear, and to evaluate it we must solve the initial
value problem (for example using a stepping method): it is thus crucial
to use a nonlinear equation solver which requires as few calls to the
function as possible.
4. The finite element method (FEM) for ODEs and PDEs: this method
is used extensively in materials sciences, mechanical engineering, geo-
sciences, climate modeling and many more. For example, a differential
equation dictates the various mechanical constraints on the wings of
an airplane in flight, as a function of shape, wind speed and materials
used. Solving this equation informs us about which specific points of
the wing are at risk of breaking first, that is: which points should be
engineered with particular care. The solution to this PDE is a func-
tion, and it can be cast as the solution to an optimization problem (by
the same principle that states a physical system at rest is in the state
that minimizes energy.) Thus, we must minimize some cost function
(an energy function) over a space of functions. Spaces of functions are
usually infinite dimensional, so we have to do something about that.
The key step in FEM is to mesh the domain of the PDE (the airplane
wing) into small elements (tetrahedrons for example), and to define a
low-dimensional family of functions over this mesh. A trivial example is
to allow the value of the function at each mesh vertex to be a variable,
and to define the function at all other points through piecewise linear
interpolation. If the energy function is quadratic (as is normally the
case), minimizing it as a function of the vertex values boils down to one
(very) large and structured linear system (each element only interacts
with its immediate neighbors, so that the corresponding matrix has to
be sparse.) Such systems are best solved with matrix-free solvers.
This idea of reducing an infinite dimensional optimization problem over a
space of functions to a finite dimensional problem (apparent in our work with
polynomial approximation, and also present in FEMs as described above) is
also the root of all modern applications of (deep) neural networks in machine
learning, which you have certainly heard about. There, the basic problem
is as follows: given examples (x1 , y1 ), . . . , (xn , yn ) (say, xi is an image, and
yi is a number which states whether this image represents a cat, a dog,
...), find a function f such that if we encounter a new image (an image we
have never seen before), then y = f (x) is a good indication of what the
image contains (a cat, a dog, ...) This problem is called learning. At its
heart, it is a function approximation problem if we assume that a “perfect”
function f indeed exists (what else could we do?) Neural networks are a fancy
196 CHAPTER 11. WHAT NOW?
[TBI97] Lloyd N. Trefethen and David Bau III. Numerical linear algebra,
volume 50. SIAM, 1997.
197