0% found this document useful (0 votes)
161 views203 pages

MAT321 Lecture Notes Boumal 2019

Mat 321
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
161 views203 pages

MAT321 Lecture Notes Boumal 2019

Mat 321
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

MAT321 Numerical Methods

Department of Mathematics and PACM


Princeton University

Instructor: Nicolas Boumal (nboumal)


TAs: Thomas Pumir (tpumir), Eitan Levin (eitanl)

Fall 2019
ii
Contents

1 Solving one nonlinear equation 3


1.1 Bisection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Simple iteration . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3 Relaxation and Newton’s method . . . . . . . . . . . . . . . . 17
1.4 Secant method . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.5 A quick note about Taylor’s theorem . . . . . . . . . . . . . . 25

2 Floating point arithmetic 27


2.1 Motivating example: finite differentiation . . . . . . . . . . . . 27
2.2 A simplified model for IEEE arithmetic . . . . . . . . . . . . . 29
2.3 Finite differentiation . . . . . . . . . . . . . . . . . . . . . . . 33
2.4 Bisection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.5 Computing long sums . . . . . . . . . . . . . . . . . . . . . . . 38

3 Linear systems of equations 41


3.1 Solving Ax = b . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.2 Conditioning of Ax = b . . . . . . . . . . . . . . . . . . . . . . 45
3.3 Least squares problems . . . . . . . . . . . . . . . . . . . . . . 47
3.4 Conditioning of least squares problems . . . . . . . . . . . . . 50
3.5 Computing QR factorizations, A = Q̂R̂ . . . . . . . . . . . . . 53
3.6 Least-squares via SVD . . . . . . . . . . . . . . . . . . . . . . 59
3.7 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.8 Fixing MGS: twice is enough . . . . . . . . . . . . . . . . . . . 61
3.9 Solving least-squares with MGS directly . . . . . . . . . . . . 62

4 Systems of nonlinear equations 65


4.1 Simultaneous iteration . . . . . . . . . . . . . . . . . . . . . . 66
4.2 Contractions in Rn . . . . . . . . . . . . . . . . . . . . . . . . 69
4.3 Jacobians and convergence . . . . . . . . . . . . . . . . . . . . 72
4.4 Newton’s method . . . . . . . . . . . . . . . . . . . . . . . . . 76

iii
iv CONTENTS

5 Eigenproblems 83
5.1 The power method . . . . . . . . . . . . . . . . . . . . . . . . 85
5.2 Inverse iteration . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.3 Rayleigh quotient iteration . . . . . . . . . . . . . . . . . . . . 91
5.4 Sturm sequences . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.5 Gerschgorin disks . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.6 Householder tridiagonalization . . . . . . . . . . . . . . . . . . 107

6 Polynomial interpolation 113


6.1 Lagrange interpolation, the Lagrange way . . . . . . . . . . . 116
6.2 Hermite interpolation . . . . . . . . . . . . . . . . . . . . . . . 123

7 Minimax approximation 125


7.1 Characterizing the minimax polynomial . . . . . . . . . . . . . 128
7.2 Interpolation points to minimize the bound . . . . . . . . . . . 133
7.3 Codes and figures . . . . . . . . . . . . . . . . . . . . . . . . . 137

8 Approximation in the 2-norm 143


8.1 Inner products and 2-norms . . . . . . . . . . . . . . . . . . . 144
8.2 Solving the approximation problem . . . . . . . . . . . . . . . 146
8.3 A geometric viewpoint . . . . . . . . . . . . . . . . . . . . . . 148
8.4 What could go wrong? . . . . . . . . . . . . . . . . . . . . . . 150
8.5 Orthogonal polynomials . . . . . . . . . . . . . . . . . . . . . 150
8.5.1 Gram–Schmidt . . . . . . . . . . . . . . . . . . . . . . 152
8.5.2 A look at the Chebyshev polynomials . . . . . . . . . . 152
8.5.3 Three-term recurrence relations . . . . . . . . . . . . . 154
8.5.4 Roots of orthogonal polynomials . . . . . . . . . . . . . 157
8.5.5 Differential equations & orthogonal polynomials . . . . 158

9 Integration 161
9.1 Computing the weights . . . . . . . . . . . . . . . . . . . . . . 162
9.2 Bounding the error . . . . . . . . . . . . . . . . . . . . . . . . 164
9.3 Composite rules . . . . . . . . . . . . . . . . . . . . . . . . . . 167
9.4 Gaussian quadratures . . . . . . . . . . . . . . . . . . . . . . . 167
9.4.1 Computing roots of orthogonal polynomials . . . . . . 169
9.4.2 Getting the weights, too: Golub–Welsch . . . . . . . . 171
9.4.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . 175
9.4.4 Error bounds . . . . . . . . . . . . . . . . . . . . . . . 175
CONTENTS v

10 Unconstrained optimization 181


10.1 A first algorithm: gradient descent . . . . . . . . . . . . . . . 184
10.2 More algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 190

11 What now? 193


vi CONTENTS
Introduction

“Numerical analysis is the study of algorithms for the problems


of continuous mathematics.”

–Nick Trefethen, appendix of [TBI97]

These notes contain some of the material covered in MAT 321 / APC
321 – Numerical Methods taught at Princeton University during the Fall
semesters of 2016–2019. They are extensively based on the two reference
books of the course, namely,

ˆ Endre Süli and David F. Mayers. An introduction to numerical analysis.


Cambridge university press, 2003, and

ˆ Lloyd N. Trefethen and David Bau III. Numerical linear algebra, vol-
ume 50. SIAM, 1997.

I thank Bart Vandereycken and Javier Gómez-Serrano, previous instructors


of this course, and Pierre-Antoine Absil for their help and insight. Special
thanks also to José S.B. Ferreira, who was my TA for the first two years, to
Yuan Liu who took on that role in 2018, and to Thomas Pumir and Eitan
Levin for 2019.
These notes are work in progress (this is their third year.) Please do
let me know about errors, typos, suggestions for improvements. . . (however
small.) Your feedback is immensely welcome, always.

Nicolas Boumal

1
2 CONTENTS
Chapter 1

Solving one nonlinear equation

The following problem is arguably ubiquitous in science and engineering:

Problem 1.1 (Nonlinear equation). Given f : [a, b] → R, continuous, find


ξ ∈ [a, b] such that f (ξ) = 0.

The assumption that f is continuous is of central importance. Indeed,


without that assumption, evaluating f at x ∈ [a, b] yields no information
whatsoever about the value of f at any other point in the interval: unless
f (x) = 0, we are not in a better position to solve the problem. Compare
this to the situation where f is continuous: then, if we query f at x ∈ [a, b]
we know at least that close to x, the value of f must be close to f (x). In
particular, if |f (x)| is small, there might be a root nearby. This is enough
to get started. Later, we will assume a stronger form of continuity, called
Lipschitz continuity: this will quantify what we mean by “close”.
Consider the following example: f (x) = ex −2x−1, depicted in Figure 1.1.
To get a sense of what is possible, let’s take a look at how Matlab’s built-in
algorithm, fzero, behaves, when given the hint to search close to x0 = 1:

f = @(x) exp(x) - 2*x - 1;


x0 = 1;
options = optimset('Display','iter');
xi = fzero(f, x0, options);
fprintf('Root found: xi = %.16e, with value f(xi) = ...
%.6e.\n', xi, f(xi));

3
4 CHAPTER 1. SOLVING ONE NONLINEAR EQUATION

This produces the following output:

Search for an interval around 1 containing a sign change:


Func-count a f(a) b f(b) Procedure
1 1 -0.281718 1 -0.281718 initial interval
3 0.971716 -0.300957 1.02828 -0.260304 search
5 0.96 -0.308304 1.04 -0.250783 search
7 0.943431 -0.318082 1.05657 -0.236654 search
9 0.92 -0.33071 1.08 -0.21532 search
11 0.886863 -0.346223 1.11314 -0.182382 search
13 0.84 -0.363633 1.16 -0.130067 search
15 0.773726 -0.379623 1.22627 -0.044042 search
17 0.68 -0.386122 1.32 0.103421 search

Search for a zero in the interval [0.68, 1.32]:


Func-count x f(x) Procedure
17 1.32 0.103421 initial
18 1.18479 -0.099576 interpolation
19 1.25112 -0.00799173 interpolation
20 1.25649 8.62309e-05 interpolation
21 1.25643 -5.34422e-07 interpolation
22 1.25643 -3.53615e-11 interpolation
23 1.25643 0 interpolation

Zero found in the interval [0.68, 1.32]


Root found: xi = 1.2564312086261697e+00, with value f(xi) = 0.000000e+00.

Based on our initial guess x0 = 1, Matlab’s fzero used 23 function


evaluations to zoom in on the positive root of f .

f(x) = e x-2x-1

2.5

1.5

0.5

-0.5

-2 -1 0 1 2
x

Figure 1.1: The function f (x) = ex − 2x − 1 has two roots: ξ = 0 and


ξ ≈ 1.2564312086261697.
1.1. BISECTION 5

1.1 Bisection
The main theorem we need to describe our first algorithm is a consequence
of the Intermediate Value Theorem (IVT). It offers a sufficient (but not
necessary) criterion to decide whether Problem 1.1 has a solution at all.

Theorem 1.2. Let f : [a, b] → R be continuous. If f (a)f (b) ≤ 0, then there


exists ξ ∈ [a, b] such that f (ξ) = 0.

Proof. If f (a)f (b) = 0, then either a or b can be taken as ξ. Otherwise,


f (a)f (b) < 0, so that f (a) and f (b) delimit an interval which contains 0.
Apply the IVT to conclude.
Hence, if we find two points in the interval [a, b] such that f changes
sign on those two points, we are assured that f has a root in between these
two points. Without loss of generality, say that a, b are two such points
(alternatively, we can always redefine the domain of f .) Say that f (a) < 0
and f (b) > 0. Let’s evaluate f at the midpoint c = a+b
2
. What could happen?

ˆ f (c) = 0: then we return ξ = c;

ˆ f (c) > 0: then f changes sign on [a, c];

ˆ f (c) < 0: then f changes sign on [c, b].

In both last cases, we identified an interval which (i) contains a root, and
(ii) is twice as small as our original interval. By iterating this procedure,
we can repeatedly halve the length of our interval with a single function
evaluation, always with the certainty that this interval contains a solution
to our problem. After k iterations, the interval has length |b − a|2−k . The
midpoint of that interval is at a distance at most |b − a|2−k−1 of a solution
ξ. We formalize this in Algorithm 1.1, called the bisection algorithm.

Theorem 1.3. When Algorithm 1.1 returns c, there exists ξ ∈ [a0 , b0 ] such
that f (ξ) = 0 and |ξ − c| ≤ |b0 − a0 |2−1−K . Assuming f (a0 ), f (b0 ) were
already computed, this is achieved in at most K function evaluations.

In principle, if we iterate the bisection algorithm indefinitely, we should


reach an arbitrarily accurate approximation of a root ξ of f . While this
statement is mathematically correct, it does not square with practice: see
Figures 1.2 and 1.3. Indeed, by default, computers use a form of inexact
arithmetic known as the IEEE Standard for Floating-Point Arithmetic (IEEE
754)—more on that later.
6 CHAPTER 1. SOLVING ONE NONLINEAR EQUATION

Interval length |b k - ak| with bisection Function value |f(ck)|


10 0 10 0

10 -5 10 -5

10 -10 10 -10

10 -15 10 -15
0 10 20 30 40 50 0 10 20 30 40 50
Iteration number k Iteration number k

Figure 1.2: Applying the bisection algorithm on f (x) = ex − 2x − 1 with


[a0 , b0 ] = [1, 2] and K = 60. The interval length `k = bk − ak decreases
by a factor of 2 at each iteration, and the function value eventually hits 0
with c48 = 1.2564312086261697 (zero is not represented on the log-scale).
Yet, computing f (c48 ) with high accuracy shows it is not quite a root. Us-
ing Matlab’s symbolic computation toolbox, syms x; f = exp(x)- ...
2*x - 1; vpa(subs(f, x, 1.2564312086261697), 20) gives f (c48 ) ≈
1.086 · 10−16 : a small error remains. Figure 1.3 show a different scenario.

Interval length |b k - ak| with bisection Function value |f(ck)|


10 0 10 5

10 -5 10 0

10 -10 10 -5

10 -15 10 -10

10 -20 10 -15
0 20 40 60 0 20 40 60
Iteration number k Iteration number k

Figure 1.3: Bisection on f (x) = (5−x)ex −5 with [a0 , b0 ] = [4, 5] and K = 60.
The interval length `k = bk − ak decreases only to 8.9 · 10−16 . Furthermore,
the function value stagnates instead of converging to 0. The culprit: inexact
arithmetic. We will see that this is actually as accurate as one can hope.
1.1. BISECTION 7

Algorithm 1.1 Bisection


1: Input: f : [a0 , b0 ] → R, continuous, f (a0 )f (b0 ) < 0; iteration budget K
2: Let c0 = a0 +b
2
0

3: for k = 0, 1, 2 . . . , K − 1 do
4: Compute f (ck ) . We really only need the sign
5: if f (ck ) has sign opposite to f (ak ) then
6: Let (ak+1 , bk+1 ) = (ak , ck )
7: else if f (ck ) has sign opposite to f (bk ) then
8: Let (ak+1 , bk+1 ) = (ck , bk )
9: else
10: return c = ck . f (ck ) = 0
11: end if
12: Let ck+1 = ak+1 +b 2
k+1

13: end for


14: return c = cK . We ran out of iteration budget

function c = my bisection(f, a, b, K)
% Example: c = my bisection(@(x) exp(x) - 2*x - 1, 1, 2, 60);

% Make sure a and b are distinct and a < b


assert(a ~= b, 'a and b must be different');
if b < a
[a, b] = deal(b, a); % switch a and b
end

% Two calls to f here


fa = f(a);
fb = f(b);

% Return immediately if a or b is a root


if fa == 0
c = a;
return;
end
if fb == 0
c = b;
return;
end

assert(sign(fa) ~= sign(fb), 'f(a) and f(b) must have ...


opposite signs');

c = (a+b)/2;
8 CHAPTER 1. SOLVING ONE NONLINEAR EQUATION

for k = 1 : K

% Only one call to f per iteration


fc = f(c);

if fc == 0
return; % f(c) = 0: done
end

if sign(fc) ~= sign(fa) % update interval to [a, c]


b = c;
fb = fc;
else % update interval to [c, b]
a = c;
fa = fc;
end

c = (a+b)/2;

end

The bisection algorithm relies heavily on Theorem 1.2. For its many
qualities (not the least of which is its simplicity), this approach has three
main drawbacks:
1. The user needs to find a sign change interval [a0 , b0 ] as initialization;
2. Convergence is fast, but we can do better;
3. Theorem 1.2 is fundamentally a one-dimensional thing: it won’t gener-
alize when we aim to solve several nonlinear equations simultaneously.
In the next section, we discuss simple iterations: a family of iterative algo-
rithms designed to solve Problem 1.1, and which will (try to) address these
shortcomings.
Recall the performance of fzero reported at the beginning of this chapter.
Based on our initial guess x0 = 1, Matlab’s fzero used 17 function evalua-
tions to find a sign-change interval of length 0.64. After that, it needed only
6 additional function evaluations to find the same root our bisection found
in 48 iterations (starting from that interval, bisection would reach an error
bound of 0.005: only two digits after the decimal point are correct.) If we
give fzero the same interval we gave bisection, then it needs only 10 function
evaluations to do its job. This confirms Problem 1.1 can be solved faster.
We won’t discuss how fzero finds a sign-change interval too much (you will
think about it during precept). We do note in Figure 1.4 that this can be
a difficult task. The methods we discuss next do not require a sign-change
interval.
1.2. SIMPLE ITERATION 9

0.4
0.2
0
-0.2
-0.4
0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8

Figure 1.4: Plot of f = @(x).5 - 1./(1 + M*abs(x - 1.05)); with


M = 200;. Given an intial guess x0 = 1, Matlab’s fzero aims to
find a sign-change interval: after 4119 function evaluations, it aban-
dons, with the last considered interval being [−1.6 · 10308 , 1.6 · 10308 ].
On the other hand, Matlab’s fsolve finds an excellent approxima-
tion of the root in 18 function evaluations, from the same intial-
ization: run x0 = 1; options = optimset('Display','iter'); ...
fzero(f, x0, options); fsolve(f, x0, options);

1.2 Simple iteration


The family of algorithms we describe now relies on a different criterion for
the existence of a solution to Problem 1.1. As an example, consider

g(x) = x − f (x).

Clearly, f (ξ) = 0 if and only if g(ξ) = ξ, that is, if ξ is a fixed point of g. Given
f , there are many ways to construct a function g whose fixed points coincide
with the roots of f , so that Problem 1.1 is equivalent to the following.

Problem 1.4. Given g : [a, b] → R continuous, find ξ ∈ [a, b] s.t. g(ξ) = ξ.

Brouwer’s theorem states a sufficient condition for the existence of a fixed


point. (Note the condition on the image of g.)

Theorem 1.5 (Brouwer’s fixed point theorem). If g : [a, b] → [a, b] is con-


tinuous, then there exists (at least one) ξ ∈ [a, b] such that g(ξ) = ξ.

Proof. We can reduce the statement to that of Theorem 1.2 by defining


f (x) = x − g(x). Indeed, f (a) = a − g(a) ≤ 0 since g(x) ≥ a for all
x. Likewise, f (b) ≥ 0. Thus, f (a)f (b) ≤ 0 and Theorem 1.2 allows to
conclude.
If one exists, finding a fixed point of g can be rather easy, see Algo-
rithm 1.2. Given an initial guess x0 ∈ [a, b], this algorithm generates a
10 CHAPTER 1. SOLVING ONE NONLINEAR EQUATION

Algorithm 1.2 Simple iteration


1: Input: g : [a, b] → [a, b], continuous; initial guess x0 ∈ [a, b].
2: for k = 0, 1, 2 . . . do
3: xk+1 = g(xk )
4: end for

sequence x0 , x1 , x2 . . . in [a, b] by iterated application of g: xk+1 = g(xk ).


(Notice here the importance that g maps [a, b] to itself, so that it is always
possible to apply g to the new iterate.) By continuity, we get an easy state-
ment right away.
Theorem 1.6. If the sequence x0 , x1 , . . . produced by Algorithm 1.2 converges
to a point ξ, then g(ξ) = ξ.
 
continuity
Proof. ξ = lim xk = lim xk+1 = lim g(xk ) = g lim xk = g(ξ).
k→∞ k→∞ k→∞ k→∞

This theorem has a big “if”. The main concern for this section will be:
Given f as in Problem 1.1, how do we pick an appropriate function g so that
(i) simple iteration on g converges, and (ii) it converges fast.
Let’s do an example, with f (x) = ex − 2x − 1, as in Figure 1.1. Here are
three possible functions gi which all satisfy f (ξ) = 0 ⇐⇒ gi (ξ) = ξ:
g1 (x) = log(2x + 1),
ex − 1
g2 (x) = ,
2
g3 (x) = ex − x − 1. (1.1)
(The domain of g1 is restricted to (−1/2, ∞).) See Figure 1.5. Notice how
g1 ([1, 2]) ⊂ [1, 2] and g2,3 ([−1/2, 1/2]) ⊂ [−1/2, 1/2]: fixed points exist.
Let’s run simple iteration with these functions and see what happens.
First, initialize all three sequences with x0 = 0.5 and run 20 iterations.

x = zeros(20, 3); % run 20 iterations for each


x(1, :) = 0.5; % initialize

for k = 1 : size(x, 1)-1


x(k+1, 1) = g1(x(k, 1));
x(k+1, 2) = g2(x(k, 2));
x(k+1, 3) = g3(x(k, 3));
end

fprintf(' g1 g2 g3\n');
fprintf('%10.8e\t%10.8e\t%10.8e\n', x');
1.2. SIMPLE ITERATION 11

1.5

g1
1

f
0.5 g3
y

g2

-0.5 y=x

-1
-0.5 0 0.5 1 1.5 2
x

Figure 1.5: Functions gi intersect the line y = x (that is, gi (x) = x) exactly
when f (x) = 0.

This produces the following output.

g1 g2 g3
5.00000000e-01 5.00000000e-01 5.00000000e-01
6.93147181e-01 3.24360635e-01 1.48721271e-01
8.69741686e-01 1.91573014e-01 1.16282500e-02
1.00776935e+00 1.05576631e-01 6.78709175e-05
1.10377849e+00 5.56756324e-02 2.30328290e-09
1.16550958e+00 2.86273445e-02 0.00000000e+00
1.20327831e+00 1.45205226e-02 0.00000000e+00
1.22570199e+00 7.31322876e-03 0.00000000e+00
1.23878110e+00 3.67001786e-03 0.00000000e+00
1.24633153e+00 1.83838031e-03 0.00000000e+00
1.25066450e+00 9.20035584e-04 0.00000000e+00
1.25314261e+00 4.60229473e-04 0.00000000e+00
1.25455714e+00 2.30167698e-04 0.00000000e+00
1.25536366e+00 1.15097094e-04 0.00000000e+00
1.25582323e+00 5.75518590e-05 0.00000000e+00
1.25608500e+00 2.87767576e-05 0.00000000e+00
1.25623408e+00 1.43885858e-05 0.00000000e+00
1.25631897e+00 7.19434467e-06 0.00000000e+00
1.25636731e+00 3.59718527e-06 0.00000000e+00
1.25639483e+00 1.79859587e-06 0.00000000e+00

Recall that the larger root is about 1.2564312086261697. Let’s try again with
12 CHAPTER 1. SOLVING ONE NONLINEAR EQUATION

initialization x0 = 1.5.

g1 g2 g3
1.50000000e+00 1.50000000e+00 1.50000000e+00
1.38629436e+00 1.74084454e+00 1.98168907e+00
1.32776143e+00 2.35107853e+00 4.27329775e+00
1.29623914e+00 4.74844242e+00 6.64845880e+01
1.27884229e+00 5.72021964e+01 7.47979509e+28
1.26910993e+00 3.47991180e+24 Inf
1.26362374e+00 Inf NaN
1.26051782e+00 Inf NaN
1.25875516e+00 Inf NaN
1.25775344e+00 Inf NaN
1.25718372e+00 Inf NaN
1.25685955e+00 Inf NaN
1.25667505e+00 Inf NaN
1.25657003e+00 Inf NaN
1.25651024e+00 Inf NaN
1.25647620e+00 Inf NaN
1.25645683e+00 Inf NaN
1.25644579e+00 Inf NaN
1.25643951e+00 Inf NaN
1.25643594e+00 Inf NaN

Question 1.7. Explain why the g3 sequence generates NaN’s (Not-a-Number)


after the first Inf (∞).

Now x0 = 10.

g1 g2 g3
1.00000000e+01 1.00000000e+01 1.00000000e+01
3.04452244e+00 1.10127329e+04 2.20154658e+04
1.95855062e+00 Inf Inf
1.59271918e+00 Inf NaN
1.43161144e+00 Inf NaN
1.35150178e+00 Inf NaN
1.30914426e+00 Inf NaN
1.28600113e+00 Inf NaN
1.27312630e+00 Inf NaN
1.26589144e+00 Inf NaN
1.2. SIMPLE ITERATION 13

1.26180281e+00 Inf NaN


1.25948479e+00 Inf NaN
1.25816821e+00 Inf NaN
1.25741966e+00 Inf NaN
1.25699381e+00 Inf NaN
1.25675147e+00 Inf NaN
1.25661353e+00 Inf NaN
1.25653500e+00 Inf NaN
1.25649030e+00 Inf NaN
1.25646485e+00 Inf NaN

The sequence generated by g1 converges reliably to the larger root, slowly.


The sequence by g2 , if it converges, converges to 0, also slowly. The sequence
by g3 , when it converges, converges to 0 very fast. Much of these differ-
ences can be explained with the concept of contractions and the associated
theorem.

Definition 1.8 (contraction). Let g : [a, b] → R be continuous. We say g is


a contraction if there exists L ∈ (0, 1) such that

∀x, y ∈ [a, b], |g(x) − g(y)| ≤ L|x − y|.

In words: g brings x, y closer; this is a type of Lipschitz condition.

If g maps [a, b] to itself and it is a contraction, it is easy to establish


convergence of simple iteration. The role of L and the importance of having
L ∈ (0, 1) become apparent in the proof.

Theorem 1.9 (contraction mapping theorem). Let g : [a, b] → [a, b] be con-


tinuous. If g is a contraction, then it has a unique fixed point ξ and the
simple iteration sequence x0 , x1 . . . generated by xk+1 = g(xk ) converges to ξ
for any x0 ∈ [a, b].

Proof. The proof is in three steps.

1. A fixed point ξ exists, by Theorem 1.5.

2. The fixed point is unique. By contradiction: if ξ 0 = g(ξ 0 ) and ξ 0 6= ξ,


then
Def. 1.8
|ξ − ξ 0 | = |g(ξ) − g(ξ 0 )| ≤ L|ξ − ξ 0 |.

Since ξ 0 6= ξ, we get L ≥ 1, which is a contradiction for a contraction.


14 CHAPTER 1. SOLVING ONE NONLINEAR EQUATION

3. Convergence: |xk+1 − ξ| = |g(xk ) − g(ξ)| ≤ L|xk − ξ|. By induction, it


follows that |xk − ξ| ≤ Lk |x0 − ξ|. Since L ∈ (0, 1), this converges to
zero as k goes to infinity, hence limk→∞ xk = ξ.
From the last step of the proof, we also get a sense that having L closer to
zero should translate into faster convergence. Let’s investigate whether func-
tions gi from our example are contractions, and if so, with which constants
Li . First, recall the Mean Value Theorem (MVT).
Theorem 1.10 (MVT). If g : [a, b] → R is continuous and it is differentiable
on (a, b), then there exists η ∈ (a, b) such that g(b) − g(a) = g 0 (η)(b − a).
Consider the MVT and the definition of contraction. If g : [a, b] → [a, b]
is continuous and it is differentiable on (a, b), then for all x, y ∈ [a, b], we
have |g(x) − g(y)| = |g 0 (η)||x − y| for some η ∈ (a, b). Thus, replacing |g 0 (η)|
with a bound independent of η (and independent of x and y), we reach the
conclusion that
∀x, y ∈ [a, b], |g(x) − g(y)| ≤ L|x − y|,
with
L , sup |g 0 (η)|.
η∈(a,b)

If this quantity is in (0, 1), then g is a contraction on [a, b]. (Note that the
sup over (a, b) is equivalent to a max over [a, b] if g 0 is continuous on [a, b].)
Are the functions gi contractions? Yes, on some intervals. Consider
Figure 1.6 which depicts |gi0 (x)|. We have:
ˆ g1 ([1, 2]) ⊂ [1, 2] and L1 = maxη∈[1,2] |g10 (η)| ≤ 0.667.
ˆ g2 ([−1/2, 1/2]) ⊂ [−1/2, 1/2] and L2 = maxη∈[−1/2,1/2] |g20 (η)| ≤ 0.825.
ˆ g3 ([−1/2, 1/2]) ⊂ [−1/2, 1/2] and L3 = maxη∈[−1/2,1/2] |g30 (η)| ≤ 0.649.
Theorem 1.9 (the contraction mapping theorem) guarantees convergence
to a unique fixed point for these gi ’s, given appropriate initialization. What
can be said about the speed of convergence? Consider the proof of Theo-
rem 1.9. In the last step, we established |xk − ξ| ≤ Lk |x0 − ξ|. What does it
take to ensure |xk − ξ| ≤ ε? Certainly, if Lk |x0 − ξ| ≤ ε, we are in the clear.
Taking logarithms, this is the case if and only if:
k log(L) + log |x0 − ξ| ≤ log(ε) (multiply by −1)
k log (1/L) ≥ log(|x0 − ξ|) + log (1/ε)
 
1 |x0 − ξ|
k≥ log .
log(1/L) ε
1.2. SIMPLE ITERATION 15

|g'1|
1.5 |g'2|

1
y

0.5

|g'3|

0
-0.5 0 0.5 1 1.5 2
x

Figure 1.6: Absolute values of the derivatives of functions gi . For a differen-


tiable function gi to be a contraction around x, a necessary condition is that
|gi0 (x)| < 1. Black dots mark the roots of f .

Of course, we do not know ξ. Surely, |x0 − ξ| ≤ [a, b], but in practice we


also rarely know [a, b]. Luckily, we can get around that by studying the first
iterate:1

|x0 − ξ| ≤ |x0 − x1 | + |x1 − ξ|


= |x0 − x1 | + |g(x0 ) − g(ξ)|
≤ |x0 − x1 | + L|x0 − ξ|.
1
Thus, |x0 − ξ| ≤ 1−L |x0 − x1 |: assuming we know L, this is a computable
quantity. Combining, we get the following bound on k.

Theorem 1.11. Under the assumptions and with the notations of the con-
traction mapping theorem, with x0 ∈ [a, b], for all k ≥ k(ε) where
 
1 |x0 − x1 |
k(ε) = log ,
log(1/L) (1 − L)ε

it holds that |xk − ξ| ≤ ε.

This last theorem only gives an upper bound on how many iterations
might be necessary to reach a desired accuracy. In practice, convergence
may be much faster. Take for example g3 , which converged to the root 0
(exactly) in only 5 iterations when initialized with x0 = 0.5. Meanwhile, the
bound with L3 = 0.649 only guarantees an accuracy of L53 |x0 − ξ| ≈ 0.058
for 5 iterations. Why is that?
1
In the first line, we use the triangle inequality: https://en.wikipedia.org/wiki/
Triangle_inequality#Example_norms.
16 CHAPTER 1. SOLVING ONE NONLINEAR EQUATION

One important reason is that the constant L is valid for a whole interval
[a, b]. Yet, this choice of interval is somewhat arbitrary. If xk → ξ, eventually,
it is really only g 0 close to ξ which matters. For g1 , the derivative g10 evaluated
at the positive root is about 0.57: not a big difference from 0.667. But for
g3 , we have g30 (0) = 0—as we get closer and closer to 0, the convergence gets
faster and faster!
Thus, informally, if g is continuously differentiable at ξ and xk → ξ,
asymptotically, the rate depends on g 0 (ξ). In fact, much of the behavior of
simple iteration is linked to g 0 (ξ). Consider the following definition.
Definition 1.12. Let g : [a, b] → [a, b] have a fixed point ξ, and let x0 , x1 . . .
be a sequence generated by xk+1 = g(xk ) for some x0 ∈ [a, b].
ˆ If there exists a neighborhood I of ξ such that x0 ∈ I implies xk → ξ,
we say ξ is a stable fixed point.
ˆ If there exists a neighborhood I of ξ such that x0 ∈ I\{ξ} implies we
do not have xk → ξ, we say ξ is an unstable fixed point.
Consider f (x) =
( (ξ can be either or neither.)
1
2
x if x ≤ 0,
2x otherwise. For a continuously differentiable function g with fixed point ξ, we can
make the following statements (note that their “if” parts are quite different
in nature.)
ˆ If |g 0 (ξ)| > 1, then ξ is unstable. Indeed, if xk is very close to ξ (but
not equal!), then, by the MVT,

|xk+1 − ξ| = |g(xk ) − g(ξ)| = |g 0 (η)||xk − ξ|

for some η between xk and ξ. By continuity of g 0 , we have |g 0 (η)| > 1


for η sufficiently close to ξ, hence: we are being pushed away from ξ
by the iteration.
ˆ If xk → ξ, then, by the MVT and by continuity of |g 0 (x)|,
|xk+1 − ξ| |g(xk ) − g(ξ)| |g 0 (ηk )||xk − ξ|
lim = lim = lim
k→∞ |xk − ξ| k→∞ |xk − ξ| k→∞ |xk − ξ|
 
= lim |g 0 (ηk )| = g 0 lim ηk = |g 0 (ξ)|,
k→∞ k→∞

where ηk lies between xk and ξ.


This last statement shows explicitly how |g 0 (ξ)| drives the error decrease,
asymptotically. If g 0 (ξ) = 0, convergence is mighty fast, eventually. Let’s
give some names to the convergence speeds we may encounter.
1.3. RELAXATION AND NEWTON’S METHOD 17

Definition 1.13. Assume xk → ξ. We say:


ˆ xk converges to ξ at least linearly if there exists µ ∈ (0, 1) and there
exists ε0 , ε1 . . . > 0 such that εk → 0, |xk −ξ| ≤ εk and limk→∞ εk+1
εk
= µ.
ˆ If the conditions hold with µ = 0, the convergence is superlinear.
ˆ If they hold with µ = 1 and |xk − ξ| = εk , the convergence is sublinear.
For the first case, if furthermore |xk − ξ| = εk , we say convergence is
linear, and ρ = − log10 (µ) is the asymptotic rate of convergence. This is
because the number of correct digits of xk as an approximation of ξ grows as
kρ asymptotically (think about it), hence the term linear convergence.
It is a good idea to go back to Figures 1.5 and 1.6 to reinterpret the
experiments in light of our understanding of the role of |g 0 (ξ)|.
At this point, a word of caution is necessary: these notions of convergence
rates are asymptotic. When does the asymptotic regime kick in? That is
largely unspecified. Consider
g1 (x) = 0.99x,
x
g2 (x) = .
(1 + x1/10 )10
Running a simple iteration on both functions from x0 = 1 generates the
following sequences. For g1 , we have linear convergence to 0:
xk = 0.99k , with ρ = − log10 (0.99) ≈ 0.004,
and for g2 we have sublinear convergence to 0:
 10
1 |xk+1 − 0| k+1
xk = 10
, and lim = lim = 1.
(k + 1) k→∞ |xk − 0| k→∞ k+2
Thus, convergence to 0 is eventually faster with g1 , yet as Figure 1.7 shows,
this asymptotic behavior only kicks in after many thousands of iterations.
For practical applications, the early convergence to a “good enough” approx-
imation of the solution may be all that matters. (This is especially true for
optimization algorithms in machine learning applied to very large datasets.)

1.3 Relaxation and Newton’s method


Given a function f , we saw that different choices of g such that g(x) = x ⇐⇒
f (x) = 0 lead to different behavior. Is there a systematic approach to pick
g? Here is one.
18 CHAPTER 1. SOLVING ONE NONLINEAR EQUATION

It's over 9000


10 0

10 -10

g1
10 -20

g2
10 -30

10 -40

10 -50
0 2000 4000 6000 8000 10000
Iteration count k

Figure 1.7: Even though the sequence generated by g2 converges only sub-
linearly, it takes over 9000 iterations for the linearly convergent sequence
generated by g1 to take over.

Definition 1.14. Let f be defined and continuous around ξ. Relaxation


defines the sequence
xk+1 = xk − λf (xk ),
where λ 6= 0 is to be chosen (see below) and x0 is given near ξ.
Thus, relaxation is simple iteration with g(x) = x − λf (x). Since g is
continuous, if relaxation converges to ξ, then f (ξ) = 0.
What about rates of convergence? Assuming differentiability, ideally, we
want |g 0 (ξ)| = |1 − λf 0 (ξ)| < 1. This is the case if and only if 1 − λf 0 (ξ) < 1
and 1 − λf 0 (ξ) > −1, that is:
ˆ f 0 (ξ) 6= 0: the root is simple;
ˆ λ and f 0 (ξ) have the same sign; and
ˆ |λ| is not too big: using λf 0 (ξ) = |λ||f 0 (ξ)|, we need |λ| < 2
|f 0 (ξ)|
.
Thus, if ξ is a simple root of f and f is continuously differentiable around ξ,
there exists some λ 6= 0 such that relaxation converges at least linearly to ξ
if started close enough: this statement is formalized in [SM03, Thm. 1.7].
The above reduces the construction of g to picking λ. Let’s automate
this as well. If we are optimistic, we may try to pick λ such that g 0 (ξ) =
1 − λf 0 (ξ) = 0, hence set λ = f 01(ξ) . The issue is that we do not know f 0 (ξ).
One idea that turns out to be particularly powerful is to allow λ to change
with k. At every iteration, our best guess for ξ is xk . So, let’s use that and
1
define λk = f 0 (x k)
.
1.3. RELAXATION AND NEWTON’S METHOD 19

Definition 1.15. For a given x0 , Newton’s method generates the sequence:


f (xk )
xk+1 = xk − .
f 0 (xk )

We implicitly assume that f 0 (xk ) 6= 0 for all k.


Question 1.16. Show that, at every step, xk+1 is the root of the first-order
Taylor approximation of f around xk .


If Newton’s method converges (that’s a big if!), then the rate is superlin-
ear provided f 0 (ξ) 6= 0. Indeed, Newton’s method is simple iteration with:

f (x) f 0 (x)2 − f (x)f 00 (x)


g(x) = x − , g 0 (x) = 1 − .
f 0 (x) f 0 (x)2

Thus, g 0 (ξ) = 0. How fast exactly is this superlinear convergence? Let’s look
at an example on f (x) = ex − 2x − 1:

f = @(x) exp(x) - 2*x - 1;


df = @(x) exp(x) - 2; % f' is easy to get here

% Initialization x 0: play around with this value: can get ...


convergence to either root!
x = .5;
fprintf('x = %+.16e, \t f(x) = %+.16e\n', x, f(x));

for k = 1 : 12
x = x - f(x) / df(x);
fprintf('x = %+.16e, \t f(x) = %+.16e\n', x, f(x));
end

This produces the following output:


x = +5.0000000000000000e-01, f(x) = -3.5127872929987181e-01
x = -5.0000000000000000e-01, f(x) = +6.0653065971263342e-01
x = -6.4733401606416163e-02, f(x) = +6.6784120574507444e-02
x = -1.8885640405095216e-03, f(x) = +1.8903462554580308e-03
x = -1.7777391536067094e-06, f(x) = +1.7777407337327134e-06
x = -1.5802248762500354e-12, f(x) = +1.5802914532514478e-12
x = +6.6576998915534905e-17, f(x) = -1.1102230246251565e-16
x = -4.4445303546980749e-17, f(x) = +0.0000000000000000e+00
x = -4.4445303546980749e-17, f(x) = +0.0000000000000000e+00
20 CHAPTER 1. SOLVING ONE NONLINEAR EQUATION

x = -4.4445303546980749e-17, f(x) = +0.0000000000000000e+00


x = -4.4445303546980749e-17, f(x) = +0.0000000000000000e+00
x = -4.4445303546980749e-17, f(x) = +0.0000000000000000e+00
x = -4.4445303546980749e-17, f(x) = +0.0000000000000000e+00

We get fast convergence to the root ξ = 0. After a couple iterations, the error
|xk − ξ| appears to be squared at every iteration, until we run into an error
of 10−17 , which is an issue of numerical accuracy. (As a practical concern, it
is nice to observe that, if we keep iterating, we do not move away from this
excellent approximation of the root.) Let’s give a name to this kind of fast
convergence.

Definition 1.17. Suppose xk → ξ. We say the sequence x0 , x1 . . . converges


to ξ with at least order q > 1 if there exists µ > 0 and a sequence ε0 , ε1 . . . > 0
with εk → 0 such that
εk+1
|xk − ξ| ≤ εk and lim q = µ.
k→∞ ε
k

If the inequality holds with equality, we say convergence is with order q; if


this holds with q = 2, the convergence is quadratic.

A couple remarks are in order:

1. There is no need to require µ < 1 since q > 1 (think about it.)

2. It makes no sense to discuss the rate of convergence of a sequence


to ξ if it does not converge to ξ, which is why the definition above
requires that the sequence converges to ξ as an assumption. Indeed,
consider the following sequence: xk = 2 for all k, and consider the
limit limk→∞ |x|xk+1 −0|
k −0|
2 = 12 > 0. Of course, we cannot conclude from
this that x0 , x1 , x2 . . . converges to 0 quadratically, since it does not
even converge to 0 in the first place. In fewer words: always secure
convergence to ξ before your discuss the rate of convergence to ξ.

Question 1.18. Show that simple iteration with g3 , if it converges to ξ = 0,


does so quadratically.

Finding such a g3 was rather lucky. Newton’s method, on the other


hand, provides a systematic way of getting quadratic convergence to isolated
roots (when it provides convergence), making it one of the most important
algorithms in numerical analysis.
1.3. RELAXATION AND NEWTON’S METHOD 21

Theorem 1.19. Let f : R → R be continuous with f (ξ) = 0. Assume f 00 (x)


is continuous in Iδ = [ξ − δ, ξ + δ] for some δ > 0 and f 00 (ξ) 6= 0. Further
assume there exists A > 0 such that Note that if
f 0 (ξ) 6= 0 then there
00
f (x) exist such A, δ > 0.
∀x, y ∈ Iδ , ≤ A.
f 0 (y)
(This implicitly requires f 0 (ξ) 6= 0.) If |x0 − ξ| ≤ h = min(δ, 1/A), then
Newton’s method converges quadratically to ξ.
Proof. Assume |xk − ξ| ≤ h (it is true of x0 , and we will show that if it is
true of xk then it is true of xk+1 , so that by induction it will be true of all
xk ’s.) In particular, xk ∈ Iδ so that we can Taylor expand f around xk :2
(ξ − xk )2 00
0 = f (ξ) = f (xk ) + (ξ − xk )f 0 (xk ) + f (ηk )
2
for some ηk between ξ and xk so that ηk ∈ Iδ . The proof works in two stages.
We first show convergence is at least linear; then we show it is actually
quadratic.

At least linear convergence. Since xk+1 = xk − ff0(x k)


(xk )
, the (signed) error
obeys:
f (xk )
ξ − xk+1 = ξ − xk +
f 0 (xk )
f (xk ) + (ξ − xk )f 0 (xk )
= Use Taylor on numerator
f 0 (xk )
(ξ − xk )2 f 00 (ηk )
=− .
2 f 0 (xk )
Using that xk , ηk ∈ Iδ and our assumptions,
1
|ξ − xk+1 | ≤ (ξ − xk )2 A See footnote.3
2
1
≤ |ξ − xk | Use |ξ − xk |A ≤ 1 since |ξ − xk | ≤ h ≤ 1/A.
2
In particular, |xk+1 − ξ| ≤ h. Since |x0 − ξ| ≤ h by assumption, all xk satisfy
|xk − ξ| ≤ h by induction. Furthermore, xk converges to ξ at least linearly.
Note: we did not yet use f 00 (ξ) 6= 0, but already we get linear convergence.
2
This is the Lagrange form of the remainder: see Section 1.5.
3
At this point, it is tempting to go for quadratic convergence directly by studying
|ξ−xk+1 |
(ξ−xk )2 , but notice that we did not yet prove that εk = |ξ − xk | converges to zero.
22 CHAPTER 1. SOLVING ONE NONLINEAR EQUATION

Quadratic convergence. Since xk → ξ, so does ηk → ξ. By continuity,


|xk+1 − ξ| f 00 (ηk ) f 00 (ξ)
lim = lim = = µ > 0.
k→∞ |xk − ξ|2 k→∞ 2f 0 (xk ) 2f 0 (ξ)
Carefully compare to the definition of quadratic convergence to conclude.
Question 1.20. What happens if f 0 (ξ) = 0? Is that good or bad?


Question 1.21. What happens if f 00 (ξ) = 0? Is that good or bad?




While Newton’s method is a great algorithm, bear in mind that the the-
orem we just established does not provide a practical way of initializing the
sequence. This remains a practical issue, which can only be resolved on a
case by case basis.
As a remark, note that the convergence guarantees given here are of the
form: if initialization is close enough to a root, then we get convergence to
that root. It is a common misconception to infer that if there is convergence
from a given initialization, then convergence is to the closest root. That
is simply not true. See [SM03, §1.7] for illustrations of just how compli-
cated the behavior of Newton’s method (and others) can be as a function of
initialization.

1.4 Secant method


Newton’s method is nice, but computing f 0 can be a pain sometimes. The
derivative can even be inaccessible at all for all intents and purposes, if f is
given to us not as a mathematical formula but rather as a computer program
whose code is either too complicated to dive into, or not revealed to us (a
black box ).
Here is an alternative, assuming f is continuously differentiable. We can
approximate the derivative at xk using the current and the previous iterate:
f (xk ) − f (xk−1 )
f 0 (xk ) ≈ = f 0 (ηk )
xk − xk−1
for some ηk between xk and xk−1 . If xk → ξ, then |xk − xk−1 | → 0 and the
approximation gets better. Furthermore, this is cheap because it only relies
on quantities that are readily available: the evaluation of f at the two most
recent iterates. Plug this into Newton’s method to get the secant method.
1.4. SECANT METHOD 23

Definition 1.22. For given x0 , x1 , the secant method generates the se-
quence:
xk − xk−1
xk+1 = xk − f (xk ) .
f (xk ) − f (xk−1 )

We implicitly assume that f (xk ) 6= f (xk−1 ) for all k.


With a drawing, convince yourself that xk+1 is the root of the line passing
through (xk , f (xk )) and (xk−1 , f (xk−1 )).
Let’s try this method on our running example.

f = @(x) exp(x) - 2*x - 1; % No need for the derivative of f

x0 = 1.5; % Need two initial points now


x1 = 1.0;

f0 = f(x0); % Evaluate f at both initial points


f1 = f(x1);

fprintf('x = %+.16e, \t f(x) = %+.16e\n', x1, f1);

for k = 1 : 12

% Compute the next iterate


x2 = x1 - f1 * (x1-x0) / (f1-f0);

% Evaluate the function there (single call to f!)


f2 = f(x2);

fprintf('x = %+.16e, \t f(x) = %+.16e\n', x2, f2);

% Slide the window


% Equivalent code: [x0, f0, x1, f1] = deal(x1, f1, x2, f2);
x0 = x1;
f0 = f1;
x1 = x2;
f1 = f2;

end

This produces the following output:

x = +1.0000000000000000e+00, f(x) = -2.8171817154095447e-01


x = +1.1845136881643570e+00, f(x) = -9.9930741879998841e-02
x = +1.2859430872139392e+00, f(x) = +4.6192337895330837e-02
x = +1.2538792881164769e+00, f(x) = -3.8492759507384733e-03
24 CHAPTER 1. SOLVING ONE NONLINEAR EQUATION

x = +1.2563456836075342e+00, f(x) = -1.2937473932783661e-04


x = +1.2564314625719744e+00, f(x) = +3.8418517700478105e-07
x = +1.2564312086009526e+00, f(x) = -3.8149927661379479e-11
x = +1.2564312086261695e+00, f(x) = -4.4408920985006262e-16
x = +1.2564312086261697e+00, f(x) = +0.0000000000000000e+00
x = +1.2564312086261697e+00, f(x) = +0.0000000000000000e+00
x = NaN, f(x) = NaN
x = NaN, f(x) = NaN
x = NaN, f(x) = NaN

Convergence to the positive root is very fast indeed, though after getting
there things go out of control. Why is that? Propose an appropriate stopping
criterion to avoid this situation.
We state a convergence result with a proof sketch here (See [SM03,
Thm. 1.10] for details). In [SM03, Ex. 1.10], you are guided to establish
superlinear convergence. (This theorem is only for your information.)

Theorem 1.23. Let f be continuously differentiable on I = [ξ − h, ξ + h] for


some h > 0 with f (ξ) = 0, f 0 (ξ) 6= 0. If x0 , x1 are sufficiently close to ξ, then
the secant method converges to ξ at least linearly.

Proof sketch. Assume f 0 (ξ) = α > 0 (the argument is similar for α < 0.) In
a subinterval Iδ of I, by continuity of f 0 , we have f 0 (x) ∈ [ 34 α, 54 α]. Following
the first part of the proof for the convergence rate of Newton’s method, this
is sufficient to conclude that |xk+1 − ξ| ≤ 23 |xk − ξ|, leading to at least linear
convergence.
As a closing remark to this chapter, we note that it is a good strategy to
use bisection to zoom in on a root at first, thus exploiting the linear conver-
gence rate of bisection and its robustness; then to switch to a superlinearly
convergent method such as Newton’s or the secant method to “finish the
job.” This two-stage procedure is part of the strategy implemented in Mat-
lab’s fzero, as described in a series of blog posts by Matlab creator Cleve
Moler.4

4
https://blogs.mathworks.com/cleve/2015/10/12/zeroin-part-1-dekkers-algorithm/,
https://blogs.mathworks.com/cleve/2015/10/26/zeroin-part-2-brents-version/,
https://blogs.mathworks.com/cleve/2015/11/09/zeroin-part-3-matlab-zero-finder-fzero/
1.5. A QUICK NOTE ABOUT TAYLOR’S THEOREM 25

1.5 A quick note about Taylor’s theorem


In this course, we frequently use Taylor’s theorem with remainder in Lagrange
form. The reader is encouraged to consult Wikipedia5 for a refresher of this
useful tool from calculus. For example, at order two the theorem is stated
below. We give the classical proof based on Cauchy’s mean value theorem
(also called the extended mean value theorem).6 This proof extends to Taylor
expansions of any order, provided f is sufficiently many times differentiable.

Theorem 1.24. Let f be twice differentiable on (x, a) with f 0 continuous on


[x, a]. There exists η ∈ (x, a)—which depends on a in general—such that
1
f (a) = f (x) + (a − x)f 0 (x) + (a − x)2 f 00 (η).
2
Proof. Consider these functions of t:

G(t) = (t − a)2 , G0 (t) = 2(t − a),


F (t) = f (t) + (a − t)f 0 (t), F 0 (t) = (a − t)f 00 (t).

Cauchy’s mean value theorem states there exists η (strictly) between a and
x such that
F 0 (η) F (x) − F (a)
= .
G0 (η) G(x) − G(a)

On one hand, we compute

F (x) − F (a) f (x) + (a − x)f 0 (x) − f (a)


= .
G(x) − G(a) (x − a)2

On the other hand we compute

F 0 (η) (a − η)f 00 (η) 1


0
= = − f 00 (η).
G (η) 2(η − a) 2

Combine and re-arrange to finish the proof.

5
https://en.wikipedia.org/wiki/Taylor%27s_theorem#Explicit_formulas_
for_the_remainder
6
https://en.wikipedia.org/wiki/Mean_value_theorem#Cauchy’s_mean_value_
theorem
26 CHAPTER 1. SOLVING ONE NONLINEAR EQUATION
Chapter 2

Inexact arithmetic, IEEE and


differentiation

Computers compute inaccurately, in a very precise way.

Vincent Legat, Prof. of Numerical Methods, UCLouvain, 2006

In this chapter, we delve into an important (if frustrating) aspect of com-


puting with real numbers on real computers: it cannot be done exactly.1
Fortunately, round off errors, as they are called, are systematic (as opposed
to random) and obey precise rules. We already witnessed an example of
math clashing with computation when investigating the bisection algorithm
(recall Figure 1.3.) Let’s look at another, more important example of this:
approximating derivatives.

2.1 Motivating example: finite differentiation


We want to approximate the derivative of a function f , but we are only al-
lowed to compute the value of f itself at some points of our choice. Certainly,
computing f at a single point cannot reveal any information about f 0 (the
rate of change.) The next best target is: can we do it with two points?

Problem 2.1. Let f : R → R be three times continuously differentiable in an


interval [x − h̄, x + h̄] for some h̄ > 0. Compute an approximation of f 0 (x)
using only two evaluations of f at well-chosen points.
1
Unless we allow an unbounded amount of memory to be used to represent numbers,
which is hopelessly impractical.

27
28 CHAPTER 2. FLOATING POINT ARITHMETIC

As often, we start with a Taylor expansion. For any 0 < h < h̄, there
exist η1 and η2 both in [x − h̄, x + h̄] such that

h2 00 h3
f (x + h) = f (x) + hf 0 (x) +f (x) + f 000 (η1 ),
2 6
2 3
h h
f (x − h) = f (x) − hf (x) + f (x) − f 000 (η2 ).
0 00
2 6
Our goal is to obtain a formula for f 0 (x). Thus, it is tempting to compute
the difference between the two formulas above:
h3 000
f (x + h) − f (x − h) = 2hf 0 (x) + (f (η1 ) + f 000 (η2 )) .
6
Solving for f 0 (x), we get:

f (x + h) − f (x − h) h2 000
f 0 (x) = − (f (η1 ) + f 000 (η2 )) . (2.1)
2h 12
Since f 000 is continuous in [x− h̄, x+ h̄], it is also bounded in that interval. Let
M3 be such that |f 000 (η)| ≤ M3 for all η in the interval. The approximation

f (x + h) − f (x − h)
f 0 (x) ≈ ∆f (x; h) =
2h
is called a finite difference approximation. From (2.1), we deduce it incurs
an error bounded as:
M3 2
|f 0 (x) − ∆f (x; h)| ≤ h. (2.2)
6
This formula suggests that the smaller h, the smaller the approximation
error, which is certainly in line with our intuition
 about derivatives. Let’s
verify this on a computer with f (x) = sin x + π3 , approximating f 0 (0).

f = @(x) sin(x + pi/3);


df = @(x) cos(x + pi/3);

% Pick 101 values of h on a log-scale from 1e-16 to 1e0.


hh = logspace(-16, 0, 101);
err = zeros(size(hh));

for k = 1 : numel(hh)

h = hh(k);
2.2. A SIMPLIFIED MODEL FOR IEEE ARITHMETIC 29

approx = (f(h) - f(-h))/(2*h);


actual = df(0);

err(k) = abs(approx - actual);

end

% Bound on | f'''| over the appropriate interval.


% Here, | f'''(x) | = | -cos(x + pi/3) | <= 1 everywhere.
M3 = 1;

loglog(hh, err, '.-', hh, (M3/6)*hh.ˆ2, '-');


legend('Actual error', 'Theoretical bound', 'Location', ...
'SouthWest');
xlabel('Step size h');
ylabel('Error | f''(0) - FD(f, 0, h) | ');
xlim([1e-16, 1]);
ylim([1e-12, 1]);

This code generates Figure 2.1. It is pretty clear that when h is “too”
small, something breaks. Specifically, it is the fact that our computations
are inexact. Fortunately, we will be able to give a precise description of what
happens, also allowing us to pick an appropriate value for h in practice.

2.2 A simplified model for IEEE arithmetic


We follow Lecture 13 in [TBI97]: read that first. We assume double precision,
which is the default in Matlab.
We are allotted 64 bits to represent real numbers. Let us use one of these
bits to code the sign: if the bit is 1, let’s agree that the number is nonnegative;
if the bit is 0, we agree the number is nonpositive. This concern aside, we
have 63 bits left, which allows us to pick 263 ≈ 1019 nonnegative numbers of
our choosing. For each possible sequence of 63 bits (each 0 or 1), we get to
choose which real number it represents. If we wish to represent a certain real
number on our computer, chances are it won’t be one of the representable
numbers, so we will round it to the nearest representable number. By doing
so, we incur a round-off error. Clearly, the spacing between representable
numbers is crucial here.
One simple (if naive) strategy that comes to mind is as follows: let us
pick some large number M > 0, and let us distribute the 263 representable
real numbers evenly between 0 and M . A serious drawback of this approach
is that, if we want to represent really large numbers, we are forced to take
M large, which in turn increases the spacing between any two representable
30 CHAPTER 2. FLOATING POINT ARITHMETIC

Error |f'(0) - FD(f, 0, h)| 10 0

10 -5

10 -10 Actual error


Theoretical bound

10 -15 10 -10 10 -5 10 0
Step size h

Figure 2.1: Mathematically, we predicted |f 0 (x) − ∆f (x; h)| (the blue curve)
should stay below M63 h2 (the red line of slope 2). Clearly, something is wrong.
From the plot, it seems h = 10−5 is a good value. In practice though, we
cannot draw this plot for we do not know f 0 (x): we need to predict what a
good h is via other means.

numbers. A related concern is that we are giving the same importance to


absolute errors anywhere on the interval: this strategy says it is just as bad
to round 109 to 109 + 1 as it is to round 1 to 1 + 1: that is just not true in
practice.
A good alternative, close to what modern computers do, is to pick the
points on a logarithmic scale. Specifically, for a small value of ε (on the order
of 10−16 in practice), we pick the numbers that are exactly representable as
1, 1 · (1 + ε), 1 · (1 + ε)2 , . . ., and likewise 1, 1 · (1 + ε)−1 , 1 · (1 + ε)−2 , . . . Since
ε is very small, at first, the numbers we can represent are very close to 1.
But eventually, this being an exponential process, we get to also pick very
large and very small numbers. The key is the following: by construction,
the relative spacing between two representable numbers is always the same,
namely: they are separated by a ratio of 1 + ε (very close to 1). The absolute
spacing, on the other hand, can grow quite large (or quite small). Indeed, if
x is a representable number, then the next representable number is x · (1 + ε).
They are separated by an absolute gap of x · (1 + ε) − x = xε. For x = 1, this
gap is only on the order of 10−16 , but for x = 106 , the gap is much larger: on
the order of 10−10 . This is acceptable: an error of 10−10 on a quantity of 106
2.2. A SIMPLIFIED MODEL FOR IEEE ARITHMETIC 31

is not as bad as if we made that error on a quantity on the order of 1. Of


course, we still only have a finite number of numbers we can represent, and
this simplified explanation also doesn’t cover how we represent 0: we leave
such concerns aside, as they are not necessary for our purposes.
The IEEE 754 standard codifies a system along the lines described above:
this is how (most) computers compute with real numbers, in a setup known as
floating point arithmetic.2 This is designed to offer a certain relative accuracy
over a huge range of numbers. Ignoring overflow and underflow problems3 as
well as denormalized numbers4 (which we will always do), the main points
to remember are:
1. Real numbers are rounded to representable numbers (on 64
bits) with relative accuracy εmach = 1.11 · 10−16 .

2. For individual, basic operations, such as +, −, ·, /, (but
also, on modern computers, special functions such as trigono-
metric functions, exponentials. . . ) on representable num-
bers, results of one operation are as accurate as can be, in
that the result is the representable number which is closest
to the correct answer.
Thus, for a given real number a, its representation f l(a) in double precision
obeys5

f l(a) = a(1 + ε0 ), with |ε0 | ≤ εmach .

Furthermore, given two numbers a and b already represented exactly in mem-


ory (that is, f l(a) = a, f l(b) = b), we can assume the following about opera-
tions with these numbers:
ˆ a ⊕ b = f l(a + b) = (a + b)(1 + ε1 ),

ˆ a b = f l(a − b) = (a − b)(1 + ε2 ),

ˆ a b = f l(ab) = ab(1 + ε3 ),
2
https://en.wikipedia.org/wiki/IEEE_floating_point
3
Overflow occurs when one attempts to work with a number larger than the biggest
number which can be stored (about 10308 ); underflow occurs when one attempts to store a
number which is closer to zero than the closest nonzero number which can be represented.
4
Denormalized numbers fill the gap around zero to improve accuracy there, but the
relative accuracy is not as good as everywhere else.
5
Remark that εmach is half of ε above, since if a is in the interval [x, x(1 + ε)] whose
limits are exactly represented, its distance to either limit is at most half of the interval
length, that is, 2ε x. Then, the relative error upon rounding a to its closest representable
number is |a−f|a|l(a)| ≤ 2ε |x| ε
|a| ≤ 2 , εmach .
32 CHAPTER 2. FLOATING POINT ARITHMETIC

ˆ a b = f l(a/b) = ab (1 + ε4 ),

ˆ sqrt(a) = a(1 + ε5 ),

where |εi | ≤ εmach for all i.


Finally, we add as an extra rule that

Multiplication and division by 2 is exact.

This follows from the fact that computers typically represent numbers in
binary, so that multiplying and dividing by 2 can be done exactly. Conse-
quently, powers of 2 (and of 1/2) are exactly representable, and multiplying
or dividing by them is done exactly. Similarly, since we typically use one bit
to encode the sign,

If a can be represented exactly, then the same is true of −a.


Hence, computing −a is error-free.

Computations usually involve many simple operations executed in suc-


cession, so that round-off errors will combine. Let’s see how addition of three
numbers works out (notice that we now need to specify the order in which
additions are computed):

a ⊕ (b ⊕ c) = a ⊕ (b + c)(1 + ε1 )
= [a + (b + c)(1 + ε1 )] (1 + ε2 )
= (a + b + c) + ε2 (a + b + c) + ε1 (b + c) + ε1 ε2 (b + c)
= (a + b + c) + ε2 (a + b + c) + ε1 (b + c) + O(ε2mach ).

In the last equation, we made a simplification which we will always do: terms
proportional to ε2mach (or ε3mach , ε4mach . . .) are so small that we do not care;
so we hide them in the notation O(ε2mach ). Notice how the formula tells us
the result of the addition (after both round-offs) is equal to the correct sum
a + b + c, plus some extra terms. It is useful to bound the error:

|a ⊕ (b ⊕ c) − (a + b + c)| ≤ εmach (|a + b + c| + |b + c|) + O(ε2mach ).

In relative terms, we get


 
a ⊕ (b ⊕ c) − (a + b + c) |b + c|
≤ εmach 1 + + O(ε2mach ).
a+b+c |a + b + c|

Question 2.2. Is there a preferred order in which to sum a, b, c to reduce


the error? Think of a  b  c.
2.3. FINITE DIFFERENTIATION 33

Question 2.3. Can you establish a formula for the sum of a1 , . . . , an ?




Matlab’s eps function gives the spacing between a representable number


and the next representable number:
>> help eps
eps Spacing of floating point numbers.
D = eps(X), is the positive distance from ABS(X) to the next
larger in magnitude floating point number of the same precision
as X.
Essentially, eps(x) is 2|x|εmach . Notice the phrase “of the same precision as
X”: this is because Matlab allows computations in both double precision (as
described above), and in single precision. In single precision, only 32 bits are
used to represent real numbers, and εmach ≈ 6 · 10−8 : eps(single(1))/2.
Sometimes, the reduced computation time is more important than the re-
sulting loss of accuracy.
Below, we are going to find out that the main issue with the finite dif-
ferentiation example is the computation of a − b where |a|, |b| are large yet
a ≈ b. To see why that is, consider computing f l(a − b) in a situation where
a, b are not exactly represented. Then:
f l(a − b) = f l(a) f l(b) = (a(1 + ε1 ) − b(1 + ε2 ))(1 + ε3 )
= (a − b)(1 + ε3 ) + ε1 a − ε2 b + O(ε2mach ).
In terms of relative accuracy, we get
 
f l(a − b) − (a − b) |a| + |b|
≤ εmach 1 + + O(ε2mach ).
a−b |a − b|
Clearly, if |a − b|  |a| + |b|, we are in trouble. This happens if a, b are
large yet close, so that rounding them individually incurs errors that are
small relative to themselves, yet large relative to our actual target: their
difference.

2.3 Finite differentiation


Formula (2.2) is mathematically correct: it gives the so-called truncation
error of the finite difference approximation. Yet, as illustrated by Figure 2.1,
34 CHAPTER 2. FLOATING POINT ARITHMETIC

it is reckless to rely on it in IEEE arithmetic. Why? The short answer is:


when h is very small, f (x + h) and f (x − h) are almost equal (to f (x)).
Thus, their difference is much smaller in magnitude than their own values in
magnitude. But, following the IEEE system, each of f (x − h) and f (x + h)
is stored with a relative accuracy proportional to (essentially) |f (x)|. Thus,
the difference is computed with an error proportional to |f (x)| as well, not
proportional to the difference. So, for small h, the error might be much larger
than the value we are computing!
We effectively compute f l(∆f (x; h)). How much error (overall) do we
incur for that? The classic trick is to start with a triangular inequality, so
that we can separate on one side the round-off error, and on the other side
the truncation error (that is, the mathematical error):
|f l(∆f (x; h)) − f 0 (x)| = |f l(∆f (x; h)) − ∆f (x; h) + ∆f (x; h) − f 0 (x)|
≤ |f l(∆f (x; h)) − ∆f (x; h)| + |∆f (x; h) − f 0 (x)| .
The truncation error we already understand: it is bounded by M63 h2 . Let’s
focus on the round-off error. For notational convenience, we let x = 0 and
omit it in the notation. We also assume f can be evaluated with relative
accuracy ε at x ± h:
f l(f (x ± h)) = f (x ± h)(1 + ε1,2 ), with |ε1,2 | ≤ εmach .
We could say 10εmach or 100εmach : that is not what matters here. And since
we get to pick h, we might as well pick one that is represented exactly (and
2h will be too).6
f l(∆f (h)) = (f l(f (h)) f l(f (−h))) (2h)
f (h)(1 + ε1 ) − f (−h)(1 + ε2 )
= (1 + ε3 )(1 + ε4 ).
2h
Let’s break it down: ε1 , ε2 are the relative errors due to computing f (h) and
f (−h); ε3 is the error incurred by computing their difference; and ε4 is the
error due to the division.7 Let us reorganize terms to make ∆f (h) appear:
f (h)(1 + ε1 ) − f (−h)(1 + ε2 )
f l(∆f (h)) = (1 + ε3 )(1 + ε4 )
2h
f (h) − f (−h) ε1 f (h) − ε2 f (−h)
= (1 + ε3 )(1 + ε4 ) + (1 + ε3 )(1 + ε4 )
2h 2h
ε1 f (h) − ε2 f (−h)
= ∆f (h) + ∆f (h)(ε3 + ε4 ) + + O(ε2 ).
2h
6
If h is not exactly represented, we get an error in the denominator which appears as
1 1
1+ε5 . It is then useful to recall that, since |ε5 | ≤ εmach , we have 1+ε 5
= 1 − ε5 + O(ε2mach ).
7
You can even get rid of ε4 if you restrict youself to h being a power of 1/2.
2.3. FINITE DIFFERENTIATION 35

A number of terms had a product of two or more εi ’s in them; we hid them


all under O(ε2 ). Do this soon to simplify computations. So, the round-off
error is made of three terms:
ε1 f (h) − ε2 f (−h)
|f l(∆f (h)) − ∆f (h)| ≤ 2ε|∆f (h)| + + O(ε2 ).
2h

The first term is fine: it is the usual form for a relative error. The last term
is also fine: we consider O(ε2 ) very small. It is the middle term which is
the culprit. To understand why it is harmful, recall that ε1 and ε2 can be
both positive and negative (corresponding to rounding up or down when the
operation was computed.) Thus, if the signs of ε1 f (h) and ε2 f (−h) happen
to be opposite (which might very well be the case), then the numerator is
quite large. Using f (±h) = f (0) ± hf 0 (0) + O(h2 ):

ε1 f (h) − ε2 f (−h) ε|f (h)| + ε|f (−h)| |f (0)|


≤ ≤ε + ε|f 0 (0)| + O(εh).
2h 2h h

Clearly, if h is small, this is bad. Overall then, we find the following round-off
error (where we also used |∆f (h) − f 0 (0)| = O(h2 )):

|f (0)|
|f l(∆f (h)) − ∆f (h)| ≤ 3ε|f 0 (0)| + ε + O(ε2 ) + O(εh).
h
Finally, we have the following formula to bound the error; at this point, we
re-integrate x in our notation:

M3 2 |f (x)|
|f l(∆f (x; h)) − f 0 (x)| ≤ h + 3ε|f 0 (x)| + ε + O(ε2 ) + O(εh).
6 h
(2.3)

Good. This should help us pick a suitable value of h. The goal is to minimize
the parts of the bound that depend on h. This is pretty much the case when
both error terms are equal:8

M3 2 |f (x)|
h ≈ε ,
6 h
q
thus, h = 3 6|fM(x)| ε is an appropriate choice. If the constant is not too
3

different from 1, the magic value to remember is h ≈ 3 ε ≈ 10−5 .
8
More precisely, you could observe that 6 h+ ε |f (x)|
M3 2
h q attains its minimum when the
derivative with respect to h is zero; this happens for h = 3 3|fM(x)|
3
ε.
36 CHAPTER 2. FLOATING POINT ARITHMETIC

Question 2.4. Looking at the culprit in the error bound, ε |f (0)|


h
, one might
think that it is sufficient to work with g(x) ≡ f (x) − f (0) instead, so that
g 0 (0) = f 0 (0), and the culprit does not appear when computing g 0 (0) since
g(0) = 0. Why is that a fallacy?


Question 2.5. Can you track down what happens if the computation of f (h)
and f (−h) only has a relative accuracy of 100εmach instead of εmach ? (Follow
ε1 and ε2 .)


The following bit of code adds the IEEE-aware error bound to the mix,
as depicted in Figure 2.2.

hold all;
loglog(hh, (M3/6)*hh.ˆ2 + 3*eps(1)*abs(df(0)) + ...
(eps(1)./hh)*abs(f(0)));
legend('Actual error', 'Theoretical bound', 'IEEE-aware bound');

Question 2.6. Notice how, in this log-log plot, the right bit has a slope of 2,
whereas the left bit has a slope of −1. Can you tell why? The precise value
of the optimal h depends on M3 which is typically unknown. Is it better to
overestimate or underestimate?


2.4 Bisection
Recall the bisection method: for a certain continuous function f , let a0 < b0
(represented exactly in memory) be the end points of an interval such that
f (a0 )f (b0 ) < 0 (so the interval contains a root). The bisection method
computes c0 = a0 +b 2
0
, the mid-point of the interval [a0 , b0 ], and decides to
make the next interval either [a1 , b1 ] = [a0 , c0 ] or [a1 , b1 ] = [c0 , b0 ]. Then, it
iterates. If this is done exactly, the interval length
`k = bk − ak
1
is reduced by a factor of two at each iteration, so that `k = `
2k 0
→ 0. Ah,
but computations are not exact. . .
2.4. BISECTION 37

10 0
Error |f'(0) - FD(f, 0, h)|

10 -5

Actual error
-10 Theoretical bound
10
IEEE-aware bound

10 -15 10 -10 10 -5 10 0
Step size h

Figure 2.2: Factoring in round-off error in our analysis, we get a precise un-
derstanding of how accurately the finite difference formula can approximate
derivatives in practice. The rule of thumb h ≈ 10−5 is fine in this instance.

What happens if we iterate until `k becomes very small (think: small


enough that machine precision becomes an issue)? We eventually get to the
point where ak and bk are “next to each other” in the list of exactly repre-
sentable numbers. Thus, when computing ck , which should fall in between
the two, we instead get f l(ck ) rounded to either ak or bk : the interval will
no longer change, and the interval length will no longer decrease. How long
is the interval at that point? About εmach |ak |. Since at convergence ak , bk
should have converged to bracket a root ξ, we may expect a good bound to
be `k / 21k `0 + εmach |ξ|.9 Indeed, Figure 2.3 shows Figure 1.3 with an added
IEEE-aware bound which explains the behavior exactly.
The following piece of code confirms the explanation, by verifying that
once bisection gets stuck, ak , bk are indeed “neighbors” in the finite set of
representable reals.

% run bisection; then:


fprintf('a = %.16e\n', a);
fprintf('b = %.16e\n', b);

9
The bound is only approximate because the interval is not quite exactly halved at
each iteration, also because of round-off errors. That effect is negligible, but it is a good
exercise to try and account for it.
38 CHAPTER 2. FLOATING POINT ARITHMETIC

fprintf('c = %.16e\n', c);


fprintf('b - a = %.16e\n', b - a);
fprintf('eps(a) = %.16e\n', eps(a));

The output is:

a = 4.9651142317442760e+00
b = 4.9651142317442769e+00
c = 4.9651142317442769e+00
b - a = 8.8817841970012523e-16
eps(a) = 8.8817841970012523e-16

Indeed, the distance between a and the next representable number as given
by eps(a) is exactly the distance between a and b. As a result, c is rounded
to either a or b (in this case, to b.)
A final remark: the story above suggests we can get an approximation of
a root ξ basically up to machine precision. If you feel courageous, you could
challenge this statement and ask: how about errors in computing f (ck )?
When ck is close to a root, f (ck ) is close to zero, hence we might get the sign
wrong and move to the wrong half interval. . . We won’t go there today.

Interval length |b k - ak| with bisection and IEEE-aware bound


10 0

10 -10

10 -20
0 10 20 30 40 50 60
Iteration number k

Figure 2.3: Figure 1.3 with an extra curve: the red curve shows the bound
1
` +εmach |ξ| which better predicts the behavior of the interval length during
2k 0
bisection under inexact arithmetic.

2.5 Computing long sums


How accurately can we compute ni=1 i12 ? There is a preferred order. To
P
understand it, we need a formula controlling the round-off error incurred in
2.5. COMPUTING LONG SUMS 39

computing a sum of n numbers x1 , . . . , xn , where xi = 1/i2 in our case. We


will not assume that xi is represented exactly in memory. Thus,
n
! n
X M
fl xi = f l(x1 ) ⊕ · · · ⊕ f l(xn ) = f l(xi ),
i=1 i=1

where we have to specify the order of summation. Let’s say we sum x1 with
x2 , then the result with x3 , then the result with x4 , etc.—that is, we sum the
big numbers first. Using f l(xi ) = xi (1 + εi ), establish the following identity,
where ε(i) is the relative error incurred by the ith addition (there are n − 1
of them):
n n n n−1
!
M X X X
f l(xi ) = xi + xi εi + ε(j) + O(ε2mach ).
i=1 i=1 i=1 j=i−1

(We tacitly defined ε(0) = 0 for ease of notation.)


For the sum we are interested in, xi can be computed as 1/i/i or 1/(i · i),
thus involving two basic operations. This means that, up to O(ε2mach ) terms,
we can bound |εi | ≤ 2ε. Plugging this into the above identity, we get the
following error bound:
n n n
X 1 M X 2 + (n − 1) − (i − 1) + 1
2
− f l(1/i/i) ≤ 2
εmach + O(ε2mach )
i=1
i i=1 i=1
i
n n
X 1 X n−i
= 3εmach 2
+ εmach 2
+ O(ε2mach ). (2.4)
i=1
i i=1
i

Since ∞ π2
P 1
Pn 1
i=1 i2 = 6 and log(n) ≤ i=1 i ≤ 1 + log(n), we further get
n n
π2
 
X 1 M
2
− f l(1/i/i) ≤ (3 + n) − log(n) εmach + O(ε2mach ).
i=1
i i=1
6
2
This bounds the error to roughly π6 nεmach , which, for large n, is a relative
error of about nεmach . This is not so good: if n = 109 , then we only expect
6 accurate digits after the decimal point.
On the other hand, if we had summed with i ranging from n to 1 rather
than from 1 to n—small numbers first—then in (2.4) we would have summed
i/i2 = 1/i rather than (n − i)/i2 , so that
n n n n
X 1 M X 1 X 1
2
− f l(1/i/i) ≤ 4εmach 2
+ εmach + O(ε2mach )
i=1
i i=1
i i
 2 i=1  i=1
π
≤ 4 + log(n) + 1 εmach + O(ε2mach ).
6
40 CHAPTER 2. FLOATING POINT ARITHMETIC

This is much better! For n = 109 , the error is smaller than 29εmach < 21 10−14 ,
which means the result is accurate up to 14 digits after the decimal point!
PnOne 1
final point: in experimenting with this, be careful that even though
π2
i=1 i2 → 6 as n → ∞, for finite n, the difference may be bigger than
10−14 . In particular, if we let n = 109 , then
n ∞ 2·10 9
π2 X 1 X 1 X 1 1
− 2
= 2
≥ 2
≥ 109 9 )2
= .25 · 10−9 .
6 i=1
i 9
i 9
i (2 · 10
i=10 +1 i=10 +1

2
Thus, when comparing the finite sum with π6 , at best, only 9 digits after the
decimal point will coincide (and it could be fewer); that is not a mistake: it
only has to do with the convergence of that particular sum.

n = 1e10; % this takes a while to run

total1 = 0;
for ii = 1:1:n
total1 = total1 + 1/iiˆ2;
end
fprintf('Sum large first: %.16f\n', total1);

total2 = 0;
for ii = n:-1:1
total2 = total2 + 1/iiˆ2;
end
fprintf('Sum small first: %.16f\n', total2);

fprintf('Asymptotic value: %.16f\n', piˆ2 / 6);

% That's about 6 accurate digits and we got 7.


fprintf('Without being careful, we expect an error of about ...
%.2e\n', n*eps(piˆ2/6));

Sum large first: 1.6449340578345750 % (8th digit is off.)


Sum small first: 1.6449340667482264 % (10th digit is off.)
Asymptotic value: 1.6449340668482264
Without being careful, we expect an error of about 2.22e-06

Question 2.7. At a “big picture level”, why does it make sense to sum the
small numbers first?


Chapter 3

Linear systems of equations

In this chapter, we solve linear systems of equations (Ax = b), we discuss the
sensitivity of the solution x to perturbations in b, we consider the implica-
tions for solving least-squares problems, and get into the important problem
of computing QR factorizations. As we do so, we encounter a number of
algorithms for which we ask the questions: What is its complexity in flops
(floating point operations)? And also: What could break the algorithm?
We follow mostly the contents of [TBI97]: specific lectures are referenced
below. See blackboard for Matlab codes used in class.

3.1 Solving Ax = b
We aim to solve the following problem, where we assume A is invertible to
ensure existence and uniqueness of the solution.

Problem 3.1 (System of linear equations). Given an invertible matrix A ∈


Rn×n and a vector b ∈ Rn , find x ∈ Rn such that Ax = b.

At first, we consider a particular class of that problem.

Problem 3.2 (Triangular system of linear equations). Given an invertible


upper triangular matrix A ∈ Rn×n and a vector b ∈ Rn , find x ∈ Rn such
that Ax = b.

An upper triangular matrix A obeys aij = 0 if i > j. For a 6 × 6 matrix,


this is the following pattern, where × indicates an entry which may or may

41
42 CHAPTER 3. LINEAR SYSTEMS OF EQUATIONS

not be zero, while other entries are zero:


 
× × × × × ×

 × × × × ×
 × × × ×
A= .

 × × ×
 × ×
×
Question 3.3. Prove that A upper triangular is invertible if and only if
akk 6= 0 for all k.


Triangular systems are particularly simple to solve: solve first for xn ,


then work your way up obtaining xn−1 , xn−2 up to x1 . This is called back
substitution: see first two pages of Lecture 17 in [TBI97] (the remainder of
that lecture concerns the stability of the algorithm, which we do not cover
in class.)

Algorithm 3.1 Back substitution (Trefethen and Bau, Alg. 17.1)


1: Given: A ∈ Rn×n upper triangular and b ∈ Rn
2: for k = n, nP−n 1, . . . , 1 do . Work backwards
b − akj xj
3: xk = k j=k+1akk
4: end for

Question 3.4. What is the complexity of this algorithm in flops, that is:
how many floating point operations (or arithmetic operations) are required to
execute it, as a function of n?


Question 3.5. Assuming exact arithmetic, could this algorithm break?




Now that we know how to solve triangular systems, a nice observation is


that if A is not necessarily triangular itself but we can somehow factorize it
into a product A = LU , where L is lower triangular and U is upper triangular,
then solving Ax = b is also easy. Indeed:
(
Ly = b
Ax = b ≡ LU x = b ≡
U x = y.
3.1. SOLVING AX = B 43

Algorithm 3.2 LU factorization without pivoting [TBI97, Alg. 20.1]


1: U ← A, L ← I
2: for k in 1 : (n − 1) do . For each diagonal entry
3: for j in (k + 1) : n do . For each row below row k
ujk
4: `jk ← ukk . 1 divide
5: uj,k:n ← uj,k:n − `jk uk,k:n . n − k multiply and n − k subtract
6: end for
7: end for

Algorithm 3.3 LU factorization with partial pivoting [TBI97, Alg. 21.1]


1: U ← A, L ← I, P ← I
2: for k in 1 : (n − 1) do . For each diagonal entry
3: Select i ≥ k such that |uik | is maximized . Pivot selection
4: uk,k:n ↔ ui,k:n . Row swap effect
5: `k,1:(k−1) ↔ `i,1:(k−1) . Row swap effect
6: pk,: ↔ pi,: . Row swap effect
7: for j in (k + 1) : n do . For each row below row k, do elimination
ujk
8: `jk ← ukk
9: uj,k:n ← uj,k:n − `jk uk,k:n
10: end for
11: end for

Thus, we can first solve Ly = b (L is lower triangular: you should be able to


adapt the back substitution algorithm to this case easily), then solve U x = y.
If A is invertible, then L, U are both invertible (why?). Thus, both back
substitutions are valid (at least, in exact arithmetic). In fact, we can make
L unit lower triangular, which means `kk = 1 for all k. As a result, there are
no divisions involved in back substitution with L and the complexity of that
part is n2 − n flops instead of n2 .
It turns out you (most likely) already learned the algorithm to factor A =
LU in your introductory linear algebra class, likely under the name Gaussian
Elimination (and the fact it results in an LU factorization was probably not
highlighted.) See Lectures 20 and 21 in [TBI97] for Gaussian elimination,
with and without pivoting: we cover these in class. See Algorithms 3.2
and 3.3.
With pivoting, we get a factorization of the form
P A = LU,
where L is unit lower triangular, U is upper triangular and P is a permutation
matrix. Given this factorization, systems Ax = b can be solved in exactly
44 CHAPTER 3. LINEAR SYSTEMS OF EQUATIONS

n2 − n flops still, since the presence of a permutation matrix requires no


additional arithmetic operations:
(
Ly = P b
Ax = b ≡ P Ax = P b ≡ LU x = P b ≡
U x = y.

In these lectures, you find that computing P, L, U requires ∼ 23 n3 flops.


The total cost of solving LU x = P b is thus exactly 2n2 − n flops, which
is the same as the cost of computing a matrix-vector product Ax (verify
this). In other terms: if we obtain a factorization P A = LU , then solving
linear systems in A becomes as cheap and easy to do as to compute the
product A−1 b if A−1 were available. Then one may wonder: is there any
advantage to computing an LU factorization as opposed to computing A−1 ?
To answer this, we should first ask: how would we compute A−1 ? It turns
out that a good algorithm for this is simply to use an LU factorization of
A to solve the n linear systems Ax = ei for i = 1, . . . , n, where ei is the
ith column of the identity matrix (this is what Matlab does).1 Indeed, the
solutions of these linear systems are the columns of A−1 (why?). Using this
method, computing A−1 involves an LU factorization (∼ 23 n3 flops) and n
system solves (∼ 2n3 flops), for a total of ∼ 83 n3 flops. Then, we still need
to compute the product A−1 b for an additional ∼ 2n2 flops. This is counter-
productive: it would have been much better to use the LU factorization to
solve Ax = b directly. An important take away from this is: even if you
need to “apply A−1 repeatedly”, do not compute A−1 explicitly. Rather,
compute the LU factorization of A and use back-and-forward substitution.
This is cheaper (because there was no need to compute A−1 ) and incurs
less round-off error simply because it involves far fewer computations. This
explanation explains this recommendation in Matlab’s documentation: “It is
seldom necessary to form the explicit inverse of a matrix. A frequent misuse
of inv arises when solving the system of linear equations Ax = b. One way to
solve the equation is with x = inv(A)*b. A better way, from the standpoint
of both execution time and numerical accuracy, is to use the matrix backslash
operator x = A\b. This produces the solution using Gaussian elimination,
without explicitly forming the inverse.” Thus, Matlab’s backslash operator
internally forms the LU factorization. If we need to solve many systems
involving the same matrix A and not all the right-hand sides are known at
the same time, then it is likely better to use Matlab’s lu function to save the
factorization, then use that—more in Precept 3.
1
From Matlab’s documentation: “inv performs an LU decomposition of the input
matrix (or an LDL decomposition if the input matrix is Hermitian). It then uses the
results to form a linear system whose solution is the matrix inverse inv(X).”
3.2. CONDITIONING OF AX = B 45

After studying these lectures, take a moment to engrave the following in


your memory: even if A is invertible and exact arithmetic is used, Gaussian
elimination without pivoting may fail (in fact, the factorization A = LU may
not exist.) Here is an example to that effect:
 
0 1
A= .
1 1

Gaussian elimination without pivoting fails at the first step. In contrast,


Gaussian elimination with pivoting produces a factorization P A = LU with-
out trouble in this case. Furthermore, following the general principle that if
something is mathematically impossible at 0, it will be numerically trouble-
some near 0, it is true that LU decomposition without pivoting is possible
for
1 
3
εmach 1
A= ,
1 1

but doing so in inexact arithmetic results in a large error A − LU = ( 00 01 ).


Here too, pivoting fixes the issue efficiently.
In three words: pivoting is not just a good idea; it is necessary in general.
Interestingly, there are certain special cases where pivoting is not necessary;
see for example Lecture 23 (Cholesky factorization).

3.2 Conditioning of Ax = b
See Lecture 12 in [TBI97] for general context. We discuss mainly the follow-
ing points in class.
When solving Ax = b with A ∈ Rn×n invertible and b nonzero, how
sensitive is the solution x to perturbations in the right hand side b? If b is
perturbed and becomes b + δb for some δb ∈ Rn , then surely the solution of
the linear system changes as well, and becomes:

A(x + δx) = b + δb. (3.1)

How large could the deviation δx be? We phrase this question in relative
terms, that is, we want to bound the relative deviation in terms of the relative
perturbation:

kδxk kδbk
≤ [ something to determine ] .
kxk kbk
46 CHAPTER 3. LINEAR SYSTEMS OF EQUATIONS

In asking this question, we want to consider the worst case over all possible
perturbations b. It is also important to stress that this is about the sensitivity
of the problem, not about the sensitivity (or rather, stability) of any given
algorithm. If a problem is sensitive to perturbations, this affects all possible
algorithms to solve that problem. We must keep this in mind when assessing
candidate algorithms (“à l’impossible, nul n’est tenu.”)
One concept is particularly important to our discussion: the notion of
matrix norm—see√ Lecture 3 in [TBI97]. Given a vector norm k · k (for
example, kxk2 = xT x), we can define a subordinate matrix norm as
kAxk
kAk = max .
x6=0 kxk
Intuitively, this is the largest factor by which A can change the norm of any
given vector.
Question 3.6. Show that for the 2-norm, kAk2 = σmax (A), the largest sin-
gular value of A.


Going back to (3.1), since Ax = b, we have Aδx = δb. In bounding kδxk


kxk
,
one task is to upper bound kδxk; using the definition of subordinate matrix
norm:
kδxk = kA−1 δbk ≤ kA−1 kkδbk.
On the other hand, we must lower bound kxk; since Ax = b, we have
kbk = kAxk ≤ kAkkxk,
hence,
kbk
kxk ≥ .
kAk
Combining, we get:
kδxk kδbk
≤ kAkkA−1 k ,
kxk kbk
which is of the required form. We call the multiplicative factor
κ(A) = kAkkA−1 k (3.2)
the condition number of A with respect to the norm k · k. It is always at least
1,2 and is infinite if A is not invertible.
2
Because kABk ≤ kAkkBk for subordinate norms [TBI97, eq. (3.14)]; hence, 1 = kIk =
kAA−1 k ≤ kAkkA−1 k.
3.3. LEAST SQUARES PROBLEMS 47

σmax (A)
Question 3.7. Show that for the 2-norm, κ(A) = σmin (A)
.


In the 2-norm, the situation is particularly clear: A is invertible if and


only if σmin (A) 6= 0; The notion of conditioning says: if σmin (A) is close
to zero relative to σmax (A), then even though A is invertible, solving linear
systems in A will be tricky because the solution x could be very sensitive
to perturbations of b (if those perturbations happen to align with the worst
directions.) When κ(A) is much larger than 1, we say A is ill-conditioned,
and we say that solving Ax = b is an ill-conditioned problem.
Material in [TBI97, Lecture 12] shows different matrix computation prob-
lems whose conditioning is also given by κ(A), which is why we often omit to
distinguish between conditioning of the problem at hand and the condition
number of A.
As to the origin of perturbations δb, they can of course come from the
source of the data: after all, b typically is the result of some experiment,
and as such it is usually of finite accuracy. But of course, even if b is known
exactly, the simple act of expressing it as floating point numbers in the IEEE
standard results in a relative error on the order of εmach . Thus, kδbk
kbk
should
be expected to be always at least of that order.
A rule of thumb is that in solving Ax = b in inexact arithmetic, assuming
perfect knowledge of A and b, we must account for a possible relative error
of order
O(κ(A)εmach ).
If κ(A) ≈ 106 and εmach ≈ 10−16 , we do not expect more than about 10
accurate digits in the result.
To actually obtain these 10 accurate digits, one must also use an appropri-
ate algorithm. In this course, we call an algorithm stable if it offers as much
accuracy as the condition number of the problem permits—this is a rather
poor definition of stability; the interested reader is directed to Lectures 14
and beyond in [TBI97] for more prudent definitions.
It turns out that the natural question “Is solving a linear system via LU
factorization as described above a stable algorithm?” is particularly subtle.
For practical purposes, the answer is yes, and we will leave it at that in this
course. The interested reader is encouraged to look at [TBI97, Lecture 22].

3.3 Least squares problems


See Lecture 11 in [TBI97].
48 CHAPTER 3. LINEAR SYSTEMS OF EQUATIONS

Let (t1 , b1 ), . . . , (tm , bm ) be given data points. We would like to compute a


vector of coefficients x = (ad , . . . , a0 )T for a polynomial p of degree d defined
by p(t) = ad td + · · · + a1 t + a0 such that p(ti ) ≈ bi . Specifically, we want to
minimize the sum of squared errors:
m
X
min |p(ti ) − bi |2 .
x∈Rd+1
i=1

We assume m > d + 1, so that there are strictly more data than there are
unknowns. Notice that
 
ad
ad−1 
 
· · · t1i t0i  ...  − bi .
 d d−1 
p(ti ) − bi = ti ti
 
 
 a1 
a0

Collecting all equations for i = 1 . . . m in a vector,


  d
p(t1 ) − b1 t1 · · · t01  
   
b1
..   .. ..  ad  .. 
.  . .   ..   . 

= .  .  −  .  = Ax − b.

 .
..   .. ..   .. 
 . a
d 0 0
p(tm ) − bm t ··· t bm
| m {z m } x
| {z }
| {z }
A b

Thus, the least squares problem reduces to:

Problem 3.8. Given A ∈ Rm×n of full column rank and b ∈ Rm , compute


x ∈ Rn solution of

min kAx − bk2 ,


x∈Rn

where k · k is the 2-norm: kyk2 = y12 + · · · + ym


2
.

Above, n = d + 1. Matrix A as defined earlier indeed has full column


rank provided all ti ’s are distinct—we omit a proof.

Question 3.9. Show that if A does not have full column rank, then the least
squares problem cannot have a unique solution.


3.3. LEAST SQUARES PROBLEMS 49

If b is in the image of A (also called the range of A), that is, if b belongs to
the subspace of linear combinations of the columns of A, then, by definition,
there exists a solution x to the over-determined system of equations Ax = b.
If that happens, x is a solution to the least squares problem (why?).
That is typically not the case. In general, for an over determined system,
we de not expect b to be in the image of A. Instead, the solution x to the
least squares problem is such that Ax is the vector in the image of A which
is closest to b, in the 2-norm. In your introductory linear algebra course, you
probably argued that this implies b − Ax (the residue) is orthogonal to the
image of A. Equivalently, b − Ax is orthogonal to all columns of A (since
they form a basis of the image of A). Algebraically:

AT (b − Ax) = 0.

Reorganizing this statement yields the normal equations of Ax = b:

AT Ax = AT b.

Question 3.10. Show that AT A is invertible iff A has full column rank.

As it turns out, solving the normal equations is a terrible way of solving a


least-squares problem, and we can understand it by looking at the condition
number of AT A.

Problem 3.11. In the 2-norm, show that κ(AT A) = κ(A)2 , where we extend
the definition κ(A) = σσmax (A)
min (A)
to rectangular matrices.

Thus, if κ(A) ≈ 108 , the condition number of AT A is already about 1016 ,


which means even if we solve the normal equations with a stable algorithm
we cannot guarantee a single digit to be correct. Matrix A as defined above
tends to be poorly conditioned if some of the ti ’s are too close, so this is an
issue for us.
Alternatively, if we have an algorithm to factor A into the product

A = Q̂R̂,

where Q̂ ∈ Rm×n has orthonormal columns and R̂ ∈ Rn×n is upper triangu-


lar, then we can use that to solve the normal equations while avoiding the
50 CHAPTER 3. LINEAR SYSTEMS OF EQUATIONS

squaring of the condition number. We proceed as follows. First, note that


the following are equivalent:

AT Ax = AT b
R̂T Q̂T Q̂R̂x = R̂T Q̂T b.

Question 3.12. Show that R̂ is invertible iff A has full column rank.


Thus, using both invertibility of R̂T (under our assumption that A has
full column rank) and the fact that Q̂T Q̂ = I (since Q̂ has orthonormal
columns), we find that the system simplifies to:

R̂x = Q̂T b.

This is a triangular system. Importantly, its condition number is fine.


Question 3.13. Show that in the 2-norm we have κ(R̂) = κ(A).


The latter considerations motivate us to discuss algorithms to compute


QR factorizations. You most likely already know one algorithm: the Gram–
Schmidt process.

3.4 Conditioning of least squares problems


In the previous section, we swept two important points under the rug: (a)
we did not formally argue that κ(A), extended to rectangular matrices as
κ(A) = σσmax (A)
min (A)
, is a meaningful notion of conditioning; and (b) the right
hand side of the normal equations is AT b, which means that if b is perturbed
to b + δb, then the right hand side of the normal equations is perturbed to
AT b+AT δb (plus round-off error). That is: the perturbation of the right hand
side is not in just any direction; at least in exact arithmetic, it is necessarily
in the image of AT , which is orthogonal to the kernel of A. Both of these
points mean the discussion about conditioning we had earlier, while useful
to guide our intuition, is not entirely appropriate to discuss least squares
problems. This section gives further details (it is not discussed in class, but
is useful for homework.)
Lecture 18 in [TBI97] gives a proper treatment of conditioning for least
squares. As a highlight of that lecture, we cover here one important question:
3.4. CONDITIONING OF LEAST SQUARES PROBLEMS 51

how sensitive is the least squares solution x to perturbations of b? Formally,


let b̃ = b + δb, where δb is some perturbation, and let x̃ = x + δx be the cor-
responding (perturbed) least squares solution. Assuming A has full column
rank (as we always do in this chapter), the following equations hold:

x = (AT A)−1 AT b,
x̃ = (AT A)−1 AT b̃.

Subtracting the first equation from the second, we infer that:

δx = (AT A)−1 AT δb.

The matrix (AT A)−1 AT plays a special role: it is called the pseudo-inverse
of A. We denote it by A+ :

A+ = (AT A)−1 AT . (3.3)

The pseudo-inverse gives the solutions to a least squares problem: x = A+ b.


Using the matrix norm subordinate to the vector 2-norm, we have that the
norm of the perturbation on x is bounded as:

kδxk ≤ kA+ kkδbk. (3.4)

Thus, it is necessary to understand the norm of A+ , which is given by its


largest singular value. To determine the singular values of A+ , first consider
the SVD of A:
 
σ1
 .. 
 . 
T
A = U ΣV , where Σ= ∈ Rm×n
 
 σn 

 

and U ∈ Rm×m , V ∈ Rn×n are orthogonal matrices. A simple computation


shows that:
 
1/σ1
A+ = V 
 ...  T
U .
1/σn

Notice that this is an SVD of A+ (up to the fact that one would normally
order the singular values from largest to smallest, and they are here ordered
from smallest to largest.) In particular, the largest singular value of A+
52 CHAPTER 3. LINEAR SYSTEMS OF EQUATIONS

is 1/σn . This gives the following meaning to the definition we gave earlier
without justification for the condition number of a full-rank, rectangular
matrix:
1 σmax (A)
kAk = σmax (A), kA+ k = , κ(A) = = kAkkA+ k.
σmin (A) σmin (A)

(Compare this to the definition κ(A) = kAkkA−1 k for square, invertible


matrices.) We are now in a good position to reconsider (3.4):

kδxk kA+ kkδbk kbk kδbk


≤ = κ(A) .
kxk kxk kAkkxk kbk
| {z }
relative sensitivity
of x w.r.t. b

This inequality expresses the relative sensitivity of x with respect to pertur-


bations of b in a least squares problem, in terms of the condition number
of A and problem-dependent norms, kAk, kbk, kxk. We can further interpret
this extra coefficient as follows:
kbk kAxk kbk
= .
kAkkxk kAkkxk kAxk

The first factor is at most 1: it can only help (in that its effect on sensitivity
can only be to lower it.) It indicates that it is preferable (from a sensitivity
point of view) to have x in a subspace that is damped rather than amplified
by A. The second factor has a good geometric interpretation. Consider the
following:

Question 3.14. Argue AA+ is the orthogonal projector to the range of A.

Based on the latter property of AA+ and x = A+ b,

Ax = AA+ b

is the orthogonal projection of b to the range of A. Hence,

kbk kbk
= ,
kAxk kAA+ bk

the ratio between the norm of b and that of its orthogonal projection to the
range of A, is given by the cosine of the angle between b and that subspace.
3.5. COMPUTING QR FACTORIZATIONS, A = Q̂R̂ 53

kAkkxk
Let θ denote this angle and let η = kAxk
≥ 1. Then, we can summarize our
findings as:
kδxk κ(A) kδbk κ(A) kδbk
≤ ≤ .
kxk η cos θ kbk cos θ kbk
Notice the role of θ: if b is close to the range of A, then Ax ≈ b “almost”
has a consistent solution: this is rewarded by having cos θ close to 1. On the
other hand, if b is almost orthogonal to the range of A, then we are very far
from having a consistent solution and sensitivity is exacerbated, as indicated
by cos θ close to 0.

3.5 Computing QR factorizations, A = Q̂R̂


See Lectures 7, 8 and 9 in [TBI97] in full, except for the part about continuous
functions in Lecture 7 (we will discuss this in detail later in the course.)
See also the Matlab codes on Blackboard (shown in class) for experiments
showing the behavior of various QR algorithms.
Given a full column-rank matrix A ∈ Rm×n , classical Gram-Schmidt
(CGS) computes a matrix Q̂ ∈ Rm×n with orthonormal columns and an
upper triangular matrix R̂ ∈ Rn×n with positive diagonal entries such that
 
r11 · · · r1n
   
| | | |
A = Q̂R̂, A = a1 · · · an  , Q̂ = q1 · · · qn  , R̂ = 
 ... ..  .
. 
| | | | rnn

This can be expressed equivalently as a collection of n vector equations:

a1 = r11 q1 (3.5)
a2 = r12 q1 + r22 q2
a3 = r13 q1 + r23 q2 + r33 q3
..
.

This structure guarantees that

span(q1 ) = span(a1 ), span(q1 , q2 ) = span(a1 , a2 ), etc.

To compute this factorization, CGS proceeds iteratively. First, it produces


q1 to span the same subspace as a1 , and to have unit norm:
1
q1 = a1 .
ka1 k
54 CHAPTER 3. LINEAR SYSTEMS OF EQUATIONS

Then, it produces q2 by projecting a2 to the orthogonal complement of the


space spanned by q1 ,3

a⊥ T
2 = a2 − (q1 a2 )q1 ,

and by normalizing the result:4


1 ⊥
q2 = a2 .
ka⊥
2k

Similarly, q3 is obtained by projecting a3 to the orthogonal complement of


the space spanned by q1 , q2 ,

a⊥ T T
3 = a3 − (q1 a3 )q1 − (q2 a3 )q2 ,

and by normalizing the result:


1 ⊥
q3 = a3 .
ka⊥
3k

The procedure is continued until n vectors have been produced, with general
formulas for j ranging from 1 to n:
j−1
X 1 ⊥
a⊥
j = aj − (qiT aj )qi , qj = aj . (3.6)
i=1
ka⊥
j k

The above equations can be rewritten in the form of (3.5):

a1 = ka1 k q1
|{z}
r11

a2 = (q1T a2 ) q1 + ka⊥ kq
| {z } | {z2 } 2
r12 r22

a3 = T
(q1 a3 ) q1 + (q2T a3 ) q2 + ka⊥ kq
| {z } | {z } | {z3 } 3
r13 r23 r33
..
.
3
This ensures a⊥
2 is orthogonal to q1 , hence q2 (which is just a scaled version of a2 )

is also orthogonal to q1 , as desired. To get the formulas, we use the fact that these are
equivalent: (a) project to the orthogonal complement of a space; and (b) project to the
space, then subtract that from the original vector.
4
We could also set q2 = − ka1⊥ k a⊥
2 : taking the positive sign is a convention; it will lead
2
to a positive diagonal for R.
3.5. COMPUTING QR FACTORIZATIONS, A = Q̂R̂ 55

Algorithm 3.4 Classical Gram-Schmidt (unstable) [TBI97, Alg. 7.1]


1: Given: A ∈ Rm×n with columns a1 , . . . , an ∈ Rm
2: for j in 1 . . . n do . For each column
3: vj ← aj
4: for i in 1 . . . j − 1 do . For each previously treated column
5: rij ← qiT aj . Compare with Algorithm 3.5
6: vj ← vj − rij qi
7: end for
8: rjj ← kvj k . By this point, vj = a⊥
j
vj
9: qj ← rjj
10: end for

The general formula for j ranging from 1 to n is:


j−1
X
aj = (qiT aj ) qi + ka⊥ k qj .
i=1
| {z } | {zj }
rij rjj

These show explicitly how to populate the matrices Q̂ and R̂ as we proceed


through the Gram-Schmidt algorithm, see Algorithm 3.4. The following is
equivalent Matlab code for this algorithm, organized a bit differently:

function [Q, R] = classical gram schmidt(A)


% Classical Gram-Schmidt orthonormalization algorithm (unstable)

[m, n] = size(A);
Q = zeros(m, n);
R = zeros(n, n);

for j = 1 : n

v = A(:, j);

R(1:j-1, j) = Q(:, 1:j-1)' * v;

v = v - Q(:, 1:j-1) * R(1:j-1, j);

R(j, j) = norm(v);
Q(:, j) = v / R(j, j);

end

end
56 CHAPTER 3. LINEAR SYSTEMS OF EQUATIONS

See codes Numerically, CGS is seen to behave rather poorly. What could possibly have
lecture LSQ first contact.m
and
gone wrong? Here is a list of suspects:
lecture QR comparison.m
(shown in class). ˆ Columns of Q̂ are not really orthonormal?

ˆ R̂ is not really upper triangular?

ˆ It is not true that A = Q̂R̂?

ˆ A is not actually full column rank, breaking an important assumption?

Following the experiments in the codes on Blackboard, we find that A has


full column rank, and when computing A = Q̂R̂ using CGS, we indeed satisfy
A ≈ Q̂R̂, and of course R̂ is exactly upper triangular (by construction). The
culprit: round-off errors cause columns of Q̂ to be far from orthonormal in
case A is poorly conditioned.
An improved CGS, called Modified Gram–Schmidt (MGS), partly fixes
the problem—the modification is subtle; spend some time thinking about it.
The main idea goes as follows.
Consider the master equation of CGS, namely (3.6). What does it mean
for A to be poorly conditioned? It means its columns, while linearly inde-
pendent, are close to being dependent, in the sense that aj may be close
to the subspace spanned by a1 , . . . , aj−1 or, equivalently, by q1 , . . . , qj−1 . In
turn, this means a⊥j may be a (very) small vector: we have no control over
a⊥
j
that, it is a property of the given A. Thus, when forming qj = ka⊥
, any
j k

numerical errors accumulated in computing a⊥ j may be amplified by a lot.


And indeed, we accumulate a lot of such errors, in ‘random’ directions, from
the computation

a⊥ T T
j = aj − (q1 aj )q1 − · · · − (qj−1 aj )qj−1 .

Each vector (qiT aj )qi is rounded to some representable numbers, causing


round-off errors, and these errors compound (add up) without any serious
cancellations. Since these errors are added on top of the true a⊥
j which (in

this discussion) is very small, the computed aj points in a direction that
may be very far from being orthogonal to q1 , . . . , qj−1 .
To alleviate this issue, the key is to notice that we can perform this
computation differently, in such a way that errors introduced so far can be
reduced going forward. As an example, consider

a⊥ T T
3 = a3 − (q1 a3 )q1 − (q2 a3 )q2 .
3.5. COMPUTING QR FACTORIZATIONS, A = Q̂R̂ 57

In CGS, this formula is implemented as follows:


v 3 ← a3
v3 ← v3 − (q1T a3 )q1
v3 ← v3 − (q2T a3 )q2 .
That is: initialize v3 to a3 ; project a3 to span(q1 ), and subtract that from v3 ;
then project a3 to q2 and subtract that from v3 . Mathematically, v3 = a⊥ 3 in
the end. But we could do this computation differently. Indeed, assume we
only executed the first step, so that, at this point, v3 = a3 . Then, surely,
q1T v3 = q1T a3 . Now assume we performed the first two steps, so that at this
point
v3 = a3 − (q1T a3 )q1 .
Then, mathematically,
q2T v3 = q2T a3 − (q1T a3 )(q2T q1 ) = q2T a3 ,
since q1T q2 = 0. Thus, the following procedure also computes a⊥
3:

v3 ← a3
v3 ← v3 − (q1T v3 )q1 = (I − q1 q1T )v3
v3 ← v3 − (q2T v3 )q2 = (I − q2 q2T )v3 .
This procedure reads as follows: initialize v3 to a3 ; project v3 to the orthog-
onal complement of q1 ; and project the result to the orthogonal complement
of q2 . More generally, the rule to obtain a⊥
j is:

a⊥ T T
j = (I − qj−1 qj−1 ) · · · (I − q1 q1 )aj .

This small modification turns out to be beneficial. Why? Follow the nu-
merical errors: each projection I − qi qiT introduces some round-off error.
Importantly, much of that round-off error will be eliminated by the next
T
projector, I − qi+1 qi+1 , because components of the error that happen to be
aligned with qi+1 will be mathematically zeroed out (numerically, they are
greatly reduced). One concrete effect for example is that qj is as orthogonal
to qj−1 as numerically possibly, because the last step in obtaining a⊥ j is to
T
apply the projector I − qj−1 qj−1 . This is not so for CGS.
This modification of CGS, called MGS, is presented in a neatly organized
procedure as Algorithm 3.5. The algorithm is organized somewhat differ-
ently from Algorithm 3.4 (specifically, it applies the projector I − qi qiT to
all subsequent vj ’s as soon as qi becomes available), but other than that it
is equivalent to taking Algorithm 3.4 and only replacing rij ← qiT aj with
rij ← qiT vj . Here is Matlab code for MGS.
58 CHAPTER 3. LINEAR SYSTEMS OF EQUATIONS

Algorithm 3.5 Modified Gram-Schmidt [TBI97, Alg. 8.1]


1: Given: A ∈ Rm×n with columns a1 , . . . , an ∈ Rm
2: for i in 1 . . . n do
3: vi ← ai
4: end for
5: for i in 1 . . . n do
6: rii ← kvi k
7: qi ← rviii
8: for j in (i + 1) . . . n do
9: rij ← qiT vj . Crucially, here we compute qiT vj , not qiT aj
10: vj ← vj − rij qi
11: end for
12: end for

function [Q, R] = modified gram schmidt(A)


% Modified Gram-Schmidt orthonormalization algorithm (better ...
stability than CGS)

[m, n] = size(A);
Q = zeros(m, n);
R = zeros(n, n);

for j = 1 : n

v = A(:, j);

R(j, j) = norm(v);
Q(:, j) = v / R(j, j);

R(j, (j+1):n) = Q(:, j)' * A(:, (j+1):n);


A(:, (j+1):n) = A(:, (j+1):n) - Q(:, j) * R(j, (j+1):n);

end

end

MGS provides much better stability than CGS, but it is still not perfect.
In particular, while it is still true that A ≈ Q̂R̂, and indeed Q̂ is closer to
having orthonormal columns than when we use CGS, it may still be far from
orthonormality when A is poorly conditioned. We discuss fixes below. Before
we move on, ponder the following questions:

1. What is the computational cost of MGS?


3.6. LEAST-SQUARES VIA SVD 59

2. What are the failure modes (assuming exact arithmetic)?

3. Given a matrix A ∈ Rm×n , do we always have existence of a factoriza-


tion A = Q̂R̂? Do we have uniqueness?

All the answers (with developments) are in [TBI97, Lec. 7–9]. In a nutshell:

1. The cost is dominated by the nested loop (instructions 9 and 10):


n X
n
X n(n − 1)
4m − 1 = (4m − 1) = ∼ 2mn2 flops;
i=1 j=i+1
2

2. Failure can only occur if we divide by 0, meaning rjj = 0 for some j.


This happens if and only if a⊥ j = 0, that is, iff aj is in the span of
a1 , . . . , aj−1 , meaning the columns of A are not linearly independent.
Hence, mathematically, if A has full column rank, then MGS (and
CGS) succeed (in exact arithmetic).

3. If A has full column rank, we have existence of course (since the al-
gorithm produces such a factorization). As for uniqueness, the only
freedom we have (as a result of R̂ being upper triangular) is in the sign
of rjj . By forcing rjj > 0, the factorization is unique.

As a side note: if A is not full column rank, there still exists a Q̂R̂ factor-
ization, but it is not unique and its properties are different from the above:
see [TBI97, Thm. 7.1]. Furthermore, the reason we write the factorization as
A = Q̂R̂ instead of simply A = QR is because the latter notation is tradition-
ally reserved for the “full” QR decomposition (as opposed to the “economy
size” or “thin” or “partial” QR we have discussed here), in which Q ∈ Rm×m
is orthogonal (square matrix) and R ∈ Rm×n has an n × n upper triangular
block at the top, followed by rows of zeros. Matlab’s qr(A) produces the
full QR decomposition by default. To get the economy size, call qr(A, 0).

3.6 Least-squares via SVD


As conditioning of A is pushed to terrible values, all algorithms will break.
When facing particularly hard problems, it may pay to use the following
algorithm for least squares:

1. Compute the thin SVD of A: A = Û Σ̂V̂ T , where Û ∈ Rm×n , V̂ ∈ Rn×n


have orthonormal columns and Σ̂ = diag(σ1 , . . . , σn ).
Matlab: [U, S, V] = svd(A, 0);
60 CHAPTER 3. LINEAR SYSTEMS OF EQUATIONS

2. Plug this into the normal equations and cancel terms using orthonor-
mality of Û , V̂ and invertibility of Σ̂:

AT Ax = AT b
V̂ Σ̂T Û T Û Σ̂V̂ T x = V̂ Σ̂T Û T b
Σ̂V̂ T x = Û T b
x = V̂ Σ̂−1 Û T b,

where Σ̂−1 = diag(1/σ1 , . . . , 1/σn ) is trivial to apply.

This algorithm is the most stable of all the ones we discuss in these notes.
The drawback is that it requires computing the SVD of A, which is typically
more expensive than computing a QR factorization: more on this later.

3.7 Regularization
Poor conditioning is often fought with regularization (on top of using stable
algorithms as above). We do not discuss this much in this course; keywords
are: Tikhonov regularization, or regularized least squares.
The general idea is to minimize kAx − bk2 + λkxk2 , with some chosen
λ > 0, instead of merely minimizing kAx − bk2 . This intuitively promotes
smaller solutions x by penalizing (we say, regularizing) the norm of x. In
practice, this is done as follows: given a least squares problem Ax = b,
replace it with the larger least squares problem Ãx = b̃:
   
√A b
x= . (3.7)
λIn 0
| {z } |{z}
à b̃

As announced, it corresponds to minimizing kÃx − b̃k2 = kAx − bk2 + λkxk2 .


The normal equations involve ÃT Ã = AT A + λI, and the right hand side is
ÃT b̃ = AT b. Problem (3.7) is solved by applying any of the good methods
described above on Ã, b̃ directly. For example, you can apply MGS (suitably
“fixed” as described below if necessary) to compute a QR factorization of Ã.
What is the condition number of this new problem? It is easy to see that
the eigenvalues λi of ÃT Ã = AT A+λI are simply the squared singular values
of A shifted by λ. Thus, the singular values of à are:
q p
σi (Ã) = λi (ÃT Ã) = σi (A)2 + λ2 .
3.8. FIXING MGS: TWICE IS ENOUGH 61

Hence, the condition number of à is


s
σmax (A)2 + λ2
κ(Ã) = .
σmin (A)2 + λ2

Clearly, for λ = 0, κ(Ã) = κ(A), while for larger values of λ, the condition
number of à decreases, thus reducing numerical difficulties. On the other
hand, larger values of λ change the problem (and its solution): which value
of λ is appropriate depends on the application.

3.8 Fixing MGS: twice is enough


MGS works significantly better than CGS, but it can still break for poorly
conditioned A. One easy way to improve MGS is known as the twice is
enough trick. The goal is to have Q̂ indeed very close to having orthonormal
columns, while preserving A ≈ Q̂R̂ (which MGS already delivers in practice).
It works as follows.
If A has full column rank but it is ill-conditioned, then computing
MGS
A −→ Q̂1 R̂1

using MGS results in A ≈ Q̂1 R̂1 yet columns of Q̂1 are not quite orthonormal.
Importantly though, Q̂1 is much better conditioned than A: after all, even
though it is not quite orthogonal, it is much closer to being orthogonal than
A itself, and orthogonal matrices have condition number 1 in the 2-norm.
The trick is to apply MGS a second time, this time to Q̂1 :
MGS
Q̂1 −→ Q̂2 R̂2 .

This time, we expect columns of Q̂2 to be very nearly orthonormal. Further-


more, since Q̂1 ≈ Q̂2 R̂2 ,

A ≈ Q̂1 R̂1 ≈ Q̂2 (R̂2 R̂1 ).

Defining Q̂ = Q̂2 and R̂ = R̂2 R̂1 (which is indeed upper triangular because it
is a product of two upper triangular matrices) typically results in an excellent
QR factorization of A. In the experiments on Blackboard, this QR is as
good as the one built into Matlab as the function qr. (The latter is based
on Householder triangularization.) It is however slower, since Householder
triangularization is cheaper than two calls to MGS (by some constant factor.)
62 CHAPTER 3. LINEAR SYSTEMS OF EQUATIONS

3.9 Solving least-squares with MGS directly


Another “fix” is described in Lecture 19 of [TBI97]. It consists not in fixing
the QR factorization itself, but rather in using it differently for the purpose
of solving a least squares problem. Specifically, go through the reasoning of
plugging A = Q̂R̂ in the normal equations again, but not using orthonormal-
ity:
AT Ax = AT b
R̂T Q̂T Q̂R̂x = R̂T Q̂T b
Q̂T Q̂R̂x = Q̂T b. (3.8)

At this point, instead of canceling Q̂T Q̂, we could solve these systems:
Q̂T Q̂y = Q̂T b
R̂x = y.

We would expect this to fare better, because Q̂T Q̂ ought to be better con-
ditioned than AT A. However, this would be fairly expensive, as the system
Q̂T Q̂ does not have particularly favorable structure (such as triangularity).
Instead, we do the following: augment the matrix A with the vector b,
and compute a QR factorization of that matrix (for example, using simple
MGS):
  MGS
à = A b −→ Q̃R̃.

Gram–Schmidt Consider also the QR factorization A = Q̂R̂. Since this factorization is


proceeds column by
column: for the first n
unique, it must be that the QR factorization of à admits the following form:
of these, the fact we  
appended b to the     R̂ r  
matrix has no effect, so à = A b = Q̂ q = Q̂R̂ Q̂r + αq ,
we get the exact same | {z } 0 α
factorization at first. Q̃
| {z }
This is why Q̂ and R̂ R̃
appear in Q̃ and R̃.
where q ∈ Rm and α ∈ R. Consider the last column of this matrix equality:
b = Q̂r + αq.

Mathematically, Q̂T q = 0. Numerically, this might not be quite true, but,


intuitively, making that assumption should be safer than making the strong
(and false) assumption Q̂T Q̂ ≈ I. Thus, we expect the following mathemat-
ical identity to hold approximately numerically:
Q̂T b = Q̂T Q̂r.
3.9. SOLVING LEAST-SQUARES WITH MGS DIRECTLY 63

Plug this into (3.8):

Q̂T Q̂R̂x = Q̂T b = Q̂T Q̂r.

Simply using that Q̂T Q̂ is expected to be invertible (a far safer assumption


than orthonormality), we find:

R̂x = r.
 
In summary: compute the QR factorization Q̃R̃ of A b , extract the top-
left block of R̃, namely, R̂, and extract the top-right column of R̃, namely,
r; then solve the triangular system R̂x = r. This turns out to yield a
numerically well-behaved solution to the least-squares problem.
64 CHAPTER 3. LINEAR SYSTEMS OF EQUATIONS
Chapter 4

Solving simultaneous nonlinear


equations

In this chapter, we are interested in solving the following problem.

Problem 4.1 (Simultaneous nonlinear equations). Given f : Rn → Rn , con-


tinuous on a non-empty, closed domain D ⊆ Rn , find ξ ∈ D such that
f (ξ) = 0.

This is a generalization of the very first problem we encountered in this


course. Notice that solving systems of linear equations as we did in the
previous chapter is also a special case, by considering f (x) = Ax − b. These
notes follow the narrative in class—see Chapter 4 in [SM03] for the full story.
Ponder for a moment why we require f to be continuous.
For example, consider solving f (x) = 0 with f : R2 → R2 defined as:
 
f1 (x1 , x2 )
f (x) = ,
f2 (x1 , x2 )

where
1 2
x1 + x22 − 9 ,

f1 (x1 , x2 ) =
9
f2 (x1 , x2 ) = −x1 + x2 − 1.

The locus f1 = 0, that is, the set of points x ∈ R2 such that f1 (x) = 0 is the
circle of radius 3 centered at the origin. The locus f2 = 0 is the line passing
through the points (0, 1) and (−1, 0). The two loci intersect at two points:
these are the solutions to the system f (x) = 0. See Figure 4.1
We could solve this particular system analytically, but in general this task
may prove difficult. Thus, we set out to study general algorithms to compute

65
66 CHAPTER 4. SYSTEMS OF NONLINEAR EQUATIONS

x2
0

-5
-5 0 5
x1

Figure 4.1: Loci f1 (x) = 0 and f2 (x) = 0. Intersections solve f (x) = 0.

roots of f in n dimensions. Looking back at our work for single nonlinear


equations, consider the algorithms we covered then and ponder: which of
these have a good chance to generalize? Which seem difficult to generalize?

4.1 Simultaneous iteration


As for simple iteration, the core idea is: given f as above, pick g : Rn →
Rn with the property that f (ξ) = 0 ⇐⇒ g(ξ) = ξ. Then, root finding
becomes equivalent to computing fixed points. If furthermore g(D) ⊆ D
(which means: for any x ∈ D, g(x) ∈ D), given x(0) ∈ D we can setup a
sequence:

x(k+1) = g(x(k) ).

Notice we used superscripts to index elements of a sequence now, so that


(k)
subscripts still index entries of a vector, as usual. Hence, xi is the ith entry
of the kth element of the sequence.
One way to create functions g in one dimension was relaxation, which
considers

g(x) = x − λf (x)

for some nonzero λ ∈ R. Let’s see how this fares on our example.
4.1. SIMULTANEOUS ITERATION 67

%% Plot the loci


t = linspace(0, 2*pi, 2501);
plot(3*cos(t), 3*sin(t), 'LineWidth', 2);
hold all;
t = linspace(-5, 5, 251);
plot(t, 1+t, 'LineWidth', 2);
axis equal;
xlabel('x 1'); ylabel('x 2');

xlim([-5, 5]);
ylim([-5, 5]);

%% Run simultaneous iteration


lambda = 0.05; % try different values here; what happens?
f = @(x) [ (x(1)ˆ2 + x(2)ˆ2 - 9)/9 ; -x(1) + x(2) - 1 ];
g = @(x) x - lambda*f(x);
x = [1 ; 0];

plot(x(1), x(2), 'k.', 'MarkerSize', 15);


fprintf('%.8g %.8g \n', x(1), x(2));

for k = 1 : 100

x = g(x);

plot(x(1), x(2), 'k.', 'MarkerSize', 15);


fprintf('%0.8g %0.8g \n', x(1), x(2));

end

f(x)

This creates the following (truncated) output, with Figure 4.2.

1 0
1.0444444 0.1
1.0883285 0.19722222
1.1315321 0.29177754
1.173946 0.38376527
1.2154714 0.4732743
1.2560194 0.56038416
1.2955105 0.64516592
1.3338739 0.72768315
1.3710475 0.80799268
1.4069774 0.88614543
1.4416172 0.96218702
68 CHAPTER 4. SYSTEMS OF NONLINEAR EQUATIONS

1.4749279 1.0361585
% ...
1.6138751 2.6717119
1.6097494 2.66882
1.6057833 2.6658665
1.6019756 2.6628623
1.5983247 2.659818
1.5948288 2.6567433
1.5914856 2.6536476
1.588293 2.6505395
1.5852484 2.6474272
1.582349 2.6443182
1.5795921 2.6412198
1.5769746 2.6381384
1.5744933 2.6350802
1.5721451 2.6320509

ans = % this is f(x) at the last iterate

0.0444
0.0599

1
x2

-1

-2

-3

-4

-5
-5 0 5
x1

Figure 4.2: Relaxation with λ = 0.05, x(0) = (1, 0)T .

It seems to converge to the root in the positive orthant. Play around


4.2. CONTRACTIONS IN RN 69

with the parameters. You should find that it easily diverges. I never saw it
converge to the other root.
Thus, there is at least hope for simultaneous iteration. To understand
convergence, we turn toward the contraction mapping theorem we had in
one dimension, and try to generalize it to Rn .

4.2 Contractions in Rn
We shall prove Theorem 4.1 in [SM03].

Theorem 4.2 (Contraction mapping theorem in Rn ). Suppose D is a non-


empty, closed subset of Rn , g : Rn → Rn is continuous on D, and g(D) ⊆ D.
Suppose further that g is a contraction on D (in some norm). Then, g has a
unique fixed point ξ in D and simultaneous iteration converges to ξ for any
x(0) in D.

To make sense of this, we need to introduce a notion of contraction in


n
R . This is Definition 4.2 in [SM03].

Definition 4.3. Suppose g : Rn → Rn is defined on a closed subset D of Rn .


If there exists L such that

kg(x) − g(y)k ≤ Lkx − yk

for all x, y ∈ D for some vector norm k · k, we say g satisfies a Lipschitz


condition on D. L is the Lipschitz constant in that norm. In particular, if
L ∈ (0, 1) and g(D) ⊆ D, g is a contraction on D in that norm.

Here are some typical vector norms that are often useful, called the 2-
norm, 1-norm and ∞-norm respectively:

ˆ kxk2 = x21 + · · · + x2n ,


p

ˆ kxk1 = |x1 | + · · · + |xn |,

ˆ kxk∞ = maxi∈{1,...,n} |xi |.

See [TBI97, Lec. 3] for further details, including the associated subordinate
matrix norms: you should be comfortable with these notions. A couple of
remarks are in order.

1. If g satisfies a Lipschitz condition, then g is continuous on D. Often,


we say g is Lipschitz continuous.
70 CHAPTER 4. SYSTEMS OF NONLINEAR EQUATIONS

2. All norms in Rn are equivalent, that is, if k · ka , k · kb are any two vector
norms, there exist constants c, C > 0 such that for any vector x, we
have

ckxkb ≤ kxka ≤ Ckxkb .

Here is one consequence: if kx(k) −ξk → 0 in some norm, then the same
is true in any other norm.

3. If g satisfies a Lipschitz condition on D in any norm, then it satisfies it


in all norms, possibly with another constant—can you prove it? Impor-
tantly, g may be a contraction in some norm, yet not be a contraction
in another norm (and vice versa).

Proof of Theorem 4.2. The theorem has three claims:

1. Existence of a fixed point ξ;

2. Uniqueness of the fixed point ξ;

3. Convergence of simultaneous iteration to ξ.

The key point here is If we assume existence for now, then we can easily show parts 2 and 3.
to notice that in the The technical part is proving existence, which we will do later.
proofs of uniqueness
and convergence, we
assume existence of a
fixed point. Thus,
Uniqueness. Assume there exists a fixed point ξ ∈ D, that is, g(ξ) = ξ.
these proofs do not For contradiction, assume there exists another fixed point η ∈ D. Then,
achieve much on their
own: we need to prove
existence separately. kξ − ηk = kg(ξ) − g(η)k ≤ Lkξ − ηk.

Since ξ 6= η by assumption, we can divide by kξ − ηk and get L ≥ 1, which


contradicts the fact g is a contraction.

Convergence. Still assume there exists a fixed point ξ. Then,

kx(k+1) − ξk = kg(x(k) ) − g(ξ)k ≤ Lkx(k) − ξk ≤ Lk+1 kx(0) − ξk.

Since L ∈ (0, 1), we deduce

lim kx(k) − ξk ≤ lim Lk kx(0) − ξk = 0,


k→∞ k→∞

so that x(k) → ξ. We even showed the error kx(k) − ξk converges to 0 at least


linearly.
4.2. CONTRACTIONS IN RN 71

To prove existence of a fixed point ξ, we use the notion of Cauchy se- For a single equation,
quence. we proved existence
using the intermediate
value theorem: that
Definition 4.4. A sequence x(0) , x(1) , x(2) , . . . in Rn is a Cauchy sequence in does not generalize to
Rn if for any ε > 0 there exists k0 (which usually depends on ε) such that: higher dimensions,
which is why we need
the extra work.
∀k, m ≥ k0 , kx(m) − x(k) k ≤ ε.

This is independent of the choice of norm, since norms are equivalent in Rn .

Here are a few facts related to this notion, which we shall not prove.

1. Rn is complete, that is: every Cauchy sequence in Rn converges in Rn .

2. If all iterates of a Cauchy sequence are in a closed set D ⊆ Rn , then


the limit ξ exists and is in D—see [SM03, Lemma 4.1].

3. If g is continuous on D and the sequence x(0) , x(1) , . . . in D converges to


ξ in D, then limk→∞ g(x(k) ) = g limk→∞ x(k) = g(ξ)—this is a direct
consequence of continuity of f , see [SM03, Lemma 4.2].

We can now proceed with the proof.


Proof of Theorem 4.2, continued. Existence. All that is left to do to prove
existence of a fixed point is to show the simultaneous iteration sequence is
Cauchy. Indeed, if it is, then the second point above asserts limk→∞ x(k) =
ξ ∈ D, and the third point yields
 
ξ = lim x(k+1) = lim g(x(k) ) = g lim x(k) = g(ξ),
k→∞ k→∞ k→∞

so that ξ is a fixed point in D. To establish that the sequence is Cauchy,


we must control the distance between any two iterates. Without loss of
generality, let m > k. By the triangle inequality:

kx(m) − x(k) k = kx(m) − x(m−1) + x(m−1) − x(m−2) + · · · + x(k+1) − x(k) k


≤ kx(m) − x(m−1) k + kx(m−1) − x(m−2) k + · · · + kx(k+1) − x(k) k.

All of these terms are of the same form. Let’s look at one of them:

kx(k+1) − x(k) k = kg(x(k) ) − g(x(k−1) )k ≤ Lkx(k) − x(k−1) k.

By induction,

kx(k+1) − x(k) k ≤ Lk kx(1) − x(0) k.


72 CHAPTER 4. SYSTEMS OF NONLINEAR EQUATIONS

Applying this to each term in the sum above, we get

kx(m) − x(k) k ≤ Lm−1 + Lm−2 + · · · + Lk kx(1) − x(0) k




= Lm−1−k + Lm−2−k + · · · + 1 Lk kx(1) − x(0) k.




Let The geometric sum is easily understood:


S = 1 + L + L2 + · · · .
Notice that 1
LS = S − 1. Hence, 1 + L + L2 + · · · + Lm−2−k + Lm−1−k ≤ 1 + L + L2 + · · · = .
S = 1−L1
. 1−L
Hence,

(m) (k)kx(1) − x(0) k k


kx −x k≤ L .
1−L
The fraction is just some number. Since L ∈ (0, 1), for any ε, we can pick
k0 such that the right hand side is less than ε for all k ≥ k0 (and m > k as
assumed earlier); thus, the sequence is Cauchy.

4.3 Jacobians and convergence


For a single nonlinear equation, we showed that if g : R → R is continuously
differentiable, then the asymptotic behavior of simple iteration near a fixed
point ξ ultimately depends on |g 0 (ξ)| alone. In particular, if |g 0 (ξ)| < 1, then
g is locally a contraction around ξ, and |g 0 (ξ)| dictates the asymptotic rate
of convergence.
Here, we want to develop analogous understanding of contractions in Rn .
The analog of g 0 (ξ) will be the Jacobian of g at ξ (a matrix), and the analog
of the magnitude (absolute value) will be to take a (subordinate) matrix
norm. First, a multivariable calculus reminder [SM03, Def. 4.3]
Definition 4.5. Let g : Rn → Rn be continuous on an open neighborhood1
∂gi
N (x) of x ∈ Rn . Suppose all first partial derivatives ∂xj
exist at x. The
Jacobian matrix Jg (x) of g at x is the n × n matrix with elements
∂gi
(Jg (x))ij = (x), 1 ≤ i, j ≤ n.
∂xj
A few remarks.
1. Jg : Rn → Rn×n associates a matrix to each vector x (where it is de-
fined).
1
An open neighborhood of x in some set D is any open subset of D which contains x.
4.3. JACOBIANS AND CONVERGENCE 73

2. If g is differentiable at x (for example, if the partial derivatives are


continuous in N (x)), then Jg (x) represents the differential of g at x,
that is:

g(x + h) = g(x) + Jg (x)h + E(x, h),

where E(x, h) “goes to zero faster than h when h goes to zero”; that
is:
kE(x, h)k
lim = 0,
h→0 khk
for any vector norm k · k.
3. What this limit means is: for any tolerance ε > 0 (of our own choosing),
there exists a bound δ > 0 (perhaps very small) such that the fraction
is smaller than ε provided khk ≤ δ. Explicitly:

∀ε > 0, ∃δ > 0 such that if khk ≤ δ, then kE(x, h)k ≤ εkhk.

4. If g is twice differentiable at x, then by Taylor we have kE(x, h)k ≤


ckhk2 for some constant c, provided h is sufficiently small. We write:

g(x + h) = g(x) + Jg (x)h + O(khk2 ). (4.1)

Recalling the concept of subordinate matrix norm, we can understand


how the Jacobian at a fixed point ξ dictates local behavior of simultaneous
iteration. Assume g is differentiable at ξ. Let x(k) = ξ + h and think of h as
small enough so that kE(ξ, h)k ≤ εkhk—we will pick ε momentarily. Then,

kx(k+1) − ξk = kg(x(k) ) − g(ξ)k = kg(ξ + h) − g(ξ)k


= kg(ξ) + Jg (ξ)h + E(ξ, h) − g(ξ)k
= kJg (ξ)h + E(ξ, h)k
(triangle inequality) ≤ kJg (ξ)hk + kE(ξ, h)k
(subordinate norm) ≤ kJg (ξ)kkhk + εkhk
= kJg (ξ)k + ε kx(k) − ξk.


We can ensure x(k+1) is closer to ξ than x(k) is close to ξ in the chosen norm
if we can make kJg (ξ)k + ε < 1. Surely, if kJg (ξ)k ≥ 1, then we lose. But, as
soon as kJg (ξ)k < 1, there exists ε > 0 small enough such that kJg (ξ)k + ε
is still strictly less than 1. And since we get to pick ε, we can do that. The
only downside is that it may require us to force khk to be very small, that
is: this may only be useful if x(k) is very close to ξ. Here is the take-away:
74 CHAPTER 4. SYSTEMS OF NONLINEAR EQUATIONS

If ξ is a fixed point of g and g is differentiable at ξ, and if


kJg (ξ)k < 1 in some subordinate matrix norm, then there ex-
ists a (possibly very small) neighborhood of ξ such that, if x(0)
is in that neighborhood, then simultaneous iteration converges to
ξ. Furthermore, the errors kx(k) − ξk converge to zero at least
linearly. By extension, in that scenario, we say that the sequence
x(0) , x(1) , . . . converges to ξ at least linearly.

The discussion above essentially captures [SM03, Theorem 4.2].


An example to practice the calculus. Consider f as defined above and
the associated relaxation, as well as their Jacobians:
1 2 2

(x 1 + x 2 − 9)
f (x) = 9 , g(x) = x − λf (x)
−x1 + x2 − 1
2 2
  2λ 2λ

x 1 x 2 1 − x 1 − x 2
Jf (x) = 9 9 , Jg (x) = I2 − λJf (x) = 9 9 .
−1 1 λ 1−λ

What is the norm of Jg at ξ in the positive orthant? We investigate this


numerically for a range of λ values, showing the results in Figure 4.3.

% Define the problem


f = @(x) [ (x(1)ˆ2 + x(2)ˆ2 - 9)/9 ; -x(1) + x(2) - 1 ];
Jf = @(x) [ 2*x(1)/9, 2*x(2)/9 ; -1, 1 ];

% Get a root with Matlab's fzero (we will develop our own ...
algorithm below)
x0 = [1 ; 2];
xi = fsolve(f, x0);

% Jacobian of g(x) = x - lambda*f(x) at xi, as a function of ...


lambda
Jgxi = @(lambda) eye(2) - lambda * Jf(xi);

% Plot norm(Jgxi) for a range of values of lambda, and for ...


different norms.
figure;
hold all;
for p = [1, 2, inf]
handle = ezplot(@(lambda) norm(Jgxi(lambda), p), [-.1, .8]);
set(handle, 'LineWidth', 1.5);
end
ylim([.9, 1.3]);
legend('1-norm', '2-norm', '\infty-norm');
title('$ \ | J g(\xi) \ | p$ for $g(x) = x - \lambda f(x)$', ...
'Interpreter', 'Latex');
4.3. JACOBIANS AND CONVERGENCE 75

kJg (9)kp for g(x) = x ! 6f (x)


1.3
1-norm
2-norm
1-norm
1.2

1.1

X: 0.3303
0.9 Y: 0.9488
0 0.2 0.4 0.6 0.8
6

Figure 4.3: The root of f in the positive orthant is indeed a stable fixed
point for g with λ between 0 and about 0.6, as indicated by the 2-norm of
the Jacobian. After the fact, we find that λ = 0.33 would have been a better
choice than λ = 0.05, but this is hard to know ahead of time.

A word about different norms. Notice that the contraction mapping the-
orem asserts convergence to a unique fixed point provided g is a contraction
in some norm. It is important to remark that it is sufficient to have the
contraction property in one norm to ascertain convergence. Indeed, as the
example below indicates, it may be that g is not a contraction in some norm,
but is a contraction in another norm. The other way around: observing
kJg (ξ)k > 1 in any number of norms is not enough to rule out convergence
(contrary to the one-dimensional case where we only had to check |g 0 (ξ)|.)

f = @(x) [ (x(1)ˆ2 + x(2)ˆ2 - 9)/9 ; -x(1) + x(2) - 1 ];


Jf = @(x) [ (2*x(1))/9, (2*x(2))/9 ; -1, 1 ];

lambda = 0.05;
g = @(x) x - lambda*f(x);
Jg = @(x) eye(2) - lambda*Jf(x);

% Get a root with Matlab's fzero (we will develop our own ...
algorithm below)
x0 = [1 ; 2];
xi = fsolve(f, x0);

% Display three subordinate matrix norms of the Jacobian at xi.


for p = [1, 2, inf]
fprintf('%g-norm of Jg(xi) = %.4g\n', p, norm(Jg(xi), p));
76 CHAPTER 4. SYSTEMS OF NONLINEAR EQUATIONS

end

1-norm of Jg(xi) = 1.0330


2-norm of Jg(xi) = 0.9867
Inf-norm of Jg(xi) = 1.0110

Thus, g is locally a contraction around the positive-orthant fixed point in


the 2-norm, but not in the 1-norm or ∞-norm. That is fine: the fact it
is a contraction in the 2-norm is sufficient to guarantee convergence. The
large value 0.987 also supports the observation that convergence is slow in
the 2-norm.
Running the same code with xi = fsolve(f, [-1, -2]) to investi-
gate the other fixed point yields:

1-norm of Jg(xi) = 1.078


2-norm of Jg(xi) = 1.041
Inf-norm of Jg(xi) = 1.046

We have kJg (ξ)k > 1 in all three norms. We cannot conclude from this
observation, but it does decrease our hope to see simple iteration with g
converge to that root.

4.4 Newton’s method


We now vastly broaden the class of relaxation we consider. Instead of con-
sidering relaxations of the form g(x) = x − λf (x) for λ 6= 0 in R, why not
allow ourselves to consider

g(x) = x − M f (x),

where M ∈ Rn×n is a nonsingular matrix? It is still true that g(x) = x ⇐⇒


f (x) = 0 (verify it).
This generalization turns out to be beneficial even for simple examples.
Indeed, consider computing roots of f below with the former scheme:
 1 
x
f (x) = 2 1 1 , g(x) = x − λf (x),
− 2 x2
1
1 − λ2
  
0 0
Jf (x) = 2 , Jg (x) = I2 − λJf (x) = .
0 − 21 0 1 + λ2
4.4. NEWTON’S METHOD 77

At any point x, for any choice of λ, the norm of the Jacobian is at least 1
in the 2-norm (and also in the 1-norm and in the ∞-norm).
 Yet, if we allow
1 0
ourselves to use a matrix M , then setting M = yields:
0 −1

1
g(x) = x − M f (x), Jg (x) = I2 − M Jf (x) = I2 .
2
The 2-norm (and 1-norm, and ∞-norm) of this matrix is 12 < 1, hence, local
convergence is guaranteed.
Based on our understanding of the critical role of a small Jacobian at ξ,
ideally, we would pick M such that

0 = Jg (ξ) = In − M Jf (ξ), hence, M = (Jf (ξ))−1 ,

assuming the inverse exists. Of course, we do not know ξ, let alone (Jf (ξ))−1 .
Reasoning that at iteration k our best approximation for ξ is x(k) (as far as
we know at that point), we are lead to Newton’s method:

Definition 4.6 (Newton’s method in Rn ). For a differentiable function f ,


given x(0) ∈ Rn , Newton’s method generates the sequence

x(k+1) = x(k) − (Jf (x(k) ))−1 f (x(k) ), k = 0, 1, 2, . . .

This is equivalent to simultaneous iteration with

g(x) = x − (Jf (x))−1 f (x).

It is implicitly assumed all Jacobians encountered are nonsingular.

One important remark is that one should not compute the inverse of
Jf (x(k) ). Instead, solve the linear system implicitly defined by the iteration
(using LU with pivoting for example, possibly via Matlab’s backslash):

Jf (x(k) ) (x(k+1) − x(k) ) = −f (x(k) ) .


| {z } | {z } | {z }
A x b

Add the solution to x(k) to obtain x(k+1) . This is faster and numerically more
accurate than computing the inverse of the matrix.
Importantly, with Newton’s method, we can get any root of f where the
Jacobian is nonsingular. On our original example (intersection of circle and
line), initializing Newton’s method at various points can yield convergence
to either root.
78 CHAPTER 4. SYSTEMS OF NONLINEAR EQUATIONS

f = @(x) [ (x(1)ˆ2 + x(2)ˆ2 - 9)/9 ; -x(1) + x(2) - 1 ];


Jf = @(x) [ 2*x(1)/9, 2*x(2)/9 ; -1, 1 ];

x = [1 ; 2]; % initialize

for k = 1 : 8

fx = f(x);
Jfx = Jf(x);

x = x - (Jfx\fx); % Newton's step: solve a linear system

fprintf('x = [%+.2e, %+.2e], f(x) = [%+.2e, %+.2e];\n', ...


x(1), x(2), fx(1), fx(2));

end

x = [+1.67e+00, +2.67e+00], f(x) = [-4.44e-01, +0.00e+00];


x = [+1.56e+00, +2.56e+00], f(x) = [+9.88e-02, +0.00e+00];
x = [+1.56e+00, +2.56e+00], f(x) = [+2.34e-03, +2.22e-16];
x = [+1.56e+00, +2.56e+00], f(x) = [+1.44e-06, -2.22e-16];
x = [+1.56e+00, +2.56e+00], f(x) = [+5.51e-13, -2.22e-16];
x = [+1.56e+00, +2.56e+00], f(x) = [+0.00e+00, +0.00e+00];
x = [+1.56e+00, +2.56e+00], f(x) = [+0.00e+00, +0.00e+00];
x = [+1.56e+00, +2.56e+00], f(x) = [+0.00e+00, +0.00e+00];

Here is the output if we initialize at (−1, −2)T :

x = [-3.00e+00, -2.00e+00], f(x) = [-4.44e-01, -2.00e+00];


x = [-2.60e+00, -1.60e+00], f(x) = [+4.44e-01, +0.00e+00];
x = [-2.56e+00, -1.56e+00], f(x) = [+3.56e-02, +0.00e+00];
x = [-2.56e+00, -1.56e+00], f(x) = [+3.22e-04, +2.22e-16];
x = [-2.56e+00, -1.56e+00], f(x) = [+2.75e-08, -2.22e-16];
x = [-2.56e+00, -1.56e+00], f(x) = [+1.97e-16, -2.22e-16];
x = [-2.56e+00, -1.56e+00], f(x) = [+0.00e+00, +2.22e-16];
x = [-2.56e+00, -1.56e+00], f(x) = [+0.00e+00, +0.00e+00];

Convergence is fast! We will prove quadratic convergence momentarily.


Here is what that means in Rn [SM03, Def. 4.6]

Definition 4.7. Suppose x(0) , x(1) , x(2) , . . . converges to ξ ∈ Rn . We say the


sequence converges with at least order q > 1 if the errors kx(k) − ξk converge
4.4. NEWTON’S METHOD 79

to 0 with at least order q, that is, if there exists a sequence ε0 , ε1 , . . . > 0


converging to zero and µ ≥ 0 such that
εk+1
∀k, kx(k) − ξk ≤ εk , and lim q = µ,
k→∞ ε
k

for some arbitrary norm in Rn . We similarly define convergence with order q,


at least quadratic convergence, and quadratic convergence by applying those
definitions to the sequence of error norms kx(k) − ξk.

The main theorem about Newton’s method follows [SM03, Thm. 4.4].

Theorem 4.8. Let f : Rn → Rn be three times continuously differentiable,


and let ξ be a root of f . Assume Jf (ξ) is nonsingular. Then, provided x(0) is
sufficiently close to ξ, Newton’s method converges to ξ at least quadratically.

Proof. The proof is in two parts. First, we establish convergence. Then, we


establish the speed of convergence.

Convergence. At a high level, we really only need to investigate the


norm of Jg (ξ). Since g(x) = x − (Jf (x))−1 f (x), the chain rule yields:

Jg (x) = In − the derivative of (Jf (x))−1 f (x) − (Jf (x))−1 Jf (x),


 

where we are not too precise about the term in the bracket for now. From This is similar to the
this and the fact that f (ξ) = 0, we deduce that Jg (ξ) = 0. We now give observation that
1
x 7→ a(x) is
further details about this computation. Define A(x) = (Jf (x))−1 . This is a differentiable at x
differentiable function of x at ξ.2 Since g(x) = x − A(x)f (x), we have: provided a(x) 6= 0 and
a is differentiable.
n
X
gi (x) = xi − aik (x)fk (x),
k=1
n  
∂gi X ∂aik ∂fk
(x) = δij − (x)fk (x) + aik (x) (x) ,
∂xj k=1
∂x j ∂x j

where we used the Kronecker delta notation:


(
1 if i = j,
δij =
0 otherwise.
2
To see this, notice first that Jf (x) is a differentiable function of x by assumption. By
Cramer’s rule, the inverse of Jf (x) is a matrix whose entries are polynomials (determi-
nants) of the entries of Jf (x), divided by the determinant of Jf (x). Provided Jf (x) is
invertible, this is indeed differentiable in x.
80 CHAPTER 4. SYSTEMS OF NONLINEAR EQUATIONS

The term nk=1 ∂a


P ik
∂xj
(x)fk (x) (thankfully) vanishes at x = ξ since f (ξ) = 0.
Focus on the other term, which is nothing but the inner product between the
ith row of A(x) and the j column of Jf (x):
n n
X ∂fk X
aik (x) (x) = [A(x)]ik [Jf (x)]kj
k=1
∂xj k=1
= [A(x)Jf (x)]ij
= (Jf (x))−1 Jf (x) ij
 

= [In ]ij
= δij .

Thus, Jg (ξ) = 0, as expected. Regardless of choice of norm, kJg (ξ)k = 0 < 1,


which by our earlier considerations implies there exists a neighborhood of ξ
such that if x(0) is in that neighborhood, then Newton’s method converges
to ξ.
In [SM03, Thm. 4.4], More explicitly, consider (4.1) again: there exist constants c, δ > 0 such
the proof is organized that, provided khk ≤ δ,
differently, so that it is
f and not g that one
expands in a Taylor g(ξ + h) = g(ξ) + Jg (ξ)h + E(ξ, h)
series. This has the
advantage that they = g(ξ) + E(ξ, h),
only need f to be twice
continuously
differentiable. Here, where kE(ξ, h)k ≤ ckhk2 . With h = x(k) − ξ, if kx(k) − ξk ≤ δ,
since we need g to be
twice continuously kx(k+1) − ξk = kg(x(k) ) − g(ξ)k
differentiable, we need
f to be three times = kg(ξ + h) − g(ξ)k
continuously
differentiable. = kE(ξ, h)k
≤ ckhk2
= ckx(k) − ξk2 . (4.2)

In particular, if ckx(k) − ξk ≤ 21 (we could also have chosen another constant


strictly less than 1), it follows that kx(k+1) − ξk ≤ 12 kx(k) − ξk. Thus, if
1
kx(0) − ξk ≤ min δ, 2c , then the same is true of all x(k) (why?) and New-
ton’s method converges to ξ (at least linearly).3

3
We omitted to verify that Newton’s equation is well defined, that is, that Jf (x(k) ) is
invertible for all k under some conditions on x(0) . The argument goes as follows: Jf (x) is
assumed to be a continuous function of x, hence its determinant is a continuous function
of x. Since the determinant is nonzero at ξ by assumption, it must be nonzero in a neigh-
borhood of ξ by continuity, hence: Newton’s method is well defined in that neighborhood.
Details are in [SM03, Thm. 4.4]: we omit them for simplicity.
4.4. NEWTON’S METHOD 81

At least quadratic convergence. Consider (4.2) again. Under the


condition above on x(0) , it holds for all k that

kx(k+1) − ξk
≤ c.
kx(k) − ξk2

Thus, taking the limit,

kx(k+1) − ξk
lim ≤ c.
k→∞ kx(k) − ξk2

This establishes at least quadratic convergence. (Note: this convergence is


quadratic if the limit equals a positive number; if it equals 0 (which is not
excluded here), then the convergence is faster than quadratic.)
A word to conclude. It is a common misconception that the above theo-
rem allows this conclusion: “Newton’s method converges to the root which
is closest to x(0) .” This is completely false, in so many ways:

1. What does “closest” mean? In what norm?

2. The above theorem does not guarantee convergence in general.

3. Rather, it says: around each nondegenerate4 root ξ, there is a (possibly


very small) neighborhood of ξ such that initializing in that neighbor-
hood guarantees (fast) convergence to that root.

4. Initializing outside that neighborhood, anything can happen: conver-


gence to ξ anyway (possibly slow at first), convergence to another root
(possibly far away), or even divergence.

Consider Figure 4.4, known as a Newton fractal, to anchor this remark. De-
spite this warning, Newton’s is one of the most useful algorithms in numerical
analysis, as it allows to refine crude approximations to high accuracy.

4
That is, such that the Jacobian of f at that root is nonsingular.
82 CHAPTER 4. SYSTEMS OF NONLINEAR EQUATIONS

Figure 4.4: Figure by Henning Makholm, available at https://en.


wikipedia.org/wiki/Newton_fractal. Newton fractal generated by the
polynomial f (z) = z 3 − 2z + 2, where f : C → C can also be thought of as
a function f : R2 → R2 by separating real and imaginary part. The image
area covers a square with side length 4 centered on 0. Corner pixels repre-
sent ±2 ± 2i exactly. The three roots are given by roots([1 0 -2 2]) as
-1.7693 + 0.0000i, 0.8846 + 0.5897i, 0.8846 - 0.5897i. The three
shades of yellow/green pixel colors indicates which root Newton converges
to if initialized there. Pixel intensity indicates the number of iterations it
takes to get closer than 10−3 (in complex absolute value) to a root. Red
color suggests no convergence (divergence or cycling). Notice how sensitive
the limit point can be to small changes in initialization.
Chapter 5

Computing eigenvectors and


eigenvalues

In this chapter, we study algorithms to compute eigenvectors and eigenvalues


of matrices. We follow a mixture of Chapter 5 in [SM03], Lectures 24–27
and 30–31 in [TBI97], and some extra material.
Problem 5.1. Given a matrix A ∈ Rn×n (or in Cn×n ), an eigenpair of A is
a pair (λ, x) such that x ∈ Cn is nonzero, λ ∈ C, and
Ax = λx.
The eigenproblem is that of computing one or all eigenpairs of A.
Notice that
λ is an eigenvalue of A
⇐⇒ Ax = λx has a solution x 6= 0
⇐⇒ (A − λIn )x = 0 has a solution x 6= 0
⇐⇒ A − λIn is not invertible
⇐⇒ det(A − λIn ) = 0.
This leads to the notion of characteristic polynomial.
Definition 5.2. Given a square matrix A of size n, the characteristic poly-
nomial of A, defined by
pA (λ) = det(A − λIn )
is a polynomial of degree n whose roots are the eigenvalues of A. Thus, by
the fundamental theorem of algebra, A always has exactly n eigenvalues in
the complex plane (counting multiplicities).

83
84 CHAPTER 5. EIGENPROBLEMS


2 1
Example 5.3. For example consider A = . Its characteristic poly-
−3 3
nomial is
 
2−λ 1
pA (λ) = det (A − λI2 ) = det = λ2 − 5λ + 9.
−3 3 − λ

Its two complex roots are 2.5 ± i1.658 . . .: these are the eigenvalues of A.

Remark 5.4. Matlab provides the function poly which returns the coeffi-
cients of pA in the monomial basis. Notice that Matlab defines the charac-
teristic polynomial as det(λIn − A), which is equivalent to (−1)n pA (λ). Of
course, this does not change the roots.

Remark 5.5. Owing to work by Abel (later complemented by Galois), it


is known since the 19th century that there does not exist any closed-form
formula for the roots of a polynomial
√ of degree five or more using only basic
arithmetic operations (+, −, ×, /, ·). Furthermore, for any polynomial p of
degree n, one can easily construct a matrix A of size n such that pA = p
(a so-called companion matrix). Consequently, no finite-time algorithm can
compute the eigenvalues of a general matrix of size n ≥ 5: in this chapter,
we must use iterative algorithms. This is in stark contrast with other linear
algebra problems we covered such as solving Ax = b and factoring A = QR.

One obvious idea to compute the eigenvalues of A is to obtain the char-


acteristic polynomial, that is, compute a0 , . . . , an such that

pA (λ) = an λn + · · · + a1 λ + a0 ,

then to compute the roots of pA (for example, using the iterative algorithms
we covered to solve nonlinear equations.) This turns out to be a terrible idea.

n = 50; % try 50, 200, 400


A = randn(n)/sqrt(n); % scaling to (mostly) keep eigs in ...
disk of radius 1.5

p = poly(A); % The problem is already here,


d1 = roots(p); % not so much here.

d2 = eig(A); % We can trust this one.

subplot(1,2,1);
plot(real(d1), imag(d1), 'o', 'MarkerSize', 8); hold all;
plot(real(d2), imag(d2), 'x', 'MarkerSize', 8); hold off;
xlim([-1.3, 1.3]);
5.1. THE POWER METHOD 85

ylim([-1.3, 1.3]);
pbaspect([1,1,1]);
title('Eigenvalues of A and computed roots of p A');
xlabel('Real part');
ylabel('Imaginary part');
legend('Computed roots', '"True" eigenvalues');

subplot(1,2,2);
stem(n:-1:0, abs(p));
pbaspect([1.6,1,1]);
xlim([0, n]);
set(gca, 'YScale', 'log');
title('Coefficients of p A(\lambda)');
xlabel('Absolute coefficient of \lambdaˆk for k = 0...n');
ylabel('Absolute value of coefficient');

Running the code with n = 50 and 400 produces Figures 5.1 and 5.2. As
can be seen from the figures, the coefficients of pA in the monomial basis
(1, λ, λ2 , . . .) grow out of proportions already for small values of n: there are
many orders of magnitude between them, far more than 16. Even if we find
an excellent algorithm to compute the coefficients in IEEE arithmetic and to
compute the roots of the polynomial from there, the round-off error on the
coefficients alone is already too much to preserve a semblance of accuracy.
This is because the conditioning of the problem “given the coefficients of
a polynomial in the monomial basis, find the roots of that polynomial” is
very bad. In other words: small perturbations of the coefficients may lead to
large perturbations of the roots. We claim this here without proof. We will
develop other ways.

Remark 5.6. Computing the eigenvalues of A by first computing its charac-


teristic polynomial pA is a terrible idea. Yet, the other way around is a good
idea: given a polynomial p, building a so-called companion matrix A such
that pA and p have the same roots, then using an eigenproblem algorithm to
compute the eigenvalues of A is a good strategy to compute the roots of p.
Type edit roots in Matlab to see that this is the strategy used by Matlab.

5.1 The power method


In this section, we discuss one of the simplest algorithms available to compute
the largest eigenvalue (in magnitude) of a matrix A ∈ Cn×n , and an asso-
ciated eigenvector. For convenience, we assume A is diagonalizable, which
is the case for almost all matrices in Cn×n . That is, we assume there exists
86 CHAPTER 5. EIGENPROBLEMS

Eigenvalues of A and computed roots of pA

Computed roots
1 "True" eigenvalues
Coefficients of p A ( )
0
10

Absolute value of coefficient


0.5

10-5
Imaginary part

-10
10

-0.5

10-15
0 10 20 30 40 50
-1
Absolute coefficient of k for k = 0...n

-1 -0.5 0 0.5 1
Real part

Figure 5.1: Computing the eigenvalues of A as the roots of its character-


istic polynomial pA with n = 50. Left: actual roots and computed roots
in complex plane; Right: coefficients of pA in absolute value (logarithmic
scale). The “true” eigenvalues are those computed by Matlab’s eig: they
are trustworthy.

Eigenvalues of A and computed roots of pA

Computed roots
1 "True" eigenvalues
Coefficients of p A ( )
100
Absolute value of coefficient

0.5
Imaginary part

0 -50
10

-0.5

10-100
0 100 200 300 400
-1
Absolute coefficient of k for k = 0...n

-1 -0.5 0 0.5 1
Real part

Figure 5.2: Same as Figure 5.1 with n = 400: the computed roots are com-
pletely off. The strange pattern of the roots of the computed characteristic
polynomial is not too relevant for us. As a side note on that topic, notice that
the roots of a polynomial with random Gaussian coefficients follow the same
kind of pattern: x = roots(randn(401, 1)); plot(real(x), imag(x),
’.’); axis equal;.
5.1. THE POWER METHOD 87

V ∈ Cn×n invertible and D ∈ Cn×n diagonal such that

A = V DV −1 .

In other words,

AV = V D,

which implies D = diag(λ1 , . . . , λn ) contains the eigenvalues of A and vk —


the kth column of V —is an eigenvector associated to λk . Without loss of
generality, we can assume kvk k2 = 1 for all k.

Assumption 0 A is diagonalizable.
Let the eigenvalues be ordered in such a way that

|λ1 | ≥ |λ2 | ≥ · · · ≥ |λn |. (5.1)

Thus, λ1 is a largest eigenvalue in magnitude, with eigenvector v1 : these are


our targets.
The power iteration is a type of simultaneous iteration. For a given
nonzero x(0) ∈ Cn , it generates a sequence of iterates as follows:

x(k+1) = g(x(k) ), for k = 0, 1, 2, . . . , where (5.2)


Ax
g(x) = . (5.3)
kAxk2
√ pPn
The 2-norm for complex vectors is defined by kuk2 = u∗ u = 2
i=1 |ui | ,

where u denotes the complex Hermitian conjugate-transpose of a vector (or
matrix).

Question 5.7. Show that the fixed points of g are eigenvectors of A.


Ak x(0)
Question 5.8. Show by induction that x(k) = kAk x(0) k2
for k = 1, 2, . . .

Thus, to understand the behavior of the power iteration, we must un-


derstand the behavior of Ak x(0) . Using the relation A = V DV −1 , we can
establish the following:

Ak = (V DV −1 )(V DV −1 ) · · · (V DV −1 )
= V D(V −1 V )D(V −1 V )D · · · (V −1 V )DV −1
= V Dk V −1 . (5.4)
88 CHAPTER 5. EIGENPROBLEMS

Equivalently:

Ak V = V Dk , (5.5)

where the matrix Dk is diagonal, with entries λk1 , . . . , λkn .


The initial iterate x(0) can always be expanded in the basis of eigenvectors
v1 , . . . , vn with some coefficients c1 , . . . , cn ∈ C, so that

x(0) = c1 v1 + · · · + cn vn = V c. (5.6)

Thus,

Ak x(0) = Ak V c = V Dk c = (c1 λk1 )v1 + · · · + (cn λkn )vn . (5.7)

Let us factor out the (complex) number λk1 from this expression:
n  k !
X λj
Ak x(0) = λk1 c1 v1 + cj vj . (5.8)
j=2
λ 1

Now come the crucial assumptions for the power method:

Assumption 1 The largest eigenvalue is strictly larger than all others:


|λ1 | > |λ2 |.

Assumption 2 The initial iterate x(0) aligns at least somewhat with v1 ,


that is: c1 6= 0.1
Under these conditions, it is clear that (λj /λ1 )k → 0 as k → ∞ for
j = 2, . . . , n, so that, asymptotically, Ak x(0) is aligned with v1 : the dominant
eigenvector. More precisely,
 k
Pn λj
k (0)
A x λ1k c v
1 1 + c
j=2 j λ1 vj
(k)
x = = .
kAk x(0) k2 |λk1 | Pn  k
λj
c1 v1 + j=2 cj λ1 vj
2

λk1
If λ1 is a positive real number, then |λk1 |
= 1 for all k so that the limit exists:
c1 v 1 c1
Here, we use kv1 k2 = 1. lim x(k) = lim = v1 . (5.9)
k→∞ k→∞ kc1 v1 k2 |c1 |
If A is Hermitian (A = A∗ ), this is equivalent to saying: x(0) is not orthogonal to v1 .
1

For a general matrix, it is equivalent to saying: x(0) does not lie in the subspace spanned
by v2 , . . . , vn . If x(0) is taken as a (complex) Gaussian random vector, this is satisfied with
probability 1.
5.2. INVERSE ITERATION 89

The complex number |cc11 | has modulus one: its presence is indicative of the
fact that unit-norm eigenvectors are defined only up to phase: if v1 is an
eigenvector, then so are −v1 , iv1 , −iv1 and all other vectors of the form eiθ v1
for any θ ∈ R. The important point is that x(k) asymptotically aligns with v1 ,
up to an unimportant complex phase. More generally, if λ1 is not a positive
real number, then the limit of x(k) does not exist because the phase may keep
λk
changing due to the term |λk1 | ; this is inconsequential as it does not affect the
1
direction spanned by x(k) .2
|λ2 |
The convergence rate is dictated by the ratio |λ 1|
. If this is close to 1,
convergence is slow; if this is close to 0, convergence is fast. In all cases,
unless this ratio is zero, the convergence is linear since the error decays as
 k
|λ2 |
|λ1 |
times a constant.
If after k iterations v = x(k) is deemed a reasonable approximate dominant
eigenvector of A, we can extract an approximate corresponding eigenvalue
using the Rayleigh quotient:

Av ≈ λ1 v, thus
v Av ≈ λ1 v ∗ v, so

v ∗ Av
λ1 ≈ . (5.10)
kvk22
For A real and symmetric, see Theorem 5.13 in [SM03] or Theorem 27.1 in
[TBI97] for guarantees on the approximation quality.

5.2 Inverse iteration


The power method is nice, but it has two main shortcomings:
1. It can only converge to a dominant eigenpair; and
λ1
2. Convergence speed is dictated by λ2
: it could be slow.

The central idea behind inverse iteration is to use the power method on a
different matrix to address both these issues.
For a diagonalizable matrix A = V DV −1 and a given µ ∈ C, observe the
following:

A − µIn = V DV −1 − µV V −1 = V (D − µIn )V −1 .
2
In comparison, observe that limk→∞ x(k) (x(k) )∗ = v1 v1∗ : this limit exists even if λ1 is
not real positive.
90 CHAPTER 5. EIGENPROBLEMS

The right-hand side is a diagonalization of the left-hand side; thus: the


eigenvalues of A − µIn are λ1 − µ, . . . , λn − µ, and both A and A − µIn have
the same eigenvectors. Now, take the inverse:
 1 
λ1 −µ
(A − µIn )−1 = V (D − µIn )−1 V −1 = V 
 ..  −1
V . (5.11)
.
1
λn −µ

Thus, (A − µIn )−1 shares eigenvectors with A, and the eigenvalues of (A −


µIn )−1 are λ11−µ , · · · , λn1−µ . Hence, if we know how to pick µ closer to a desired
simple eigenvalue λ` than to any other, then running the power method on
(A − µIn )−1 will produce convergence to an eigenvector associated to λ` .
Indeed: λ`1−µ is then the largest eigenvalue in absolute value. Furthermore,
if we are really good at picking µ close to λ` , we can get an excellent linear
convergence rate.
For a given A ∈ Cn×n , µ ∈ C and x(0) ∈ Cn (nonzero, often taken at
random), the inverse power iteration executes the following:

y (k+1) = (A − µIn )−1 x(k) , (5.12)


y (k+1)
x(k+1) = . (5.13)
ky (k+1) k2

Each iteration involves solving a system of linear equations in M = A − µIn .


Since this is the same matrix over and over again, it pays to do the following:

1. Compute the LU factorization of M , that is P M = LU once for O(n3 )


flops, then

2. Once per iteration, solve the system M y (k+1) = x(k) using the LU
factorization for only O(n2 ) flops each time.

Question 5.9. Compare the costs of inverse iteration and the power method.

Numerically, one aspect of inverse iteration screams for attention. Indeed,


it appears that we have an incentive to pick µ as close as possible to λ` ; yet,
this will undoubtedly deteriorate the condition number of A − µIn . In fact,
if we manage to set µ = λ` —which seems ideal—then the matrix is not even
invertible. How could one hope to solve the linear system (5.12) with good
accuracy? It turns out that this is not a problem.
Reconsider the linear system to be solved in inverse iteration:

(A − µIn )y = b.
5.3. RAYLEIGH QUOTIENT ITERATION 91

If b is perturbed to become b + δb, then y is perturbed as well to y + δy and


these perturbations are related by:

(A − µIn )δy = δb.

Using A = V DV −1 , remember we have

(A − µIn )−1 = V (D − µIn )−1 V −1 .

Define the vector u = V −1 δb with entries u1 , . . . , un . Then,

δy = (A − µIn )−1 δb
= V (D − µIn )−1 V −1 δb
= V (D − µIn )−1 u
n
X ui
= vi .
i=1
λi − µ

If µ ≈ λ` for some particular ` (and µ is significantly different from all other


eigenvalues), the corresponding term in this sum dominates and we get
u`
δy ≈ v` .
λ` − µ

Crucially, this perturbation δy in the solution y is primarily aligned with v` ;


but v` is exactly what we intend to compute! As inverse iteration converges, y
aligns with v` , and so does most of the perturbation δy. Upon normalizing y
to get the next iterate, the effects of ill-conditioning are essentially forgotten.3

5.3 Rayleigh quotient iteration


One point in particular can be much improved regarding inverse iteration:
the choice of µ, which is supposed to approximate λ` . First of all, it is not
clear that one can always know a good value for µ ahead of time. Second,
it makes sense that, as we iterate and x(k) converges, we should be able to
exploit that to improve µ. The Rayleigh quotient, already defined in (5.10),
helps do exactly that.
3
IEEE arithmetic can still run into trouble here, specifically due to under/overflow.
If this happens, Inf’s and NaN’s will appear in the solution y: the reader should think
carefully about what to do when that happens.
92 CHAPTER 5. EIGENPROBLEMS

The main idea behind Rayleigh quotient iteration (RQI) is to use the
Rayleigh quotient at each iteration to redefine µ. Given A and x(0) (with the
latter often chosen at random), RQI iterates as follows:

(x(k) )T Ax(k)
µk = , (5.14)
(x(k) )T x(k)
y (k+1) = (A − µk In )−1 x(k) , (5.15)
y (k+1)
x(k+1) = . (5.16)
ky (k+1) k2

(You can also choose to initialize µ0 independently of x(0) to favor convergence


to an eigenpair with eigenvalue close to µ0 : think about it.)
Note that, contrary to the situation for inverse iteration, it is not clear if
and where to RQI converges: we do not analyze it in this course. When RQI
converges, for real symmetric matrices (or complex Hermitian), it often does
so cubically. This is extremely fast. On the other hand, now µk changes at
each iteration, which means we can no longer precompute an LU factorization
of the linear system once and for all: it seems as if each iteration must cost
O(n3 ) flops.
Fortunately, this computational burden can be reduced by transforming A
to a so-called Hessenberg form.4 In this course, we only focus on a particular
case: when A is real and symmetric, transforming it to Hessenberg form
means: to make it tridiagonal. We discuss algorithms for this task later in
this chapter. For O(n3 ) flops, such algorithms produce an orthogonal matrix
P and a symmetric, tridiagonal matrix H such that

A = P HP T .

In Matlab, this is built-in as [P, H] = hess(A).


To understand how the eigenvectors and eigenvalues of A relate to those
of the simpler matrix H, it is useful to remember the spectral theorem: if A
is real and symmetric,
1. A is diagonalizable with real eigenvalues and real eigenvectors; and

2. The eigenvectors v1 , . . . , vn can be taken to form an orthonormal basis.


In other words, there exists D = diag(λ1 , . . . , λn ) (real) and V ∈ Rn×n
orthogonal such that

A =V DV T , (5.17)
4
A matrix is in Hessenberg form if Aij 6= 0 =⇒ j ≥ i + 1.
5.4. STURM SEQUENCES 93

and λ1 ≥ · · · ≥ λn (where the ordering makes sense because the eigenvalues


are real).
Question 5.10. Show that if (λ, x) is an eigenpair of H, then (λ, P x) is an
eigenpair of A.
It is left to you as an exercise to figure out how best to use tridiago-
nalization in conjunction with RQI to reduce the cost of each iteration to a
mere O(n) flops. Whether or not the precomputation is worth the effort may
depend on the total number of RQI iterations you intend to run with A.
Remark 5.11. In treating the power method, inverse iteration and RQI,
we never acknowledged the possibility of the targeted eigenvalue having (geo-
metric) multiplicity greater than 1. Analyzing these methods in that scenario
takes more care, beyond the scope of this introductory course. You are encour-
aged to experiment numerically to get a sense of how these methods behave
in that case.

5.4 Eigenvalues of a symmetric, tridiagonal


matrix: Sturm sequences
When we approached solving Ax = b, we considered the simplified problem of
solving a triangular system first, then showed how one can reduce any system
to a pair of triangular systems. In the same spirit, we now consider the
particular problem of computing the eigenvalues of a symmetric, tridiagonal
matrix, and later we will show how to reduce any symmetric matrix to a
tridiagonal one without changing its eigenvalues.
Problem 5.12. Given a symmetric, tridiagonal matrix T ∈ Rn×n , compute
some or all of its eigenvalues λn ≤ · · · ≤ λ1 .
Notice that we do not care about the eigenvectors for now. Using inverse
iteration or RQI, you should be able to recover eigenvectors efficiently after
computing eigenvalues.
Let the matrix T be defined as:
 
a1 b 2
 b 2 a2 b 3 
 
T =
 ... ... ... .

(5.18)
 
 bn−1 an−1 bn 
b n an
Without loss of generality, we can make the following assumption:
94 CHAPTER 5. EIGENPROBLEMS

Assumption 5.13. For all i = 2, . . . , n, assume bi 6= 0.

Indeed, if any bk = 0, the eigenvalue problem separates into smaller ones


with the same structure.

 
A 0
Question 5.14. Let M = be a matrix with blocks A, B (square).
0 B
Show that the eigenvalues of A are eigenvalues of M . Similarly, show that
the eigenvalues of B are eigenvalues of M . Deduce that the set of eigenvalues
of M is exactly the union of the sets of eigenvalues of A and B.

Question 5.15. What can you tell about the eigenvalues of T if b2 = 0? If


b3 = 0? If some bi = 0? Essentially, by observing that if any bi is zero the
matrix T takes on a block-diagonal form, you should be able to separate the
problem of computing the eigenvalues of T into smaller eigenvalue problems
with the same symmetric, tridiagonal structure.

Let Tk = T1:k,1:k denote the kth principal submatrix of T , so that T = Tn .


Each Tk is symmetric, tridiagonal of size k × k. This suggests we may be
able to analyze T through recurrence on the Tk ’s. In particular, since we
are interested in the eigenvalues of T which are the roots of its characteristic
polynomial, it makes sense to investigate it.
For each k, define the characteristic polynomial of Tk as:

pk (λ) = det(Tk − λIk ).

This is a polynomial of degree k. For example, for k = 5, expanding the


determinant along the last column for the first matrix and along the last row
5.4. STURM SEQUENCES 95

for the last matrix we find:


 
a1 − λ b2
 b2
 a2 − λ b3 

p5 (λ) = det 
 b3 a3 − λ b4 

 b4 a4 − λ b5 
b5 a5 − λ
 
a1 − λ b2
 b2 a2 − λ b3 
= (a5 − λ) det  
 b3 a3 − λ b4 
b4 a4 − λ
 
a1 − λ b2
 b2 a2 − λ b3 
− b5 det  
 b3 a3 − λ b 4 
b5
= (a5 − λ)p4 (λ) − b25 p3 (λ).

Thus, p5 is easily expressed in terms of the two previous polynomials, p4 and


p3 . This suggests the following three-term recurrence:

p1 (λ) = a1 − λ,
p2 (λ) = (a2 − λ)(a1 − λ) − b22 ,
pk+1 (λ) = (ak+1 − λ)pk (λ) − b2k+1 pk−1 (λ), for k = 2, . . . , n − 1. (5.19)

Equivalently, we can define5

p0 (λ) = 1

and allow the recurrence (5.19) to run for k = 1, . . . , n − 1, based on p0 and


p1 .

Question 5.16. Verify that the recurrence (5.19) for k = 1 is valid with the
definition p0 (λ) = 1.

The following is important:

Using the recurrence relation, one can evaluate the characteristic


polynomials directly, without ever needing to figure out the coef-
ficients of the polynomials.
5
One way to understand the definition of p0 is that it is the determinant of a 0 × 0
matrix, which is 1 for the same reason an empty product is 1.
96 CHAPTER 5. EIGENPROBLEMS

This is important, since we know from the beginning of this chapter that the
coefficients of pn may be badly behaved. On the other hand, evaluating the
polynomials using the recurrence is fairly safe.
The following is the main theorem of this section. It tells us how we
can use the recurrence relation to determine where the eigenvalues of T are
located. Remember we assume b2 , . . . , bn are nonzero.

Theorem 5.17 (The Sturm sequence property, Theorem 5.9 in [SM03]).


For θ ∈ R, consider the Sturm sequence (p0 (θ), . . . , pn (θ)). The number of
agreements in sign between consecutive members of the sequence equals the
number of eigenvalues of T strictly greater than θ.

To count agreements in sign in the Sturm sequence, see Figure 5.6.6 The
following code illustrates this theorem; see also Figure 5.3.

%% The Sturm sequence property

%% Generate a symmetric, tridiagonal matrix


n = 6;
e = ones(n, 1);
T = spdiags([-e 2*e -e], -1:1, n, n);

% Using the recurrence relation, we can easily evaluate


% the characteristic polynomials of T and its principal
% submatrices for any given theta.
theta = 1.5;
q = zeros(n+1, 1);
q(1) = 1; % p 0(x) = 1
q(2) = T(1,1) - theta; % p 1(x) = a 1 - x
for k = 2 : n
% p k(x) = (a k - x)p {k-1}(x) - b kˆ2 p {k-2}(x)
a k = T(k, k);
b k = T(k-1, k);
q(k+1) = (a k - theta)*q(k) - b kˆ2*q(k-1);
end
stem(0:n, q);
xlim([-1, n+1]);
title(sprintf('Sturm sequence at \\theta = %g', theta));
set(gca, 'XTick', 0:n);

The proof of the Sturm sequence property relies crucially on the following
theorem. (Remember that, by the spectral theorem, the k roots of pk are
real since Tk is symmetric.) We still assume all bi ’s are nonzero.
6
The rule given in [SM03] is incorrect for Sturm sequences ending with a 0.
5.4. STURM SEQUENCES 97

Sturm sequence at = 1.5


1.5

0.5

-0.5

-1
0 1 2 3 4 5 6

Figure 5.3: Sturm sequence at θ = 1.5 for the 6 × 6 tridiagonal matrix


with 2’s on the diagonal and −1’s above and below. Its eigenvalues are
0.1981, 0.7530, 1.5550, 2.4450, 3.2470, 3.8019. Exactly 4 are strictly larger
than θ, coinciding with the number of sign agreements in the sequence.

Theorem 5.18 (Cauchy’s interlace theorem, Theorem 5.8 in [SM03]). The


roots of pk−1 are real and distinct, and separate those of pk for k = 2, 3, . . . , n.
That is, (strictly) between any two consecutive roots of pk (also real and
distinct), there is exactly one root of pk−1 .
This theorem is illustrated in Figure 5.4, with a matrix T generated as:

%% Generate a specific symmetric, tridiagonal matrix


n = 6;
e = ones(n, 1);
T = spdiags([-e 2*e -e], -1:1, n, n);

Question 5.19. Write your own code to generate Figure 5.4, based on
the three-term recurrence. Notice that to evaluate pn (θ), you will evaluate
p0 (θ), p1 (θ), . . . , pn (θ), so that you will be able to plot all polynomials right
away. See if you can write your code to evaluate the three-term recurrence
at several values of θ simultaneously (using matrix and vector notations).
The Sturm sequence theorem allows to find any desired eigenvalue of T
through bisection. Indeed, define
](x) = the number of eigenvalues of T strictly larger than x. (5.20)
98 CHAPTER 5. EIGENPROBLEMS

6 T6 T5 T4 T3 T2 T1

-2

-4

-6
0 0.5 1 1.5 2 2.5 3 3.5 4

Figure 5.4: Characteristic polynomials of the six principal submatrices of the


6 × 6 tridiagonal matrix with 2’s on the diagonal and −1’s above and below.
The Cauchy Interlace Theorem explains why the eigenvalues of Tk and Tk+1
interlace.

This function can be computed easily using Sturm sequences. By Cauchy


interlacing, the eigenvalues of T are distinct. As a result, they are separated
on the real line as:

λn < λn−1 < · · · < λ3 < λ2 < λ1 . (5.21)

Say we want to compute λk . Start with a0 < b0 such that λk ∈ (a0 , b0 ]—we
will see how to compute such initial bounds later. Consider the middle point
c0 = a0 +b
2
0
and compute ](c0 ). Two things can happen:

1. Either ](c0 ) ≥ k, indicating λ1 , . . . , λk > c0 —in particular, λk > c0 ; or

2. ](c0 ) < k, indicating λk ≤ c0 .

In the first case, we determined that λk ∈ (c0 , b0 ], while in the second case
we found λk ∈ (a0 , c0 ]. In both cases, we found an interval twice smaller
than the original interval, still with the guarantee that it contains λk . Upon
iterating this procedure, we produce an interval of length |b0 − a0 |/2K in K
iterations, each of which involves a manageable number of operations. This
5.4. STURM SEQUENCES 99

interval provides both an approximation of the eigenvalue and an error bound


for it.
Question 5.20. How many flops are required for one iteration of this bisec-
tion algorithm?

As a side note, observe that ](a) − ](b) gives the number of eigenvalues of T
in the interval (a, b].
Proof of Theorem 5.18, see also Theorem 5.8 in [SM03]. We prove Cauchy
interlacing by induction. To secure the base case, consider the roots of
p1 (λ) = a1 − λ and those of p2 (λ) = (a1 − λ)(a2 − λ) − b22 . The unique
root of p1 is a1 , while the two roots of p2 are:
p p
(a1 + a2 ) ± (a1 + a2 )2 − 4(a1 a2 − b22 ) (a2 − a1 ) ± (a2 − a1 )2 + 4b22
= a1 + .
2 2
p
Using that b2 6= 0, we deduce that (a2 − a1 )2 + 4b22 > |a2 − a1 |, and from
there it is easy to deduce that the two roots of p2 are real and distinct, and
that a1 (the unique root of p1 ) lies strictly between them. Thus, the base
case of the induction holds.
We now proceed by induction. The induction hypothesis is that the roots
of pk−1 interlace those of pk , and that both have real, distinct roots (k − 1
and k respectively). The goal is to infer that pk+1 has k + 1 real, distinct
roots interlaced with those of pk . We do this in two steps: first, we show that
pk+1 has at least one root strictly between any two consecutive roots of pk :
this readily accounts for at least k − 1 roots. Then, we show that pk+1 has a
root strictly smaller than all roots of pk , and (by a similar argument) another
root strictly larger than all the roots of pk . All told, this locates k + 1 roots
of pk+1 as desired, which is all of them.
First, let us show that pk+1 admits at least one root between any two
consecutive roots of pk . The key argument is the intermediate value theorem:
Let α, β denote two consecutive roots of pk , with α < β. Evaluate pk+1 at
these two points:

pk+1 (α) = (ak+1 − α)pk (α) − b2k+1 pk−1 (α) = −b2k+1 pk−1 (α),
pk+1 (β) = (ak+1 − β)pk (β) − b2k+1 pk−1 (β) = −b2k+1 pk−1 (β).

By the intermediate value theorem, if pk+1 (α) and pk+1 (β) have opposite
signs, then pk+1 admits a root in (α, β). According to the above (using that
bk+1 6= 0), this is the case exactly if pk−1 (α) and pk−1 (β) have opposite signs.
This in turn follows from the induction hypothesis: we assume here that the
100 CHAPTER 5. EIGENPROBLEMS

roots of pk−1 and those of pk interlace, which implies that pk−1 has exactly
one root in (α, β). Hence, pk−1 (α) and pk−1 (β) have opposite signs as desired.
Apply this whole reasoning to each consecutive pair of roots of pk to locate
k − 1 roots of pk : we only have two more to locate.
Second, we want to show that pk+1 has a root strictly smaller than the
smallest root of pk ; let us call the latter γ. The key observation is that all
the characteristic polynomials considered here go to positive infinity on the
left side, that is, for all r:

lim pr (λ) = +∞. (5.22)


λ→−∞

Indeed, pr (λ) = det(Tr − λIn ) = (−λ)r + lower order terms.7 Consider


Figure 5.4 for confirmation. We use this observation as follows: consider the
sign of

pk+1 (γ) = (ak+1 − γ)pk (γ) − b2k+1 pk−1 (γ) = −b2k+1 pk−1 (γ).

If this is negative, then pk+1 must have a root strictly smaller than γ since
it must obey (5.22): pk+1 (λ) has to become positive eventually as λ → −∞.
According to the above, this happens exactly if pk−1 (γ) is positive (using
again that bk+1 6= 0). By the induction hypothesis, all the roots of pk−1 are
strictly larger than γ, and since pk−1 itself obeys (5.22), it follows that indeed
pk−1 (γ) > 0, as desired. This shows that pk+1 has at least one root strictly
smaller than γ.
To conclude, we need only show that pk+1 has a root strictly larger than
the largest root of pk . The argument is similar to the one above. As
only significant difference, we here use the fact that limλ→+∞ pk−1 (λ) and
limλ→+∞ pk+1 (λ) are both infinite of the same sign.
Before we get into the proof of the Sturm sequence property, let us make
three observations about how zeros may appear in such sequences:

1. The first entry of the sequence is always + (in particular, it is not 0).

2. There can never be two subsequent zeros. Indeed, assume for con-
tradiction that pk (θ) = pk−1 (θ) = 0. Then, the recurrence implies
0 = −b2k pk−2 (θ). Thus, pk−2 (θ) = 0 (under our assumption that
bk 6= 0.) Applying this same argument to pk−1 (θ) = pk−2 (θ) = 0
implies pk−3 (θ) = 0, etc. Eventually, we conclude p0 (θ) = 0, which is a
contradiction since p0 (θ) = 1 for all θ.
7
You can also see it from the recurrence relation (5.19), which shows the sign of the
highest order term changes at every step.
5.4. STURM SEQUENCES 101

3. When a zero occurs before the end of the sequence, it is followed


by the sign opposite the sign that preceded it. Specifically, patterns
(+, 0, −) and (−, 0, +) can occur, but patterns (+, 0, +) and (−, 0, −)
cannot. Indeed: if pk (θ) = 0 with k < n, then pk−1 (θ) and pk+1 (θ) are
nonzero because of the previous point. Furthermore, the recurrence
states pk+1 (θ) = −b2k+1 pk−1 (θ), hence they have opposite signs.
We now give a proof of the main theorem. The proof differs somewhat
from [SM03, Theorem 5.9]. Specifically, we handle the case of zeros in the
Sturm sequence.
Proof of Theorem 5.17. The Sturm sequence property is a theorem about
two quantities. Let us give them a name:

sk (θ) = number of sign agreements in (p0 (θ), . . . , pk (θ)), and


gk (θ) = number of roots of pk strictly larger than θ.

Our goal is to show that sn (θ) = gn (θ). By induction, we prove that sk (θ) =
gk (θ) for all k, which implies the result. The first step is to secure the base
case, k = 1. By definition,
(
1 if a1 − θ > 0,
s1 (θ) = number of sign agreements in (1, a1 − θ) =
0 otherwise.

Furthermore, since the unique root of p1 is a1 ,


(
1 if a1 > θ,
g1 (θ) = number of roots of p1 strictly larger than θ =
0 otherwise.

Clearly, g1 (θ) = s1 (θ) as desired. Now, under the induction hypothesis


sk−1 (θ) = gk−1 (θ), let us prove that sk (θ) = gk (θ). In order to do so, no-
tice that sk (θ) and gk (θ) can be expressed in terms of sk−1 (θ) and gk−1 (θ).
Indeed,
(
sk−1 (θ) + 1 if (pk−1 (θ), pk (θ)) agree in sign,
sk (θ) =
sk−1 (θ) otherwise.

Similarly, owing to the Cauchy interlacing theorem, for any θ, the difference
gk (θ) − gk−1 (θ) can be either 0 or 1. Specifically,
(
gk−1 (θ) + 1 if, compared to pk−1 , pk has one more root > θ,
gk (θ) =
gk−1 (θ) otherwise.
102 CHAPTER 5. EIGENPROBLEMS

Thus, using the induction hypothesis sk−1 (θ) = gk−1 (θ), in order to show
that sk (θ) = gk (θ), we only need to verify that the following conditions are
in fact equivalent:

1. (pk−1 (θ), pk (θ)) agree in sign;

2. Compared to pk−1 , pk has one more root strictly larger than θ.

This is best checked on a drawing: see Figure 5.5. There are four cases to
verify, using Cauchy interlacing and the fact that limλ→−∞ pr (θ) = +∞ for
r = k − 1, k.

1. θ is a root of pk : then the sequence is (pk−1 (θ), 0). There is no sign


agreement, and indeed pk and pk−1 have the same number of roots
strictly larger than θ.

2. θ < the smallest root of pk : then both pk−1 (θ) and pk (θ) are positive.
We have a sign agreement, and indeed pk has k roots strictly larger
than θ, which is one more than pk−1 .

3. θ > the largest root of pk : then pk−1 (θ) and pk (θ) (nonzero) have
opposite sign. We have no sign agreement, and indeed pk and pk−1
have the same number of roots strictly larger than θ, namely, zero.

4. θ ∈ (α, β), where α, β are two consecutive roots of pk . By Cauchy


interlacing, pk−1 has exactly one root γ ∈ (α, β). There are three cases:

(a) θ ∈ (α, γ): there is no sign agreement since pk−1 (θ) and pk (θ)
(both nonzero) have opposite sign (check this by following the
sign patterns on a drawing). And indeed, since θ passed one more
root of pk than it passed roots of pk−1 , there are an equal number
of roots of pk and pk−1 strictly larger than θ.
(b) θ = γ: the sequence is (0, pk (θ)). There is a sign agreement, and
indeed θ passed as many roots of pk as it passed roots of pk−1 , so
that pk has one more strictly larger than θ;
(c) θ ∈ (γ, β): there is sign agreement (for the same reason that there
was no sign agreement in the first case), and there is one more
root of pk to the right of θ than there are roots of pk−1 .

Hence, sk (θ) = gk (θ), and by induction sn (θ) = gn (θ), as desired.


5.5. GERSCHGORIN DISKS 103

Figure 5.5: The Cauchy interlacing theorem completely characterizes the


sign patterns of pk and pk−1 , which allows to relate the sign agreements in
the Sturm sequence at θ to the number of roots of pk strictly larger than θ.

5.5 Locating eigenvalues: Gerschgorin


We cover two simple approaches to find rough bounds on the location of the
eigenvalues of matrices in the complex plane. For symmetric matrices, these
yield real intervals which contain the eigenvalues: such intervals can be used
to start a bisection based on Sturm sequences.

Technique 1. The first technique is based on subordinate norms. If (λ, x)


is an eigenpair of A ∈ Cn×n , then λx = Ax, which implies (reading from the
middle):

|λ|kxk = kλxk = kAxk ≤ kAkkxk, (5.23)

for any choice of vector norm k · k. Since x is an eigenvector, it is nonzero


and we get

|λ| ≤ kAk, (5.24)

where kAk is the norm of A, subordinate to the vector norm k · k. For sym-
metric matrices, this means each eigenvalue is in the interval [−kAk, +kAk].
For this observation to be practical, it only remains to pick a vector norm
whose subordinate matrix norm is straightforward to compute. A poor choice
is the 2-norm (since the subordinate norm is the largest singular value: this
is as hard to compute as an eigenvalue); good choices are the 1-norm and
∞-norm:
Any eigenvalue λ of A obeys |λ| ≤ kAk1 and |λ| ≤ kAk∞ .
104 CHAPTER 5. EIGENPROBLEMS

start + 1

0
0 0 0
0


1

Figure 5.6: Diagram to count sign agreements in a Sturm sequence.


For example, the sequence (+, +, −, 0, +, 0, −, −, 0) has 4 sign agreements:
(+, +), (0, +), (0, −), (−, −). As argued in the text, there cannot be two
consecutive zeros. Furthermore, the patterns (+, 0, +) and (−, 0, −) can-
not occur. This suggests one way to handle zeros: replace them with
the sign opposite to the previous entry. With the above example, we get
(+, +, −, +, +, −, −, −, +), which indeed still has 4 sign agreements.

These bounds are indeed trivial to compute.


n
|aij | and kAk1 = kAT k∞ .
P
Question 5.21. Prove that kAk∞ = max
i=1,...,n j=1


Thus, we can start the Sturm bisection with the interval [−kAk1 , kAk1 ] or
[−kAk∞ , kAk∞ ].8

Technique 2. The above strategy determines one large disk in the com-
plex plane which contains all eigenvalues. If the eigenvalues are spread out,
this can only give a very rough estimate of their location. A more refined
approach consists in determining a collection of (usually) smaller disks whose
union contains all eigenvalues. These disks are called Gerschgorin disks. The
code below produces Figure 5.7; then we will see how they are constructed.
8
The Sturm bisection technically works with intervals of the form (a, b], not [a, b].
Hence, if the smallest eigenvalue is targeted, one should check whether −kAk1 or −kAk∞
is a root of pn , simply by evaluating it there, or one can make the interval slightly larger.
5.5. GERSCHGORIN DISKS 105

%% Locating eigenvalues
A = [7 2 0 ; -1 8 1 ; 2 2 0];
e = eig(A);
plot(real(e), imag(e), '.', 'MarkerSize', 25);
xlabel('Real');
ylabel('Imaginary');
title('Eigenvalues of A');
xlim([-15, 15]);
ylim([-15, 15]);
axis equal;
hold on;

%%
% subordinate norms
circles(0, 0, norm(A, 1), 'FaceAlpha', .1);
circles(0, 0, norm(A, inf), 'FaceAlpha', .1);
% The code for 'circles' is here:
% https://www.mathworks.com/matlabcentral/ ...
% fileexchange/45952-circle-plotter

%%
% Gerschgorin disks
for k = 1 : size(A, 1)
radius = sum(abs(A(k, [(1:k-1), (k+1:end)])));
circles(real(A(k, k)), imag(A(k, k)), radius, ...
'FaceAlpha', .1);
end

hold off;

Let’s go through the construction, which will amount to a proof of the


theorem below. Pick any eigenpair (λ, x) of A. Let k be such that |xi | ≤ |xk |
for all i. Note that xk 6= 0 since x 6= 0. Since λx = Ax, we have in particular:
n
X X
λxk = (Ax)k = aki xi = akk xk + aki xi . (5.25)
i=1 i6=k

Reorganizing:
X
(λ − akk )xk = aki xi . (5.26)
i6=k

Using the triangular inequality and the defining property of k,


X X X
|λ − akk | · |xk | = aki xi ≤ |aki | · |xi | ≤ |xk | |aki |. (5.27)
i6=k i6=k i6=k
106 CHAPTER 5. EIGENPROBLEMS

Eigenvalues of A
15

10

5
Imaginary

-5

-10

-15
-15 -10 -5 0 5 10 15
Real

Figure 5.7: Comparison of subordinate-norm disks hand Gerschgorin


i disks in
7 2 0
the complex plane to locate the eigenvalues of A = −1 8 1 (blue dots). The
2 2 0
largest disk has radius kAk1 = 12. The second largest has radius kAk∞ = 10.
The latter two are centered at the origin. The three smaller disks are the
Gerschgorin disks, centered at the diagonal entries of A.

Finally, using xk 6= 0 we have:


X
|λ − akk | ≤ |aki |. (5.28)
i6=k

This indeed guarantees any eigenvalue λ is in at least one of n possible disks


(corresponding to all possible values of k). Since we do not know a priori
which eigenvalue is in which disk, the safest statement we can make at this
point is that all eigenvalues lie in the union of all n disks.
Theorem 5.22 (Gerschgorin disks, Theorem 5.4 in [SM03]). Consider a
matrix A ∈ Cn×n . All eigenvalues
P of A lie in the region D = ∪nk=1 Dk , where
Dk = {z ∈ C : |z − akk | ≤ i6=k |aki |}.
Question 5.23. How do you use Gerschgorin disks to produce an interval
which contains all eigenvalues of a symmetric matrix A?
Question
P 5.24. A symmetric matrix A is diagonally dominant if, for all k,
akk ≥ i6=k |aki |. Show that such matrices are positive semidefinite.)
5.6. HOUSEHOLDER TRIDIAGONALIZATION 107

5.6 Tridiagonalizing matrices: Householder


The Sturm sequence property only applies to tridiagonal matrices. In this
section, we show how to reduce any symmetric matrix of size n to a tridiago-
nal, symmetric matrix with the same eigenvalues, in O(n3 ) flops. This tridi-
agonalization algorithm is useful in a number of other contexts too (notably
for RQI). See Lecture 26 in [TBI97], and a bit of Lecture 10 for Householder
reflectors.
Problem 5.25. Given a symmetric matrix A of size n × n, find a symmetric
matrix T which is tridiagonal and has the same eigenvalues as A.
A first step is to identify which operations one can apply to a matrix
without affecting its eigenvalues. The answer is similarity transforms. That
is, for any invertible matrix M of size n, it holds that M −1 AM and A have the
same eigenvalues. Indeed, consider the characteristic polynomial of M −1 AM :
pM −1 AM (λ) = det M −1 AM − λIn


= det M −1 AM − λM −1 In M


= det M −1 (A − λIn )M


(using det(CD) = det(C) det(D)) = det(M −1 ) det(A − λIn ) det(M )


(again) = pA (λ) det(M −1 M )
= pA (λ),
where pA (λ) is the characteristic polynomial of A. We find that A and
M −1 AM have the same characteristic polynomial, hence they have the same
eigenvalues.
A second step is to restrict our attention to similarity transforms that
preserve the symmetry of A. Thus, we consider only invertible matrices M
such that M −1 AM is symmetric. In other words, using A = AT :
M −1 AM = (M −1 AM )T = M T AT (M −1 )T = M T A(M −1 )T .
This is true in particular if M −1 = M T , that is, if M is orthogonal. Overall,
we decide to restrict our attention to orthogonal transformations of A:
Problem 5.26. Given a symmetric matrix A of size n×n, find an orthogonal
matrix Q of size n × n such that T = QT AQ is tridiagonal.
It is clear that T is symmetric, and as we just argued it has the same
eigenvalues as A. How do we construct such a matrix Q?9 Consider a 5 × 5
9
We give a brief description here: see the textbook for a full description. In partic-
ular, see the textbook for a discussion of flop count, and for practical implementation
considerations.
108 CHAPTER 5. EIGENPROBLEMS

matrix for example:


 
× × × × ×
× × × × ×
 
×
A= × × × ×.
× × × × ×
× × × × ×
The goal is to transform all the bold entries into zeros. Let us focus on the
first column first: we want to find an orthogonal matrix Q1 such that QT1 A
will have zeros in the indicated entries of the first column. Here is a crucial
point:
Applying QT1 to A on the left will replace each one of the rows of
A with a linear combination of the rows of A. We are about to
design QT1 such that zeros appear in the first column. Since we
will then have to apply Q1 on the right, it is crucial that right-
multiplication with Q1 should leave the first column untouched,
as otherwise it might scramble our hard-earned zeros. Thus, we
need QT1 applied on the left to leave the first row untouched.
In other words, we need Q1 to assume this form:
 
T 1 0
Q1 = ,
0 H1
where H1 ∈ R(n−1)×(n−1) is orthogonal (so that Q1 is orthogonal: think about
it). Indeed, applying QT1 on the left has no effect on the first row, and
applying Q1 on the right has no effect on the first column. With a proper
choice of H1 , we will have
 
× × × × ×
× × × × ×
T
 
Q1 A = 
 0 × × × × ,

 0 × × × ×
0 × × × ×
and when we apply Q1 on the right, the zeros will endure. Furthermore, since
QT1 AQ1 is symmetric, and since we know it has zeros in the first column, it
must also have zeros in the first row:
 
× × 0 0 0
× × × × ×
T
 
Q1 AQ1 =   0 × × × × .

 0 × × × ×
0 × × × ×
5.6. HOUSEHOLDER TRIDIAGONALIZATION 109

From here, it is only a matter of iterating the same idea. Specifically, let us
design QT2 such that applying it to the left will make zeros appear in the right
places in the second column. Upon doing this, we must make sure not to
affect the first two rows, so that when we apply Q2 to the right, our work in
columns 1 and 2 will be unaffected. Specifically, build an orthogonal matrix
H2 of size n − 2 such that
 
× × 0 0 0
× × × × ×
T T
 
Q2 Q1 AQ1 =   0 × × × × ,

 0 0 × × ×
0 0 × × ×

where
 
I2 0
QT2 = .
0 H2

By the same symmetry argument, applying Q2 on the right yields zeros in


the second row, preserving the other zeros produced so far:
 
× × 0 0 0
× × × 0 0 
T T
 
Q2 Q1 AQ1 Q2 =   0 × × × × .

 0 0 × × ×
0 0 × × ×

Similarly, build H3 orthogonal of size n − 3 such that


 
× × 0 0 0
× × × 0 0
 
QT3 QT2 QT1 A Q1 Q2 Q3 =  0 × × × 0,
| {z } | {z } 
QT Q
 0 0 × × ×
0 0 0 × ×
| {z }
T

where
 
I3 0
QT3 = .
0 H3

In general, define

Q = Q1 Q2 · · · Qn−2 .
110 CHAPTER 5. EIGENPROBLEMS

This matrix is indeed orthogonal because a product of orthogonal matrices


is orthogonal. Furthermore, as announced, we have

QT AQ = T

where T is tridiagonal. Equivalently, we can also write

A = QT QT ,

which is sometimes more convenient. If only the eigenvalues of A are of


interest, there is no need to build the matrix Q explicitly: it is only necessary
to produce T . If the eigenvectors are desirable, there are computationally
efficient ways of building them without direct construction of Q: see below
for a few words on that topic, and the textbook for details.
At this point, the only missing piece is: how do we construct the orthog-
onal transformations H1 , H2 , . . . , Hn−2 ? One practical way is to use House-
holder reflectors: see Lecture 10 in [TBI97]. Specifically, let us consider what
such a matrix H must satisfy:

1. H ∈ Rk×k must be orthogonal, and


 
×
0
2. For a given vector x ∈ Rk , we must have Hx =  .. .
 
.
0

In our setting, the vector x that drives the construction of H is extracted


from the column currently under inspection. A simple observation is that,
since H is orthogonal, it has no effect on the 2-norm of a vector. Hence,
kHxk = kxk. As a result, the second requirement can be written as

Hx = ±kxke1 , (5.29)
 
where e1 is the first canonical vector in Rk , that is, eT1 = 1 0 · · · 0 .
Either sign is good: we will commit later. How do we construct a matrix H
satisfying these requirements? One good way is by reflection.10 Specifically,
consider
x − Hx
u= . (5.30)
kx − Hxk
10
See Figure 10.1 in [TBI97], where their F is our H and their normalized v is our u.
5.6. HOUSEHOLDER TRIDIAGONALIZATION 111

This unit-norm vector defines a plane normal to it. We will reflect x across
that plane. To this end, consider the orthogonal projection of x to the span
of u: it is (uT x)u. Hence, the orthogonal projection of x to the plane itself is
P x = x − (uT x)u.
We can reflect x across the plane by moving from x to the plane, orthogonally,
then traveling the exact same distance once more (along the same direction).
Thus,
Hx = x + 2(P x − x) = 2P x − x = x − 2(uT x)u = (Ik − 2uuT )x.
In other words, a good candidate is
H = Ik − 2uuT .
Since kuk = 1, it is clear that H is orthogonal. Indeed,
H T H = (Ik − 2uuT )T (Ik − 2uuT ) = Ik − 4uuT + 4u(uT u)uT = Ik ,
so that H T = H −1 . If we can find an efficient way to compute u, then we
have an efficient way to find H. To this end, combine (5.29) and (5.30):
u ∝ x ∓ kxke1 ,
where by ∝ we mean that the vector on the left hand side is proportional to
the vector on the right hand side. The vector on the right hand side is readily
computed, and furthermore we know that u has unit norm, so that it is clear
how to compute u in practice: compute the right hand side, then normalize.
How do we pick the sign? Floating-point arithmetic considerations suggest
to pick the sign such that sign(x1 ) and sign(∓kxk) = sign(∓) agree (think
about it). Overall, given x ∈ Rk , we can compute u as follows:
1. u ← x + sign(x1 )kxke1 ,
u
2. u ← kuk
.
This costs ∼ 3k flops. The corresponding reflector is H = Ik − 2uuT . In
practice, we would not compute H explicitly: that would be quite expensive.
Instead, observe that applying H to a matrix can be done efficiently by
exploiting its structure, namely:
HM = M − 2u(uT M ).
Computing the vector uT M costs ∼ 2k 2 flops. With uT M computed, com- To compute 2uv T , first
puting HM costs another ∼ 2k 2 flops. This is much cheaper than if we form multiply u by 2 for k
flops, then do the
H explicitly (∼ k 2 ), then compute a matrix-matrix product, at ∼ 2k 3 flops. vector product for k2
Of course, we can also apply H on the right, as in M H, for ∼ 4k 2 as well. flops (as opposed to
doing the vector
product first, then
multiplying each entry
by 2, which costs 2k2
flops.)
112 CHAPTER 5. EIGENPROBLEMS

Question 5.27. Using the above considerations, how many flops do you need
to compute the matrix T , given the matrix A? You should find that, without
exploiting the symmetry of A, you can do this in ∼ 10
3
n3 flops.

Question 5.28. The vectors u1 , . . . , un−2 produced along the way (to design
the reflectors H1 , . . . , Hn−2 ) can be saved in case it becomes necessary to apply
Q to a vector later on. How many flops does it take to apply Q to a vector
y ∈ Rn in that scenario? How does that compare to the cost of a matrix-vector
product Qy as if we had Q available as a matrix directly?

Question 5.29. What happens if we apply the exact same algorithm to a


non-symmetric matrix A?

Remark 5.30. Householder reflectors can also be used to compute QR fac-


torizations: this is what Matlab uses for qr; see Lecture 10 in [TBI97]. Fur-
thermore, Householder tridiagonalization is the initial step for the so-called
QR algorithm for eigenvalue computations: see Lectures 28, 29 in [TBI97].
Subsequent operations in that algorithm perform QR factorizations iteratively
(hence the name QR algorithm: it is not an algorithm to compute QR fac-
torizations; it is an algorithm to compute eigenvalue decompositions, using
QR factorizations). This is quite close to the algorithm Matlab uses for eig.
Chapter 6

Polynomial interpolation

This chapter is based on Chapter 6 in [SM03] and focuses on the following


problem.

Problem 6.1. Given x0 , . . . , xn ∈ R distinct and y0 , . . . , yn ∈ R, find a


polynomial p of degree at most n such that p(xi ) = yi for i = 0, . . . , n.

It is clear that the polynomial should be allowed to have degree up to n in


general, since there are n + 1 constraints: we need n + 1 degrees of freedom.
It so happens that degree n is always sufficient as well, as will become clear.
It is good to clarify a few points of vocabulary here:

ˆ Polynomial interpolation is the problem spelled out above;

ˆ Polynomial regression is the problem of finding a polynomial such that


p(xi ) ≈ yi (where approximation is best in some sense): this usually
allows to pick a lower-degree polynomial, and makes more sense if the
data is noisy;

ˆ Polynomial approximation is the problem of approximating a function


f with a polynomial; it comes in at least two flavors:

– Taylor polynomials approximate f and its derivatives at a specific


point;
– Minimax and least-squares polynomials (which we discuss in later
lectures) approximate f over a whole interval.

ˆ Polynomial extrapolation consists in using the obtained polynomial at


points x outside the range for which one has data: this is usually risky
business.

113
114 CHAPTER 6. POLYNOMIAL INTERPOLATION

The key insight to solve the interpolation problem is to notice that

Pn = { polynomials of degree at most n }

is a linear subspace of dimension n + 1 (see also later in this section). Hence,


upon choosing a basis for it, any element in Pn can be identified with n + 1
coefficients. It is all about picking a convenient basis.
The obvious choice is the so-called monomial basis:

1, x, x2 , . . . , xn .

Then, a candidate solution p ∈ Pn can be written as


n
X
p(x) = ak x k
k=0

with coefficients a0 , . . . , an to be determined. Each equation is of the form:


n
X
yi = p(xi ) = ak xki .
k=0

This is a linear equation in the coefficients. Collecting them for i = 0, . . . , n


yields the system:
 
  a0  
2 n
1 x0 x0 · · · x0   y0
1 x1 x2 · · · xn   a1   y1 
1 1  
..   a2  =  ..  . (6.1)
  
 .. .. ..
. . . .  ..   . 
.
1 xn x2n · · · xnn yn
| {z } an
Vandermonde matrix

Thus, it is sufficient to solve this linear system involving the Vandermonde


matrix to obtain the coefficients of the solution p in the monomial basis.
The Vandermonde matrix is invertible provided the xi ’s are distinct1 (as we
assume throughout), but it is unfortunately ill-conditioned in general.2
This means that for most choices of interpolation points xi we cannot
hope to solve this linear system with any reasonable accuracy for n beyond
a small threshold. This is illustrated for equispaced interpolation points and
Chebyshev interpolation points (to be discussed later) in Figure 6.1.
1
We do not prove this explicitly, though it is implied implicitly by results to follow.
j
2
One notable exception is for the complex case where xj = e2πi n+1 (the n + 1 complex
roots of unity): then the Vandermonde matrix is in fact the Fourier matrix, unitary up to
a scaling. Its condition number is 1!
115

% Condition number of Vandermonde matrix


a = 0; b = 1;
kk = 20;
cond1 = zeros(kk, 1);
cond2 = zeros(kk, 1);
for k = 1:kk
x1 = linspace(a, b, k+1); % equispaced
x2 = (a+b)/2 + (b-a)/2 * cos( (2*(0:k) + 1)*pi ./ ...
(2*k+2) ); % Chebyshev
cond1(k) = cond(vander(x1));
cond2(k) = cond(vander(x2));
end
semilogy(1:kk, cond1, '.-', 1:kk, cond2, '.-');

The condition number of the Vandermonde matrix grows exponentially with n


1020
Equispaced points x i
Chebyshev points x i
15
10
Condition number

1010

5
10

100
0 2 4 6 8 10 12 14 16 18 20
degree n (n+1 interpolation points)

Figure 6.1: For equispaced and Chebyshev interpolation points, the Vander-
monde matrix has exponentially growing condition number. Beyond degree
20 or so, one cannot hope to compute the coefficients of the interpolation
polynomial in the monomial basis with any reasonable accuracy in double
precision arithmetic. This does not mean, however, that the interpolation
polynomial cannot be evaluated accurately: only that methods based on
computing the coefficients in the monomial basis first are liable to incur a
large error.
116 CHAPTER 6. POLYNOMIAL INTERPOLATION

Sets of polynomials as vector spaces


In your linear algebra course, you may have only considered vectors and
subspaces in Rn . For that important particular case, the properties we use
over and over again are:
1. The zero vector belongs to Rn ,
2. Adding two vectors from Rn yields a vector of Rn , and
3. Multiplying a vector of Rn by a scalar yields a vector of Rn .
In fact, these are the only properties of Rn that we really need in order
to develop the core concepts and results from your linear algebra course,
including the notions of linear independence, bases, dimension, subspaces,
changes of coordinates, linear transformations, etc.
Now, consider the following statements about Pn , the set of polynomials
of degree at most n, where n ≥ 0:
1. The zero polynomial is a polynomial of degree at most n,
2. Adding two polynomials of degree at most n yields a polynomial of
degree at most n, and
3. Multiplying a polynomial of degree at most n by a scalar yields a poly-
nomial of degree at most n.
Thus, Pn satisfies all the same properties as Rn insofar as we are concerned for
linear algebra purposes. This is why we say Pn is a vector space (or a linear
space, or a linear subspace, or simply a subspace), and why we have notions
of linearly independent polynomials, bases for Pn , dimension of Pn and of its
subspaces, changes of coordinates for polynomials, linear transformations on
polynomials, etc.
More advanced concepts of your linear algebra course required an addi-
tional tool on Rn : the notion of an inner product (or dot product). This
allowed us to define such things as orthogonal projections and orthonormal
bases. We will see that inner products can be defined over Pn as well, so
that we can define orthonormal bases of polynomials.
This is an abstract concept: take some time to let it sink in.

6.1 Lagrange interpolation, the Lagrange way


So far, we made only one arbitrary choice: working with the monomial basis.
Clearly, that won’t do. Thus, we must aim to find a better basis. Here,
6.1. LAGRANGE INTERPOLATION, THE LAGRANGE WAY 117

“better” means that the matrix appearing in the linear system should have
better condition number that the Vandermonde matrix. In doing so, we
might as well be greedy and aim for the best conditioned matrix of all: the
identity matrix.
We are looking for a basis of Pn made of n + 1 polynomials of degree at
most n,

L0 (x), . . . , Ln (x), (6.2)

such that the interpolation problem reduces to a linear system with an iden-
tity matrix. Using this (for now unknown) basis, the solution p can be written
as
n
X
p(x) = ak Lk (x).
k=0

Enforcing p(xi ) = yi for each i, we get the following linear system of n + 1


equations in n + 1 unknowns:
 
  a0  
L0 (x0 ) L1 (x0 ) L2 (x0 ) · · · Ln (x0 )   y0
 L0 (x1 ) L1 (x1 ) L2 (x1 ) · · · Ln (x1 )   a1   y1 
..   a2  =  ..  . (6.3)
    
 .. .. ..
 . . . .  ..   . 
.
L0 (xn ) L1 (xn ) L2 (xn ) · · · Ln (xn ) yn
an

For the matrix to be identity, we need each Lk to satisfy the following (con-
sider each column separately):

Lk (xk ) = 1, and Lk (xi6=k ) = 0. (6.4)

The latter specifies n roots for each Lk , only leaving a scaling indeterminacy;
that scaling is determined by the former condition. There is only one possible
choice:
Q
i6=k (x − xi )
Lk (x) = Q . (6.5)
i6=k (xk − xi )

The numerator is determined by the requirement to have roots xi6=k , and


the denominator is determined by fixing the scaling Lk (xk ) = 1. These
polynomials are called the Lagrange polynomials for the given xi ’s. Each Lk
has degree n.
Question 6.2. Show the Lagrange polynomials form a basis of Pn .
118 CHAPTER 6. POLYNOMIAL INTERPOLATION

Since the matrix is identity, the system of equations is easily solved:


ak = yk for each k, and the interpolation polynomial is:
n
X
p(x) = yk Lk (x). (6.6)
k=0

It is clear by construction that p(xi ) = yi .


It is important to note here that the polynomials Lk are not meant to
be expanded into the monomial basis. Consider (6.6) as the final answer.
To evaluate p(x), one approach is to simply plug x into the formula (6.5)
for each Lk , then form the linear combination (6.6). This requires O(n2 )
work for each x. A cheaper and numerically better approach is to use the
barycentric formula. See for example https://en.wikipedia.org/wiki/
Lagrange_polynomial#Barycentric_form. This takes O(n2 ) preprocessing
work (to be done once, could be saved to disk), then O(n) for each point: as
much work as if one had an expression for p in the monomial basis.
Question 6.3. Implement a function f = interp lagrange bary(x, ...
y, t) which returns the evaluation of the interpolation polynomial through
data (x, y) at points in t; your code should use the barycentric formula men-
tioned above.

The following program allows to visualize the Lagrange polynomials for eq-
uispaced points on an interval, see Figure 6.2.

% Lagrange polynomial basis


n = 5;
x = linspace(-1, 1, n+1);
I = eye(n+1);
t = linspace(min(x), max(x), 251);
for k = 1 : n+1

subplot(2, 3, k);

% Evaluate the Lagrange polynomials by interpolating


% the standard basis vectors.
ek = I(k, :);
plot(t, interp lagrange bary(x, ek, t), 'LineWidth', 1.5);

hold all;
stem(x, ek, '.', 'MarkerSize', 15, 'LineWidth', 1.5);
6.1. LAGRANGE INTERPOLATION, THE LAGRANGE WAY 119

hold off;

ylim([-1, 1.5]);
set(gca, 'YTick', [-1, 0, 1]);
set(gca, 'XTick', [-1, 0, 1]);

end

1 1 1

0 0 0

-1 -1 -1
-1 0 1 -1 0 1 -1 0 1

1 1 1

0 0 0

-1 -1 -1
-1 0 1 -1 0 1 -1 0 1

Figure 6.2: Lagrange polynomials for 6 equispaced points in [−1, 1].

It is clear from above that the solution p exists and is unique. Neverthe-
less, we give here a uniqueness proof that does not involve any specific basis,
primarily because it uses an argument we will use frequently.
Theorem 6.4. The solution of the interpolation problem is unique.
Proof. For contradiction, assume there exist two distinct polynomials p, q ∈
Pn verifying p(xi ) = yi and q(xi ) = yi for i = 0, . . . , n. Then, the polynomial
h = p − q is also in Pn and it has n + 1 roots:

h(xi ) = p(xi ) − q(xi ) = 0 for i = 0, . . . , n.

Yet, the only polynomial of degree at most n which has strictly more than n
roots is the zero polynomial. Thus, h = 0, and it follows that p = q.
If the data points (xi , yi ) are obtained by sampling a function f , that is,
yi = f (xi ), and x0 , . . . , xn are distinct points in [a, b], then, a natural question
is:
How large can the error f (x) − pn (x) be for x ∈ [a, b], where pn
is the interpolation polynomial for n + 1 points?
120 CHAPTER 6. POLYNOMIAL INTERPOLATION

This is actually a polynomial approximation question: something we will


talk about extensively in later lectures. One key observation is that the
approximation error depends of course on f itself, but it also very much
depends on the choice of points xi , as we now illustrate with two functions:

cos(x) + 1
f1 (x) = over [0, 2π], and (6.7)
2
1
f2 (x) = over [−5, 5]. (6.8)
1 + x2
Both are infinitely continuously differentiable. Yet, one will prove much
harder to approximate than the other. Run the lecture code to observe the
following:

ˆ With equispaced points, at first, increasing n yields better approxi-


mation quality for f1 , until a turning point is hit and approximation
deteriorates.

ˆ With equispaced points, for f2 , increasing n seems to deteriorate the


approximation quality right away: none of the polynomials attain small
approximation error. This is the Runge phenomenon.

ˆ In both cases, the errors seem largest close to the boundaries of [a, b],
which suggests sampling more points there. This is the idea behind
Chebyshev points, which we will justify later.

ˆ With Chebyshev points, approximation quality for f1 improves with n


until we reach machine precision, at which point the error stabilizes.

ˆ For f2 , Chebyshev points again allow approximation error to decrease


with n, albeit more slowly.

The following theorem bounds the approximation error, allowing to under-


stand the separate roles of f and the interpolation points xi . We will use
this theorem over and over from now on.

Theorem 6.5 (Theorem 6.2 in [SM03]). Let f : [a, b] → R be n + 1 times


continuously differentiable. Let pn ∈ Pn be the interpolation polynomial for f
at x0 , . . . , xn distinct in [a, b]. Then, for any x ∈ [a, b], there exists ξ ∈ (a, b)
(which may depend on x) such that

f (n+1) (ξ)
f (x) − pn (x) = πn+1 (x), (6.9)
(n + 1)!
6.1. LAGRANGE INTERPOLATION, THE LAGRANGE WAY 121

where πn+1 (x) = (x − x0 ) · · · (x − xn ). Defining Mn+1 = maxξ∈[a,b] |f (n+1) (ξ)|,


Mn+1
|f (x) − pn (x)| ≤ |πn+1 (x)| (6.10)
(n + 1)!
for any x ∈ [a, b].
Notice that the dependence on f is only through Mn+1 , while the depen-
dence on xi ’s is only through πn+1 . In general, f is forced by the application
and we have little or no control over it. On the other hand, we often have
control over the interpolation points. Thus, it makes sense to try and find
xi ’s which keep πn+1 small in some appropriate sense. Later, we will show
that this is precisely what the Chebyshev points aim to do. For now, we just
give a bound on πn+1 for equispaced points, since it is such a natural case to
consider.
Lemma 6.6. For equispaced points x0 , . . . , xn on [a, b] with x0 = a, xn = b,
the spacing between any two points is h = b−an
and
n!hn+1
|πn+1 (x)| ≤ (6.11)
4
for all x ∈ [a, b].
Proof. The statement is clear if x is equal to one of the xi ’s. Thus, consider
x 6= xi for all i. Hence, x lies strictly between two consecutive points, that
is, there exists i such that xi < x < xi+1 . Considering the two terms of πn+1
pertaining to these two points, since they form a quadratic, we easily obtain:
h2
|(x − xi )(x − xi+1 )| ≤ max |(ξ − xi )(ξ − xi+1 )| = .
ξ∈[xi ,xi+1 ] 4
(Indeed, the maximum of the quadratic is attained in the middle of the two
roots.) We now need to consider all other terms in πn+1 . There are two kinds:
for k < i, we have xk < xi < x < xi+1 ; on the other hand, for k > i + 1, we
have xk > xi+1 > x > xi . Hence,
(
|xi+1 − xk | = (i + 1 − k)h if k < i,
|xk − x| ≤
|xk − xi | = (k − i)h if k > i + 1.

(This is clearer if you simply make a drawing of the situation on the real
line.) Combine all inequalities to control πn+1 :
|πn+1 (x)| = |x − x0 | · · · |x − xi−1 ||(x − xi )(x − xi+1 )||x − xi+2 | · · · |x − xn |
1
≤ (i + 1) · · · 2 · · 2 · · · (n − i)hn+1 .
4
122 CHAPTER 6. POLYNOMIAL INTERPOLATION

The product of integers attains its largest value if x lies in one of the extreme
intervals: (x0 , x1 ) or (xn−1 , xn ), that is, i = 0 or i = n − 1. Hence,
n! n+1
|πn+1 (x)| ≤ h
4
for all x ∈ [a, b].
Combining with Theorem 6.5, a direct corollary is that for equispaced
hn+1
points the approximation error is bounded by 4(n+1) Mn+1 . As n increases, the
fraction decreases exponentially fast. Importantly, the function-dependent
Mn+1 can increase with n, possibly fast enough to still push the bound to
infinity. This in itself does not imply the actual error will go to infinity, but
it is indicative that large errors are possible.
Question 6.7. For f1 (6.7), what is a good value for Mn+1 ? Based on this,
what is your best explanation for the experimental behavior of the approxi-
mation error?


We now give a proof of the main theorem, following [SM03, Thm 6.2].
Proof of Theorem 6.5. It is only necessary to establish equation (6.9). If x
is equal to any of the interpolation points xi , that equation clearly holds.
Thus, it remains to prove (6.9) holds for x ∈ [a, b] distinct from any of the
interpolation points. To this end, consider the following function on [a, b]:
f (x) − pn (x)
φ(t) = f (t) − pn (t) − πn+1 (t).
πn+1 (x)
(Note that x is fixed in this definition: φ is only a function of t.) Here is are
the two crucial observations:
1. φ has at least n + 2 distinct roots in [a, b]. Indeed, φ(xi ) = 0 for
i = 0, . . . , n, and also φ(x) = 0, and
2. φ is n + 1 times continuously differentiable.
Using both facts, by the mean value theorem (or Rolle’s theorem), the deriva-
tive of φ has at least n+1 distinct roots in [a, b]. In turn, the second derivative
of φ has at least n distinct roots in [a, b]. Continuing this argument, we con-
clude that the (n + 1)st derivative of φ (which we denote by φ(n+1) ) has at
least one root in [a, b]: let us call it ξ. (Of course, ξ may depend on x in a
complicated way, but that is not important: we only need to know that such
a ξ exists.) Verify the two following claims:
6.2. HERMITE INTERPOLATION 123

1. The (n + 1)st derivative of pn (t) with respect to t is identically 0, and


2. The (n + 1)st derivative of πn+1 (t) = tn+1 + (lower order terms) with
respect to t is (n + 1)! (constant).
Then, the (n + 1)st derivative of φ is:
f (x) − pn (x)
φ(n+1) (t) = f (n+1) (t) − (n + 1)!.
πn+1 (x)

Since φ(n+1) (ξ) = 0, we conclude that


f (x) − pn (x)
0 = f (n+1) (ξ) − (n + 1)!,
πn+1 (x)
hence (6.9) holds for all x ∈ [a, b].

6.2 Hermite interpolation


Hermite interpolation is a variation of the Lagrange interpolation problem.
It is stated as follows.
Problem 6.8. Given x0 , . . . , xn ∈ R distinct and y0 , . . . , yn ∈ R, z0 , . . . , zn ∈
R, find a polynomial p of degree at most 2n + 1 such that p(xi ) = yi and
p0 (xi ) = zi for i = 0, . . . , n.
This problem is solved following the same construction strategy. With
the notation δik = 1 if i = k and δik = 0 if i 6= k, we must:
1. Find H0 , . . . , Hn ∈ P2n+1 such that Hk (xi ) = δik and Hk0 (xi ) = 0 for all
i, k; and
2. Find K0 , . . . , Kn ∈ P2n+1 such that Kk (xi ) = 0 and Kk0 (xi ) = δik for
all i, k.
Then, by linearity, the solution p to the Hermite problem is:
n
X
p= yk Hk + zk Kk .
k=0

To construct the basis polynomials Hk , Kk , it is easiest to work our way


up from the Lagrange polynomials,
Y x − xi
Lk (x) = .
x k − xi
i6=k
124 CHAPTER 6. POLYNOMIAL INTERPOLATION

For example, Kk must have a double root at each xi6=k and a simple root at
xk : this accounts for all the 2n + 1 possible roots of Kk , thus it must be that

Kk (x) ∝ (Lk (x))2 (x − xk ).

The scale is set by enforcing Kk0 (xk ) = 1. By chance, the above is already
properly scaled: it is an equality. To construct Hk , one can work similarly
(though it’s a tad less easy to guess), and one obtains:

Hk (x) = (Lk (x))2 (1 − 2L0k (xk )(x − xk )) .

Here too, it is an exercise to verify that Hk satisfies its defining properties.


A theorem similar to Theorem 6.5 can be established for Hermite inter-
polation too. We omit the proof.

Theorem 6.9 (Theorem 6.4 in [SM03]). Let f : [a, b] → R be 2n + 2 times


continuously differentiable. Let p2n+1 ∈ P2n+1 be the Hermite interpolation
polynomial for f at x0 , . . . , xn distinct in [a, b]. Then, for any x ∈ [a, b], there
exists ξ ∈ (a, b) (which may depend on x) such that

f (2n+2) (ξ)
f (x) − p2n+1 (x) = (πn+1 (x))2 , (6.12)
(2n + 2)!

where πn+1 (x) = (x−x0 ) · · · (x−xn ). Defining M2n+2 = maxξ∈[a,b] |f (2n+2) (ξ)|,

M2n+2
|f (x) − pn (x)| ≤ (πn+1 (x))2 (6.13)
(2n + 2)!

for any x ∈ [a, b].

Notice that the dependence on the interpolation point is also through


πn+1 —same as for Lagrange interpolation. Thus, if we find interpolation
points which make |πn+1 | small, these will be relevant for both Lagrange and
Hermite interpolation. This theorem can be used to understand errors in
Gauss quadrature for numerical integration (a problem we discuss later.)
Chapter 7

Minimax approximation

In the previous chapter, we studied polynomial interpolation. Theorem 6.5


went a bit beyond the original intent, venturing into polynomial approxima-
tion. Indeed, it states that for f sufficiently smooth, the approximation error
of pn at all points in the interval [a, b] is bounded as
Mn+1
|f (x) − pn (x)| ≤ |πn+1 (x)|. (7.1)
(n + 1)!
A more general way to look at this result is to introduce the ∞-norm, and
look at this result as saying that some particular norm of f − pn is bounded.
We do this now.
Definition 7.1. Let C[a, b] denote the set of real-valued continuous functions
on [a, b]. This is an infinite-dimensional linear space. The ∞-norm of a
function g ∈ C[a, b] is defined as:
kgk∞ = max |g(x)|.
x∈[a,b]

This is well defined since |g| is continuous on the compact set [a, b] (Weier-
strass). It is an exercise to verify this is indeed a norm.
With this notion of norm, we can express (a slightly relaxed version of)1
the error bound above as:
Mn+1
kf − pn k∞ ≤ kπn+1 k∞ . (7.2)
(n + 1)!
Furthermore, we have the convenient notation Mn+1 = kf (n+1) k∞ .
This formalism leads to three natural questions:
1
The error bound (7.1) specifies it is the same x on both sides of the inequality,
whereas (7.2) takes the worst case on both sides independently.

125
126 CHAPTER 7. MINIMAX APPROXIMATION

1. Can we pick x0 , . . . , xn such that kπn+1 k∞ is minimal? Then, interpo-


lating at those points is a good idea, in particular if we do not know
much about f . We will solve this question completely and explicitly.

2. If the goal is to approximate f , who says we must interpolate? We


could try to solve this directly:

min kf − pn k∞ .
pn ∈Pn

That is: minimize the actual error rather than a bound on the error.
This is the central question of this chapter. We will characterize the
solutions (show existence, uniqueness and more), but we won’t do much
in the way of computations.

3. What about other norms?

Question 2 puts the finger on the central problem of this chapter.

Problem 7.2. Given f ∈ C[a, b], find pn ∈ Pn such that

kf − pn k∞ = min kf − qk∞
q∈Pn

= min max |f (x) − q(x)|.


q∈Pn x∈[a,b]

The solution to this problem is called the minimax polynomial for f on [a, b],
because of the min-max combination.

Notice that, at this stage, it is unclear whether minimax polynomials exist


and whether they are unique—we will establish this as we go. Furthermore, it
may very well be that the minimax pn interpolates f at some points (indeed,
that will be the case), but this is not required a priori.
Answering Question 2 provides an answer to Question 1. Indeed, con-
sider this expansion of πn+1 :

πn+1 (x) = (x − x0 ) · · · (x − xn ) = xn+1 − qn (x),

where qn ∈ Pn . Minimizing kπn+1 k∞ is equivalent to minimizing kxn+1 −


qn k∞ , which is exactly the problem of finding the minimax approximation of
f (x) = xn+1 in Pn .2
2
It is unclear at this point that picking qn as such leads to xn+1 − qn (x) having n + 1
distinct real roots in [a, b], which is required to factor πn+1 into (x − x0 ) · · · (x − xn )—this
will be resolved fully.
127

Last but not least, regarding Question 3: we could choose to minimize


the error kf − pn k for some other norm. In particular, in the next chapter
we consider the (weighted) 2-norm over C[a, b]:
s
Z b
kgk2 = w(x)|g(x)|2 dx, (7.3)
a

where w is a proper weight function (to be discussed in the next chapter.)


Importantly, in infinite dimensional spaces such as C[a, b], the choice of norm
matters. Remember, in finite dimensional spaces such as Rn , all norms are
equivalent in that for any two norms k·kα , k·kβ there exist constants c1 , c2 > 0
such that
∀x ∈ Rn , c1 kxkβ ≤ kxkα ≤ c2 kxkβ .
This notably means convergence in one norm implies convergence in all
norms. Not so in infinite dimensional spaces! This is partly why we have two
whole, separate chapters about polynomial approximation in two different
norms: they are quite different problems, requiring different mathematics.
Question 7.3. Propose a sequence of functions f1 , f2 , . . . in C[−1, 1] such
that kfn k2 → 0 (with w(x) ≡ 1) whereas kfn k∞ → ∞.


Question 7.4. Show that the opposite cannot happen. Specifically, show
there exists a constant c such that kf k2 ≤ ckf k∞ .


Questions beget questions. A fourth comes to mind:


4. How low can we go? Can we truly expect any continuous function to
be arbitrarily well approximated by some polynomial?
The answer to this question is: yes! This is the message of the following
fundamental result (proof omitted.)
Theorem 7.5 (Weierstrass approximation theorem, Theorem 8.1 in [SM03]).
For any f ∈ C[a, b], for any tolerance ε > 0, there exists a polynomial p such
that
kf − pn k∞ ≤ ε.
In other words, polynomials are dense in C[a, b] (in the same sense that
rational numbers are dense among the reals.)
128 CHAPTER 7. MINIMAX APPROXIMATION

The catch is: the theorem does not specify how large the degree of p may
need to be as a function of f and ε. Admittedly, it could be impractically
large in general. In what follows, we maintain some control over the degree,
using results with the same flavor as Theorem 6.5. Note that the above result
extends to k · k2 directly using Question 7.4.

7.1 Characterizing the minimax polynomial


How do we compute the minimax polynomial? This is a tough question.
The classical algorithm for this is the Remez exchange algorithm.3 We won’t
study this algorithm here, in good part because actual minimax polynomi-
als are harder to compute than interpolants at Chebyshev nodes, while the
latter are usually good enough. Yet, we will study minimax approximation
sufficiently to characterize solutions. This in turn will allow us to construct
the Chebyshev nodes and to understand why they are such a good choice.
Let’s start with the basics.

Theorem 7.6 (Existence, Theorem 8.2 in [SM03]). Given f ∈ C[a, b], there
exists pn ∈ Pn such that

kf − pn k∞ ≤ kf − qk∞ ∀q ∈ Pn .

Before we get to the proof, you may wonder: why does this necessitate
a proof at all? It all hinges upon the distinction between infimum and
minimum.4 The infimum of a set of reals is always well defined: it is the
largest lower-bound on the elements of the set. In contrast, the minimum is
only defined if the infimum is equal to some element in the set; we then say
the infimum or minimum is attained. In our scenario, each polynomial q ∈ Pn
maps to a real number kf − qk∞ . The set of those numbers necessarily has
an infimum. The question is whether some number in that set is equal to
the infimum, that is, whether there exists a polynomial pn in Pn such that
kf − pn k∞ is equal to the infimum. If that is the case, the infimum is called
the minimum, and pn is called a minimizer. This existence theorem is all
about that particular point.
Proof. Minimax approximation is an optimization problem:

min E(q),
q∈Pn

3
https://en.wikipedia.org/wiki/Remez_algorithm
4
https://math.stackexchange.com/questions/342749/
what-is-the-difference-between-minimum-and-infimum
7.1. CHARACTERIZING THE MINIMAX POLYNOMIAL 129

where E : Pn → R is the cost function: E(q) = kf −qk∞ . Our task is to show


that the min is indeed a min and not an inf, that is: we need to show the
minimum value of E exists and is attained for some q. The usual theorem
one aims to invoke for such matters is Weierstrass’ Extreme Value Theorem:5
Continuous functions on non-empty compact sets attain their bounds.
(EVT)
Thus, we contemplate two questions:
1. Is E continuous?
2. Pn is finite dimensional. In finite dimensional subspaces, a set is com-
pact if and only if it is closed and bounded.6 Pn is not bounded, hence
it is not compact; can we resolve that?
Continuity of E is clear: continuous perturbations of q lead to continuous
perturbations of |f − q|, which in turn lead to continuous perturbations of
the maximum value of |f − q| on [a, b], which is E(q) (details omitted).
The key point is to address non-compactness of Pn . To this end, the
strategy is to construct a compact set S ⊂ Pn such that

inf E(q) = inf E(q) = min E(q). (7.4)


q∈Pn q∈S |{z} q∈S
Weierstrass EVT

If we prove the first equality, then the above states the infimum over Pn is
equivalent to a min over S, thus showing existence of an optimal q.
Define S as follows:

S = {q ∈ Pn : E(q) ≤ kf k∞ }.

This set is indeed bounded and closed, hence it is compact. Furthermore,


0 ∈ S (that is, the zero polynomial is in S) since 0 ∈ Pn and

E(0) = kf k∞ ≤ kf k∞ .

Thus, S is non-empty: EVT indeed applies to S.


It remains to show inf q∈Pn E(q) = inf q∈S E(q). To this end, let q ∗ ∈ S be
a minimizer for E in S. Then, since 0 ∈ S,

E(q ∗ ) = min E(q) ≤ E(0) = kf k∞ .


q∈S

5
https://en.wikipedia.org/wiki/Extreme_value_theorem#Generalization_to_
metric_and_topological_spaces
6
That’s another Weierstrass theorem: Bolzano–Weierstrass, https://en.wikipedia.
org/wiki/Bolzano%E2%80%93Weierstrass_theorem.
130 CHAPTER 7. MINIMAX APPROXIMATION

/ S we have E(q) > kf k∞ ≥ E(q ∗ ). Since inf q∈Pn E(q)


As a result, for any q ∈
is the largest lower bound on all values attained by E in Pn , and since all
values attained by E outside of S are strictly larger than the smallest value
attained in S, it follows that (7.4) holds indeed.
Given a polynomial r ∈ Pn , it is a priori quite difficult to determine if
this polynomial is minimax, let alone to quantify how far it is from being
minimax. The following theorem resolves most7 of the difficulty.
Theorem 7.7 (De la Vallée Poussin’s theorem, Theorem 8.3 in [SM03]). Let
f ∈ C[a, b] and r ∈ Pn . Suppose there exist n + 2 points x0 < · · · < xn+1 in
[a, b] such that
f (xi ) − r(xi ) and f (xi+1 ) − r(xi+1 )
have opposite signs for i = 0, . . . , n.8 Then,
min |f (xi ) − r(xi )| ≤ min kf − qk∞ ≤ kf − rk∞ . (7.5)
i=0,...,n+1 q∈Pn

The upper bound minq∈Pn kf − qk∞ ≤ kf − rk∞ , is trivial. It is the


lower bound which is informative and non trivial. In particular, if the lower
bound and the upper bound approximately match, this tells us r is almost
minimax. This is powerful information. Of course, there is no reason a priori
to believe we can make the gap small, but we will see later that a polynomial
is minimax if and only if the lower and upper bounds match (provided the
best possible xi ’s are picked for the lower bound.) See also Figure 7.1.
Proof. Define µ as the left-hand side of our inequality of interest:
µ= min |f (xi ) − r(xi )|.
i=0,...,n+1

Certainly, if µ = 0, then the theorem is true. Hence, we only need to consider


the case µ > 0. Let pn ∈ Pn be minimax for f on [a, b] (we proved its existence
above). For contradiction, let us assume that
kf − pn k∞ < µ.
Our goal is to derive a contradiction from that statement. To this end,
consider the following inequalities, valid for all i = 0, . . . , n + 1:
|pn (xi ) − f (xi )| ≤ kf − pn k∞ < µ ≤ |r(xi ) − f (xi )|. (7.6)
7
Most but not all, because the condition on r is not satisfied in general.
8
If f (xi ) − r(xi ) = 0 for some i, then the theorem is trivially true regardless of the
meaning given to the words “opposite signs” in the presence of 0’s, so that there is no
need to be more specific.
7.1. CHARACTERIZING THE MINIMAX POLYNOMIAL 131

f = @(t)1./(1+25*t.2) Approximation errors for degree 9


1.2 0.3
True f Chebyshev interp
0.25 X: 0
1 Y: 0.2692

0.2

0.8 Chebyshev interp error


0.15

0.6 0.1
X: -0.5636 X: 0.5568
0.05 X: -0.9558 Y: 0.0288 Y: 0.02859 X: 0.9439
0.4 Y: 0.01115 Y: 0.01128

0.2 X: -1 X: -0.7939 X: 0.803 X: 1


-0.05
Y: -0.01035 Y: -0.01555 Y: -0.01564 Y: -0.01035

0 -0.1 X: -0.2659 X: 0.2761


-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 -1 -0.8 -0.6 -0.4 -0.2
Y: -0.08757
0 0.2 0.4
Y: -0.0872
0.6 0.8 1

Figure 7.1: Chebyshev interpolation at n + 1 = 10 points of the Runge


1
function f (x) = 1+25x 2 for x ∈ [−1, 1]. Consider the oscillations of the error

in light of De la Vallée Poussin’s theorem. Compare with Figure 7.2.

Now consider the difference r − pn at these special points:

r(xi ) − pn (xi ) = (r(xi ) − f (xi )) − (pn (xi ) − f (xi )) .

In inequality (7.6), we showed that, in absolute value, the first term is strictly
larger than the second term. Thus,

sign(r(xi ) − pn (xi )) = sign(r(xi ) − f (xi )) .

By assumption, the right-hand side changes sign n + 1 times as i = 0, . . . , n.


But the left-hand side involves r − pn : a polynomial of degree n. If that
polynomial changes sign n + 1 times, the intermediate value theorem implies
it has at least n + 1 distinct roots; but since it has degree at most n, this is
only possible if r(x) = pn (x) for all x. This is a contradiction since r(xi ) −
pn (xi ) 6= 0 for all i. Alternatively, we can also notice that if r = pn , then
kf − rk∞ = kf − pn k∞ < µ, yet by (7.5) we know kf − rk∞ ≥ µ: this is also
a contradiction.

Having De la Vallée Poussin’s theorem at our disposal, one direction of


the following equivalence theorem becomes easy to prove. An equivalence
theorem is a powerful thing: it tells us that two seemingly different sets
of properties (being minimax on the one hand, admitting a special sequence
{x0 , . . . , xn+1 } on the other hand) are actually equivalent. Pragmatically, this
means every time we are searching for or dealing with a minimax theorem, we
can invoke those sequences without loss. The particular form of this theorem
also makes it a lot easier to verify whether any given polynomial is or is not
minimax—a non trivial feat.
132 CHAPTER 7. MINIMAX APPROXIMATION

Theorem 7.8 (Oscillation theorem, Theorem 8.4 in [SM03]). Consider f ∈


C[a, b]. A polynomial r ∈ Pn is minimax for f on [a, b] if and only if there
exist n + 2 points x0 < · · · < xn+1 in [a, b] such that

1. |f (xi ) − r(xi )| = kf − rk∞ for i = 0, . . . , n + 1, and

2. f (xi ) − r(xi ) = −(f (xi+1 ) − r(xi+1 )) for i = 0, . . . , n.

Proof. The proof is in two parts:

ˆ Sufficiency: We assume x0 < · · · < xn+1 are given and satisfy the
conditions of the theorem; we aim to show r is minimax. Apply De la
Vallée Poussin’s theorem:
DlVP
kf − rk∞ = min |f (xi ) − r(xi )| ≤ min kf − qk∞ ≤ kf − rk∞ .
i=0,...,n+1 q∈Pn

The left-most and right-most quantities are the same. We conclude


inequalities must be equalities:

kf − rk∞ = min kf − qk∞ ,


q∈Pn

that is, r is minimax.

ˆ Necessity: This is the technical part of the proof. We omit it—see the
reference book for a full account.9

The oscillation theorem leads to uniqueness of the minimal polynomial


(using the necessity part of the theorem.)

Theorem 7.9 (Uniqueness, Theorem 8.5 in [SM03]). Each f ∈ C[a, b] admits


a unique minimax polynomial in Pn .

Proof. The proof is by contradiction. Assume pn , qn ∈ Pn are two distinct


minimax polynomials for f . Then, we argue
pn +qn
1. 2
is also minimax, and

2. this implies pn , qn coincide at n + 2 distinct points, showing pn = qn .

Here are the arguments for both parts:


9
A PDF version of [SM03] has some errors that were corrected in the print version,
specifically for the construction of x0 .
7.2. INTERPOLATION POINTS TO MINIMIZE THE BOUND 133

1. Let E = kf − pn k∞ = kf − qn k∞ . Then,

pn + qn 1
E≤ f− = kf − pn + f − qn k∞
2 ∞ 2
1
≤ (kf − pn k∞ + kf − qn k∞ ) = E.
2

2. Since pn +q
2
n
∈ Pn is minimax, the oscillation theorem gives x0 < · · · <
xn+1 in [a, b] such that (first property)

pn (xi ) + qn (xi )
f (xi ) − =E ∀i.
2

Multiply by 2 on each side:

| f (xi ) − pn (xi ) + f (xi ) − qn (xi ) | = 2E ∀i.


| {z } | {z }
|·|≤E |·|≤E

This implies f (xi ) − pn (xi ) = f (xi ) − qn (xi ) for each i, in turn showing
pn (xi ) = qn (xi ) for each i. Thus, the polynomial pn − qn ∈ Pn has n + 2
distinct roots: this can only happen if pn = qn .

7.2 Interpolation points to minimize the bound


We can leverage all of this new insight to determine the best interpolation
points to minimize the upper bound on the approximation error we deter-
mined in Theorem 6.5. Indeed, recall that for f ∈ C[a, b] continuously dif-
ferentiable n + 1 times we established
Mn+1
kf − pn k∞ ≤ kπn+1 k∞ ,
(n + 1)!

where pn ∈ Pn interpolates f at x0 < · · · < xn in [a, b], Mn+1 = kf (n+1) k∞


and

πn+1 (x) = (x − x0 ) · · · (x − xn ) = xn+1 − qn (x).

Above, qn ∈ Pn depends on the choice of interpolation points. Our goal is to


pick x0 , . . . , xn so as to minimize

kπn+1 k∞ = kxn+1 − qn k∞ .
134 CHAPTER 7. MINIMAX APPROXIMATION

The polynomial qn which minimizes that quantity is the minimax approxi-


mation of xn+1 over [a, b] in Pn .
Choosing qn to be this minimax polynomial is a valid strategy only if
xn+1 − qn indeed has n + 1 distinct real roots in [a, b]. Fortunately, the
oscillation theorem guarantees this (and more). Indeed, by the oscillation
theorem, there exists a sequence of n + 2 points in [a, b] for which xn+1 −
qn (x) attains the value ±kxn+1 − qn k∞ with alternating signs. In particular,
xn+1 − qn (x) changes sign n + 1 times in [a, b], which means xn+1 − qn (x) has
at least n + 1 distinct roots in [a, b]. Since xn+1 − qn (x) has degree n + 1,
these are all of the roots.
We claim that solving the following task is sufficient to find qn (and
the interpolation points) explicitly. Without loss of generality, fix [a, b] =
[−1, 1].10

Task. Find a sequence of polynomials T0 , T1 , T2 , . . . such that

1. Tn ∈ Pn \Pn−1 (degree exactly n),

2. kTn k∞ = 1,

3. Tn (x) attains ±1 at n + 1 points in [−1, 1], alternating.

Why would that work? Write

Tn (x) = an xn + lower order terms

so that an denotes the coefficient of the leading order term in Tn (and an−1
denotes the coefficient of the leading order term in Tn−1 , etc.) Then, define

1
qn (x) = xn+1 − Tn+1 (x).
an+1

Crucially, this qn indeed has degree n, even though it was built from two
polynomials of degree n + 1. We argue qn is minimax for xn+1 , as desired.
Indeed, consider the error function:

1
xn+1 − qn (x) = Tn+1 (x).
an+1
10
To work in a different interval [a, b], simply execute the affine change of variable
a+b b−a
t 7→ 2 + 2 t, so that −1 is mapped to a and +1 is mapped to b.
7.2. INTERPOLATION POINTS TO MINIMIZE THE BOUND 135

1
By definition of Tn+1 , the error alternates n + 2 times between ± an+1 , which
n+1
coincides with ±kx − qn k∞ . By the oscillation theorem (the part with the
easy proof), this guarantees we found a minimax. In conclusion, with

1
(x − x0 ) · · · (x − xn ) = πn+1 (x) = xn+1 − qn (x) = Tn+1 (x), (7.7)
an+1

we find that picking the n + 1 interpolation points as the roots of Tn+1 yields
a bound on the approximation error as

Mn+1 1
kf − pn k∞ ≤ ,
(n + 1)! |an+1 |

since kπn+1 k∞ = kTn+1 /an+1 k∞ = 1/|an+1 |. It remains to construct the


polynomial Tn+1 and to determine its roots and an+1 .

Building the Tn ’s. The Tn ’s are remarkable polynomials.11 Let’s give


them a name.

Definition 7.10. The Chebyshev polynomials are defined by recurrence as:

T0 (x) = 1,
T1 (x) = x,
Tn+1 (x) = 2xTn (x) − Tn−1 (x) for n = 1, 2, . . .

(These are also called the Chebyshev polynomials of the first kind.)

We need to

1. Check that these polynomials indeed fulfill our set task; and

2. Investigate an+1 and the roots of Tn+1 .

Parts of this is straightforward from the recurrence relation. For instance, it


is clear that each Tn is indeed in Pn . Furthermore, it is clear that

a0 = 1,
a1 = 1,
an+1 = 2an for n = 1, 2, . . . .
11
See https://en.wikipedia.org/wiki/Chebyshev_polynomials for the tip of the
iceberg.
136 CHAPTER 7. MINIMAX APPROXIMATION

Explicitly,

a0 = 1,
an+1 = 2n for n = 0, 1, . . . (7.8)

This notably confirms each Tn is of degree exactly n. Yet, the other state-
ments regarding kTn k∞ , alternation between ±1 and roots are not straight-
forward from the recurrence. For this, we need a surprising lemma.
Lemma 7.11. For all n ≥ 0, for all x ∈ [−1, 1],

Tn (x) = cos(n arccos(x)).

Proof. The proof is by induction. Start by verifying that the statement holds
for T0 and T1 . Then, as induction hypothesis, assume

Tk (x) = cos(k arccos(x))

for k = 0, . . . , n, for all x ∈ [−1, 1]. We aim to show the same holds for Tn+1 .
To this end, recall the trigonometric identity:

2 cos u cos v = cos(u + v) + cos(u − v).

Set u = nθ and v = θ:

2 cos(nθ) cos(θ) = cos((n + 1)θ) + cos((n − 1)θ).

Define x = cos(θ) ∈ [−1, 1]. Then, θ = arccos(x) and:

2x cos(n arccos(x)) = cos((n + 1) arccos(x)) + cos((n − 1) arccos(x)),


| {z } | {z }
Tn (x) Tn−1 (x)

where we used the induction hypothesis in the under-braces. Rearranging,


we find by the recurrence relation that, for all x ∈ [−1, 1],

cos((n + 1) arccos(x)) = 2xTn (x) − Tn−1 (x) = Tn+1 (x),

as desired.
This lemma makes it straightforward to continue our work. In particular,
1. It is clear that kTn k∞ = maxx∈[−1,1] | cos(n arccos(x))| = 1; and

2. Alternation is verified since Tn (x) = ±1 if and only if n arccos(x) = kπ


for some integer k, which shows alternation at the points cos kπ n
: for
k = 0, . . . , n, these are n + 1 distinct points in [−1, 1].
7.3. CODES AND FIGURES 137

3. The roots are easily determined: Tn+1 (x) = 0 if and only if


cos((n + 1) arccos(x)) = 0
with x ∈ [−1, 1], that is, for
π 
2
+ kπ
xk = cos , k = 0, . . . , n. (7.9)
n+1

The above roots are the so-called Chebyshev nodes (of the first kind). Plot
them to confirm they are more concentrated near the edges of [−1, 1].

Consequences for interpolation. The machinery developed above leads


to the following concrete message regarding polynomial approximation of
sufficiently smooth functions: working with the interval [−1, 1], if pn is the
(unique) polynomial in Pn which interpolates f at the Chebyshev nodes (7.9),
Mn+1
kf − pn k∞ ≤ .
2n · (n + 1)!
The denominator grows fast. It takes a particularly wild smooth function
for this bound not to go to zero. Of more direct concern though, note that
this bound is only useful if f is indeed sufficiently differentiable. Simple
functions such as f (x) = |x| cause trouble of their own. You should check
it experimentally. For f (x) = sign(x) (which is not even continuous, hence
breaks the most fundamental of our assumptions), you will witness the Gibbs
phenomenon, familiar to those of you who know Fourier transforms.
Question 7.12. What happens to this bound if we use the affine change of
variable to work on [a, b] instead of [−1, 1]?


We close this investigation here (for now at least.) To state the obvious:
there is a lot more to Chebyshev nodes and Chebyshev polynomials than
meets the eye thus far. Having a look at the Wikipedia page linked above
can give you some hints of the richness of this topic. We will encounter some
of it in the next chapter, as an instance of orthogonal polynomials.

7.3 Codes and figures


The following code uses the Chebfun toolbox.12 Among many other things
relevant to this course, Chebfun notably packs ready-made functions for min-
12
http://www.chebfun.org
138 CHAPTER 7. MINIMAX APPROXIMATION

imax approximation (which is the main reason for using it here) and for La-
grange interpolation at Chebyshev nodes (which is used here too, but for
which we could have just as easily used our own codes based on the previous
chapter.)

%%
clear;
close all;
clc;
set(0, 'DefaultLineLineWidth', 2);
set(0, 'DefaultLineMarkerSize', 30);

%% Pick a polynomial degree: play with this parameter


n = 16;

%% Pick a function to play with and an interval [a, b]

% Notice that many functions are even or odd:


% Check carefully effect of picking n even or odd as well.

f = @(t) 1./(1+25*t.ˆ2);
% f = @(t) sin(t).*(t.ˆ2-2);
% f = @(t) sqrt(cos((t+1)/2));
% f = @(t) log(cos(t)+2);
% f = @(t) t.ˆ(n+1);
% f = @abs;
% f = @sign; % Not continuous! Our theorems break down.

a = -1;
b = 1;

% fc is a representation of f in Chebfun format.


% For all intents and purposes, this is the same
% as f up to machine precision.)
%
% You can get the Chebfun toolbox freely at
% http://www.chebfun.org.
fc = chebfun(f, [a, b], 'splitting', 'on');

subplot(1, 2, 1);
plot(fc, 'k-');
subplot(1, 2, 2);
plot([a, b], [0, 0], 'k-');

% Chebyshev interpolation
% Get a polynomial of degree n to approximate f on [a, b]
% through interpolation at the Chebyshev points
7.3. CODES AND FIGURES 139

% (of the first kind.)

% The interpolant is here obtained through Chebfun. Manually,


% you can also use the explicit expression for the Chebyshev
% nodes together with your codes regarding Lagrange
% interpolation (barycentric formula) to evaluate the
% interpolant: it's the same.
%
% n+1 indicates we want n+1 nodes (polynomial has degree n).
% chebkind = 1 selects Chebyshev nodes of the first kind.
fn = chebfun(f, [a, b], n+1, 'chebkind', 1);

% plot the polynomial


subplot(1, 2, 1); hold all;
plot(fn);
title(['f = ' char(f)]);

% Plot the approximation error


subplot(1, 2, 2); hold all;
plot(fc-fn);

grid on;
title(sprintf('Approximation errors for degree %d', n));

%% Actual minimax (via Remez algorithm)

pn = minimax(fc, n);

% Plot minimax polynomial


subplot(1, 2, 1); hold all;
plot(pn);

% Plot minimax approximation error


subplot(1, 2, 2); hold all;
plot(fc-pn);

%% Legend
subplot(1, 2, 1);
legend('True f', 'Chebyshev interp', 'Minimax', ...
'Orientation', 'horizontal', 'Location', 'North');
legend('boxoff');
pbaspect([1.6, 1, 1]);
subplot(1, 2, 2);
legend('0', 'Chebyshev interp', 'Minimax', ...
'Orientation', 'horizontal', 'Location', 'North');
legend('boxoff');
pbaspect([1.6, 1, 1]);
140 CHAPTER 7. MINIMAX APPROXIMATION

f = @(t)1./(1+25*t.2) Approximation errors for degree 15


1.2 0.1
True f Chebyshev interp Minimax Chebyshev interp Minimax

0.8 0.05

0.6

0.4 0

0.2

0 -0.05
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1

Figure 7.2: Chebyshev interpolation at n + 1 = 16 points compared to min-


1
imax approximation of degree n = 15 of the Runge function f (x) = 1+25x 2

for x ∈ [−1, 1], computed with Chebfun (Remez algorithm.) The Chebyshev
approximation is pretty good compared to the optimal minimax, yet is much
simpler to compute. If you only had the Chebyshev polynomial, what could
you deduce about the minimax from De la Vallée Poussin’s theorem?

f = @(t)1./(1+25*t.2) Approximation errors for degree 16


1.2 0.04
True f Chebyshev interp Minimax Chebyshev interp Minimax
0.03
1

0.02
0.8

0.01
0.6
0

0.4
-0.01

0.2
-0.02

0 -0.03
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1

Figure 7.3: Same as Figure 7.2 but with n = 16. Since f is even, the minimax
polynomial has 19 alternation points rather than only 18 (think about it). If
you only had the Chebyshev polynomial, what could you deduce about the
minimax from De la Vallée Poussin’s theorem?
7.3. CODES AND FIGURES 141

Figure 7.4: The first few Chebyshev polynomials. Picture credit: https:
//en.wikipedia.org/wiki/Chebyshev_polynomials#/media/File:
Chebyshev_Polynomials_of_the_First_Kind.svg.

Figure 7.5: Chebyshev nodes are more concentrated toward the ends of the in-
terval, though they come from evenly spaced points on the upper-half of a cir-
cle. Picture credit: https://en.wikipedia.org/wiki/Chebyshev_nodes#
/media/File:Chebyshev-nodes-by-projection.svg.
142 CHAPTER 7. MINIMAX APPROXIMATION
Chapter 8

Approximation in the 2-norm

Given a function f on [a, b] (conditions to be specified later), we consider the


problem of computing pn , a solution to

min kf − pk2 ,
p∈Pn

where k · k2 is a 2-norm defined below. One key difference with the minimax
approximation problem from the previous chapter is that the 2-norm is in-
duced by an inner product (this is not so for the ∞-norm): this will make
our life a lot easier.
To begin this chapter, we:

1. Discuss inner products and 2-norms over spaces of functions;

2. Derive the solution to the 2-norm approximation problem, securing


existence and uniqueness as a byproduct;

3. Argue that the optimal polynomial pn really is just the orthogonal


projection of f to Pn ; and

4. Delve into systems of orthogonal polynomials, which ease such projec-


tions.

For the first three elements at least, the math looks very similar to things you
have already learned (including in this course) about least squares problems.
The crucial point is: we are now working in an infinite dimensional space.
It is important to go through the steps carefully from first principles, and to
keep our intuitions in check.

Remark 8.1. Take a moment to reflect on this: now and for the last couple
of chapters, functions are often considered as vectors. As strange as this may

143
144 CHAPTER 8. APPROXIMATION IN THE 2-NORM

sound, remember this: by definition, a vector is nothing but an element of


a vector space. The most common vector space being Rn , it is only natural
that we tend to forget the more general definition, and conflate in our mind
the notion of vector with that of a “column of numbers”. It is worth it to
take a step back and consider what it means that C[a, b] (for example) is a
vector space, and what it means for a function f ∈ C[a, b] to be a vector in
that space. Also think about how this abstraction allows you to use familiar
diagrams such as planes and arrows to represent subspaces of C[a, b] (Pn for
example) and functions themselves.

8.1 Inner products and 2-norms


Definition 8.2 (Def. 9.1 in [SM03]). Let V be a linear space over R. A
real-valued function h·, ·i : V × V → R is an inner product on V if

1. hf, gi = hg, f i ∀f, g ∈ V ;

2. hf + g, hi = hf, hi + hg, hi ∀f, g, h ∈ V ;

3. hλf , gi = λ hf, gi ∀λ ∈ R, ∀f, g ∈ V ; and

4. hf, f i > 0 ∀f ∈ V, f 6= 0.

A linear space with an inner product is called an inner product space.

Definition 8.3 (Def. 9.2 in [SM03]). For f, g ∈ V , if hf, gi = 0 we say f


and g are orthogonal (to one another).

Example 8.4. If V = Rn (finite dimensional), the most common inner


product by far is:
n
X
hx, yi = xk yk = xT y.
k=1

Generally, for any symmetric and (strictly) positive definite W ∈ Rn×n ,

hx, yi = xT W y

is an inner product (check the properties above.)

Example 8.5. For our purpose, the more interesting case is V = C[a, b],
the space of continuous functions on [a, b]. On that space, given a positive,
8.1. INNER PRODUCTS AND 2-NORMS 145

continuous and integrable weight function w : (a, b) → R, the following is an


inner product (check the properties):
Z b
hf, gi = w(x)f (x)g(x)dx. (8.1)
a
Notice that w is allowed to take on infinite values at a, b, provided it remains
integrable. This will be important later.
Definition 8.6 (Def. 9.3 in [SM03]). An inner product leads to an induced
norm (usually called the 2-norm):
p
kf k = hf, f i.
Note: it is not trivial that this is indeed a norm (remember, we are in infi-
nite dimensions now; things you learned in finite dimension may not apply.)
See [SM03, Lemma 9.1 and Thm. 9.1] for a complete argument based on
Cauchy–Schwarz (in infinite dimensions).
Continuing the previous example, we have the following 2-norm over
C[a, b]:
s
Z b
kf k2 = w(x)(f (x))2 dx. (8.2)
a

From these expressions for the inner product and the norm, it becomes clear
that we do not actually need to restrict ourselves to continuous functions f ,
as was the case for the ∞-norm. We only need to make sure all integrals
that occur are well defined. This is the case over the following linear space:
L2w (a, b) = f : [a, b] → R : w(x)(f (x))2 is integrable over (a, b) . (8.3)


Question 8.7. Verify L2w (a, b) is indeed a linear space, and show that it
contains C[a, b] strictly (that is, it contains strictly more than C[a, b].)
We are finally in a good position to frame the central problem of this
chapter.
Problem 8.8. For a given weight function w on (a, b) and a given function
f ∈ L2w (a, b), find pn ∈ Pn such that kf − pn k2 ≤ kf − qk2 for all q ∈ Pn .
We show momentarily that the solution to this problem exists and is
unique. It is called the polynomial of best approximation of degree n to f in
the 2-norm on (a, b). Since polynomials are dense in C[a, b] for the 2-norm as
we discussed in the previous chapter, it is at least the case that for continuous
f , as n → ∞, the approximation becomes arbitrarily good. We won’t discuss
rates of approximation here (that is, how fast the approximation error goes
to zero as n increases.)
146 CHAPTER 8. APPROXIMATION IN THE 2-NORM

8.2 Solving the approximation problem


Our unknown is a polynomial of degree at most n. As usual, the first step is
to pick a basis:

Let q0 , . . . , qn form a basis for Pn .

Equipped with this basis, the unknowns are simply the coefficients c0 , . . . , cn
forming the vector c ∈ Rn+1 such that
n
X
p n = c0 q 0 + · · · + cn q n = ck q k . (8.4)
k=0

We wish to pick c so as to minimize kf − pn k2 . Equivalently, we want to


minimize:

h(c) = kf − pn k22 = hf − pn , f − pn i
= hf, f i + hpn , pn i − 2 hf, pn i
* + * +
X X X
= kf k22 + ck q k , c` q` − 2 f, ck q k
k ` k
XX X
= kf k22 + ck c` hqk , q` i − 2 ck hf, qk i .
k ` k

Introduce the matrix M ∈ R(n+1)×(n+1) and the vector b ∈ Rn+1 defined by:

Mk` = hqk , q` i , bk = hf, qk i .

Then,
XX X
h(c) = kf − pn k22 = kf k22 + ck c` Mk` − 2 c k bk
k ` k
= kf k22 T
+ c M c − 2b c. T

This is a quadratic expression in c. How do we minimize it? Let’s consider a


simpler problem first: how to minimize a quadratic in a single variable? Let

h(x) = ax2 + bx + c.

The recipe is well known: if h00 (x) = 2a > 0 (the function is convex rather
than concave, to make sure there indeed is a minimizer), then the unique
minimizer of h is such that h0 (x) = 2ax + b = 0, that is, x = − 2a b
. This
recipe generalizes.
8.2. SOLVING THE APPROXIMATION PROBLEM 147

Lemma 8.9. Let h : Rn → R be a smooth function. If ∇2 h(x)  0 for all x,


then the unique minimizer of h is the solution of ∇h(x) = 0, where
∂h ∂ 2h
(∇h(x))k = (x), (∇2 h(x))k,` = (x).
∂xk ∂xk ∂x`
Proof. See chapter about optimization.
In our case, h : Rn+1 → R is defined by
XX X
h(c) = kf − pn k22 = Mk` ck c` − 2 bk ck + kf k22 .
k ` k

We need to get the gradient and Hessian. Formally, using the simple rule
(
∂ci 1 if i = j,
= δij =
∂cj 0 otherwise,
we get:
∂h XX ∂ X ∂ck
(∇h(c))j = (c) = Mk` (ck c` ) − 2 bk +0
∂cj k `
∂c j
k
∂c j
XX X
= Mk` (δkj c` + ck δ`j ) − 2 bk δkj
k ` k
X X
= Mj` c` + Mkj ck − 2bj
` k
T
= (M c)j + (M c)j − 2bj
= (2M c − 2b)j .
In the last equality, we used that M = M T since Mk` = hqk , q` i and inner
products are symmetric. Compactly:
∇h(c) = 2(M c − b).
The Hessian is similarly straightforward to obtain:
∂ ∂h
(∇2 h(c))ij = (c)
∂ci ∂cj
!
∂ X X
= Mj` c` + Mkj ck − 2bj
∂ci ` k
= Mji + Mij .
Hence,
∇2 h(c) = 2M.
Lemma 8.9 indeed applies in our case since M  0.
148 CHAPTER 8. APPROXIMATION IN THE 2-NORM

Question 8.10. Show that M  0.

Consequently, we find that solutions to our problem satisfy ∇h(c) = 0,


that is,

M c = b. (8.5)

Since M is positive definite, it is in particular invertible. This confirms


existence and uniqueness of the solution to Problem 8.8. (This covers the
contents of [SM03, Thm. 9.2] with a different proof.)
What does this mean in practice? To solve a specific problem, one must

1. Pick a basis q0 , . . . , qn of Pn ;

2. Compute M via Mk` = hqk , q` i: this requires computing ∼ n2 integrals,


once and for all for a given w—they can be stored;

3. Compute b via bk = hf, qk i: this requires computing n + 1 integrals, for


each given f ;

4. Solve the linear system M c = b (this can be simplified by preprocessing


M , most likely via Cholesky factorization since M  0—this can be
stored instead of M .)

The solution is then pn (x) = c0 q0 (x) + · · · + cn qn (x), which is easy to evaluate


if the basis polynomials are easy to evaluate. Computing the integrals is most
often not doable by hand: in future lectures, we discuss how to compute them
numerically. A more pressing problem is that M might be poorly conditioned
(see example below.) We need to address that.

8.3 A geometric viewpoint


Best approximation in the 2-norm is (weighted) least squares approximation:
Z b
Minimize: kf − pn k22 = w(x)(f (x) − pn (x))2 dx.
a

Just as in the finite dimensional case, the solution here also can be interpreted
as the orthogonal projection of f to Pn , where orthogonal is defined with
8.3. A GEOMETRIC VIEWPOINT 149

respect to the chosen inner product. To confirm equivalence, consider this


chain of if and only if’s:
n
X
pn = ck qk is the orthogonal projection of f to Pn
k=0
⇐⇒ f − pn is orthogonal to Pn
⇐⇒ hf − pn , qi = 0 for all q ∈ Pn
⇐⇒ hf − pn , q` i = 0 for ` = 0, . . . , n
⇐⇒ hf, q` i = hpn , q` i for ` = 0, . . . , n
X n
⇐⇒ b` = M`k ck for ` = 0, . . . , n
k=0
⇐⇒ b` = (M c)` for ` = 0, . . . , n
⇐⇒ b = M c.
This shows that there exists a unique orthogonal projection pn of f to Pn ,
and that the coefficients of pn in the basis q0 , . . . , qn are given by c: the
unique solution to M c = b. In the last section, we saw that M c = b also
characterizes the best 2-norm approximation of f in Pn , so that overall we
find that, indeed, the orthogonal projection of f to Pn is the best 2-norm
approximation of f in Pn .
This last part is established with a different proof in the book. Specifi-
cally, using Cauchy–Schwarz in infinite dimension, it is first proved that the
orthogonal projection is optimal. Then, it is proved that the optimum is
unique by different means. Because it is often useful, we include a proof of
Cauchy–Schwarz here.
Lemma 8.11 (Cauchy–Schwarz). For a given inner product space V , for
any two vectors f, g ∈ V , it holds that
| hf, gi | ≤ kf k2 kgk2 .
Proof. Since k · k2 is a norm, for any t ∈ R we have:
0 ≤ ktf + gk22 = t2 kf k22 + 2t hf, gi + kgk22 .

This holds in particular for t = − hf,gi


kf k22
. (Notice that the right hand side
is a convex quadratic in t: here, we choose t to be the minimizer of that
quadratic, which yields the strongest possible inequality.) Plugging in this
value of t we find:
hf, gi2 hf, gi
0≤ 4
kf k22 − 2 hf, gi + kgk22 .
kf k2 kf k22
150 CHAPTER 8. APPROXIMATION IN THE 2-NORM

Reorganizing yields
hf, gi2 ≤ kf k22 kgk22 .
Take the square root to conclude.

8.4 What could go wrong?


To fix ideas, let us pick [a, b] = [0, 1]. The most obvious choice for the weight
function is w(x) = 1 for all x in [0, 1]. The most obvious choice of basis is
the monomial basis: qk (x) = xk . In that case, the matrix M can be obtained
analytically:
Z 1 Z 1 1
k ` xk+`+1 1
Mk` = hqk , q` i = x x dx = xk+` dx = = .
0 0 k+`+1 0 k+`+1
This matrix is known as the Hilbert matrix, and it is plain terrible. The code
below reveals that already for n = 11, the condition number is κ(M ) ≈ 1016 .
In other words: upon solving M c = b in double precision, we cannot expect
any digit to be correct for n ≥ 11.

%% Hilbert matrix

% Matrix M corresponding to best 2-norm approximation with


% respect to the weight w(x) = 1 over the interval [0, 1] with
% the monomial basis 1, x, xˆ2, ..., xˆn.
% Conditioning is terrible.

n = 11;
M = zeros(n+1, n+1);
for k = 0 : n
for l = 0 : n
M(k+1, l+1) = 1/(k+l+1);
end
end
% Equivalently, M = hilb(n+1);

cond(M)

8.5 Orthogonal polynomials


The weight function w is typically tied to the application: we must consider
it a given. Same goes for f , of course. Thus, our only lee-way is in the choice
8.5. ORTHOGONAL POLYNOMIALS 151

of basis q0 , . . . , qn . We can espouse two different viewpoints which lead to


the same conclusion:

Algebraic viewpoint: Solving M c = b should be easy: choose qk ’s such


that M is diagonal;

Geometric viewpoint: Orthogonal projections to Pn should be easy:


choose qk ’s to form an orthogonal basis.

Both considerations lead to the requirement hqk , q` i = 0 for all k 6= `.1


One key point is: we do not want just q0 , . . . , qn to form an orthogonal
basis of Pn : we want an infinite sequence of polynomials such that the first
n + 1 of them form such a basis for Pn , for any n.

Definition 8.12 ([SM03, Def. 9.4]). The sequence of polynomials φ0 , φ1 , φ2 , . . .


is a system of orthogonal polynomials on (a, b) with respect to the weight
function w if

1. Each φk has degree exactly k,

2. hφk , φ` i = 0 if k 6= `, and

3. hφk , φk i =
6 0.

If the basis q0 , . . . , qn is chosen as φ0 , . . . , φn , solving the best approxima-


tion problem is straightforward since M is diagonal:

bk hf, φk i
ck = = .
Mkk hφk , φk i

The denominator can be precomputed and stored. The numerator requires


an integral which can be evaluated numerically (more on that later.) Fur-
thermore, if the best approximation in Pn is unsatisfactory, then computing
the best approximation in Pn+1 only requires computing one additional coef-
ficient, cn+1 . This is in stark contrast to previous approximation algorithms
we have encountered.
This begs the question: how does one construct systems of orthogonal
polynomials?
1
For reasons that will become clear, we do not insist on forming an orthonormal basis
for now.
152 CHAPTER 8. APPROXIMATION IN THE 2-NORM

8.5.1 Gram–Schmidt
Given a weight function w and a basis of polynomials q0 , q1 , q2 , . . . such that
qk has degree exactly k, we can apply the Gram–Schmidt procedure to figure
out a system of orthogonal polynomials.2 Concretely,

1. Let φ0 = q0 ; then

2. Assuming φ0 , . . . , φn are constructed, build φn+1 ∈ Pn+1 as:

φn+1 = qn+1 − an φn − · · · − a0 φ0 ,

where the coefficients a0 , . . . , an must be chosen so that hφn+1 , φk i = 0


for k = 0, . . . , n. That is,

0 = hφn+1 , φk i = hqn+1 , φk i − ak hφk , φk i .

Hence,

hqn+1 , φk i
ak = .
hφk , φk i

Computing the coefficients ak involves integrals which may or may not be


computable by hand. Even if this is figured out, a more important issue
remains: evaluating the polynomial φn at some point x requires ∼ n2 flops
on top of evaluating q0 (x), . . . , qn (x) (think about it.) This is expensive. We
can do better.

Question 8.13. Orthogonalize the monomial basis 1, x, x2 , x3 , . . . with re-


spect to the weight w(x) = 1 over [−1, 1]. The resulting system of polynomials
(upon normalizing such that φk (1) = 1 for all k) is known as the Legendre
polynomials. (Just compute the first few polynomials to get the idea.)

8.5.2 A look at the Chebyshev polynomials


We actually already encountered a system of orthogonal polynomials. In-
deed, the Chebyshev polynomials (Definition 7.10) are orthogonal on [−1, 1]
2
We won’t use Gram–Schmidt in practice, so there is no need to worry about numerical
stability in the face of inexact arithmetic: classical Gram–Schmidt will do.
8.5. ORTHOGONAL POLYNOMIALS 153

1
with respect to the weight w(x) = √1−x 2 . This weight function puts more

emphasis on accurate approximation near ±1.3


To verify orthogonality, recall
Tn (x) = cos(n arccos(x))
for x ∈ [−1, 1], so that
Z 1
Mk,` = hTk , T` i = Tk (x)T` (x)w(x)dx
−1
Z 1
1
= cos(k arccos(x)) cos(` arccos(x)) √ dx
−1 1 − x2
Z 0
1
(change: x = cos θ) = cos(kθ) cos(`θ) (− sin θ)dθ
| sin θ|
Zπ π
= cos(kθ) cos(`θ)dθ
0
1 π
Z
= cos((k − `)θ) + cos((k + `)θ)dθ
2 0

0 if k 6= `,

= π if k = ` = 0,

π
2
if k = ` 6= 0.

This specifies M entirely. Furthermore,


Z 1 Z π
dx
bk = hf, Tk i = f (x) cos(k arccos(x)) √ = f (cos θ) cos(kθ)dθ.
−1 1 − x2 0

Hence,
pn = c0 T0 + · · · + cn Tn
is the best 2-norm approximation to f over [−1, 1] with respect to the weight
1
w(x) = √1−x 2 if and only if

1 π
Z
b0
c0 = = f (cos θ)dθ,
M00 π 0
2 π
Z
bk
ck = = f (cos θ) cos(kθ)dθ, for k = 1, 2, . . .
Mkk π 0
(This is closely related to taking a Fourier transform of f ◦ cos.)
3
It is now clear why we needed to allow w to take on infinite values at the extreme
points of the interval.
154 CHAPTER 8. APPROXIMATION IN THE 2-NORM

8.5.3 Three-term recurrence relations


The Chebyshev polynomials are orthogonal, and they obey a three-term
recurrence relation. This is not an accident. We now show that all systems
of orthogonal polynomials obey such a recurrence.
This notably means that it is only necessary to figure out the coefficients
of this recurrence to obtain a cheap and reliable way of computing the poly-
nomials. This is highly preferable to the Gram–Schmidt procedure. The
theorem below gives explicit formulas for these coefficients that can be eval-
uated in practice for a given weight w. For various important weights, the
recurrence coefficients can also be looked up in a book.

Theorem 8.14. Let φ0 , φ1 , φ2 . . . be a system of orthogonal polynomials with


respect to some weight w on (a, b) and such that each polynomial is monic,
that is, φk (x) = xk + lower order terms (the leading coefficient is 1.) Then,

φ0 (x) = 1,
φ1 (x) = x − α0 ,
φk+1 (x) = (x − αk )φk (x) − βk2 φk−1 (x), for k = 1, 2, . . . (8.6)

and
hxφk , φk i
αk = , for k = 0, 1, 2, . . .
kφk k22
kφk k22
βk2 = , for k = 1, 2, . . .
kφk−1 k22

Before we get to the proof, a couple remarks:

1. If the coefficients αk , βk2 are known (precomputed and stored), then


evaluating φ0 , . . . , φn at a given x requires ∼ 4n flops. (Check it: you
first compute φ0 (x) and φ1 (x) (for 0 and 1 flop respectively), then
unroll the recurrence to successively compute φ2 (x), . . . , φn (x) for 4
flops each.)

2. This theorem also shows the system of orthogonal polynomials with


respect to w is unique up to scaling—the scaling is fixed by requiring
them to be monic.

Proof. φ0 (x) = 1 is the only monic polynomial of degree 0, hence this is the
only possibility. The polynomial φ1 (x) must be of the form x − α0 for some
α0 : imposing 0 = hφ1 , φ0 i = hx − α0 , 1i gives α0 = hx,1i
h1,1i
as prescribed.
8.5. ORTHOGONAL POLYNOMIALS 155

Now consider k ≥ 1. Since φk ’s are monic,

φk (x) = xk + “some polynomial of degree ≤ k − 1”, and


φk+1 (x) = xk+1 + “some polynomial of degree ≤ k”.

Hence, xφk − φk+1 is a polynomial of degree at most k and we can write:

xφk − φk+1 = c0 φ0 + · · · + ck φk . (8.7)

Our primary goal is to show that only ck and ck−1 are nonzero. To this end,
first take inner products of (8.7) with φ` for some `:
k
X
hxφk , φ` i − hφk+1 , φ` i = cj hφj , φ` i .
j=0

For ` ≤ k, orthogonality leads to:

hxφk , φ` i = c` hφ` , φ` i . (8.8)

For ` = k, we find

hxφk , φk i
ck = = αk . (8.9)
kφk k22

For ` = k − 1, equation (8.8) gives:

hxφk , φk−1 i
ck−1 = .
kφk−1 k22

We can simplify this further, using a very special property:


Z b
hxφk , φk−1 i = xφk (x)φk−1 (x)w(x)dx = hφk , xφk−1 i .
a

Think about this last equation: it’s rather special. We exploit it as follows:

xφk−1 = xk + “some polynomial of degree ≤ k − 1”


= φk + “some other polynomial of degree ≤ k − 1” .
| {z }
q∈Pk−1

We can expand q in the basis φ0 , . . . , φk−1 :

q = a0 φ0 + · · · + ak−1 φk−1 .
156 CHAPTER 8. APPROXIMATION IN THE 2-NORM

Since φk is orthogonal to φ0 , . . . , φk−1 , it follows that φk is orthogonal to q


as well4 and

hxφk , φk−1 i = hφk , xφk−1 i = hφk , φk + qi = hφk , φk i .

Thus,
kφk k22
ck−1 = 2
= βk2 . (8.10)
kφk−1 k2
Finally, consider (8.8) for ` ≤ k − 2. Then,

hxφk , φ` i = hφk , xφ` i = 0,

since xφ` is a polynomial of degree ` + 1 ≤ k − 1: it is orthogonal to φk by


the same argument as above. Hence,

c0 = · · · = ck−2 = 0. (8.11)

Collect the numbered equations to conclude.


Remark 8.15. A converse of this theorem exists, see the Shohat–Favard
theorem https: // en. wikipedia. org/ wiki/ Favard% 27s_ theorem .
Question 8.16. Assume w is an even weight function over the interval
[−1, 1]. In the notation of the above theorem, show φk is even if k is even
and φk is odd if k is odd, and show the coefficients αk are all zero. (The two
are best shown together, as one implies the other, in alternation.)


Question 8.17. Using the expression Tn (x) = cos(n arccos(x)) for Cheby-
shev polynomials and the fact they are orthogonal with respect to w(x) =
√ 1 , use the theorem above to recover the recurrence relation for the Tn ’s.
1−x2
(Be mindful of normalization: Tn is not monic.)


Remark 8.18. For Legendre polynomials (w(x) = 1, x ∈ [−1, 1]), explicit


formulas are known:
k2
αk = 0, βk2 = .
4k 2 − 1
4
That is: φk is orthogonal to Pk−1 . This is important. We’re going to use this again.
8.5. ORTHOGONAL POLYNOMIALS 157

Likewise, for Chebyshev polynomials:


(
1
if k = 1,
αk = 0, βk2 = 2
1
4
if k ≥ 2.

In both cases, the recurrence provides the orthogonal polynomials scaled to be


monic. Notice how these coefficients are quite close (consider the asymptotics
for k → ∞.)

8.5.4 Roots of orthogonal polynomials


Polynomials orthogonal for an inner product over (a, b) have distinct roots
in (a, b).

Theorem 8.19 (Thm. 9.4 in [SM03]). Let φ0 , φ1 , φ2 , . . . form a system of


orthogonal polynomials with respect to the weight w on (a, b). For j ≥ 1, the
j roots of φj are real and distinct and lie in (a, b).

Proof. Assume ξ1 , . . . , ξk are the k points in (a, b) where φj changes sign.5


There is at least one such point. Indeed, φj is orthogonal to φ0 (x) = 1 so
that
Z b
0 = hφj , φ0 i = φj (x)w(x)dx,
a

which implies φj changes sign at least once on (a, b). (Here, we used that φj
is not identically zero, that w is positive, and that both are continuous on
(a, b).) Define

πk (x) = (x − ξ1 ) · · · (x − ξk ).

Then, the product φj (x)πk (x) no longer changes sign in (a, b). This implies
Z b
0 6= φj (x)πk (x)w(x)dx = hφj , πk i .
a

Now, πk is a polynomial of degree k, and φj is orthogonal to all polynomials


of degree up to j − 1, so that it must be that πk has degree at least j: k ≥ j.
On the other hand, φj cannot change sign more than j times since it has
degree j. Thus, k = j, as desired: the points ξ1 , . . . , ξj are (all) the roots of
φj , distinct in (a, b).
5
Thus, double roots, quadruple roots etc. do not count.
158 CHAPTER 8. APPROXIMATION IN THE 2-NORM

In the chapters about integration, we will see that the roots of orthogo-
nal polynomials are particularly appropriate to design numerical integration
schemes (Gauss quadrature rules.) How can we compute these roots? As
it turns out, they are the eigenvalues of a tridiagonal matrix.6 This means
any of our fast and reliable algorithms to compute eigenvalues of tridiagonal
matrices can be used here. But more on this in the next chapter, about
integration.

8.5.5 Differential equations & orthogonal polynomials


If you go to the Wikipedia pages for Legendre polynomials7 and Chebyshev
polynomials,8 you will find that the defining property put forward for both of
these is that they are the solutions to a certain differential equation. We took
a very different route, but evidently that perspective is important enough to
take center stage in other contexts. This side comment aims to give a high
level sense of how these equations are related to our polynomials.
You certainly know the spectral theorem: it says that a symmetric matrix
A ∈ Rn×n admits n real eigenvalues, and that associated eigenvectors form
an orthogonal basis for Rn . We can think of Rn as a vector space (of finite
dimension), and choose the standard inner product hx, yi = xT y. And we can
think of A as a linear operator: A : Rn → Rn . A matrix is symmetric if A =
AT . You can check that this is equivalent to stating that hx, Ayi = hAx, yi
for all x, y ∈ Rn . An operator with the latter property is called self-adjoint.
There exists an equivalent of the spectral theorem for (compact) linear
operators on Hilbert spaces, that is, inner product spaces (possibly infinite
dimensional) that are complete for the metric induced by the chosen inner
product.9 For example, L2w (a, b) is a Hilbert space for the inner products
we have encountered here. A linear operator A : L2w (a, b) → L2w (a, b) maps
functions to functions in that space. For example, A could be a differential
operator (differentiating a function, under some conditions, gives another
function.) There exists a notion of eigenvalue and corresponding eigenfunc-
tion (the equivalent of an eigenvector) for A: solutions of Af = λf . If A
is self-adjoint (in the sense that hAf , gi = hf, Agi, where f, g are two func-
tions) and if it is compact, then the spectral theorem says λ’s are real and,
crucially, the eigenfunctions are orthogonal for the chosen inner product.
6
This shouldn’t be too surprising; In our work about Sturm sequences, we reduced the
problem of computing eigenvalues of tridiagonal matrices to that of computing the roots
of polynomials expressed via a three-term recurrence relation.
7
https://en.wikipedia.org/wiki/Legendre_polynomials
8
https://en.wikipedia.org/wiki/Chebyshev_polynomials
9
https://en.wikipedia.org/wiki/Compact_operator_on_Hilbert_space
8.5. ORTHOGONAL POLYNOMIALS 159

How is this related to Legendre and Chebyshev polynomials? Take an-


other look at these differential equations that appear at the top of their
Wikipedia pages: they are of the form “some differential operator applied to
f = some multiple of f ”. In other words: the orthogonal polynomials are the
eigenfunctions of that particular (compact) linear differential operator on a
Hilbert space of functions, for the corresponding inner product (that is, the
choice of weight function w on (a, b)).
This is all part of a more general theory of Sturm–Liouville operators,10
which provide a way to produce orthogonal systems of eigenfunctions on a
large class of Hilbert spaces. The tip of a beautiful iceberg.

10
https://en.wikipedia.org/wiki/Sturm%E2%80%93Liouville_theory
160 CHAPTER 8. APPROXIMATION IN THE 2-NORM
Chapter 9

Integration

We consider the problem of computing integrals.


Rb
Problem 9.1. Given f ∈ C[a, b], compute a
f (x)dx.

For most functions f , this integral cannot be computed analytically.


2
Think for example of f (x) = e−x (whose integrals come up frequently in
probability computations involving Gaussians), of f (x) = log(2 + cos(5ex −
sin(x))), and of cases where f (x) is the result of a complex computation,
that may not even be known to us (black box model.) Thus, we resort to
numerical algorithms to approximate it.
Informed by the previous chapters, our main strategy is to approximate
R b pn in PnR, and
f with a polynomial
b
to integrate pn instead of f . Presumably,
if f ≈ pn , then a f (x)dx ≈ a pn (x)dx. Integrating the polynomial should
pose no difficulty. To legitimize this intuition, an important aspect is to give
a precise, quantified meaning to these “≈” signs.
We discussed three ways to approximate a function with a polynomial:
interpolation, minimax, and best in the 2-norm. For minimax, we did not
discuss concrete algorithms (and existing ones are involved: we don’t want to
work that hard to answer our simple question.) For 2-norm approximations,
one actually needs to compute integrals numerically: that would be circular.
This leaves interpolation.
Recall Lagrange interpolation. We only need to make a choice of inter-
polation points a ≤ x0 < · · · < xn ≤ b. Then,

n
X Y x − xi
f (x) ≈ pn (x) = f (xk )Lk (x), Lk (x) = .
x k − xi
k=0 i6=k

161
162 CHAPTER 9. INTEGRATION

Then, informally,
Z b Z b Xn Z b n
X
f (x)dx ≈ pn (x)dx = f (xk ) Lk (x)dx = wk f (xk ).
a a k=0 | a {z } k=0
wk

Some work goes into computing the quadrature weights w0 , . . . , wn —but no-
tice that this is independent of f : it needs only be donePonce, and the
n
weights can be stored to disk. Then, the quadrature rule k=0 wk f (xk ) is
easily applied. The points x0 , . . . , xn where f needs to be evaluated are called
quadrature points.

9.1 Computing the weights


The weights
Z b
wk = Lk (x)dx
a

depend on the quadrature points x0 , . . . , xn . The formula suggests an obvious


way to compute the weights. For the sake of example, let us consider the
case where the quadrature points are equispaced on [a, b], giving so-called
Newton–Cotes quadrature rules.

Trapezium rule (Newton–Cotes with n = 1). With x0 = a, x1 = b,


the Lagrange polynomials have degree 1:
x−b x−a
L0 (x) = , L1 (x) = .
a−b b−a
These are easily integrated by hand:
Z b b
x−b b−a x−a b−a
Z
w0 = dx = , w1 = dx = .
a a−b 2 a b−a 2
Overall, we get the quadrature rule
Z b Z b
b−a
f (x)dx ≈ p1 (x)dx = (f (a) + f (b)).
a a 2
This is the area of the trapezium defined by the points (a, 0), (a, f (a)),
(b, f (b)), (b, 0). Notice the symmetry here: it makes sense that f (a) and
f (b) get the same weight.
9.1. COMPUTING THE WEIGHTS 163

Simpson’s rule (Newton–Cotes with n = 2). With equispaced points


x0 = a, x1 = a+b
2
, x2 = b, Lagrange polynomials have degree 2. We could
obtain the weights exactly as above, but this is rapidly becoming tedious.
Let’s keep an eye out for shortcuts as we proceed. Consider w0 first:
b b
(x − x1 )(x − x2 ) b−a
Z Z
w0 = L0 (x)dx = = ...some work... = .
a a (x0 − x1 )(x0 − x2 ) 6

Similar computations yield w1 , w2 . But we can save some work here. For
example, by a symmetry argument similar as above, we expect w2 = w0 .
You can verify it. What about w1 ? Let’s think about the (very) special case
f (x) = 1. Since f is a polynomial of degree 0, its interpolation polynomial
in P2 is exact: f = p2 . Thus,
Z b
b−a= f (x)dx
a
Z b
= p2 (x)dx = w0 f (x0 ) + w1 f (x1 ) + w2 f (x2 ) = w0 + w1 + w2 .
a

That is: the weights sum to b − a (the length of the interval.) This allows to
conclude already:

b−a b−a b−a


w0 = , w1 = 4 , w2 = .
6 6 6
This rule interpolates f at three points with a quadratic and approximates
the integral of f with the integral of the quadratic.
Let’s revisit the argument above: f (x) = 1 is integrated exactly upon
approximation by p2 , because p2 = f . More generally,

If f ∈ Pn , then f = pn .R
b Rb
Hence, if f ∈ Pn , then a f (x)dx = a pn (x)dx = nk=0 wk f (xk ).
P

This last point is crucial.


We can use this to our advantage to automate the computation of weights
wk . Indeed, with n + 1 points, the integration rule is exact for polynomials
of degree up to n, which, equivalently, means it is exact for polynomials
1, x, x2 , . . . , xn . Each of these yields a linear equation in the wk ’s. Consider
f (x) = xi for i = 0, . . . , n:
b n
bi+1 − ai+1
Z X
i
= x dx = wk xik .
i+1 a k=0
164 CHAPTER 9. INTEGRATION

These n + 1 linear equations can be arranged in matrix form:


    
x00 · · · x0n w0 b−a
 .. ..   ..  =  ..
. (9.1)

. .  .   .
bn+1 −an+1
xn0 ··· xn n
wn n+1

We recognize an old friend: the matrix is a (transposed) Vandermonde ma-


trix. We know from experience that it is often ill conditioned even for n
below 100. We will circumvent this for more sophisticated methods later.
Still, this system is easily set up for the general case (general n, and general
choice of quadrature points) and can be solved easily for small n. For larger
n, recall that we only need to solve the system once accurately. This could be
done offline, for example using variable precision arithmetic packages (vpa
in Matlab’s symbolic toolbox, for example), then rounding the weights to
double precision. This of course also has its practical limitations. For these
reasons, we put conditioning concerns aside for now and proceed, bearing in
mind that it may be unwise to aim for very large values of n, unless the wk ’s
are computed in a better way.1

9.2 Bounding the error


Our goal now is to bound the truncation error:
Z b n
X Z b Z b
En (f ) = f (x)dx − wk f (xk ) = f (x)dx − pn (x)dx .
a k=0 a a

Recall Theorem 6.5: if f ∈ C n+1 [a, b] (meaning: if f is n + 1 times continu-


ously differentiable on [a, b]) and pn ∈ Pn interpolates f at a ≤ x0 < . . . <
xn ≤ b, then

kf (n+1) k∞
|f (x) − pn (x)| ≤ |πn+1 (x)|,
(n + 1)!
1
In previous chapters, to address ill-conditioning of the Vandermonde matrix, we
changed bases: instead of using monomials 1, x, x2 , . . . we used Lagrange polynomials
L0 , L1 , L2 , . . . which turned the Vandermonde matrix into an identity matrix. Notice that
doing this here, that is, imposing that L0 , . . . , Ln are integrated exactly, only shifts the
Rb
burden to that of computing the right hand side, which contains a Lk (x)dx: this is our
original task. At any rate, later in this chapter we will see that there are better ways to
build quadrature rules, namely, composite rules and Gaussian rules.
9.2. BOUNDING THE ERROR 165

where πn+1 (x) = (x − x0 ) · · · (x − xn ). This leads to a simple bound:


Z b
En (f ) = f (x) − pn (x)dx
a
Z b
≤ |f (x) − pn (x)|dx
a
kf (n+1) k∞ b
Z
≤ |πn+1 (x)|dx (9.2)
(n + 1)! a
kf (n+1) k∞
≤ (b − a)kπn+1 k∞ . (9.3)
(n + 1)!
Both (9.2) and (9.3) can be used to obtain an explicit bound on En (f ). Us-
ing (9.3) is easier given our previous work so we will use this one. See [SM03,
§7] for an explicit use of (9.2), which gives slightly better constants.
As a general observation, unsurprisingly, if the quadrature nodes x0 <
· · · < xn are chosen so as to make kπn+1 k∞ small (a choice independent of
f ), then so is the integration error, provided f (n+1) is defined, continuous
and not too large. This echoes our considerations from the chapter about
interpolation. Thus, we also expect that it is better to use Chebyshev nodes
rather than equispaced nodes as quadrature points.

Equispaced points (Newton–Cotes). Recall For small n, we can


n+1 get an exact
n! b−a
n
expression for
kπn+1 k∞ ≤ . kπn k∞ . For n = 1,
4
it is easy to derive:
As a result, kπ2 k∞ = (b−a)
2
.
4
(b − a)n+2 For n = 2, it takes
En (f ) ≤ kf (n+1) k∞ . (Newton–Cotes) a bit of work (with
4(n + 1)nn+1 the help of Wolfram
This gives in particular: for example, as
otherwise it is quite
(b − a)3 tedious), and we
E1 (f ) ≤ kf 00 k∞ , (trapezium rule) 3
8 get kπ3 k∞ = (b−a) √ .
12 3
(b − a)4 This is not too
E2 (f ) ≤ kf 000 k∞ . (Simpson’s rule) important: it only
96
changes constants
somewhat.

Chebyshev points. Recall equations (7.7) and (7.8): for n ≥ 0, choosing


Chebyshev nodes on [−1, 1] as in (7.9) leads to:
1
πn+1 (x) = (x − x0 ) · · · (x − xn ) = Tn+1 (x),
2n
166 CHAPTER 9. INTEGRATION

whose infinity norm is 1/2n . What about a general interval [a, b]? Consider
the linear change of variable
a+b b−a
t = t(x) = + x,
2 2
constructed to satisfy t(−1) = a and t(1) = b: it maps [−1, 1] to [a, b]. The
Chebyshev nodes on [a, b] are defined as

tk = t(xk ), for k = 0, . . . , n.

Interpolation at those points leads to a different πn+1 polynomial:

π̃n+1 (t) , (t − t0 ) · · · (t − tn )
   
a+b b−a a+b b−a
= t− − x0 · · · t − − xn .
2 2 2 2

Consider the inverse of the change of variable:


 
2 a+b
x = x(t) = t− .
b−a 2

Plugging this into the expression for π̃n+1 (t) and factoring out b−a
2
from each
of the n + 1 terms gives:
 n+1  n+1
b−a b−a
π̃n+1 (t) = (x − x0 ) · · · (x − xn ) = πn+1 (x).
2 2

As a result,
 n+1
b−a 1
kπ̃n+1 k∞ = .
2 2n

Combining with (9.3) gives:

(b − a)n+2
En (f ) ≤ kf (n+1) k∞ . (Chebyshev nodes) (9.4)
(n + 1)! 22n+1

In particular,

(b − a)3
E1 (f ) ≤ kf 00 k∞ ,
16
(b − a)4
E2 (f ) ≤ kf 000 k∞ .
192
9.3. COMPOSITE RULES 167

9.3 Composite rules


We covered composite rules in class. See [SM03, §7.5], including error bounds
(you can use the error bounds derived above instead, which gives ∼ m13 for
composite Simpson’s instead of ∼ m14 .)
Notice why it is important to work with Newton–Cotes rules here: they
allow to reuse extreme points of subintervals, instead of needing to recompute
them.
Composite rules are quite important in practice: there are no typeset
notes here because the book is sufficiently explicit; do not take it as a sign
that this is less important :).

9.4 Gaussian quadratures


The quadrature rules discussed above, based on interpolation at n+1 quadra-
ture points, integrate polynomials in Pn exactly.
Definition 9.2. A quadrature rule has degree of precision n if it integrates
all polynomials in Pn exactly, but not so for Pn+1 .
Thus, the rules presented above have degree of precision at least n.
Question 9.3. Show that Newton–Cotes with n = 1 has degree of precision
1, yet Newton–Cotes with n = 2 has degree of precision 3. More generally,
reason that Newton–Cotes rules have degree of precision n if n is odd, and
degree of precision n + 1 if n is even.


The notion of degree of precision of a rule suggests a natural question:


Using n + 1 quadrature points, what is the highest degree of pre-
cision possible for a quadrature rule?
Let’s put a limit on our dreams. For a given quadrature rule with quadrature
nodes x0 , . . . , xn and weights w0 , . . . , wn , consider the following polynomial:

q(x) = (x − x0 )2 · · · (x − xn )2 .
Rb
This is a polynomial of degree 2n + 2. Surely, P a
q(x)dx is positive (in
particular, nonzero). Yet, the quadrature rule yields nk=0 wk q(xk ) = 0, since
all the quadrature points are roots of q. We conclude that no quadrature rule
based on n+1 points can integrate q correctly; in other words: no quadrature
168 CHAPTER 9. INTEGRATION

rule based on n + 1 points has degree of precision 2n + 2 or more. Fine, we


do not expect to find a method with degree of precision more than 2n + 1.
But can we attain this high degree?
The answer, magically to some extent, is yes! It goes by the name of
Gauss quadrature, and involves orthogonal polynomials. Before we get into
it, let us extend the scope of the integration problem somewhat, to allow for
weights w(x) (this is notably useful for best approximation in the 2-norm):
Problem 9.4. Given f ∈ C[a, b] and a weight function w on (a, b), compute
Rb
a
f (x)w(x)dx.
Consider a system of orthogonal polynomials for the inner product with
weight w,

φ0 , φ1 , φ2 , . . . ,

as defined in Definition 8.12. Observe that for any polynomial p2n+1 ∈ P2n+1 ,
there exist two polynomials q and r in Pn such that

p2n+1 (x) = q(x)φn+1 (x) + r(x).

(That is, q is the quotient and r is the remainder after division of p2n+1
by φn+1 .) Furthermore, let x0 < · · · < xn in [a, b] and w0 , . . . , wn form a
quadrature rule of degree of precision at least n (to be determined.) Then,
Z b Z b
p2n+1 (x)w(x)dx = (q(x)φn+1 (x) + r(x))w(x)dx
a a
Z b Z b
= q(x)φn+1 (x)w(x)dx + r(x)w(x)dx
|a {z } a
=hq,φn+1 i=0 since q∈Pn
n
X
= wk r(xk ) (since the rule is exact for r ∈ Pn .)
k=0

Thus, using the quadrature rule of degree of precision at least n, one could
conceivably integrate all polynomials of degree up to 2n + 1, if only one could
evaluate r instead of p2n+1 at the quadrature nodes. In general, this is not
an easy task. Here comes the key part:
If we pick the quadrature nodes x0 < . . . < xn to be the n + 1
roots of φn+1 (known to be real, distinct and in [a, b]), then

p2n+1 (xk ) = q(xk )φn+1 (xk ) + r(xk ) = r(xk ).


9.4. GAUSSIAN QUADRATURES 169

Consequently,
Z b n
X
p2n+1 (x)w(x)dx = wk p2n+1 (xk ).
a k=0

The weights wk can be determined as usual from (9.1)—but we


will do better.
Thus, the roots of orthogonal polynomials tell us where to sample f in
order to maximize the degree of precision of the quadrature rule. How do we
compute those roots?

9.4.1 Computing roots of orthogonal polynomials


Recall Theorem 8.19, which states that roots of orthogonal polynomials are
real and distinct and lie in (a, b). Now, we need to compute them. We will
show they are the eigenvalues of a symmetric, tridiagonal matrix: we know
how to handle that. We only need to find this tridiagonal matrix. First, we
will find a non-symmetric tridiagonal matrix; then, we will show it has the
same eigenvalues as a symmetric one.
It all hinges on the three-term recurrence, Theorem 8.14.2 As a matter of
example, let us aim to find the roots of φ5 . The three-term recurrence gives:
α0 φ0 (x) + φ1 (x) = xφ0 (x)
2
β1 φ0 (x) + α1 φ1 (x) + φ2 (x) = xφ1 (x)
β22 φ1 (x) + α2 φ2 (x) + φ3 (x) = xφ2 (x)
β32 φ2 (x) + α3 φ3 (x) + φ4 (x) = xφ3 (x)
β42 φ3 (x) + α4 φ4 (x) + φ5 (x) = xφ4 (x).
| {z }
→rhs

In matrix form, this reads:


      
α0 1 φ0 (x) φ0 (x) 0
β12 α1 1  φ1 (x) φ1 (x)  0 
 2
     
 β2 α2 1  φ2 (x) = x φ2 (x) −  0  .
       
 β32 α3 1  φ3 (x) φ3 (x)  0 
β42 α4 φ4 (x) φ4 (x) φ5 (x)
| {z } | {z } | {z }
J v(x) v(x)

The matrix on the left, J, is the Jacobi matrix. The crucial observation
follows:
2
The recurrence is set up assuming the polynomials are monic, which doesn’t affect
their roots so this is inconsequential to our endeavor.
170 CHAPTER 9. INTEGRATION

If x is a root of φ5 , then
Jv(x) = xv(x). (9.5)
That is: if x is a root of φ5 , then x is an eigenvalue of the Jacobi
matrix, with eigenvector v(x).3
Let’s say this again: all five distinct roots of φ5 are eigenvalues of J. Since
J is a 5 × 5 matrix, it has five eigenvalues, so that the roots of φ5 are exactly
the eigenvalues of J. This generalizes for all n of course.
Theorem 9.5. The roots of φn+1 are the eigenvalues of the Jacobi matrix
 
α0 1
 β 2 α1 1 
 1 
Jn+1 = 
 . . . . . . .

. . .
2
 
 βn−1 αn−1 1 
βn2 αn
This matrix is tridiagonal, but it is not symmetric. The eigenvalue com-
putation algorithms we have discussed require a symmetric matrix. Let’s
try to make J symmetric without changing its eigenvalues. Using a diagonal
similarity transformation (coefficients sk 6= 0 to be determined):
S −1 J S
z }| { z }| { z }| {
s−1

0 α0 1 s0
 s−1
1
 β12 α1 1  s1 
s−1
  2
 
 2
 β2 α2 1  s2 
s−1
  2
  
 3
 β3 α3 1   s3 
−1
s4 β42 α4 s4
s1
α0
 
s0
 ss0 β12 α1 s2
s1

 1 s1 2 s3

= β
s2 2
α 2 s2
.

s2 2 s4 

s3
β 3 α 3 s3
s3 2
s4
β 4 α4
| {z }
J

Verify that the matrix J on the right hand side has the same eigenvalues as
J. We wish to choose the sk ’s in such a way that J is symmetric. This is
the case if, for each k,
sk−1 2 sk
βk = .
sk sk−1
3
Note that v(x) 6= 0 since its first entry is 1.
9.4. GAUSSIAN QUADRATURES 171

kφk k22
Since βk2 = kφk−1 k22
, a valid choice is:

sk = kφk k2 .

The resulting symmetric matrix is:


 
α0 β1
 β1 α 1 β2 
 
J = β2 α2 β3 .

 β3 α3 β4 
β4 α 4

This is indeed tridiagonal and symmetric, and has the same eigenvalues as
J, so that its eigenvalues are the roots of φ5 . We can generalize.

Theorem 9.6. The roots of φn+1 are the eigenvalues of the matrix
 
α0 β1
 β1 α1 β2 
 
Jn+1 = 
 . . . . . . .

(9.6)
. . .
 
 βn−1 αn−1 βn 
βn αn

The eigenvalues give us the roots. To get the weights, for now, we only
know one way: solve the ill-conditioned Vandermonde system (9.1). Fortu-
nately, there is a better way. It involves the eigenvectors of Jn+1 .

Remark 9.7. As a side note, Cauchy’s interlace theorem, Theorem 5.18,


further tells us the roots of a system of orthogonal polynomials interlace since
βk 6= 0 for all k. This is not surprising, given the resemblance between the
recurrences (5.19) and (8.6).

9.4.2 Getting the weights, too: Golub–Welsch


The Gauss quadrature rule now looks like this:
Z b n
X
f (x)w(x)dx ≈ wj f (xj ),
a j=0

where x0 < · · · < xn are the eigenvalues of Jn+1 , and the weights can be
chosen to ensure exact integration if f ∈ P2n+1 .
172 CHAPTER 9. INTEGRATION

This last point leads to the following observation. Define a4 system of


orthonormal polynomials:
1
ϕk (x) = φk (x), for k = 0, 1, 2, . . .
kφk k2
Then, for any k, ` in 0, . . . , n,

δk` = hϕk , ϕ` i
Z b
= ϕk (x)ϕ` (x)w(x)dx
a
n
!
X
= wj ϕk (xj )ϕ` (xj ).
j=0

This last equality5 follows from the fact that ϕk (x)ϕ` (x) is a polynomial of
degree at most 2n; thus, it is integrated exactly by the Gauss quadrature
rule. Verify that these equations can be written in matrix form:

I = PWPT,

with I the identity matrix of size n + 1 and


   
ϕ0 (x0 ) · · · ϕ0 (xn ) w0
P =  ... ..  , W = .. .
  
.  .
ϕn (x0 ) · · · ϕn (xn ) wn

The identity I = P W P T notably implies that both P and W are invertible.


Hence,

W = P −1 (P T )−1 = (P T P )−1 .

Alternatively,

W −1 = P T P.

In other words:
n n
1 X X
= (P T P )jj = 2
Pkj = (ϕk (xj ))2 . (9.7)
wj k=0 k=0

4
Unique if we further require the leading coefficient to be positive, as is the case here.
5
This equality also shows that ϕ0 , ϕ1 , . . . are not only orthonormal with respect to
a continuous inner product, but also with respect to a discrete inner product (see also
Remark 9.10). But this is a story for another time.
9.4. GAUSSIAN QUADRATURES 173

How do we compute the right hand side? This is where the eigenvectors come
in.
Recall Jv(x) = xv(x) from eq. (9.5). Apply S −1 on the left and insert
SS −1 to find:
−1 −1
| {zJS} S
S v(x) = x S −1 v(x) .
| {z } | {z }
J u(x) u(x)

Thus, u(x) is an eigenvector of J associated to the eigenvalue x, and


 1    
kφ0 k2 φ0 (x) ϕ0 (x)
1
 φ1 (x) ϕ1 (x)
 
kφ1 k2
  
−1 1
u(x) = S v(x) =   φ2 (x) = ϕ2 (x) .
   
kφ2 k2
1
 φ3 (x) ϕ3 (x)
   
 kφ3 k2
1 φ4 (x) ϕ4 (x)
kφ4 k2

Corollary 9.8. The roots of ϕn+1 (equivalently, the roots of φn+1 ) are the
eigenvalues of the symmetric, tridiagonal matrix Jn+1 . Let x0 < · · · < xn in
(a, b) denote these roots. Any eigenvector u(j) associated to the eigenvalue xj
is of the form:
 
ϕ0 (xj )
 ϕ1 (xj ) 
(j)
u = cj  ..  , (9.8)
 
 . 
ϕn (xj )

for some constant cj .

Let u(j) be an eigenvector of Jn+1 associated to the eigenvalue xj as


in (9.8), with constant cj chosen so that u(j) is normalized,6 that is,
n
X n
X
(j)
1= ku(j) k22 = (uk )2 = c2j (ϕk (xj ))2 .
k=0 k=0

Then, going back to (9.7), we find

wj = c2j .

This brings us to the final question: how do we find cj ? Given an eigenvector,


it is easy to normalize it, but that doesn’t tell us what cj is. The trick is
6
This is with respect to the vector 2-norm now!
174 CHAPTER 9. INTEGRATION

to observe that ϕ0 is a constant. Specifically, it is the constant polynomial


whose norm is 1. So, for some fixed µ, we have:

ϕ0 (x) = µ, ∀x.

Then,
(j)
u0 = cj ϕ0 (xj ) = cj µ.

Thus, to find cj (which gives us wj ), we only need to find µ. To this end,


note that,
Z b
kϕ0 k2 = 1 ⇐⇒ µ2 w(x)dx = 1,
a

so that
1
µ2 = R b .
a
w(x)dx
Thus,
(j) b
(u )2
Z
(j)
wj = c2j = 0 2 = (u0 )2 w(x)dx . (9.9)
µ a
| {z }
compute once

After these derivations, the practical implications are clear: this last equation
is all you need to figure out how to implement the Golub–Welsch algorithm.
This algorithm computes the nodes and the weights of a Gaussian quadrature:
see Algorithm 9.1.
Question 9.9. Implement the procedure above, known as the Golub–Welsch
algorithm, to compute the nodes and weights of the Gauss–Legendre quadra-
ture rule (Gauss with Legendre polynomials, that is, w(x) = 1 over [−1, 1].)


Remark 9.10. The procedure above also proves the weights in a Gauss
quadrature rule are positive. This is great numerically. Indeed, assume you
are integrating f (x) which is nonnegative over [a, b]. Then, the quadrature
rule is a sum of nonnegative numbers: there is no risk of catastrophic can-
cellation due to round-off errors in computing the sum. This is not so for
Newton–Cotes rules, which have negative weights already for n ≥ 8: these
may lead to intermediate computations of differences of large numbers.
9.4. GAUSSIAN QUADRATURES 175

Algorithm 9.1 Golub–Welsch for Gauss quadrature nodes and weights


1: Given: α0 , . . . , αn and β1 , . . . , βn : the three-term recurrence coefficients
for the system of polynomials orthogonal with respect to weight w(x) over
Rb
[a, b]; see Remark 8.18 for example. Furthermore, let t = a w(x)dx.
2: Construct the symmetric, tridiagonal matrix Jn+1 as in (9.6).
3: For the matrix Jn+1 , compute the eigenvalues x0 < · · · < xn in (a, b) and
the associated unit-norm eigenvectors u(0) , . . . , u(n) in Rn+1 .
(j) (j)
4: Following (9.9), for j = 0, . . . , n, let wj = (u0 )2 · t, where u0 is the first
entry of eigenvector u(j) .

9.4.3 Examples
Figures 9.1–9.4 compare different quadrature rules, obtained in different
ways, on four different integrals.
ˆ Newton–Cotes (Vandermonde) uses n + 1 equispaced points on the
interval, and obtains the weights by solving the Vandermonde system
(ill conditioned).
ˆ Chebyshev nodes (Vandermonde) does the same but with n + 1 Cheby-
shev nodes in the interval.
ˆ Legendre nodes (Vandermonde) does the same but using the n + 1
roots of the (n + 1)st Legendre polynomial. Mathematically, the later
is the Gauss quadrature with n + 1 points for unit weight w(x) = 1,
but because solving the Vandermonde system is unstable, it eventually
fails.
ˆ Legendre nodes (Golub–Welsch) is the same rule as the previous one,
but it computes the weights with Golub–Welsch, which is far more
trustworthy numerically.
ˆ The composite trapezium rule uses n trapezia to approximate the in-
tegral.
At point n, all rules use exactly n + 1 evaluations of the integrand. They
require varying amounts of work to compute the nodes and weights, but
notice that these can be precomputed once and for all.

9.4.4 Error bounds


We did not get a chance to discuss error bounds for Gauss quadrature rules
in class. See Theorems 10.1 and 10.2 in [SM03] for results in this direction
176 CHAPTER 9. INTEGRATION

(for your information.) Interestingly, the first one is related to approximation


errors when interpolating à la Hermite—this is actually another way to con-
struct Gauss quadratures, and is the way preferred in [SM03]. In the present
lecture notes, we followed an approach which emphasizes the role of orthog-
onal polynomials instead, because (a) it is practical, (b) it is beautiful, and
(c) this is the way to go to generalize Gauss quadratures to other domains
besides the real line. The following is a side-note similar to the reasoning in
Theorem 10.2 of [SM03].
Rb
Say we want to compute a f (x)dx with f continuous (the story gener-
alizes to include weights w(x), but we here consider w(x) = 1 over [a, b] for
simplicity). Consider p2n+1 ∈ P2n+1 : the minimax approximation of f over
[a, b] of degree at most 2n + 1. We do not need to compute this polynomial:
we only need to know that it exists, as we proved two chapters ago.
We can split f as follows:

f = p2n+1 + e2n+1 ,

where e2n+1 is some function (not necessarily a polynomial), and ke2n+1 k∞


becomes arbitrarily close to 0 as n goes to infinity, because polynomials are
dense in C[a, b] (Weierstrass’ theorem).
Now say x0 , . . . , xn and w0 , . . . , wn form the Gauss quadrature rule of
degree of precision 2n + 1. Apply this rule to f :

n
X n
X n
X
wk f (xk ) = wk p2n+1 (xk ) + wk e2n+1 (xk )
k=0 k=0 k=0
Z b n
X
= p2n+1 (x)dx + wk e2n+1 (xk ),
a k=0

where we used the fact that the quadrature rule is exact when applied to
p2n+1 . On the other hand:

Z b Z b Z b
f (x)dx = p2n+1 (x)dx + e2n+1 (x)dx.
a a a
9.4. GAUSSIAN QUADRATURES 177

Thus, our integration error is given by:


Z b n
X Z b n
X
f (x)dx − wk f (xk ) = e2n+1 (x)dx − wk e2n+1 (xk )
a k=0 a k=0
Z b Xn
≤ e2n+1 (x)dx + wk e2n+1 (xk )
a k=0
Z b n
X
≤ ke2n+1 k∞ 1dx + ke2n+1 k∞ |wk |
a k=0
= 2(b − a)kf − p2n+1 k∞ .

The last step relies on two facts:

1. w0 , . . . , wn ≥ 0 (see Remark 9.10), and

2. w0 + . . . + wn = b − a since the rule is exact for f (x) = 1.

Here is the take-away: the integration error of the Gauss quadrature rule
is determined by the minimax error of approximating f with a polynomial of
degree 2n + 1 (even though we are using only n + 1 evaluations of f !). The
error goes to zero with n → ∞. If f lends itself to good minimax approxima-
tions of low degree, then the results should be quite good already for finite
n (and remember: we do not need to compute the minimax approximation:
we only use the fact that it exists). Furthermore, we can get an upper-bound
on the bound by plugging in any other polynomial of degree at most 2n + 1
which is a good ∞-norm approximation of f . For example, we know that the
polynomial of degree 2n + 1 which interpolates f at the 2n + 2 Chebyshev
nodes on (a, b) reaches a pretty good ∞-norm error as long as f is many
times continuously differentiable and those derivatives are not catastrophi-
cally crazy. We do at least as well as if we were using that polynomial, even
though finding that polynomial would have required 2n + 2 evaluations of f !
178 CHAPTER 9. INTEGRATION

Absolute integration error


105
Newton-Cotes (Vandermonde)
Chebyschev nodes (Vandermonde)
Legendre nodes (Vandermonde)
Legendre nodes (Golub-Welsch)
Composite trapezium
100

10-5

10-10

10-15

10-20

100 101 102 103


n (rule uses n+1 points)

R1
Figure 9.1: Computation of 0
cos(2πx)dx = 0 with various quadratures.

Absolute integration error

Newton-Cotes (Vandermonde)
104 Chebyschev nodes (Vandermonde)
Legendre nodes (Vandermonde)
Legendre nodes (Golub-Welsch)
Composite trapezium
102

100

10-2

10-4

10-6

10-8

10-10

100 101 102 103


n (rule uses n+1 points)

R1√ 2
Figure 9.2: Computation of 0
xdx = 3
with various quadratures.
9.4. GAUSSIAN QUADRATURES 179

Absolute integration error


105
Newton-Cotes (Vandermonde)
Chebyschev nodes (Vandermonde)
Legendre nodes (Vandermonde)
Legendre nodes (Golub-Welsch)
Composite trapezium
100

10-5

10-10

10-15

10-20

100 101 102 103


n (rule uses n+1 points)

R1 1
Figure 9.3: Computation of −1 1+25x2
dx = 25 arctan(5) with various quadra-
tures.

Absolute integration error


105
Newton-Cotes (Vandermonde)
Chebyschev nodes (Vandermonde)
Legendre nodes (Vandermonde)
Legendre nodes (Golub-Welsch)
Composite trapezium
100

10-5

10-10

10-15

10-20

100 101 102 103


n (rule uses n+1 points)

R1 1
Figure 9.4: Computation of 0
x8 dx = 9
with various quadratures.
180 CHAPTER 9. INTEGRATION
Chapter 10

Unconstrained optimization

One of the central problems in optimization is the following.


Problem 10.1. Given a continuous function f : Rn → R, compute1

min f (x), (10.1)


x∈Rn

that is, compute the minimal value that f can take for any choice of x.
In general, this value may not exist and may not be attainable, in which
case the problem is less interesting. We make the following blanket assump-
tion throughout the chapter to avoid this issue.
Assumption 10.2. The function f is twice continuously differentiable2 and
attains the minimal value f ∗ .
When the minimum exists and is attainable, one is often also interested
in determining an x∗ ∈ Rn for which this minimal value is attained. Such an
x∗ is called an optimum.3 The set of all optima is denoted

arg min f (x). (10.2)


x∈Rn

In general, this set can contain any number of elements. Because the variable
x is free to take up any value, we say the problem is unconstrained. It is im-
portant to make a distinction between globally optimal points and points that
appear optimal only locally (that is, when compared only to their immediate
surroundings.)
1
It is traditional to talk about minimization. To maximize, consider −f (x) instead.
2
Many statements hold without demanding this much smoothness. Given the limited
time at our disposal, we will keep technicality low to focus on ideas.
3
Or also: an optimizer, a minimum, a minimizer.

181
182 CHAPTER 10. UNCONSTRAINED OPTIMIZATION

Definition 10.3 (optimum and local optimum). A point x ∈ Rn is a local


optimum of f if there exists a neighborhood U of x such that f (x) ≤ f (y)
for all y ∈ U . A local optimum is a (global) optimum if f (x) ≤ f (y) for all
y ∈ Rn .
The definition of optimum is exactly what we want, but it is not revealing
in terms of computations. To work our way toward practical algorithms,
we now determine necessary optimality conditions, that is: properties that
optima must satisfy.
Lemma 10.4 (Necessary optimality conditions). If x∗ ∈ Rn is an optimum
of f , then ∇f (x∗ ) = 0 and ∇2 f (x∗ )  0.
Proof. For contradiction, assume ∇f (x∗ ) 6= 0 and consider x = x∗ −α∇f (x∗ )
for some α > 0. By a Taylor expansion at x∗ , we find that

f (x) = f (x∗ ) + ∇f (x∗ )T (x − x∗ ) + o(kx − x∗ k2 ) (10.3)


= f (x∗ ) − αk∇f (x∗ )k22 + o(α). (10.4)

By definition, o(α) is a term such that limα→0 o(α)


α
= 0. Hence, there exists
ᾱ > 0 such that for all α ∈ (0, ᾱ) the term −αk∇f (x∗ )k22 + o(α) is negative.
This shows that f (x) < f (x∗ ) which is a contradiction.
Similarly, assume now for contradiction that ∇2 f (x∗ ) 6 0. Thus, there
exists u ∈ Rn of unit norm such that uT ∇2 f (x∗ )u = −δ < 0. Using the fact
that ∇f (x∗ ) = 0 as we just established and a Taylor expansion, we find for
x = x∗ + αu that

f (x) = f (x∗ ) + (x − x∗ )T ∇2 f (x∗ )(x − x∗ ) + o(kx − x∗ k22 ) (10.5)


= f (x∗ ) − α2 δ + o(α2 ). (10.6)

Again, there is an interval (0, ᾱ) of values of α which show x outperforms x∗ ,


leading to a contradiction.
Notice that local optima also are second-order critical points. It is useful
to give a name to all points which satisfy the necessary conditions.
Definition 10.5 (Critical points). A point x ∈ Rn is a (first-order) critical
point if ∇f (x) = 0. If furthermore ∇2 f (x)  0, then x is a second-order
critical point.
Critical points which are neither local minima nor local maxima are called
saddle points.
As will become clear, in general, we can only hope to compute first- or
second-order critical points. Since these can be arbitrarily far from optimal,
183

this is a major obstacle to our endeavor. Fortunately, for certain classes


of functions, it is the case that all critical points are optima, so that the
necessary conditions of Lemma 10.4 are also sufficient.
Definition 10.6 (convex function). A function g : Rn → R is convex if, for
all x, y ∈ Rn and for all α ∈ [0, 1],
g(αx + (1 − α)y) ≤ αg(x) + (1 − α)g(y), (10.7)
that is, if the line segment between any two points on the graph of the function
lies above or on the graph.
Lemma 10.7. Owing to the differentiability properties of f , these are equiv-
alent:
(a) f is convex;
(b) f (y) ≥ f (x) + (y − x)T ∇f (x) for all x, y ∈ Rn ;
(c) ∇2 f (x)  0 for all x ∈ Rn .
Proof. We first show (a) and (b) are equivalent.
If f is convex, then for any α ∈ [0, 1] we have
f (x + α(y − x)) ≤ αf (y) + (1 − α)f (x),
or, equivalently,
f (x + α(y − x)) − f (x)
≤ f (y) − f (x).
α
Furthermore, by definition of the gradient,
f (x + α(y − x)) − f (x)
lim = (y − x)T ∇f (x).
α↓0 α
Thus, (y − x)T ∇f (x) ≤ f (y) − f (x), as desired.
The other way around, assume now that (b) holds. For any x, y ∈ Rn
and any α ∈ [0, 1], consider z = αx + (1 − α)y. Using (b) twice we get
f (x) ≥ f (z) + (x − z)T ∇f (z),
f (y) ≥ f (z) + (y − z)T ∇f (z).
Multiply the first by α and the second by 1 − α and add:
αf (x) + (1 − α)f (y) ≥ f (z) + (αx + (1 − α)y − z)T ∇f (z)
= f (z)
= f (αx + (1 − α)y),
184 CHAPTER 10. UNCONSTRAINED OPTIMIZATION

as desired.
We now show that (b) and (c) are equivalent.
If (c) holds, then simply consider a Taylor expansion of f : for any x, y ∈
n
R , there exists α ∈ [0, 1] such that
1
f (y) = f (x) + (y − x)T ∇f (x) + (y − x)T ∇2 f (x + α(y − x))(y − x).
2
(10.8)
Since the Hessian is assumed everywhere positive semidefinite, this yields

f (y) ≥ f (x) + (y − x)T ∇f (x)

for all x, y, showing (b) holds.


The other way around, if (b) holds, then (c) must hold. Assume oth-
erwise for contradiction. Then, there exists x ∈ Rn and u ∈ Rn such that
uT ∇2 f (x)u < 0. By continuity of the Hessian, we can take u small enough
so that uT ∇2 f (x + αu)u < 0 for all α ∈ [0, 1] as well. Plugging this in the
Taylor expansion (10.8) with y = x + u, we find
1
f (y) = f (x) + (y − x)T ∇f (x) + uT ∇2 f (x + αu)u
2
< f (x) + (y − x)T ∇f (x),

which indeed contradicts (b).


Question 10.8. Show that linear functions, vector norms and weighted sums
of convex functions with positive weights are convex. Show that if g : Rn → R
is convex, then g composed with a linear function is convex. Show that if
g1 , . . . , gn : Rn → R are convex, then g(x) = supi∈{1,...,n} gi (x) is convex.
Lemma 10.9. If f is convex, then critical points and minima coincide.
Proof. If ∇f (x) = 0, use part (b) of Lemma 10.7 to conclude that x is
optimal. The other way around, if x is optimal, it is a critical point by
Lemma 10.4, so that ∇f (x) = 0.

10.1 A first algorithm: gradient descent


Gradient descent (or steepest descent) is probably the most famous and most
versatile optimization algorithm—take a look at Algorithm 10.1. The algo-
rithm is also called steepest descent because, locally, following the negative
gradient induces the steepest decrease in the cost function f (up to first order
approximation).
10.1. A FIRST ALGORITHM: GRADIENT DESCENT 185

Algorithm 10.1 Gradient descent


1: Input: x0 ∈ Rn
2: for k = 0, 1, 2 . . . do
3: Pick a step-size ηk
4: xk+1 = xk − ηk ∇f (xk )
5: end for

In general, we can only guarantee convergence to critical points,4 but


it is intuitively expected that the algorithm converges to a local minimum
unless the initial point x0 is chosen maliciously—an intuition which can be
formalized.
Besides an initial guess x0 ∈ Rn , gradient descent requires a strategy to
pick the step-size ηk . This amounts to considering the line-search problem

min φ(η) = f (xk − η∇f (xk )), (10.9)


η∈R

which restricts f to a one-dimensional problem along the line generated by


xk and ∇f (xk ). At first sight, it might seem like a good idea to pick η ∈
arg minη φ(η), but this is impractical for all but the simplest settings.
Fortunately, there is no need to solve the line-search problem exactly to
ensure convergence. It is enough to ensure sufficient decrease.
Definition 10.10. We say the step-sizes in gradient descent yield sufficient
decrease if there exists c > 0 such that, for all k,

f (xk ) − f (xk+1 ) ≥ ck∇f (xk )k22 . (10.10)

Lemma 10.11. If the step-sizes yield sufficient decrease, then gradient∗ de-
scent produces an iterate xk such that k∇f (xk )k2 ≤ ε with k ≤ d f (x0c)−f ε12 e
and k∇f (xk )k2 → 0. There is no condition on x0 .
Proof. Assume that x0 , . . . , xK−1 all have gradient larger than ε. Then, using
both the fact that f is lower bounded and the sufficient decrease property, a
classic telescoping sum argument gives

f (x0 ) − f ∗ ≥ f (x0 ) − f (xK )


K−1
X K−1
X
= f (x` ) − f (x`+1 ) ≥ ck∇f (x` )k22 ≥ cKε2 .
`=0 `=0

4
Notice the plural: we won’t show convergence to a unique critical point, even though
this is what typically happens in practice. We only show that all accumulation points are
critical points.
186 CHAPTER 10. UNCONSTRAINED OPTIMIZATION

Hence, K ≤ (f (x0 )−f ∗ )/cε2 , so that if more iterations are computed, it must
be that the gradient dropped below ε at least once. Furthermore, since the
sum of squared gradient norms is upper bounded, the gradient norm must
converge to 0.
Remark 10.12. Let us stress this: there are no assumptions on x0 . On
the other hand, the theorem only guarantees that all accumulation points of
the sequence of iterates are critical points: it does not guarantee that these
are global optima. Importantly, if f is convex, then critical points and global
optima coincide, which shows all accumulation points are global optima re-
gardless of initialization: this is powerful!
Sufficient decrease can be achieved easily if we assume f has a Lipschitz
continuous gradient with known constant L.
Lemma 10.13. If k∇f (x) − ∇f (y)k2 ≤ Lkx − yk2 for all x, y, then the
1
constant step size ηk = L1 yields sufficient decrease with c = 2L .
Proof. First, we show the Lipschitz condition implies the following statement:
for all x, y,
L
f (y) − f (x) − (y − x)T ∇f (x) ≤ ky − xk22 . (10.11)
2
Indeed, by the fundamental theorem of calculus,
Z 1
f (y) − f (x) = (y − x)T ∇f (x + s(y − x))ds
0
Z 1
T
(y − x)T ∇f (x + s(y − x)) − ∇f (x) ds.
 
= (y − x) ∇f (x) +
0

The error term is easily bounded:


Z 1
(y − x)T ∇f (x + s(y − x)) − ∇f (x) ds
 
0
Z 1
≤ ky − xk2 k∇f (x + s(y − x)) − ∇f (x)k2 ds
0
Z 1
≤ ky − xk2 Lks(y − x)k2 ds
0
L
= ky − xk22 ,
2
which confirms (10.11). Let η be our tentative step-size, so that xk+1 =
xk −η∇f (xk ). Then, equation (10.11) can be used to bound the improvement
10.1. A FIRST ALGORITHM: GRADIENT DESCENT 187

one can expect: set x = xk and y = xk+1 , then, removing the absolute value
on the left-hand side,

L 2
f (xk+1 ) − f (xk ) + ηk∇f (xk )k22 ≤ η k∇f (xk )k22 ,
2

or, equivalently,
 
L
f (xk ) − f (xk+1 ) ≥ η 1 − η k∇f (xk )k22 . (10.12)
2

The right-hand side is largest if η = L1 , at which point the improvement at


1
every step is at least 2L k∇f (xk )k22 as announced.

Notice that we only use the Lipschitz property along the piecewise linear
curve that joins the iterates x0 , x1 , x2 . . .. This can help when analyzing
functions f whose gradients are not globally Lipschitz continuous.

Algorithm 10.2 Backtracking Armijo line-search


1: Given: xk ∈ Rn , c1 ∈ (0, 1), τ ∈ (0, 1), η̄ > 0
2: Init: η ← η̄
3: while f (xk ) − f (xk − η∇f (xk )) < c1 ηk∇f (xk )k22 do
4: η ←τ ·η
5: end while
6: return ηk = η.

In practice, one rarely knows the Lipschitz constant L, and it is standard


to use a line-search algorithm, such as Algorithm 10.2 for example. The
reasoning is as follows: based on (10.12), it is clear that any choice of ηk
in the open interval (0, 2/L) guarantees decrease in the cost function. To
further ensure sufficient decrease, it is merely necessary to find a step-size in
that interval which remains bounded away from both ends of the interval.
Algorithm 10.2 achieves this without knowledge of L, in a logarithmic number
of steps. (Again: the constant L appears in the bound of the lemma, but
needs not be known to run the algorithm.) Another advantage is adaptivity:
in regions where f is less flat, the method will dare make larger steps. Of
course, each iteration of Algorithm 10.2 involves one call to f , which may be
costly. In practice, a lot of engineering goes into fine-tuning the line-search
algorithm. For example, a nice touch is to “remember” the previous step-size
and to use it to better pick η̄ at the next iterate.
188 CHAPTER 10. UNCONSTRAINED OPTIMIZATION

Lemma 10.14. Assume f has L-Lipschitz continuous gradient. The back-


tracking line-search Algorithm 10.2 returns a step-size η which satisfies
 
2(1 − c1 )τ
η ≥ min η̄,
L

in at most
  
Lη̄
max 1, 2 + logτ −1
2(1 − c1 )

calls to the cost function f . In particular, this means the sufficient decrease
condition (Def. 10.10) is met with
 
2c1 (1 − c1 )τ
c ≥ min c1 η̄, ,
L

implying convergence of Algorithm 10.1. (Notice that the right-hand side is


always smaller than 1/2L.)

Proof. When the line-search algorithm tries the step-size η, the Lipschitz
continuous gradient assumption, via (10.11), guarantees that
 
L
f (xk ) − f (xk − η∇f (xk )) ≥ η 1 − η k∇f (xk )k22 .
2

(This is the same as (10.12).) If the algorithm does not stop, it must be that
 
L
η 1 − η k∇f (xk )k22 < c1 ηk∇f (xk )k22 ,
2

or, equivalently, that 1 − L2 η < c1 , so that

2(1 − c1 )
η> .
L
As soon as η drops below this bound, we can be sure that the line-search
will return. This happens either at the very first guess, when η = η̄, or after
a step-size reduction by a factor τ , which cannot have reduced the step-size
below τ times the right-hand side, so that, when the algorithm returns, η
satisfies:
 
2(1 − c1 )τ
η ≥ min η̄, . (10.13)
L
10.1. A FIRST ALGORITHM: GRADIENT DESCENT 189

In other terms: the returned step-size is at least as large as a certain constant


in (0, 2/L).
In the worst case, how many times might we need to call f before the
line-search returns? After ` + 1 calls to the function f , the backtracking
line-search tries the step-size η̄τ ` . Either the line-search returns η̄ after the
first call to f , or it returns
2(1 − c1 )
η̄τ ` ≥ τ.
L
Hence, minding the fact that τ ∈ (0, 1) so that log(τ ) < 0, we find succes-
sively
 
2(1 − c1 )
(` − 1) log(τ ) ≥ log
Lη̄
 
log 2(1−cLη̄
1)

`+1≤2+ .
log(τ )
This concludes the proof, since ` + 1 is here the number of calls to f .
The main advantage of Lemma 10.11 is that it makes no assumptions
about the initial iterate x0 (compare this with our work in Chapters 1 and 4
of [SM03]). Yet, the guaranteed convergence rate is very slow: it is sublinear.
This is because, in the worst case, f may present large, almost flat regions
where progress is slow. Luckily, the convergence rate becomes linear if the
iterates get sufficiently close to a (strict) local optimum. Here is a statement
to that effect, with somewhat overly restrictive assumptions.
Lemma 10.15. Under the Lipschitz assumptions on the gradient of f , if x∗
is a local optimum where 0 ≺ ∇2 f (x∗ ) ≺ LIn , there exists a neighborhood
U of x∗ such that, if the sequence x0 , x1 , x2 . . . generated by gradient descent
with constant step-size ηk = 1/L ever enters the neighborhood U , then the
sequence converges at least linearly to x∗ .
Proof. Proof sketch: it is sufficient to observe that gradient descent in this
setting is simultaneous iteration through relaxation:
1
xk+1 = g(xk ) = xk − ∇f (xk ). (10.14)
L
The Jacobian of g at the fixed-point x∗ is Jg (x∗ ) = In − L1 ∇2 f (x∗ ). Under the
assumptions, the eigenvalues of ∇2 f (x∗ ) all lie in (0, L), so that kJg (x∗ )k2 <
1. Thus, by continuity, g is a contraction map in a neighborhood of x∗ . From
there, one can deduce linear convergence of xk to to x∗ (after U has been
entered.)
190 CHAPTER 10. UNCONSTRAINED OPTIMIZATION

Notice that x∗ being a local optimum readily implies ∇2 f (x∗ )  0 and


the Lipschitz condition readily implies ∇2 f (x∗ )  LIn . Linear convergence
can also be established when the line-search algorithm is used.

The main take-away (informal): if f is sufficiently smooth,


gradient descent with appropriate step-sizes converges to critical
points without conditions on the initial iterate x0 ; while the con-
vergence may be sublinear and to a saddle point, it is normally
expected that the convergence rate will eventually be linear and
to a local optimum. If f is convex, we get convergence to a global
minimum.

10.2 More algorithms


Continuous optimization is a whole field of research, extending far beyond
gradient descent. One important algorithm you actually already know: New-
ton’s method. Indeed, for convex f at least, minimizing f is equivalent
to finding x such that ∇f (x) = 0: this is a system of (usually) nonlinear
equations. Applying Newton’s method to find roots of ∇f (x) leads to this
iteration:

xk+1 = xk − (∇2 f (xk ))−1 ∇f (xk ),

assuming the inverse exists. This step is computed by solving a linear system
where the matrix is the Hessian of f at xk . Importantly, one does not con-
struct the Hessian matrix to do this. That would be very expensive in most
applications. Instead, one resorts to matrix-free solvers, which only require
the ability to compute products of the form ∇2 f (xk )u for vectors u.
Another interpretation of Newton’s method for optimization is the fol-
lowing: at xk , approximate f with a second-order Taylor expansion:
1
f (x) ≈ f (xk ) + (x − xk )T ∇f (xk ) + (x − xk )T ∇2 f (xk )(x − xk ).
2
If f is strictly convex,5 then this quadratic approximation of f is itself strictly
convex. Find x which minimizes the quadratic: this coincides with xk+1 ! In
other words: an iteration of Newton’s method for optimization consists in
moving to the critical point of the quadratic approximation of f around the
current iterate.
5
Strictly convex means the Hessian is positive definite, rather than only positive
semidefinite.
10.2. MORE ALGORITHMS 191

What if f is not strictly convex? Then, the quadratic may not be convex,
so that its critical points is not necessarily a minimizer: it could be a maxi-
mizer, or a saddle point. If such is the case, then moving to that point is ill
advised. A better strategy consists in recognizing that the Taylor expansion
can only be trusted in a small neighborhood around xk . Thus, the quadratic
should be minimized in that trusted region only (instead of blindly jumping
to the critical point.) This is the starting point of the so-called trust region
method, which is an excellent algorithm widely used in practice.
What if we do not have access to the Hessian? One possibility is to ap-
proximate the Hessian using finite differences of the gradient. Alternatively,
one can resort to the popular BFGS algorithm, which only requires access to
the gradient and works great in practice as well.
192 CHAPTER 10. UNCONSTRAINED OPTIMIZATION
Chapter 11

What now?

Let’s take a quick look in the mirror. We discussed numerical algorithms to


solve the following problems:

1. Solve Ax = b, for various kinds of A’s, also in a least-squares sense.


This lead us to consider LU and QR factorizations.

2. Solve f (x) = 0 for f : Rn → R, with n = 1 or n > 1.

3. Compute eigenvalues and eigenvectors of a matrix A, Ax = λx.

4. Interpolate or approximate a function f with polynomials, which lead


us to the mesmerizing world of orthogonal polynomials: f ≈ pn .
Rb
5. Approximate derivatives and integrals of functions, f 0 (x), a f (x)dx.

6. Compute the minimum of a function f , minx f (x).

For most problems, we also assessed which aspects of it can make it harder or
easier, through analysis. For example, Ax = b is more difficult to solve if A is
poorly conditioned, and f is more difficult to approximate with polynomials
if its high-order derivatives go wild. We also acknowledged the effects of
inexact arithmetic.
Hopefully, you were convinced that mathematical proofs and numerical
experimentation inform each other. Both are necessary to gain confidence in
the algorithms we develop, and eventually use in settings where failure has
consequences.
What now? The problems we studied are fundamental, in that they
appear throughout the sciences and engineering. I am confident that you
will encounter these problems in your own work, in various ways. With even
more certainty, you will encounter problems we did not address at all. Some

193
194 CHAPTER 11. WHAT NOW?

of these appear in chapters of [SM03] we did not open. By now, you are well
equipped to study these problems and algorithms on your own.
1. Piecewise polynomial approximation (splines): instead of interpolating
f with a polynomial of high degree, divide [a, b] into smaller intervals,
and approximate f with a low-degree polynomial on each interval sep-
arately. Only, do it in a way that the polynomials “connect”: the
piecewise polynomial function should be continuous, and continuously
differentiable a number of times. These requirements lead to a banded
linear system: you know how to solve this efficiently.

2. Initial value problems for ODEs: in such problems, the unknown is a


function f , which has to satisfy a differential equation. If this equation
is linear and homogeneous, then the solutions are obtained explicitly
in terms of the eigenvectors and eigenvalues of the corresponding ma-
trix. If the system is linear but inhomogeneous, then explicit solutions
still exist if the inhomogeneous term is polynomial. If it is not polyno-
mial, you know how to approximate it accurately with a polynomial.
If the ODE is nonlinear, then stepping methods can be used: the gra-
dient descent algorithm is actually an example of a stepping method
(explicit Euler) applied to a specific ODE, so the concept is not too
foreign. Some fo these stepping methods (implicit Euler) require solv-
ing a nonlinear equation at each step. In electrical engineering, ODEs
dictate the behavior of electrical circuits with impressive accuracy. To
simulate a circuit before sending it to production (a costly stage), state
of the art industrial software solves these ODEs as efficiently and ac-
curately as they can. ODEs are also used to model chemical reactions
in industrial plants, ecosystems in predator-prey models, and so many
other situations. The natural question that ensues is: can we con-
trol these systems? The answer is yes to a large extent, and leads to
control theory, which ultimately relies on our capacity to solve ODEs
numerically.

3. Boundary value problems for ODEs: in one example of this, a differ-


ential equation dictates the behavior of a system (for example, F =
d
dt
(mv) dictates the ballistics of a space rocket), and the goal is to de-
termine the initial conditions one should pick (fuel loaded in the tank,
propeller settings, etc.) so that the system will attain a specified state
at a specified time (for example, attain a geostationary orbit and stay
there after a certain amount of time.) In this scenario, there exists
a function f which maps the initial conditions (the variables) to the
final conditions. The goal is to pick the variables such that the final
195

conditions are as prescribed; thus: we are looking for the roots of this
function. It is nonlinear, and to evaluate it we must solve the initial
value problem (for example using a stepping method): it is thus crucial
to use a nonlinear equation solver which requires as few calls to the
function as possible.
4. The finite element method (FEM) for ODEs and PDEs: this method
is used extensively in materials sciences, mechanical engineering, geo-
sciences, climate modeling and many more. For example, a differential
equation dictates the various mechanical constraints on the wings of
an airplane in flight, as a function of shape, wind speed and materials
used. Solving this equation informs us about which specific points of
the wing are at risk of breaking first, that is: which points should be
engineered with particular care. The solution to this PDE is a func-
tion, and it can be cast as the solution to an optimization problem (by
the same principle that states a physical system at rest is in the state
that minimizes energy.) Thus, we must minimize some cost function
(an energy function) over a space of functions. Spaces of functions are
usually infinite dimensional, so we have to do something about that.
The key step in FEM is to mesh the domain of the PDE (the airplane
wing) into small elements (tetrahedrons for example), and to define a
low-dimensional family of functions over this mesh. A trivial example is
to allow the value of the function at each mesh vertex to be a variable,
and to define the function at all other points through piecewise linear
interpolation. If the energy function is quadratic (as is normally the
case), minimizing it as a function of the vertex values boils down to one
(very) large and structured linear system (each element only interacts
with its immediate neighbors, so that the corresponding matrix has to
be sparse.) Such systems are best solved with matrix-free solvers.
This idea of reducing an infinite dimensional optimization problem over a
space of functions to a finite dimensional problem (apparent in our work with
polynomial approximation, and also present in FEMs as described above) is
also the root of all modern applications of (deep) neural networks in machine
learning, which you have certainly heard about. There, the basic problem
is as follows: given examples (x1 , y1 ), . . . , (xn , yn ) (say, xi is an image, and
yi is a number which states whether this image represents a cat, a dog,
...), find a function f such that if we encounter a new image (an image we
have never seen before), then y = f (x) is a good indication of what the
image contains (a cat, a dog, ...) This problem is called learning. At its
heart, it is a function approximation problem if we assume that a “perfect”
function f indeed exists (what else could we do?) Neural networks are a fancy
196 CHAPTER 11. WHAT NOW?

way of defining a finite-dimensional space of nonlinear functions. This space


is parameterized by the weights of the neural connections. To learn, one
optimizes the weights such that the neural network appropriately predicts
the yi ’s at least for the known xi ’s. Thus, it is an optimization problem over
a finite-dimensional space of functions. Of course, there is a lot more to it
(for instance, it is of great importance to determine whether the obtained
neural network actually learned to generalize, as opposed to just being able
to parrot the known examples), but this is the gist of it.

Welcome to the world of numerical analysis!


Bibliography

[SM03] Endre Süli and David F. Mayers. An introduction to numerical


analysis. Cambridge university press, 2003.

[TBI97] Lloyd N. Trefethen and David Bau III. Numerical linear algebra,
volume 50. SIAM, 1997.

197

You might also like