0% found this document useful (0 votes)
98 views127 pages

Numerical Methods: Radostin Simitev Simon Candelaresi

Uploaded by

João Bulas Cruz
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
98 views127 pages

Numerical Methods: Radostin Simitev Simon Candelaresi

Uploaded by

João Bulas Cruz
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 127

i

Numerical Methods

Radostin Simitev
Simon Candelaresi

September 15, 2020


ii
Contents

1 Root finding 1
1.1 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Interval bisection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3.1 Basic considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3.2 Algorithm and analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Fixed-point iterative methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4.1 Solution of equations by iterative maps . . . . . . . . . . . . . . . . . . . . . 4
1.4.2 Convergence of iterated maps . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4.3 Iterative refinement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4.4 Order of convergence of iterative sequences . . . . . . . . . . . . . . . . . . 9
1.4.5 The Newton-Raphson method . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.5 Hands-on projects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.5.1 Fixed point iterations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.5.2 Finding all simple roots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2 Interpolation and Approximation 13


2.1 Polynomial interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1.1 The interpolation problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1.2 Newton’s recursive method . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1.3 Lagrange interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2 Approximation of functions by polynomial interpolants . . . . . . . . . . . . . . . . 18
2.2.1 The approximation problem . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2.2 Error bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2.3 Piecewise approximation and examples . . . . . . . . . . . . . . . . . . . . 20
2.3 Chebyshev economisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3.1 The problem of grid optimization . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3.2 Chebyshev polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3.3 The economization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.4 Hands-on projects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4.1 Lagrange polynomial interpolation . . . . . . . . . . . . . . . . . . . . . . . 26
2.4.2 Newton polynomial interpolation . . . . . . . . . . . . . . . . . . . . . . . . . 27

3 Numerical integration 29
3.1 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.1.1 General setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.1.2 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.1.3 Types of quadrature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2 Newton-Cotes quadrature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2.1 General rule and error bound . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2.2 Examples: Trapezium rule, Simpson rule, etc. . . . . . . . . . . . . . . . . . . 31
3.3 Gaussian quadrature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

iii
iv CONTENTS

3.3.1 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34


3.3.2 Orthogonal bases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3.3 Gaussian quadrature rules . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3.4 Summary and examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.4 Recurrence relations for orthogonal polynomials . . . . . . . . . . . . . . . . . . . . . 41
3.5 Hands-on projects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.5.1 Simple and composite quadrature . . . . . . . . . . . . . . . . . . . . . . . 42
3.5.2 Double integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.5.3 Adaptive quadrature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.5.3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.5.3.2 Simpson’s and composite Simpson’s rules . . . . . . . . . . . . . . 43
3.5.3.3 The adaptive refinement procedure . . . . . . . . . . . . . . . . . 43
3.5.3.4 Illustration and validation . . . . . . . . . . . . . . . . . . . . . . 44

4 Systems of linear equations 45


4.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2 Direct methods for solution of linear systems . . . . . . . . . . . . . . . . . . . . . . 45
4.2.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.2.2 Gaussian elimination and LU-factorisation . . . . . . . . . . . . . . . . . . . 46
4.2.3 The Thomas tridiagonal algorithm . . . . . . . . . . . . . . . . . . . . . . . 50
4.2.4 Pivoting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.2.5 Solution of linear systems by LU-factorisation . . . . . . . . . . . . . . . . . 53
4.2.6 Efficient methods for LU-factorisation . . . . . . . . . . . . . . . . . . . . . 54
4.3 Iterative methods for solutions of linear systems of equations . . . . . . . . . . . . . . . 57
4.3.1 General formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.3.2 Gauss-Jacobi and Gauss-Seidel methods . . . . . . . . . . . . . . . . . . . . 58
4.3.3 Convergence of linear iterative methods . . . . . . . . . . . . . . . . . . . . 60
4.3.3.1 Absolute condition . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.3.3.2 Diagonal dominance . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.4 Successive Over-Relaxation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.5 Eigenvalue approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.5.1 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.5.2 Gershgorin’s Circle Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . 66

5 Numerical differentiation by finite differences 69


5.1 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.1.1 General setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.1.2 Finite-difference problem formulation . . . . . . . . . . . . . . . . . . . . . 69
5.2 General derivative approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.3 Using Taylor’s Theorem to find derivatives . . . . . . . . . . . . . . . . . . . . . . . . 71
5.4 Partial derivatives and Differential operators . . . . . . . . . . . . . . . . . . . . . . 74
5.4.1 Laplacian and its computational molecule (stencil) . . . . . . . . . . . . . . . 75
5.5 Differentiation matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

6 Discretisation of differential equations by finite-differences 79


6.1 Finite-difference problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.2 Discretization of steady-state problems (Space) . . . . . . . . . . . . . . . . . . . . . 80
6.2.1 Steady-state problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
6.2.2 Linear ODEs with Dirichlet boundary conditions . . . . . . . . . . . . . . . 82
6.2.3 Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.2.4 “Experimental measurement” of the order of accuracy . . . . . . . . . . . . . 84
6.2.5 BVP with gradient boundary conditions . . . . . . . . . . . . . . . . . . . . 86
6.2.5.1 First-order for-/backward differences . . . . . . . . . . . . . . . . 86
6.2.5.2 Second-order for-/backward differences . . . . . . . . . . . . . . . 88
CONTENTS v

6.2.5.3 Fictitious nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . 89


6.2.6 Non-linear equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.2.6.1 The multi-dimensional Newton-Raphson method . . . . . . . . . . 90
6.2.6.2 Use in nonlinear FD . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.2.7 Elliptic equations in R2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.2.8 Richardson extrapolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.2.9 Curvilinear domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.3 Discretization of time-dependent problems (Time) . . . . . . . . . . . . . . . . . . . 98
6.3.1 The Euler method for ODE IVPs . . . . . . . . . . . . . . . . . . . . . . . . 98
6.3.2 The Euler method for the diffusion equation . . . . . . . . . . . . . . . . . . 99
6.3.3 The Richardson method for the diffusion equation . . . . . . . . . . . . . . . 99
6.3.4 The DuFort-Frankel method for the diffusion equation . . . . . . . . . . . . . . 101
6.3.5 The Crank-Nicolson method for the diffusion equation . . . . . . . . . . . . . 102
6.3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
6.4 Hands-on projects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
6.4.1 A two-point boundary value problem . . . . . . . . . . . . . . . . . . . . . . 105
6.4.2 Poisson’s equation in a R2 annulus . . . . . . . . . . . . . . . . . . . . . . . 106

7 Convergence of finite-difference methods 109


7.1 Convergence, consistency and stability of steady-state problems . . . . . . . . . . . . 109
7.2 Convergence of linear parabolic problems . . . . . . . . . . . . . . . . . . . . . . . 112
7.2.1 General conditions for convergence illustrated on a particular case . . . . . . 112
7.2.2 Consistency of methods for the diffusion equation . . . . . . . . . . . . . . . 114
7.2.3 Stability of finite-difference methods for the diffusion equation . . . . . . . . 115
7.2.3.1 Direct matrix method . . . . . . . . . . . . . . . . . . . . . . . . 116
7.2.3.2 Method of Fourier modes . . . . . . . . . . . . . . . . . . . . . . 118
7.3 Convergence of the Euler method for nonlinear IVP ODEs . . . . . . . . . . . . . . . . 121
vi CONTENTS
Chapter 1

Root finding

1.1 Problem formulation

Problem 1.1. Let f (x) be a real-valued continuous function on x ∈ [a,b]. Find the root x∗ of the algebraic
equation
f (x∗ ) = 0.
First of all, we need to make sure the problem is well-posed.
Claim 1.2. (Existence of solutions) If f (a) f (b) < 0 then there exists at least one c ∈ (a,b) such that f (c) = 0.

Proof. As f (a) f (b) < 0, f (x) changes sign on the interval. Since f (x) is continuous by the Intermediate
Value Theorem it takes all values between f (a) and f (b) including 0 at some c ∈ (a,b). 

Claim 1.3. (Uniqueness of solution) If f (x) is also differentiable and f 0(x) > 0 or f 0(x) < 0 on x ∈ (a,b)
then the solution of f (x) = 0 is unique.

Proof. The function is monotone (increasing or decreasing) so it takes all values between f (a) and f (b)
at most once. 

Remark 1.4. Practically, one may sketch or plot f (x) to make sure the problem is well-posed.

1.2 Errors

§ 1.5. Errors are of fundamental importance in Numerical Analysis. If the error of a numerical method
cannot be found then the method is essentially useless.
Definition 1.6. Let ĉ be the exact value (the exact mathematical answer to a problem being solved) and
c be an approximation obtained by some numerical method. The difference
E = ĉ−c
is called the error of the numerical method.
The absolute value of the error
|E | = | ĉ−c|
is called the absolute error of the numerical method.
§ 1.7. In most situations the exact value ĉ and the exact error E are not known. Often one may be able to
prove that there is a limitation (a bound) on the error.

1
2 CHAPTER 1. ROOT FINDING

Definition 1.8. Let ĉ be the exact value, c be an approximation obtained by some numerical method. If
ε > 0 such that
| ĉ−c| ≤ ε
exists it is called an error bound of the numerical method..
Remark 1.9. The last inequality is also conventionally written as
ĉ = c±ε.

Often the exact error and the error bound are called “the error”, partly because the exact error is unknown.
This, however is not correct.
Remark 1.10. Often we require a numerical method to return a result within some prescribed tolerance
or accuracy τ. This means that we require
| ĉ−c| ≤ ε < τ.

1.3 Interval bisection

1.3.1 Basic considerations

§ 1.11. (Idea) The main idea of the Interval Bisection method is to start with an interval that contains exactly
one root and then construct a sequence of increasingly shorter intervals each of which also contains the
root. Once the root is “bracketed” in a sufficiently narrow interval we declare it found.
Claim 1.12. Let I0 = [a0,b0 ] contain a unique root x∗ of f (x∗ ) = 0. Then the root is also contained in each
member In of the following sequence {In,n ( = 0,1...} of increasingly shorter intervals
[an−1,cn−1 ], f (an−1 ) f (cn−1 ) ≤ 0,
In = [an,bn ] =
[cn−1,bn−1 ], f (bn−1 ) f (cn−1 ) ≤ 0,
with cn−1 = (an−1 +bn−1 )/2.

Proof. Assume that In−1 = [an−1,bn−1 ] already contains the unique root of f (x∗ ) = 0. Split the interval
into two equal pieces using the midpoint cn−1 = (an−1 + bn−1 )/2. Now, if f (an−1 ) f (cn−1 ) ≤ 0 then there
is a change in sign in the first half of the interval so the root is contained there and we select In = [an−1,cn−1 ].
On the other hand, if f (bn−1 ) f (cn−1 ) ≤ 0 then there is a change in sign in the second half of the interval
so the root is contained there and we select In = [cn−1,bn−1 ].
The process can now be repeated narrowing the root further. 

Claim 1.13. Let In = [an,bn ] contain a unique root x∗ of f (x∗ ) = 0. Let the midpoint cn = (an + bn )/2 be
chosen as a numerical approximation to the root x∗ . Then the error of this approximations is bounded
by the error bound
1
|x∗ −cn | ≤  = (bn −an ).
2

Proof. We wish to find an error bound  so that


|x∗ −c| ≤  .
By the definition of modulus this is equivalent to
− ≤ x∗ −c ≤ 
then
c− ≤ x∗ ≤ c+ .
Using interval notation instead
x∗ ∈ [c−,c+].
We (a) assume that c = cn = (an +bn )/2 and (b) require that [c−,c+] = [an,bn ] so we solve either of
(an +bn )/2− = an
1.3. INTERVAL BISECTION 3

or
(an +bn )/2+ = bn
to obtain  = 12 (bn −an ).
Reasoning more informally, the point cn is in the middle of the interval [an,bn ]. The exact solution x∗ is also contained
within the interval x∗ ∈ [an,bn ] and so the distance between the exact and the approximate solution is less or equal
to the half the length of the interval. This distance is the error bound. 

1.3.2 Algorithm and analysis

We can now devise the following Interval Bisection algorithm.


Algorithm 1.14. (Interval Bisection Method)
1. Choose a starting interval I0 = [a0,b0 ] that contains an unique root x∗ of f (x∗ ) = 0.
2. Construct a sequence of increasingly narrower intervals In in the following way.
(a) Take the mid-point
an−1 +bn−1
cn−1 = .
2
(b) If f (an−1 ) f (cn−1 ) < 0 then select In = [an−1,cn−1 ].
(c) If f (cn−1 ) f (bn−1 ) < 0 then select In = [cn−1,bn−1 ].
(d) If f (cn−1 ) = 0 then cn−1 is the root exactly. STOP.
3. Repeat 2 until n ≥ N, where N is a given fixed number of iterations, or until the length of the interval
(bn −an ) < 2τ where τ is a prescribed tolerance.
4. An approximation to the root at each step n in c n . This approximation is bounded by the error bound
1 1 1 1

|x∗ −cn | ≤ εn = (bn −an ) = (bn−1 −an−1 ) = ··· = n+1 (b0 −a0 ).
2 2 2 2
Example 1.15. Find the number of iterations n needed to achieve accuracy of N decimal digits in using
the interval bisection algorithm.
Solution. If we require N decimal places of accuracy then the tolerance
1
τ = 10−N = 5×10−(N +1) .
2
So the error bound needs to be tighter than the tolerance
εn ≤ 5×10−(N +1) .
Solving
1 10
(b0 −a0 ) ≤ 10−(N +1),
2 n+1 2
we find that the number of iterations n must satisfy
Nlog10+log(b0 −a0 )
n≥
log2
to achieve the desired accuracy. ^
Example 1.16. Find the root of f (x) = cosx − x to accuracy of 2 decimal places.
Solution. Show by using a sketch that f (x) = cosx − x has a single positive real root and find it correct to
2 decimal places.
From a sketch of cosx on the interval in question it is clear that there is only one root, x∗ ∈ [0,π/2]. Check
f (0) = 1−0 = 1 > 0 and f (π/2) = 0− π/2 = −π/2 < 0. Take the initial interval as [a0,b0 ] = [0,π/2] then
table 1.1 shows the results of interval bisection.
To obtain the solution correct to 2 decimal places requires
2log10+log π2
n≥ ≈ 7.3
log2
so that at least 8 iterations are needed for the desired accuracy. The solution correct to 2 d.p. is x ≈ 0.74
(see table 1.1). ^
4 CHAPTER 1. ROOT FINDING

n an bn cn f (an ) f (bn ) f (cn ) En


0 0 π/2 π/4 + - - π/4
1 0 π/4 π/8 + - + 0.3927
2 0.3927 0.7854 0.5890 + - + 0.1963
3 0.5890 0.7854 0.6872 + - + 0.0982
4 0.6872 0.7854 0.7363 + - + 0.0491
5 0.7363 0.7854 0.7609 + - - 0.0245
6 0.7363 0.7609 0.7486 + - - 0.01227
7 0.7363 0.7486 0.7424 + - - 0.0061
8 0.7363 0.7424 0.7394 + - 0.003

Table 1.1: The results of interval bisection.

1.4 Fixed-point iterative methods

1.4.1 Solution of equations by iterative maps

Definition 1.17. The point x∗ is called a fixed point of the map g(x) if x∗ = g(x∗ ).
Claim 1.18. Let f (x) = g(x)− x. Then the equations f (x) = 0 and x = g(x) are equivalent and the fixed
point x∗ of g(x), if it exists, is a root of f (x) = 0

Proof. Obviously true. 


Remark 1.19. If f (x) = 0 ⇔ x = g(x), then the iterated map (the iterative scheme) xn+1 = g(xn ) can be used
to find the root of f (x) = 0 if the sequence {xn } converges to a fixed point x∗ as n → ∞. The solution is x∗ .
Example 1.20. Devise an iterative scheme to solve the equation f (x) = 2sin(x)− x = 0.
Solution. f (x) = 2sin(x)− x = 0 = g(x)− x so xn+1 = 2sinxn is an iterative fixed point map. ^
§ 1.21. This section is concerned with this class of iterative methods, called “fixed-point iterative methods”.

1.4.2 Convergence of iterated maps

For an iterative method to work, it must converge to its fixed-point. We now establish conditions for
convergence.
Definition 1.22 (Lipschitz condition). A function g(x) is said to satisfy a Lipschitz condition in an interval
I = [a,b] if L ≥ 0 exists such that
g(x)−g(y) ≤ L x − y ,

∀x,y ∈ I.
The constant L is called a Lipschitz constant.
Claim 1.23. If g(x) is Lipschitz then it is continuous.

Proof. Immediately from the definition. 


Claim 1.24. If g(x) is differentiable in an interval I, with bounded derivative, then g(x) is Lipschitz with
Lipschitz constant L = maxη ∈I |g 0(η)|.

Proof. By the Mean Value Theorem there exists η ∈ (x,y) such that
g(y)−g(x)
g 0(η) = .
y− x
1.4. FIXED-POINT ITERATIVE METHODS 5

Then
g(x)−g(y) = g 0(η) × x − y ≤ L x − y

where
L = max g 0(η) .

η ∈I


Definition 1.25 (Contraction mapping). The function g(x) is called a contraction mapping in I if it satisfies
a Lipschitz condition in I with a Lipschitz constant L < 1.

when L < 1
Remark 1.26. It is called a contraction since
g(x)−g(y) < x − y

so the distance between g(x) and g(y) is less than the distance between x and y.
Claim 1.27 (Fixed Point Theorem). Let for all x ∈ [a,b]
1. g ∈ C[a,b], (i.e. g is continuous),
2. g(x) ∈ [a,b], (i.e. g is bounded).
Then g(x) has at least one fixed point in the interval [a,b].
Let, in addition,
3. |g 0(x)| ≤ L < 1 (i.e. g(x) be a contraction on the interval (a,b) (i.e. for all x ∈ (a,b) the derivative
g 0 exists and that a Lipschitz constant 0 < L < 1 exists with |g 0(x)| ≤ L < 1.))
Then g(x) has a unique fixed point x∗ = g(x∗ ) in the interval x∗ ∈ [a,b].
Furthermore, for any initial guess x0 ∈ [a,b], the sequence defined by xn = g(xn−1 ),n ≥ 1, converges to the
unique fixed point x∗ ∈ [a,b].

Proof. Consider the function f (x) = g(x)− x.


(Existence) First, f (x) ∈ C[a,b]. Second, f (a) = g(a)−a ≥ 0 since g(a) ∈ [a,b], and also f (b) = g(b)−b ≤ 0
since g(a) ∈ [a,b]. So f (a) f (b) ≤ 0 and f changes sign on the interval so by Claim 1.2 the equation f (x) = 0
has at least one root in [a,b] and these roots are fixed points of g(x).
(Uniqueness) The function f (x) is monotonically decreasing on [a,b]. Indeed,
f 0(x) < 0,
g 0(x)−1 < 0,
g 0(x) < 1,
which is true since |g (x)| < 1. Then by Claim 1.3 the equation f (x) = 0 has exactly one (unique) root in [a,b].
0

(Convergence from any initial guess)


Using the iterative scheme xn = g(xn−1 ) we first derive the following useful result
|xn+1 − xn | = |g(xn )−g(xn−1 )| ≤ L|xn − xn−1 |
≤ L 2 |xn−1 − xn−2 |
≤ L 3 |xn−2 − xn−3 |
...
≤ L n |x1 − x0 | = L n |g(x0 )− x0 |.
Now, let x∗ be the unique fixed point of g that we just found so x∗ = g(x∗ ). Consider
xn+1 = g(xn )
xn+1 − x∗ = g(xn )− x∗
xn+1 − x∗ = g(xn )−g(x∗ )
6 CHAPTER 1. ROOT FINDING

xn+1 − x∗ = g(xn )−g(x∗ ) ≤ L|xn − x∗




xn+1 − x∗ ≤ L n+1 x0 − x∗ |
as n → ∞ then L n → 0 (because L < 1). Therefore
x n − x∗ → 0

and so xn → x∗ and the method converges to the fixed point. 

Example 1.28. Find all roots of e x = 3x 2 .


Solution. First we observe that 3x 2 = e x has roots in each of the intervals [−1,0], [0,1] and [3,4].
The first step is to rearrange to obtain the form x = g(x). This can be done in infinitely many ways, but
some obvious choices present themselves.
r r
ex ex
1. x = . In this case g(x) = . We must check to see if g is a contraction in any of the intervals, so
3 3
1
g 0(x) = √ e x/2 .
2 3
On the interval [0,1] we have
g (x) < 1
0

so the iterative method


1
xn+1 = √ e xn /2
3
will converge
r to the root in [0,1]. r
ex ex
2. x = − . In this case g(x) = − . We must check to see if g is a contraction in any of the
3 3
intervals, so
1
g 0(x) = − √ e x/2 .
2 3
On the interval [−1,0] we have
g (x) < 1
0

so the iterative method


1
xn+1 = − √ e xn /2
3
will converge to the root in [−1,0]. 
3. Another rearrangement is x = log 3x 2 = log3+2logx, in which case g(x) = log3+2logx. We must
check that it is contraction in [3,4], so
2
g 0(x) =
x
and on the interval [3,4] we have
g (x) < 1
0

so the iterative method


xn+1 = log3+2logxn
will converge to the root in [3,4].
^
Claim 1.29. An error bound at the n-th step of the iterative process xn+1 = g(xn ) is
n
xn − x∗ ≤ L g(x0 )− x0 ,

1− L
assuming that the sequence {xn } converges to a fixed point x∗ and g(x) is a contraction.

Proof. This result tells us that the error depends on L and the quality of the initial guess g(x0 )− x0 .

Take m > n,
xm − xn = xm − xm−1 + xm−1 − xm−2 + xm−2 +...− xn+1 + xn+1 − xn

≤ xm − xm−1 + xm−1 − xm−2 +...+ xn+1 − xn



1.4. FIXED-POINT ITERATIVE METHODS 7

≤ L m−1 x1 − x0 + L m−2 x1 − x0 +...+ L n x1 − x0



 
= L n 1+ L + L 2 +...+ L m−n−1 x1 − x0

(1− L m−n )
= Ln x1 − x0 .

(1− L)
Take m → ∞, we have already shown xm → x∗ and L m → 0, then note that x1 − x0 = g(x0 )− x0 to give

n
xn − x∗ ≤ L g(x0 )− x0 .

1− L

Remark 1.30. The following is a variant of the existence and uniqueness part of the Fixed Point Theorem 1.27 given above. It
provides a way to look for an initial guess x0 and a neighbourhood radius r where a unique fixed point exists.
Claim 1.31 (Banach fixed-point theorem). Let g(x) be a contraction mapping in a the neighborhood of x0 given by I = [x0 −r,x0 +r],
r > 0 (i.e. let g(x) satisfy a Lipschitz condition in the interval I with Lipschtiz constant 0 ≤ L < 1). Let the initial estimate x0 for
the fixed point of g(x) be such that
|g(x0 )− x0 | ≤ (1− L)r.
Then

1. All members of the sequence {xn } generated by xn+1 = g(xn ) lie in the interval I.
2. The sequence {xn } generated by xn+1 = g(xn ) converges to a fixed point x∗ .
3. The fixed point x∗ is unique in I.

Proof. We first derive the following useful result

|xn+1 − xn | = |g(xn )−g(xn−1 )| ≤ L|xn − xn−1 |


≤ L 2 |xn−1 − xn−2 |
≤ L 3 |xn−2 − xn−3 |
...
≤ L n |x1 − x0 | = L n |g(x0 )− x0 | ≤ L n (1− L)r.

1.

Now we prove that the distance between any xn+1 and the initial guess is less than the radius of the neighbourhood:

|xn+1 − x0 | = |xn+1 − xn + xn − xn−1 +···− x1 + x1 − x0 | (adding and subtracting terms)


≤ ||xn+1 − xn |+|xn − xn−1 |+···+|x1 − x0 | (triangle inequality)
 
≤ L n + L n−1 +···L +1 |x1 − x0 | (result above)
1− L n+1
= |g(x0 )− x0 | (x1 = g(x0 ))
 1− L 
≤ 1− L n+1 r ≤ r (quality of initial guess)

so that xn+1 ∈ I.

2.

We will show that the sequence xn is a Cauchy sequence. Take m > n,


xm − xn = xm − xm−1 + xm−1 − xm−2 + xm−2 +...− xn+1 + xn+1 − xn

≤ xm − xm−1 + xm−1 − xm−2 +...+ xn+1 − xn


≤ L m−1 x1 − x0 + L m−2 x1 − x0 +...+ L n x1 − x0


= L n 1+ L + L 2 +...+ L m−n−1 x1 − x0
 

(1− L m−n )
= Ln

x1 − x0
(1− L)
(1− L m−n )
≤ Ln (1− L)r
(1− L)
n m−n 
= L 1− L r
8 CHAPTER 1. ROOT FINDING

≤ L n r.

Now we show that for any given  > 0 we can find an N() such that for n > N() we have |xm − xn | <  so that {xn } is a Cauchy
sequence (see definition below) and so converges to a fixed point x∗ (see claim below). Indeed, it is possible to solve
|xm − xn | ≤ L n r < 
for n for any 
log −logr
n> .
logL

3.

Assume there is another solution in the interval I, say y∗ . Then as the solutions are distinct we have |x∗ − y∗ | > 0. As both are solutions
|x∗ − y∗ | = |g(x∗ )−g(y∗ )| ≤ L|x∗ − y∗ | < |x∗ − y∗ |.
This is a contradiction so x∗ is unique. 

In the proof we used the following result.


Definition 1.32 (Cauchy sequence). A sequence {x1,x2,x3,...} of real numbers is called a Cauchy sequence, if for every positive
real number  > 0, there exists is a positive integer N such that for all natural numbers m, n > N
|xm − xn | <  .
Claim 1.33. All Cauchy sequences converge (in a complete metric space).

1.4.3 Iterative refinement

§ 1.34. If a map g(x) is not a contraction map in the neighbourhood of the desired fixed point, a simple
modification can be made to turn g into a contraction. The idea is to introduce a parameter which we can
adjust so that the conditions for convergence are satisfied.
Claim 1.35. Let x∗ be a fixed point of g(x). Then x∗ is also a fixed point of the map
g(x)+λx
G(x,λ) = ,
1+λ
where λ , −1 is an arbitrary parameter.

Proof. Let x∗ be the fixed point of g(x) then, provided λ , −1 we have


x∗ = g(x∗ ),
x∗ +λx∗ = g(x∗ )+λx∗,
g(x∗ )+λx∗
x∗ = = G(x∗,λ).
1+λ
So x∗ is also the fixed point of
g(x)+λx
G(x,λ) = .
1+λ

Remark 1.36. Selecting a value of λ s.t. |∂x G(x,λ)| < 1 on x ∈ [a,b] so as to make G(x,λ) a contraction
mapping and ensure that
xn+1 = G(xn,λ)
converges to x∗ is called iterative refinement.
Example 1.37. Consider xn+1 = G(xn,λ) with G(x) defined as in Claim (1.35). Let x∗ be a fixed point of
g(x). Select a value of the parameter λ , −1 so that G 0(x∗ ) is minimal by modulus.
Solution. A wise choice of λ is to make G 0(x) as small as possible in a neighbourhood of x∗ . Consider
g 0(x)+λ
∂x G(x,λ) = .
1+λ
Take λ = −g (x0 ), where x0 is the initial guess. Then with this choice of λ check that G is a contraction
0

on our interval. If it is we can safely use G to iterate to the required fixed point. ^
1.4. FIXED-POINT ITERATIVE METHODS 9

1.4.4 Order of convergence of iterative sequences

§ 1.38. Once an sequence produced by an iterative method converges it is of practical importance to know
how fast it converges compared to other iterative sequences.

The speed (or the rate) of convergence is measured by the so called order of convergence.
Definition 1.39 (Order of convergence of sequences). The order of convergence of a sequence {xn,n = 0,1,...}
to its fixed point x∗ is defined by the integer α such
that
en+1
lim α =C , 0,

n→∞ en
where en = xn − x∗ .
Example 1.40. Show that interval bisection is an order 1 method.
Solution. Consider the error after n iterations, en . We know that the error on the next iteration (so after
n+1 iterations) is at most half the error on the current iteration
1
en+1 = en .
2
The interval length halves after each iteration, as does the error. Rearranging this gives
e 1
n+1
=
2

en
and on taking the limit we see that α = 1 and C = 1/2 so that interval bisection is an order 1 method. ^
Claim 1.41. Let {xn+1 = g(xn ),n = 0,1,...} be a sequence with fixed point x∗ . Let g(x) be k + 1 times
differentiable in a neighbourhood of x∗ and that g 0(x∗ ) = g 00(x∗ ) = ...g (k−1) (x∗ ) = 0 and g (k) (x∗ ) , 0. Then
the order of convergence is k, the order of the first non-zero derivative of g evaluated at the fixed point
x∗ . (Note that g (n) is the n-th derivative of g.)

Proof. Taylor expand about x∗ as follows


xn+1 = g(xn ) = g(xn − x∗ + x∗ ) = g(x∗ +en )
1 1
= g(x∗ )+en g 0(x∗ )+ e2n g 00(x∗ )+···+ ek−1 g (k−1) (x∗ )
2 (k −1)! n
1 1
+ enk g (k) (x∗ )+ ek+1 g (k+1) (η)
k! (k +1)! n
1 1
= x∗ + enk g (k) (x∗ )+ ek+1 g (k+1) (η) where η ∈ (xn,x∗ )
k! (k +1)! n
Now
1 1
en+1 = enk g (k) (x∗ )+ ek+1 g (k+1) (η)
k! (k +1)! n
en+1 1 (k) 1
= g (x∗ )+ en g (k+1) (η)
k
en k! (k +1)!
en+1 1 (k)

lim = g (x∗ ) , 0
n→∞ ek k!
n
so that the method is order k. 

§ 1.42. It is possible to evaluate the derivatives g (k) (x∗ ) even if the fixed point x∗ is not known. See, the Claim
about the order of convergence of the Newton-Raphson method where the known fact f (x∗ ) = 0 is used.

1.4.5 The Newton-Raphson method

§ 1.43. An iterative method that is frequently used and very efficient is the Newton-Raphson method.
§ 1.44. The following notation will be useful throughout the lecture notes.
10 CHAPTER 1. ROOT FINDING

Definition 1.45 (Big-O notation). Let f (x) and g(x) be two functions with identical domains. We write
f (x) = O g(x)
if there is M > 0 and a ∈ R s.t.
| f (x)| ≤ M |g(x)| for all x > a.
Remark 1.46. This means that asymptotically f (x) behaves like g(x).

With this we can now introduce the Newton-Raphson method for the solution of f (x) = 0.
Claim 1.47 (Newton-Raphson formula). The root x∗ of f (x) = 0 is given by
f (x0 ) 1 f 00(η)
x∗ = x0 − 0 − (x∗ − x0 )2,
f (x0 ) 2 f 0(x0 )
where x0 is arbitrary and for some η ∈ (x0,x∗ ).

Using Big-O notation we write


f (x0 )
x∗ = x0 − +O (x∗ − x0 )2 .

0
f (x0 )

Proof. Let x∗ be the exact solution of f (x∗ ) = 0 and x0 is some “initial guess”. Then we can expand f (x)
using the Taylor theorem. By Taylor’s theorem there  exists a η ∈ (x0,x∗ ) s.t.
0 = f (x∗ ) = f x0 +(x∗ − x0 )
1
= f (x0 )+(x∗ − x0 ) f 0(x0 )+ f 00(η)(x∗ − x0 )2
2
1
= f (x0 )+ x∗ f (x0 )− x0 f (x0 )+ f 00(η)(x∗ − x0 )2 .
0 0
2
Now solve for x∗ to get
f (x0 ) 1 f 00(η)
x∗ = x0 − 0 − (x∗ − x0 )2 .
f (x0 ) 2 f 0(x0 )


Algorithm 1.48 (Newton-Raphson method). • Assume that the error in the initial guess is sufficiently
2
small |x∗ − x0 |  1 then the terms O (x∗ − x0 ) are even smaller and can be neglected.

• The remaining part of the formula is not exact any longer so we use
f (xn )
xn+1 = xn − 0 ,
f (xn )
to generate a iterative sequence converging to the solution x∗ of the original equation f (x) = 0.
This is the Newton-Raphson method.
• Terminate the iterations after some tolerance criterion is met
|xn − xn−1 | < τ,
where τ is the desired accuracy.
§ 1.49. This procedure is fast when it converges, but it need not converge at all. Whether it does or not
depends on the quality of the starting guess and the properties of the function f (x). To execute the method
we require f (x) and f 0(x), both of which could be computationally costly to calculate.
Claim 1.50. The Newton-Raphson method for calculating a simple root of f (x) always converges in a
sufficiently small neighbourhood of the solution x∗ and has a quadratic order of convergence (in general).

Proof. A simple root means that f 0(x∗ ) , 0. Newton’s method is


f (xn )
xn+1 = xn − 0 = g(xn )
f (xn )
with g(x) = x − f (x)/ f 0(x). Calculate the derivatives at the fixed point
f 0 f f 00 f f 00
g 0(x) = 1− 0 + =
f ( f 0 )2 ( f 0 )2
so
g 0(x∗ ) = 0
1.5. HANDS-ON PROJECTS 11

as f (x∗ ) = 0. This means the method is not linear, it is at least quadratic since
f 00 f f 000 2 f ( f 00)2 f 00(x∗ )
g 00(x) = 0 + − = 0
f ( f 0 )2 ( f 0 )3 f (x∗ )
and g (x∗ ) = f (x∗ )/ f (x∗ ) , 0 in general. Therefore Newton’s method is an order 2, or quadratic, method.
00 00 0

The constant is used to estimate the behaviour of" the error,#


1 f 00(x∗ ) 2
en+1 ≈ e .
2 f 0(x∗ ) n


1.5 Hands-on projects

1.5.1 Fixed point iterations

Question. In this project you will devise and implement in Matlab fixed-point iterative methods for the
solution of the nonlinear equation
1
log x − = 0.
x −1
1. Produce a figure to illustrate that this equation has two separate roots – one in the interval (0,1), and a
second one in the interval (2,3). Justify your observations using existence and uniqueness arguments.
(Marks: 6)
2. Create a Matlab function iterate(g,x0,tol,maxiter) implementing a fixed point iteration
of the type xn+1 = g(xn ). The Matlab function should take as arguments the mathematical function to
be iterated, the initial guess, the required tolerance, and a “safety” parameter specifying the maximum
number of iterations allowed. The function should return a list [xn, errn, ordn] of three lists
containing all values of the iterates {xn,n = 0,...}, the error measured {en = |xn − xn−1 |,n = 0,...} and
the order of the iterative scheme, respectively, computed at each iteration step. (Marks: 10)
3. Consider the iterative scheme g(x) = exp 1/(x −1) on the interval (0,1).

(a) Produce a plot of g 0(x) to illustrate that this scheme will converge on this interval.
(b) Produce figures of the iterates, the error measured, and the order of the iterative scheme in this
case if accuracy of 6 decimal digits is required. Comment on the behaviour of the error. Provide
the value of the root. What is the order of convergence?
(c) Chose a suitable Lipschitz constant and use it to find a theoretical estimate of the number of
iterations so that the estimate is in good agreement with the actual iteration count. (Marks: 10)
4. Consider the iterative scheme g(x) = exp 1/(x −1) on the interval (2,3).
(a) Produce a plot of g 0(x) to illustrate that this scheme will diverge on this interval.
(b) Produce figures of the iterates, the error measured, and the order of the iterative scheme with
an initial guess arbitrary close to the solution (which you may identify graphically). Comment on
their behaviour of the error. Use maxiter = 10. (Marks: 6)
5. Use the iterative refinement method to define a new scheme G(x,λ) based on g(x) = exp 1/(x −1)
so that it is convergent on the interval (2,3). Repeat all tasks requested in part 3 for the new map
G(x,λ). (Marks: 8)

1.5.2 Finding all simple roots

Devise a robust and efficient scheme for finding all simple roots of a given (real nonlinear scalar) equation
f (x) = 0 in a given interval x ∈ [a,b].
Background. Important qualities of numerical methods, in addition to convergence, are (a) efficiency –
requires a small number of function evaluations; (b) robustness – fails rarely, if ever; (c) minimality - uses
a minimal amount of user input, e.g. does not require the derivative in addition to the function. No single
12 CHAPTER 1. ROOT FINDING

method meets all criteria. For instance, the Newton-Raphson map g[ f ](x) is guaranteed to converge quickly
if the initial iterate x0 is “close enough” to a simple root so that |gx0 [ f ](x0 )| < 1. However, convergence is only
local and the neighbourhood of convergence is difficult to assess ahead of time for a given f (x). In contrast,
the bisection method is slow but it is guaranteed to converge provided only that a bracketing interval is found
on which f (x) changes sign. A natural idea, then, is to combine the two in a fast and robust hybrid method.

Question 1. Write a MATLAB function [ai,ci,bi] = bisect(f,ai,bi,niter) to perform the


interval bisection method for a given function f on an interval given by its left and right ends ai and bi.
The MATLAB function must check if f (x) changes sign in the interval and if so perform a specified number
of iterations niter and then return the estimated root in ci and the left and right ends of the bracket where
it is located in ai and bi; else return in ci a NaN MATLAB object. (Marks: 8)
Question 2. Write a MATLAB function ci = newtonraphson(f,x0,tol) to perform the Newton-
Raphson method for the function f starting from initial iterate x0. The function must check if the absolute
value of the derivative of the Newton-Raphson map at x0 is less than unity, and if so perform iterations
and return in ci the estimated root; else return in ci a NaN value. To avoid non-essential input all derivatives
needed must be computed using the finite-difference formula gprim=(g(x+eps)-g(x-eps))/(2*eps),
where eps=1e-5. Use |xk − xk−1 | < tol(1 + |xk |) as a stopping condition for iterations; what is your
interpretation of this condition? (Marks: 8)
Question 3. Write a MATLAB function rts=findroots(f,a,b,n,tol) to combine the bisection
and the Newton-Raphson methods as follows. Uniformly partition the given interval [a,b] into n intervals
of equal length. For each subinterval [ai,bi ] over which the function changes sign use 3 bisection steps
to approach the root more closely calling your bisect. Next, call your newtonraphson to perform
Newton-Raphson iterations if the method converges, otherwise revert back to the last bisection bracket
and apply 3 further bisection steps. Loop until the stopping condition | f (xk )| < tol is met. Return in the
array rts all roots found. (Marks: 14)
Question 4. Write a MATLAB script Q04.m where the function findroots is used to find all of the roots of the
Chebyshev polinomialsTn (x) of degrees n from 1 to 20 with tolerance tol=1e-12. In one figure, plot all roots
found on the abscissa axis versus the polynomial degree on the ordinate axis. In a second figure, plot together
the graphs of all Chebyshev polinomials Tn (x). For the polynomial T15 compare the roots found numerically
with the known exact solutions and print the absolute error in each root in the format of the following example.

Roots and their Errors for the Chebyshev polinomial T[5]:


-9.51056516295153531182e-01 +0.00000000000000000000e+00
-5.87785252292473137103e-01 +0.00000000000000000000e+00
+3.21265594383589607162e-17 +9.33588993957266210749e-17
+5.87785252292473137103e-01 +1.11022302462515654042e-16
+9.51056516295153531182e-01 +0.00000000000000000000e+00

[Hint: you may use fprintf with format ’%+30.20e’.] (Marks: 10)
Chapter 2

Interpolation and Approximation

2.1 Polynomial interpolation

2.1.1 The interpolation problem

Problem 2.1. Given a set of n+1 data points (xi,yi ), find a polynomial p(x) that passes through each and
every point. Such a polynomial is called a polynomial interpolant for the set of points.
The solution can be summarised in the following statement.
Claim 2.2. Given a set of n +1 data points (xi,yi ) with distinct x coordinates xi , x j , i , j there exist a
unique polynomial
Õn
pn (x) = ak x k
k=0
of degree at most n such that pn (xi ) = yi , i = 0,...,n. The polynomial coefficients are determined as a solution
of the equations
Õ n
ak xik = yi, i = 0,...,n. (2.1)
k=0

Proof. A polynomial interpolant of degree n has the form


n
pn (x) = ak x k = a0 +a1 x +a2 x 2 +···+an x n .
Õ
(2.2)
k=0
The condition that pn (x) passes through each of the points (xi,yi ), i = 0,1,...n, is
pn (xi ) = yi, i = 0,1,...,n, i.e.
Õn
ak xik = yi, i = 0,...,n i.e.
k=0
yi = a0 +a1 xi +a2 xi2 +···+an xin, i = 0,1,...,n,
i.e. in explicit matrix form
 1 x0 x02
... x0n   a0   y0 
1 x1 x12
... x1n   a1   y1 

= , i.e.

... ... ... ...   ...   ... 
...


1 xn xn2
... xnn   an   yn 



V a = y. (2.3)
This system of linear equations has a unique solution for the vector of coefficients a if and only is the matrix
V is non-singular (i.e. detV , 0). The particular matrix Vik = xik is a Vandermonde matrix, see Definition

13
14 CHAPTER 2. INTERPOLATION AND APPROXIMATION

2.3 below. Given that the values of x are different xi , x j , i , j the determinant of a Vandermonde matrix
is nonzero, see Lemma 2.4. 

Definition 2.3. Given a set of xi , i = 0,1,...,n, the square n+1×n+1 matrix


 1 x0 x 2 ... x n 
0 0 
 1 x1 x 2 ... x n 

V= 1 1  , i.e. V = x k ,
... ... ... ... ... ik

 i
 1 xn x 2 ... x n 
 
n n 
is called a Vandermonde matrix.

Lemma 2.4. The determinant of a Vandermonde matrix


Ö Vik = xi is given by
k

detV = (x j − xi ),
0≤i< j ≤n
so that detV = 0 if and only if xi = x j for some i , j.

Proof. The proof of the expression for the determinant of V is by induction.


For n = 1 we have
1 x0

detV2×2 = = x1 − x0 .

1 x1

Assume that the expression is true for an n×n Vandermonde matrix.


Then consider the n+1×n+1 Vandermonde matrix,
1 x0 x 2 ... x n

0 0
1 x1 x 2 ... x n

det(Vn+1×n+1 ) = 1 1 = Col( j) → Col( j)− x0 Col( j −1)
... ... ... ... ...


1 xn x 2 ... x n
n n
1 0 0 ... 0


1 x1 − x0 x 2 − x1 x0 ... x n − x n−1 x0

= 1 1 1
... ... ... ... ...


1 xn − x0 xn2 − xn x0 ... xnn − xnn−1 x0

x1 − x0 x1 (x1 − x0 ) ... x n−1 (x1 − x0 )



1
= ... ... ... ...


xn − x0 xn (xn − x0 ) ... xn (xn − x0 )
n−1

1 x1 x 2 ... x1n−1

1

= (x1 − x0 )(x2 − x0 )...(xn − x0 ) ... ... ... ... ...


1 xn xn2 ... xnn−1


= (x1 − x0 )(x2 − x0 )...(xn − x0 )det(Vn×n )
Ö
= x j − xi .

0≤i< j ≤n
By induction the result is true.
Finally, note that if xi = x j for some i , j then there is a zero term (x j − xi ) in the product above so the
determinant then vanish. 

Remark 2.5. In principle, these results solve the polynomial interpolation problem by reducing it to a
solution of a set of linear algebraic equations, a topic we will discuss later in this course.
§ 2.6. One may solve equations (2.1), for example, by direct Gauss elimination. This however, is very
inefficient and in the rest of this chapter we will consider more efficient methods of polynomial interpolation.
2.1. POLYNOMIAL INTERPOLATION 15

2.1.2 Newton’s recursive method

§ 2.7. The aim is to use a polynomial of degree n to interpolate a set of n+1 distinct points, (xi,yi ) for i = 0,1,...n.
Suppose we can find a polynomial that interpolates the first k +1 of these points, pk (x). We use pk (x) to
construct a polynomial which passes through the first k +1 points and the point (xk+1,yk+1 ). We work through
the list of points, generating a sequence of polynomials that pass through successively more points. Adding the
final point (xn,yn ) yields the pn (x). The algorithm is recursive and the final result is summarised as follows.
Claim 2.8. Given a set of n+1 data points (xi,yi ), a polynomial pn (x) of degree n such that pn (xi ) = yi ,
i = 0,...,n is given by the recurrence relation
n−1
Ö (t − xi )
pn (t) = pn−1 (t)+(yn −pn−1 (xn )) with p0 (t) = y0 . (2.4)
(x − xi )
i=0 n

Proof. Proof by induction


Check that p0 (x0 ) = y0 , so true for 1 point.
Assume pn−1 (t) passes to the first n points pn−1 (xi ) = yi , i = 0,...,n−1.
Now prove that
(a) pn (t) passes through the last point (xn,yn ). Do that by substituting xn in equation (2.4) to get
n−1
Ö (xn − xi )
pn (xn ) = pn−1 (xn )+(yn −pn−1 (xn )) = pn−1 (xn )+ yn −pn−1 (xn )×(1) = yn .
i=0
(x n − x i )
because every terms in the numerator and the denominator of the product cancel.
(b) pn (t) still satisfies pn (xi ) = yi for all “old” points i = 0,...,n−1. Indeed, for any i = 0,...,n−1, the product
in (2.4) contains a term (xi − xi ) = 0, so the entire product vanish. Only pn−1 (xi ) remains on the RHS of
(2.4) and that is equal to yi by assumption.
Proof by construction From the proof by induction it is perhaps not obvious how the recurrence relation
is derived. So let us construct (2.4). Assume that we have already found pn−1 (t) that passes through the
first i = 0..n−1 points, and we now like to adjust it to a polynomial pn (t) of degree n that satisfies (a) and
(b) above. Such a polynomial is
n−1
Ö
pn (t) = pn−1 (t)+ A (t − xi ).
i=0
Indeed, this is of degree n because of the product of n linear terms. In addition it passes through the first
i = 0..n−1 points by the same argument as in (b) above. Finally, the constant A is determined from the
condition that pn (t) passes through the last point (xn,yn ),
n−1
Ö
pn (xn ) = yn = pn−1 (xn )+ A (xn − xi ),
i=0
giving that
n−1
Ö 1
A= (yn −pn−1 (xn )) .
(x
i=0 n
− xi )


Example 2.9. Use Newton’s recursive method to find the polynomial interpolant through the points (0,3),
(1,4) and (2,6).
Solution. First find p0 ,
p0 (x) = 3.
Generate p1 (x) using the recurrence
(x − x0 )
p1 (x) = p0 (x)+(y1 −p0 (x1 ))
(x1 − x0 )
16 CHAPTER 2. INTERPOLATION AND APPROXIMATION

= 3+ x.
Finally, generate p2 (x) using the recurrence
(x − x0 )(x − x1 )
p2 (x) = p1 (x)+(y2 −p1 (x2 ))
(x2 − x0 )(x2 − x1 )
1
= 3+ x + x(x −1)
2
x x2
= 3+ + .
2 2
^

p0 (x) = 3 p0 (x) = 3+ x x x2
p0 (x) = 3+ +
2 2
7 7 7

6 6 6

5 5 5

4 4 4

3 3 3

2 2 2

1 1 1

0 0 0
0 0.5 1 1.5 2 2.5 3 0 0.5 1 1.5 2 2.5 3 0 0.5 1 1.5 2 2.5 3

Figure 2.1: The sequence of polynomials generated by Newton’s recursive method.

2.1.3 Lagrange interpolation

§ 2.10. Instead of using a recurrence we may write the polynomial interpolant pn (x) in terms of cardinal
or basis functions.

First, we find convenient basis functions.


Definition 2.11. The Lagrange basis functions are defined as
i=n
Ö (t − xi )
Ln,k (t) = , (2.5)
(x − xi )
i=0,i,k k
where {xi,i = 0..n} is a set of distinct real numbers (xi , x j for i , j).
Lemma 2.12. The Lagrange basis functions satisfy the “orthogonality-like” property
Ln,k (x j ) = δk j , (2.6)
where δ jk = 1 if j = k and δ jk = 0 is j , k is the Kronecker delta symbol.
2.1. POLYNOMIAL INTERPOLATION 17

Proof. To prove this compute Ln,k (xk ) and Ln,k (x j ), k , j. We have


n
Ö (xk − xi )
Ln,k (xk ) = = 1,
(x − xi )
i=0,i,k k
as all terms in the product cancel. We also have
n n 
Ö x j − xi Ö x j − xi
Ln,k (x j ) = = (x j − x j ) = 0,
(x − xi )
i=0,i,k k
(x − xi )
i=0,i,k,i,j k
as (x j − x j ) = 0. 

Next, we use these Lagrange basis functions to construct a polynomial interpolant.


Claim 2.13. The polynomial of degree n
n
Õ
pn (t) = yk Ln,k (t), (2.7)
k=0
passes through each and every one of the n+1 data points (xi,yi ), so that pn (xi ) = yi , i = 0,...,n.

Proof. Starting from (2.6), we now have


n
Õ n
Õ
pn (x j ) = yk Ln,k (x j ) = yk δk j
k=0 k=0
= y0 δ0j +...+ y j δ j j +...+ yn δn j
= 0+...+0+ y j +0+...+0
= yj
so that pn (t) passes through the points (xi,yi ) for i = 0,1,...n as required. 

Example 2.14. Find the Lagrange basis functions based on the points (0,3), (1,4) and (2,6). Find the
Lagrange polynomial interpolant.
Solution. These are
(t −1)(t −2) 1
L2,0 (t) = = (t −1)(t −2)
(0−1)(0−2) 2
(t −0)(t −2)
L2,1 (t) = = t(2−t)
(1−0)(1−2)
(t −0)(t −1) 1
L2,2 (t) = = t(t −1)
(2−0)(2−1) 2
Use the Lagrange basis functions to find the degree 2 interpolating polynomial to the points (0,3), (1,4)
and (2,6). The interpolating polynomial is
p(x) = 3L2,0 (x)+4L2,1 (x)+6L2,2 (x).
Substitute for the Ln,k to give
3 6
p(x) = (x −1)(x −2)+4x(2− x)+ x(x −1)
2 2
upon simplification
x x2
p(x) = 3+ + .
2 2
^
18 CHAPTER 2. INTERPOLATION AND APPROXIMATION

(x −1)(x −2)/2 x(2− x) x(x −1)/2


1.5 1.5 1.5

1 1 1

0.5 0.5 0.5

0 0 0

-0.5 -0.5 -0.5


0 0.5 1 1.5 2 0 0.5 1 1.5 2 0 0.5 1 1.5 2

Figure 2.2: The Lagrange basis functions L2,0 , L2,1 and L2,2

2.2 Approximation of functions by polynomial interpolants

2.2.1 The approximation problem

Problem 2.15. Find an approximation to the function f (x) on the interval x ∈ [a,b] given a set of n+1 data
points {(xi, f (xi )),i = 0,1...n}.
Remark 2.16 (Solution). One possible solution is the following.
• Approximate f (x) on x ∈ [a,b] by the unique polynomial interpolant pn (x) through the given points.
• Then there is no error at the given data points
yi = f (xi ) = pn (xi ),i = 0,1...n.
• However, error is introduced for all other points [a,b] 3 x , xi , i = 0,1...n.

2.2.2 Error bound

Claim 2.17 (Error). Let pn (x) be the unique polynomial interpolant of degree n approximating an (n+1)-
times differentiable function f (x) on an interval x ∈ [a,b], so that pn (xi ) = f (xi ), i = 0...n. Then there exist a
number ξ(x) ∈ (a,b) such that the error in the approximation of f (x) by pn (x) at a general point x ∈ [a,b] is
n
f (n+1) (ξ(x)) Ö
f (x)−pn (x) = (x − xi ). (2.8)
(n+1)! i=0

Proof. Fix x , xk and consider the function,


n
Ö t − xi
g(t) = f (t)−pn (t)−[ f (x)−pn (x)] .
i=0
x − xi
2.2. APPROXIMATION OF FUNCTIONS BY POLYNOMIAL INTERPOLANTS 19

The function g(t) is zero at n+2 points. To see this, consider


n
Ö x k − xi
g(xk ) = f (xk )−pn (xk )−[ f (x)−pn (x)] .
x − xi
i=0
Then f (xk )−pn (xk ) is zero because pn is the polynomial interpolant to the points (xk , f (xk )), k = 0,1,...n.
The product
n n
Ö t − xi (t − xk ) Ö t − xi
=
i=0
x − xi (x − xk ) i=0,i,k x − xi
is clearly zero at t = xk . Therefore
g(xk ) = 0.
This gives n+1 zeros of g. Where is the other? Evaluate g(t) at t = x,
n
Ö x − xi
g(x) = f (x)−pn (x)−[ f (x)−pn (x)] = f (x)−pn (x)−[ f (x)−pn (x)] = 0.
i=0
x − xi
So g(t) has n+2 zeros, at xk for k = 0,1,...n and at x.
Using Rolle’s Theorem the fact that g(t) has n+2 zeros implies that the function g 0(t) is zero at at least
n+1 points, between the zeros of g(t). Apply Rolle’s theorem to g 0(t) and discover that g 00(t) is zero at
at least n points, between the zeros of g 0(t). Continue to discover that there is at least one point, which
we label ξ(x), at which the (n+1)-th derivative of g is zero, that is
∃ξ(x), g (n+1) (ξ(x)) = 0.
Having proved the existence of such a point we now evaluate g (n+1) (t). " We have #
n
d n+1 d n+1 Ö 1
g (n+1) (t) = f (n+1) (t)− n+1 pn (t)−[ f (x)−pn (x)] n+1 t n+1 +... .
dt dt i=0
x − xi
We know that n+1 derivatives of a degree n polynomial gives zero, and n+1 derivatives of t n+1 is (n+1)! so that
(n+1)!
g (n+1) (t) = f (n+1) (t)−[ f (x)−pn (x)] În .
i=0 (x − xi )
Evaluation of g (n+1) (t) at the location ξ(x) yields the error formula (2.8)
n
f (n+1) (ξ(x)) Ö
f (x)−pn (x) = (x − xi )
(n+1)! i=0
for some ξ(x) ∈ (a,b). 

Remark 2.18. This is an extension of the familiar Taylor remainder formula and it reduces to the latter
if x0,x1,x2 ...xn → x ∗ ∈ (a,b) !!!

The proof of the error formula uses Rolle’s theorem which we state without proof.
Claim 2.19. (Rolle’s theorem) Let F(x) be a differentiable function on [a,b] such that F(a) = F(b). Then
there exists a number c ∈ (a,b) such that
F 0(c) = 0.

Claim 2.20 (Error bound). The error in approximating a function f (x) by a polynomial interpolant pn (x)
is bounded by
n
1 Ö
| f (x)−pn (x)| ≤ max | f (n+1) (x)| max (x − xi ) .

(n+1)! x ∈[a,b] x ∈[a,b]
i=0

Proof. The error formula (2.8) is used to bound the error,


n
1 (n+1) Ö
| f (x)−pn (x)| = f ξ(x) (x − xi )

(n+1)! i=0
1 (n+1)
n
Ö
= ξ(x) (x − xi )

f

(n+1)! i=0
20 CHAPTER 2. INTERPOLATION AND APPROXIMATION

Degree 2 polynomial interpolant to sinx at points (0,0), (π/2,1) and (π,1)


1
sinx
Interpolant
0.8

0.6

0.4

0.2

0
0 0.5 1 1.5 2 2.5 3

Figure 2.3: Plot of sinx and a polynomial interpolant p2 (x) through the points (0,0), (π/2,1) and (π,0).
What is the error between f (x) = sinx and p2 (x) away from the interpolation points, at a general point x.

n
1 Ö
≤ max | f (n+1)
(x)| max (x − xi ) . (2.9)

(n+1)! x ∈[a,b] x ∈[a,b]
i=0


2.2.3 Piecewise approximation and examples

§ 2.21. The error bound (2.9) was determined assuming a given number of data points n. Can we choose
the degree of the interpolating polynomial n such that the maximum error given by the error formulas (2.8)
or (2.9) is as small as possible? This problem is discussed below.
§ 2.22. High-degree polynomials are rapidly oscillating. This leads to very large errors away from the given
data points. For this reason, low-degree piecewise approximation procedures are typically used in practice.
Algorithm 2.23 (Piecewise linear approximation). • The interval [a,b] is partitioned into a collection
of n sub-intervals Ii = [xi−1,xi ], i = 1,...n using each of the given n+1 data points xi .
• A linear polynomial interpolant pi1 (x) is computed on each sub-interval Ii .
• The linear polynomial interpolants pi1 (x) are “glued” together into a composite interpolant function
P(x) covering the full interval [a,b], (
n
Õ 1, x ∈ Ii,
P(x) = pi1 (x)Ji (x) where Ji (x) =
i=1
0, otherwise.
(Notice that P(x) selects the appropriate linear fit depending on the interval.)
Remark 2.24. Quadratic, cubic or quintic “composite” approximation procedures can be constructed
in a similar way. Polynomial degrees higher than these are typically not used in practice.
Example 2.25. Consider a table of values of f (x) = sinx on [0,π] at a uniform partition with step π/12.
If we use “composite” linear interpolation between the points what is the maximum error we make?
Solution. Choose an interval over which sinx is approximated by a linear polynomial, [x0,x1 ]. We know that
2.2. APPROXIMATION OF FUNCTIONS BY POLYNOMIAL INTERPOLANTS 21

the spacing h is h = π/12, so that x1 = x0 +h. On such an interval we know, using the error formula with n = 1,
f 00(ξ(x))
f (x)−p1 (x) = (x − x0 )(x − x0 −h)
2
for some ξ(x). Then taking the modulus and maximising the error on the RHS
f (x)−p1 (x) ≤ 1 max f 00(ξ(x)) max (x − x0 )(x − x0 −h) .

2 ξ ∈[x0, x0 +h] ξ ∈[x0,x0 +h]
The maximum value of the quadratic term is easily shown to be h2 /4. We are interested not just in the
error on a particular interval but the error over the whole domain. Let
Õ n
P(x) = pi1 (x)J[xi−1, xi ) (x)
i=1
where J is defined as above. Then P(x) selects the appropriate linear fit depending on the interval.
Finally,
2 2
f (x)−P(x) ≤ h max f 00(ξ) = h

8 ξ ∈[0,π] 8
as f 00(x) = −sinx. Using h = π/12 gives a maximum error of h2 /8 = 0.0085674. ^
Example 2.26. A table of log10 is to be made on a uniform partition of some interval. We wish to use
“composite” linear interpolation between the grid points and require four decimal places of accuracy at
all grid points. What should be the step of the partition?
Solution. Any number X can be written X10−k 10k for k ∈ Z. Choose k such that 1 ≤ 10−k X < 10, then


log10 X = log10 10−k X10k
 
= log10 10−k X +log10 10k
 
= log10 10−k X +k
so we need only tabulate the values of log10 between 1 and 10. Let h be the spacing, and consider linear
interpolation on a general interval [a,a+h]. Let the polynomial interpolant on that interval be p1a (x). Using
the error formula
1 00
f (x)−p1a (x) = f (ξ(x))(x −a)(x −a−h)
2
1
| f (x)−p1a (x)| ≤ max | f 00(x)| max [(x −a)(a+h− x)]
2 x ∈[a,a+h] x ∈[a,a+h]

h 2
≤ max | f 00(x)|.
8 x ∈[a,a+h]
Now f 0(x) = 1/(xlog10) and f 00(x) = −1/ x 2 log10 . On the interval [a,a+h] the largest value of f 00(x)
occurs at x = a so that
h2
| f (x)−p1a (x)| ≤ 2 .
8a log10
where 1 ≤ a < 10. The largest value of the right-hand side occurs when a = 1, so
h2
E(x) ≤
8log10
where E(x) is the error at x. We require, for four decimal places of accuracy
E(x) ≤ 5×10−5
which is guaranteed if
h2
< 5×10−5
8log10
that is h < 0.0303. ^
22 CHAPTER 2. INTERPOLATION AND APPROXIMATION

2.3 Chebyshev economisation

2.3.1 The problem of grid optimization

§ 2.27. Given a function f (x) then in the error formulas (2.8) or (2.9) we have control over
1. the degree of the interpolating polynomial n and
2. the location of the n+1 interpolation points xk .
Can we choose the location of the interpolation points such that the maximum error given by the error
formulas (2.8) or (2.9) is as small as possible?

In other words,
Problem 2.28. Find the optimal partition {xi } = {x0,x1,...,xn } of the interval x ∈ [a,b] (i.e. the optimal
distribution of interpolating points) so that the term
Ön
max (x − xi ) (2.10)

x ∈[a,b]
i=0
in formula (2.9) takes its minimal value? What is this minimal value?
§ 2.29. Below we demonstrate that the sought after optimal distribution of interpolating points is the partition
given by the roots of the Chebyshev polynomials.

2.3.2 Chebyshev polynomials

We first need several facts about the Chebyshev polynomials.


Definition 2.30. The Chebyshev polynomial (of the first kind) of order m ∈ N is defined as
Tm (cosθ) = cos(mθ), θ ∈ R.
Remark 2.31. This looks like a trigonometric function but it can be easily converted in polynomial form.
Example 2.32. Write out the first two Chebyshev polynomials and convert them to polynomial form.
Solution. This is done by by expanding cos(mθ) in terms of cosθ and then using the change of variable
z = cosθ ∈ [−1,1],
T0 (cosθ) = 1 =⇒ T0 (z) = 1
T1 (cosθ) = cosθ =⇒ T1 (z) = z
2
T2 (cosθ) = cos2θ = 2cos θ −1 =⇒ T2 (z) = 2z 2 −1.
^

The polynomial form of the general Chebyshev polynomial of order n is found by a recurrence relation
as follows.
Claim 2.33.
Tm+1 (z) = 2zTm (z)−Tm−1 (z), T0 (z) = 1, T1 (z) = z, z ∈ [−1,1].

Proof. Use the addition formulas of trigonometry as follows.


Tm+1 = cos (m+1)θ = cos(mθ)cosθ −sin(mθ)sinθ.

Sines must be eliminated as they do not appear in the definition of the Chebyshev polynomials, so use the
addition formula
Tm−1 = cos (m−1)θ = cos(mθ)cosθ +sin(mθ)sinθ.

Then add term by term
Tm+1 +Tm−1 = 2cosθcos(mθ) = 2zTm
2.3. CHEBYSHEV ECONOMISATION 23

which can now be rearranged to get


Tm+1 (z) = 2zTm (z)−Tm−1 (z).
T0 and T1 were computed above. 

We now need to know the roots of the Chebyshev polynomials. Since Tm (z) is a polynomial of degree n
it has n roots as follows.
Claim 2.34. The m roots of Tm (z) = 0 are given
 by 
(2k −1)π
zk = cos , k = 1,...m.
2m

Proof. The roots are found from the set of equations


zk = cosθ k , cos(mθ k ) = 0
the first being the change of variable and the second being the RHS of the definition of Tm (z). The solutions
to the second equation cos(mθ k ) = 0 are mθ k = −π/2+ πk for k ∈ Z, and substituting these into the first
equation we obtain the result. 

Claim 2.35. The Chebyshev polynomial Tm (z) can be represented as


m
Ö
Tm (z) = 2 m−1
(z−zk ), (2.11)
k=1
where {zk } are the m roots of Tm (z) in the interval z ∈ [−1,1].

Proof. Since the Chebyshev polynomial Tm (z) has n roots, given by the zk we can write it as the product
of n linear factors,
Öm
Tm (z) = A(z−z1 )(z−z2 )(z−z3 )...(z−zm ) = A (z−zk )
k=1
for some constant A. We can determine the constant A by considering the original definition of Tm (cosθ).
Write s = eiθ then the definition of cosinegives
1 1 1 n 1
 
z = cosθ = s+ , cos(nθ) = s + n
2 s 2 s
so
1 1 1 1
    
Tm s+ = sn + n
2 s 2 s
but also
m  
1 1 1 1 sn
   Ö  
Tm s+ =A s+ −zk = A n +...
2 s k=1
2 s 2
comparing coefficients of s gives the value of A as
n

A= 2m−1 .
In summary,
Öm
Tm (z) = 2m−1 (z−zk ) (2.12)
k=1
where zk are the roots of Tm (z) = 0. 

2.3.3 The economization

We can now prove the following results related to the main question of this section.
Claim 2.36. Let {zk } be the m roots of the Chebyshev polynomial Tm (z) in the interval z ∈ [−1,1]. Then
m
Ö 1
max (z−zk ) = m−1 .

z ∈[−1,1]
k=1
2
24 CHAPTER 2. INTERPOLATION AND APPROXIMATION

Proof. We make use of the various representations of Tm derived above. On


one hand,
max Tm (z) = max Tm (cosθ) = max cos(mθ) = 1.


z ∈[−1,1] θ ∈[0,π] θ ∈[0,π]
On the other hand,
m
Ö
max Tm (z) = max 2m−1 (z−zk ) = 1.

z ∈[−1,1] z ∈[−1,1]
k=1
Comparing the above
m
Ö 1
max (z−zk ) = m−1 .

z ∈[−1,1]
k=1
2
as required. 

A maximal error of this form appears to be rather small. We show that there is no other partition
{Yk ,k = 0.....m} that makes the product even smaller.
Claim 2.37. Let {Yk :Yi ∈ [−1,1],k = 1.....m}! be a partition of the interval
! z ∈ [−1,1]. Then
m m
Ö Ö 1
min max (z−Yk ) = max (z−Yk ) = m−1

{Yk } z ∈[−1,1]
k=1
z ∈[−1,1]
k=1
2
{Yk }={z k }
on the partition {zk } constructed by the roots zk of the Chebyshev polynomial Tm (z), i.e. at {Yk } = {zk }.

Proof. Prove by contradiction. Assume that there exists a set of Yk such that
m
Ö 1
max (z−Yk ) < m−1 .

z ∈[−1,1]
k=1
2
Define
Öm Ö m
Q(z) = (z−zk )− (z−Yk ).
k=1 k=1

(a) Note that Q(z) is a polynomial of degree m−1 since the highest order terms in
Öm Öm
Q(z) = (z−zk )− (z−Yk )
k=1 k=1
cancel, and can therefore have AT MOST m−1 roots.
(b) Evaluate Q(z) at z = cos( jπ/m)
Öm Ö m
Q(z)− (z−zk ) = (z−Yk )


k=1 k=1

Q(cos( jπ/m))− 1 Tm (cos( jπ/m)) < 21−m




2 m−1
(In the previous line: on the left by (2.12); on the right by assumption)

Q()−21−m cosm jπ < 21−m

m
Q()−21−m (−1) j < 21−m

the modulus splits into two inequalities


−21−m < Q()−21−m (−1) j ∪ Q()−21−m (−1) j < 21−m
21−m (−1+(−1) j ) < Q() < 21−m (1+(−1) j )
Test the inequalities for some j-s
at j = 0 : 0 < Q() < 21−m positive
at j = 1 : −2×21−m < Q() < 0 negative
...
2.3. CHEBYSHEV ECONOMISATION 25

So as j increases by 1, the sign of Q(z) changes and Q(z) has a root. But j ∈ Z so Q(z) has INFINITELY
many roots.
There is a contradiction between (a) and (b), so our original assumption is false. 

When using “Chebyshev” and “error” formulas a rescaling and reindexing is needed.
Remark 2.38 (RE-SCALING). Note that the interval where functional approximation is sought is not
the same as the interval where Chebyshev polynomials are defined, i.e. x ∈ [a,b] , [−1,1] 3 z. To use the
“Chebyshev” results a change of variable (rescaling) needs to be made
1 a+b b−a
x = a+ (b−a)(z+1) = + z, where x ∈ [a,b] and z ∈ [−1,1].
2 2 2

Proof. (Linear “Rescaling”) More generally, let z ∈ [c,d] and x ∈ [a,b]. Then
z−c x −a
= ,
d −c b−a
preserves the ratio of distances and can be solved for z
b−a
x = a+ (z−c).
d −c


Remark 2.39 (RE-INDEXING). Note also that indexing in the product differs by 1 between “Chebyshev”
and “error” formulas. When a polynomial interpolant of degree n is sought then n+1 points are needed
and Tn+1 must be used, so in “Chebyshev” formulas m = n+1 must be taken.

So finally, the answer to the original question of this section is the following.
Corollary 2.40. !
Ö n (b−a)n+1
min max (x − xk ) = ,

{x k } x ∈[a,b]
i=0
22n+1
which occurs at {xk = a+(b−a)(zk +1)/2}, where {zk } are the roots of Tn+1 .
Example 2.41. Find the maximum error of approximation of f (x) = sin x on [0, π] by an polynomial
interpolant of degree 2 on (a) a Chebyshev nodes partition and (b) an uniform partition.
Solution. (a) The polynomial interpolant of degree 2 uses 3 points.
To make the error bound as small as possible we need to generate the interpolation points using the zeros
of T3 (z) which are √ √
z0 = − 3/2, z1 = 0, z2 = 3/2.
Rescale to get the interpolation points
√ ! √ !
π π 3 π π π 3
x0 = + − , x1 = , x2 = + .
2 2 2 2 2 2 2

The minimal value of the error bound is then


n
f (x)−p2 (x) ≤ 1 max f 000(x) max (x − xk )
Ö
3! x ∈[0,π] x ∈[0,π]
k=0
1 π3
≈ 0.16149.

f (x)−p2 (x) ≤
3! 25
(b) The nodes of a uniform partition are
0, π/2, π,
with a step h = π/2. √
Working as in example 2.25 the maximum value of the product term in (2.9) is easily
shown to be 2h3 /(3 3). Applying formula (2.9) we then get
3
f (x)−p2 (x) ≤ 1 π√ ≈ 0.2486.

3! 12 3
26 CHAPTER 2. INTERPOLATION AND APPROXIMATION

The error on a “Chebyshev partition” is much smaller. ^

2.4 Hands-on projects

2.4.1 Lagrange polynomial interpolation

Question [Lagrange polynomial interpolation] In this question you will perform both simple and piecewise
polynomial interpolation to approximate a given function as well as compare measured errors to theoretical
error bounds.
1. Create a Matlab function p = LagrangeInterp(x,y,t) for ’simple’ Lagrange polynomial
interpolation of a set of data points with coordinates specified by the input vectors x and y. The
function should return an output vector p of values of the interpolant at the “query” points specified by
the input vector t. The degree of the interpolant should be determined from the length of vectors x (or y).
(Marks: 8)
2. Use LagrangeInterp to approximate f (x) = sinh sin(x 2 ) on [0,π] with a polynomial of degree
14 as follows.
(a) Define x and y as the x- and y-coordinates of 15 equidistant data points sampling the function
on the interval, i.e. {(xi,yi = sinh sin(xi2 ) ), i = 0...14}.

(b) Generate a vector t of 200 equidistant points {ti, i = 1...200} on the interval [0,π].
(c) Call LagrangeInterp(x,y,t) to compute the vector p of the values of the polynomial
interpolant evaluated at {t i, i = 1...200}.
Plot f (x) = sinh sin(x 2 ) , the 15 data points, and the result of your polynomial interpolation on the
interval [0,π], all in one figure so that the three graphs can be compared.
Plot the error of the approximation {ei = |sinh sin(ti2 ) −pi |,i = 1..200} as a function of x.
(Marks: 5)
3. Create a Matlab function p = PiecewiseInterp(interp,m,x,y,t) for piecewise (composite)
interpolation of degree m where the input vectors x, y and t and the output vector p are as
described in part 1, and intep is a function for “simple” interpolation in the sense of part 1. The
PiecewiseInterp function should
(a) partition vectors x, y and t as appropriate,
(b) interpolate on each partition by calling the function interp (e.g. created in part 1),
(c) concatenate the results to produce a vector p. (HINT: You may use the Matlab syn-
tax s(a <= s & s <= b) to select all elements of an array s that satisfy the condition a ≤ s ≤ b.)
(Marks: 8)
4. Use PiecewiseInterp to approximate f (x) = sinh sin(x 2 ) on [0,π] with a piecewise quadratic

polynomial as follows.
(a) Define x and y as the x and y coordinates of 15 equidistant data points sampling the function
on the interval, i.e. {(xi,yi = sinh sin(xi2 ) ), i = 0...14}.

(b) Generate a vector t of 200 equidistant points {ti, i = 1...200} on the interval [0,π].
(c) Call PiecewiseInterp(@LagrangeInterp,2,x,y,t) with m = 2 to compute the vector p
of the values of the piecewise quadratic polynomial interpolant evaluated at {ti, i = 1...200}.
Plot f (x) = sinh sin(x 2 ) , the 15 data points, and the result of your polynomial interpolation on the

interval [0,π], all in one figure so that the three graphs can be compared.
(Marks: 5)
5. Estimate the error bound of interpolation of f (x) = sinh sin(x 2 ) on [0,π] by a piecewise quadratic

polynomial. Plot the error bound and the actual error {ei = |sinh sin(ti2 ) − pi |,i = 1..200} in one
figure. Is the error bound satisfied?
(Marks: 10)
6. Compare and comment on your results from parts 2, 4 and 5.
(Marks: 4)
2.4. HANDS-ON PROJECTS 27

2.4.2 Newton polynomial interpolation

Question In this question you will approximate a given function by polynomial interpolants. You will
apply the Newton interpolation method on uniform and Chebishev partitions and compare measured errors.
See hints overleaf.
1. Create a Matlab function p = NewtonInterp(x,y,t) for ’simple’ Newton polynomial interpo-
lation of a set of data points with coordinates specified by the input vectors x and y. The function
should return an output vector p of values of the interpolant at the “query” points specified by the input
vector t. The degree of the interpolant should be determined from the length of vectors x (or y).
(Marks: 12)
2. Use NewtonInterp to approximate the “Runge” function f (x) = 1/(1+25(x −4)2 ) on a uniform
partition of the interval x ∈ [3,5] with a polynomial of degree 10 as follows.
(a) Define x and y as the x- and y-coordinates of 11 equidistant data points sampling the function
on the interval, i.e. {(xi,yi = 1/(1+25(xi −4)2 ), i = 0...10}.
(b) Generate a vector t of 200 equidistant points {ti, i = 1...200} on the interval [3,5].
(c) Call NewtonInterp(x,y,t) to compute the vector p of the values of the polynomial interpolant
evaluated at {ti, i = 1...200}.
Plot f (x) = 1/(1+25(x −4)2 ) and the result of your polynomial interpolation on the interval [3,5],
both in one figure so that the graphs can be compared.
Plot the absolute error of the approximation {ei = |1/(1+25(ti −4)2 )−pi |,i = 1..200} as a function of x.
(Marks: 8)
2
3. Use NewtonInterp to approximate the “Runge” function f (x) = 1/(1+25(x −4) ) on a “Chebishev”
partition of the interval x ∈ [3,5] with a polynomial of degree 10 as follows.
(a) Create a Matlab function z = ChebRoots(n) that returns the n roots of the Chebyshev polynomial
Tn (z) in [−1,1].
(b) Create a Matlab function x = remap(z,a,b) that remaps a number z ∈ [−1,1] to a number
x ∈ [a,b].
(c) Use ChebRoots and remap to generate a “Chebyshev” partition of the interval [3,5]. Then proceed
as in Part 2 to compute, evaluate and plot the polynomial interpolant and the absolute error on the interval.
(Marks: 8)
4. Using a for loop repeat Parts 2 and 3 to compute and evaluate polynomial interpolants of all even
degrees from 2 to 16 for both uniform and Chebyshev partitions. Create the following figures that
you may combine in a composite figure using the subplot command.
(a) Plot the “uniform” polynomial interpolants of all degrees in one figure. Plot their absolute errors
in a second figure.
(b) Plot the “Chebyshev” polynomial interpolants of all degrees in a third figure. Plot their absolute
errors in a fourth figure.
(c) Plot the maximum absolute error as a function of the polynomial degree in a fifth figure including
two curves - one for the “uniform” partition and one for the “Chebyshev” partition. Display the numbers
in a three-column table. To find the maximum absolute error use the inbuilt Matlab function max.
(Marks: 10)
5. Comment on your results from parts 2, 3 and 4.
(Marks: 2)
28 CHAPTER 2. INTERPOLATION AND APPROXIMATION
Chapter 3

Numerical integration

3.1 Problem formulation

3.1.1 General setting

§ 3.1. Numerical integration (or numerical quadrature) is a family of algorithms for numerical approximation
of the value of a definite integral to a given degree of accuracy.
§ 3.2. An approximation to an integral of a function f (x) can be made by first finding a convenient numerical
approximation to f (x) and then integrating the result.
One method of approximation to the function f (x) is polynomial interpolation as discussed in the previous
chapter.
§ 3.3. Approximation by polynomial interpolant of degree n requires a set of n+1 data points {(xi, f (xi )),i =
0..n}.
§ 3.4. These data points are then used for integration. We typically use sums of the form i=0 αi f (xi ). Such
Ín
sums are known as weighted sums where, the set of constants {αi,i = 0..n} are called weights and the set
of {xi,i = 0..n} are called nodes (integration nodes).

3.1.2 Problem formulation

Problem 3.5. Let f (x) be a smooth real-valued function in some bounded interval of x ∈ [a,b] ⊂ R. Find
an approximation to the definite integral of f (x) in the form of a weighted sum of function values obtained
at a finite set of integration nodes {xi }
∫ b Õn
f (x)dx = αi f (xi )+, (3.1)
a i=0
where {αi } are the set of weights, independent of f (x), and {xi } are the set of nodes.
§ 3.6. A weighted sum of function values is called a quadrature rule in this context.
Remark 3.7. Variations and generalizations of the problem can be formulated, e.g. to find a set of nodes
{xi } and a set of weights {αi } such that
∫ b Õ n
w(x) f (x)dx − αi f (xi ) < E (3.2)


a
i=0
where w(x) > 0 is a weight function, also independent of f (x) and E > 0 is an error bound of the approximation.

29
30 CHAPTER 3. NUMERICAL INTEGRATION

3.1.3 Types of quadrature

§ 3.8. Types of quadrature by inclusion of the end points:


• Closed quadrature - includes the endpoints a and b as nodes,
• Open quadrature - does not include the endpoints a and b as nodes.
§ 3.9. Types of quadrature by distribution of integration points:
• Newton-Cotes quadrature - the nodes xi are uniformly spaced.
• Gaussian quadrature - the nodes xi are not uniformly spaced.
§ 3.10. Types of quadrature by “degree” of the polynomial interpolant:
• Simple quadrature - all given integration nodes are combined in a simple relatively “high-degree”
quadrature formula.
• Composite quadrature - The interval of integration [a,b] is partitioned into a collection of disjoint
sub-intervals Ii , i = 1,...n followed by the sum of applications of a simple quadrature to each interval.
Thus a relatively “low-degree” quadrature formula is obtained.
Example 3.11. (The midpoint method). Approximate a definite integral using one node only.
Solution. The mid-point method,
∫ b
f (x)dx ≈ h f (c) (3.3)
a
where
a+b
h = b−a c=
2
is in the form of a quadrature rule with α1 = h = b−a and x1 = c = (a+b)/2. ^

3.2 Newton-Cotes quadrature

3.2.1 General rule and error bound

§ 3.12. Newton-Cotes quadrature rules are numerical integration formulas obtained in the following way.
1. The integrand function f (x) is approximated via polynomial interpolation
f (x) = pn (x)±interp,
using
• a uniform partition of the interval of integration x ∈ [a,b] and
• the Lagrange’s interpolation formula, which is particularly convenient because it involves f (xi )
and can immediately be put in the form of a quadrature rule (3.1).
2. The approximating polynomial pn (x) is integrated. (Note that polynomials are easy to integrate.)
Claim 3.13. (Newton-Cotes quadrature) Consider a (n+1)-times differentiable function f (x) on the interval
x ∈ [a,b] and a uniform partition of the same interval
{xi = a+ih, h = (b−a)/n, i = 0...n}.
Then, ∫
b Õ n
f (x)dx − αi f (xi ) ≤ E, (3.4)


a
i=0
∫ b
αi = Ln,i (x)dx, (3.5)
a
n
1 ∫ b Ö
E= max f x − xi dx, (3.6)
(n+1)
(x)
(n+1)! x ∈[a,b] a i=0
3.2. NEWTON-COTES QUADRATURE 31

where Ln,i (x) are the Lagrange basis functions.

Proof. Approximate f (x) by a polynonial pn (x) found by Lagrange interpolation at the given nodes. The
error of such approximation is given by Claim 2.17. Integrating the latter we find the Newton-Cotes
quadrature rule as well as an error bound for the quadrature. In detail,
n
1 Ö
f (x)−pn (x) = f (n+1)
(ξ(x)) (x − xi ), ξ ∈ (a,b).
(n+1)! i=0
and since pn (x) is constructed by Lagrange interpolation,
n n
Õ 1 Ö
f (x)− f (xi )Ln,i (x) = f (n+1)
(ξ) (x − xi ).
i=0
(n+1)! i=0
Integrating both sides,
n n
1
∫ b Õ ∫ b ∫ bÖ
f (x)dx − f (xi ) Ln,i (x)dx = f (n+1) (ξ) (x − xi )dx.
a i=0 a (n+1)! a i=0
Defining “weights” as
∫ b
αi = Ln,i (x)dx,
a
this can be written as∫
n n
b
1
Õ ∫ bÖ
f (x)dx − αi f (xi ) = f (n+1) (ξ) (x − xi )dx.
a i=0
(n+1)! a i=0
Now, one error bound is
n n
∫ b Õ 1 ∫ b Ö
f (x)dx − αi f (xi ) ≤ max f
(n+1)
(x) |x − xi |dx.

(n+1)! x ∈[a,b]

a i=0 a i=0


3.2.2 Examples: Trapezium rule, Simpson rule, etc.

Example 3.14. (Trapezium rule) Find the two-point closed Newton-Cotes quadrature for the evaluation
∫b
of a f (x)dx.
Solution. First we find the relevant Lagrange polynomials. As there are only two points and the end points
must be included the interpolation points are x0 = a and x1 = b. Then
x − x1 x −b b− x
L1,0 (x) = = =
x0 − x1 a−b h
and
x − x0 x −a x −a
L1,1 (x) = = = .
x1 − x0 b−a h
The weights for the quadrature are then
b
1 1 1
∫ b ∫ b 
b− x
α0 = L1,0 (x)dx = dx = − (b− x)2 = (b−a)
a a h h 2 a 2
and b
1 1 1
∫ b ∫ b 
x −a 2
α1 = L1,1 (x)dx = dx = (x −a) = (b−a).
a a h h 2 a 2
The quadrature rule is then
1 1
∫ b
f (x)dx ≈ α0 f (x0 )+α1 f (x1 ) = (b−a) f (a)+ (b−a) f (b)
a 2 2
1
= (b−a)( f (a)+ f (b)). (3.7)
2
This is the trapezium rule. ^
Example 3.15. (Simpson’s rule) Find the three-point closed Newton-Cotes quadrature for the evaluation
∫b
of a f (x)dx.
32 CHAPTER 3. NUMERICAL INTEGRATION

Solution. First we find the relevant Lagrange polynomials. As there are three points, two of which must
be the end points, and the points must be equally spaced we find the interpolation points are x0 = a, x2 = b
and x1 = c, where c is the midpoint
a+b
c= .
2
Define h = (b−a)/2 so that x0 = c−h, x1 = c and x2 = c+h.
It is convenient to introduce a change of variable x = c+hu (inverse is u = (x −c)/h, differential is dx = hdu).
In terms of u the integration nodes are u0 = −1, u1 = 0 and u2 = 1.
The Lagrange polynomials are invariant w.r.t. this change of variable, e.g. the first one
(u−u1 )(u−u2 ) ((x −c)/h−(x1 −c)/h)((x −c)/h−(x2 −c)/h)
L2,0 = =
(u0 −u1 )(u0 −u2 ) ((x0 −c)/h−(x1 −c)/h)((x0 −c)/h−(x2 −c)/h)
(x − x1 )(x − x2 )
= .
(x0 − x1 )(x0 − x2 )
So compute L2,0 in terms of u
u(u−1)
L2,0 (u) = ,
(−1)(−2)
and then find the weight α0
1
1 1 h u3 u2
∫ 
h
α0 = u(u−1)hdu = − = .
2 −1 2 3 2 −1 3
Proceed similarly to find the rest of the weights. A more cumbersome approach where the change in variable
is made only at the point of integration is presented below.
In terms of x the Lagrange polynomials are
(x −c)(x −c−h)
L2,0 (x) = ,
2h2
(x −c+h)(c+h− x)
L2,1 (x) = ,
h2
(x −c+h)(x −c)
L2,2 (x) =
2h2
and the corresponding weights
∫ b ∫ c+h
(x −c)(x −c−h)
α0 = L2,0 (x)dx = dx
a c−h 2h2
h 1

h
= u(u−1)du =
2 −1 3
∫ b
α1 = L2,1 (x)dx
a
∫ 1
4h
= h (u+1)(1−u)du =
−1 3
∫ b
α2 = L2,2 (x)dx
a
h 1

h
= (u+1)udu = .
2 −1 3
The quadrature rule is then
4h
∫ b
h h
f (x)dx ≈ α0 f (x0 )+α1 f (x1 )+α2 f (x2 ) = f (a)+ f (c)+ f (b)
a 3 3 3
h
= ( f (a)+4 f (c)+ f (b)). (3.8)
3
This is Simpson’s rule. ^
Example 3.16. Generate the composite quadrature based on the application of the trapezium rule on N
equal length sub-intervals of [a,b].
3.2. NEWTON-COTES QUADRATURE 33

Solution. Label the sub-intervals as Ii with end-points [xi−1,xi ] for i = 1,...N. The intervals are of equal
length with xi − xi−1 = h = (b−a)/N and
∫ x0 = a and xn = b.Then the trapezium rule on Ii is
h
f (x)dx = ( f (xi−1 )+ f (xi ))
Ii 2
and the sum of integrals over the sub-intervals is
∫ b N ∫
Õ
f (x)dx = f (x)dx
a i=1 Ii
N
Õ h
= ( f (xi−1 )+ f (xi ))
i=1
2
N

= f (xi−1 )+ f (xi )
2 i=1
N −1
!
h Õ
= f (x0 )+2 f (xi )+ f (x N ) .
2 i=1
^
∫b
Example 3.17. Find a bound for the error in using the trapezium rule to approximate a
f (x)dx.
Solution. The trapezium rule was derived as the closed Newton-Cotes quadrature based on interpolation
at 2 points. Therefore n = 1 and we use p1 (x) to interpolate at points x0 = a and x1 = b. Therefore, labelling
the trapezium rule as T( f ), we have
∫ b 1 ∫ b h3
f (x)dx −T( f ) ≤ max f 00(x) (x −a)(b− x)dx = max f 00(x)

2 12

a x ∈[a,b] a x ∈[a,b]
where h = b−a. ^
∫b
Example 3.18. Find a bound for the error in using the composite trapezium rule to approximate a f (x)dx.

Solution. Label composite trapezium rule as Tc ( f ) and let the N subintervals be Ii , i = 1,...N. Also let the
interpolating polynomial on interval Ii be p1,i (x). Then,
∫ b ∫ b N ∫
Õ
f (x)dx −Tc ( f ) = f (x)dx − p1,i (x)


a a i=1 Ii
ÕN ∫
= f (x)−p1,i (x)dx

i=1 Ii
N ∫
Õ
≤ f (x)−p1,i (x)dx


i=1 Ii
from the triangle inequality. Now applying the result from the previous example,with the length of Ii as
(b−a)/N,
(b−a)3
ÕN
max
00
≤ f (x)
12N 3 x ∈Ii

i=1
(b−a)3
max
00
≤ f (x)
12N 2 x ∈[a,b]
.
To summarise,
∫ b (b−a)3
max (3.9)
00
f (x)dx −T ( f ) ≤ f (x)

c
12N 2 x ∈[a,b]
.
a
^
34 CHAPTER 3. NUMERICAL INTEGRATION

3.3 Gaussian quadrature

3.3.1 Problem formulation

§ 3.19 (Idea of Gaussian quadrature). Recall that a quadrature rule takes the form
∫ b Õn
f (x)dx ≈ αk f (xk )
a k=1
for some constant weights αk .
There are 2n degrees of freedom in this expression, namely:
• the n constants αk and
• the location of the n discretisation points xk .
We can use these degrees of freedom to ensure that the quadrature rule is exact for polynomials of degree
less than 2n, that is deg f ≤ 2n−1. This is the main idea of Gaussian quadrature rules.
Problem 3.20. Let P(x) be a polynomial of degree deg(P) ≤ 2n−1. Find a quadrature rule for evaluating
∫b
a
w(x)P(x)dx exactly (without errors).
§ 3.21. In contrast, in Newton-Cotes rules the choice of the n discretisation points xk is not optimised since
they are chosen to be equidistant for convenience.
This failure to impose n available conditions leaves Newton-Cotes rules exact for polynomial of an most
degree n, i.e. deg f ≤ n−1. The difference in accuracy is marked!

3.3.2 Orthogonal bases

To describe the method of Gaussian quadrature we introduce (or recall) some details on polynomial bases.
Definition 3.22. Let f (x) and g(x) be integrable on [a,b]. Then an inner product with respect to a weight
function w(x) > 0 is defined as
∫ b
h f ,giw = w(x) f (x)g(x)dx.
a
Claim 3.23. Properties of the inner product include
h f ,gi = hg, f i
h f ,g+hi = h f ,gi+h f ,hi
h f ,λgi = λh f ,gi, λ∈R

Proof. These follow from the linearity of the integral. 

Definition 3.24. Two functions f (x) and g(x) are called orthogonal on [a,b] (where a or b or both may
be ∞) with respect to a weight function w(x) > 0 if
∫ b
h f ,giw = w(x) f (x)g(x)dx = 0.
a
Remark 3.25. If f and g are polynomials they are called orthogonal polynomials.
Definition 3.26. A set of polynomials {φ0 (x), φ1 (x), ..., φn (x)} where φk (x) is of degree k and the coefficient
of x k is unity is called an orthogonal set with respect to the non-negative
( weight function w(x) if
0
∫ b
i, j
w(x)φi (x)φ j (x)dx = δi j gi = .
a gi > 0 i = j
where gi are constants, and δi j is the Kroneker δ-symbol.
3.3. GAUSSIAN QUADRATURE 35

Example 3.27. Examples of common orthogonal sets of polynomials include

Orthogonal family Weight w(x) Interval [a,b]


1
Chebyshev √ [−1,1]
1− x 2
Legendre 1 [−1,1]
Laguerre e−x [0,∞)
2
Hermite e−x (−∞,∞)
Jacobi (1− x)α (1+ x)β [−1,1]

√ Show that Chebyshev polynomials are orthogonal on [−1,1] with respect to the weight
Example 3.28.
function 1/ 1− x 2 .
Solution. We need to show that
1 T (x)T (x) 0 i, j
∫ (
i j
I= √ dx = π .
−1 1− x 2 i= j
2
This can easily be proved by setting x = cosθ
∫ π in the expression for I then dx = −sinθdθ. If n , m, this leads to
cosnθcos(mθ)
I = − (−sinθ)dθ
0 sinθ
∫ π
= cos(nθ)cos(mθ)dθ
0
∫ π
1
= (cos((n+m)θ)+cos((n−m)θ))dθ
0 2

1 sin((n+m)θ) sin((n−m)θ)

= +
2 n+m n−m 0
= 0, if n , m.
And if n = m then ∫ π
1 π
I= (cos(2nθ)+1)dθ = .
0 2 2
^
Example 3.29. Legendre polynomials are defined by
1 dn  2  n
Pn (x) = n x −1
2 n! dx n
These are orthogonal on [−1,1] with respect to weight function w(x) = 1,
∫ 1 0 m,n


Pm (x)Pn (x)dx =

2
−1 m=n
 2n+1

Claim 3.30. (Expansion in a basis) Any non-trivial polynomial Q(x) of degree n can be decomposed in
a orthogonal polynomial basis {φi (x),i = 0...n} as
n
Õ hQ,φi i
Q(x) = ai φi (x), ai =
gi
. i=0

Proof. Assume
n
Õ
Q(x) = ai φi (x)
i=0
is true and find the constants ai . To find an take the inner product with the basis vector φn ,
Õn Õn
hQ,φn i = ai hφi,φn i = ai δi,n gn = an gn .
i=0 i=0
36 CHAPTER 3. NUMERICAL INTEGRATION

So
hQ,φn i
an =
gn
as required. 

Claim 3.31. Let φk (x) be a polynomial of degree k and a member of the orthogonal basis set {φi (x),i = 1...n}.
Then (i) φk (x) has k distinct zeroes {x j , j = 1..k} in (a,b) and (ii) φk (x) can be written in the form
Ök
φk (x) =

x − xj
j=1
(with the coefficient of x k being 1).

Proof. If k = 0 then φ0 = 1 and clearly has no zeros. Let k ≥ 1, then by orthogonality


∫ b ∫ b
wφk φ0 dx = wφk (x)dx = 0.
a a
Now, w > 0 so φk must change sign in (a,b).
Suppose φk changes sign m times with m < k in (a,b) at x1,x2,...,xm , with a < x1 < x2 < ... < xm < b,
Öm
φk (x) = β(x) (x − xi )
i=1
where β(x) is a polynomial of degree k −m > 0 with fixed sign in (a,b). Let
Öm
S(x) = (x − xi ),
i=1
then S(x)φk (x) takes the sign of β(x) in (a,b) and therefore
∫ b
w(x)S(x)φk (x)dx , 0.
a
But S(x) is a polynomial of degree m < k therefore
Õ m
S(x) = s j φ j (x)
j=0
for some s j and therefore
∫ b m
Õ ∫ b
w(x)S(x)φk (x)dx = s j w(x)φ j (x)φk (x)dx = 0
a j=0 a
a contradiction, therefore m = k. 

We can now derive the Gaussian quadrature rule as follows.

3.3.3 Gaussian quadrature rules

Claim 3.32 (Gaussian Quadrature). Let P(x) be a polynomial of degree deg(P) ≤ 2n−1. Let φn (x) be a
polynomial of degree deg(φn ) = n that is a member of an orthogonal basis set with a weight function w(x)
and let the zeros of φn be denoted by x0,x1,...,xn−1 ∈ (a,b). Then
∫ b n−1
Õ ∫ b
w(x)P(x)dx = αi P(xi ), αi = w(x)Ln−1,i (x)dx
a i=0 a
where Lk,i (x) are Lagrange basis functions (note that deg(Ln−1,i ) = n−1).

Proof. Represent P(x) as


P(x) = q(x)φn (x)+r(x)
3.3. GAUSSIAN QUADRATURE 37

where q(x) and r(x) are polynomials of degree less than n. Represent q(x) and r(x) in the following way
n−1
Õ
q(x) = β j φ j (x)
j=0
and
n−1
Õ
r(x) = r(x j )Ln−1, j (x).
j=0
Now, calculate the definite integral in question
∫ b n−1 ∫ b
Õ n−1
Õ ∫ b
w(x)P(x)dx = βj w(x)φ j φn dx + r(x j ) w(x)Ln−1, j (x)dx
a j=0 a j=0 a

(using orthogonality in first integral


and defining α as appropriate in the second integral)
n−1
Õ
= 0+ r(x j )α j
j=0
(representing r(x) = P(x)−q(x)φn (x))
n−1 
Õ 
= P(x j )−q(x j )φn (x j ) α j
j=0
(and since x j are roots of φn )
n−1
Õ
= P(x j )α j .
j=0


Remark 3.33. A n-point Newton-Cotes rule is exact for polynomials of at most degree n−1. A n-point
Gaussian rule is exact for polynomials of at most degree 2n−1. This is a major increase in accuracy and
the best one can do theoretically.

Now, as always, we have to estimate the error.


Claim 3.34. For 2n
∫ a function f ∈ C [a,b], the error of using a Gaussian quadrature rule is given by
n
b 1
  ∫ b
φ2n (x)w(x)dx.
Õ
w(x) f (x)dx − αk f (xk ) ≤ max | f (2n) (ξ)|

(2n)! ξ ∈[a,b]

a a
k=1

Proof. Omitted. 

3.3.4 Summary and examples

§ 3.35. In summary, to find the n point Gaussian quadrature


1. Find orthogonal family of polynomials φi (x) up to degree n given the weight function and interval [a,b].
2. Find the n zeroes of φn (x), label them x1,...,xn .
3. Calculate the Lagrange polynomials based on the points xk .
4. Calculate the weights
∫ b
αk = w(x)Ln,k (x)dx
a
5. The quadrature
∫ b N
Õ
w(x) f (x)dx = αk f (xk )
a k=1
is exact for f a polynomial up to degree 2N −1.
38 CHAPTER 3. NUMERICAL INTEGRATION

Example 3.36 (Gauss-Legendre quadrature). Find the three point Gaussian Quadrature rule for
∫ 1
f (x)dx.
−1

Solution. Hidden in the question is the weight function w(x) = 1 and the range [−1,1] so we need Legendre
polynomials. The three nodes will be the three roots  of P3 (x).
 Now from the definition
1 d3  2  3
P3 (x) = x −1
23 3! dx 3
1 d2 2
 
2
= 6x x −1
8.6 dx 2
1 d  2 2
  
2 2
= 6 x −1 +24x x −1
48 dx
1      
= 24x x 2 −1 +48x x 2 −1 +48x 3
48
1  
= 120x 3 −72x
48
5 3 3
= x − x
2 2
(note that we have dropped the requirement that the coefficient of x 3 is 1 since we are only interested in
the zeros of P3 ).
The quadrature points satisfy P3 (x) = 0 and are r
3
x = 0, x =± .
5
For the Gaussian quadrature rule we need  thepcardinal
 functions
 p based  on these points
(x −0) x − 3/5 5x x − 3/5
L2,0 =  p   p  =
− 3/5 −2 3/5 6
∫ 1 ∫ 1 r !
5 2 3 5
a0 = L2,0 dx = x − x dx =
−1 −1 6 5 9
 p  p 
x + 3/5 x − 3/5 5 2 3
 
L2,1 = p   p  = − x − 3
3/5 − 3/5 3 5
1 1
5 2 3 10 1 3 8
∫ ∫    
a1 = L2,1 dx = − x − dx = − − =
3 5 3 3 5 9
−1 p  −1  p 
x + 3/5 (x −0) 5x x + 3/5
L2,2 =  p  p  =
2 3/5 3/5 6
∫ 1
5
a2 = L2,2 dx =
−1 9
The three point quadrature rule is then
∫ 1 r ! r !
5 3 8 5 3
f (x)dx ≈ f − + f (0)+ f .
−1 9 5 9 9 5
This is exact for all polynomials f (x) of degree up to 2×3−1 = 5. ^
Example 3.37 (Gauss-Chebyshev quadrature). Determine the two point Gaussian Quadrature rule for
∫ 1
f (x)
√ dx
−1 1− x 2
3.3. GAUSSIAN QUADRATURE 39

and use this to find


1
cosx

√ dx.
0 1− x 2
Solution. We need Chebyshev polynomials, T2 (x) = 2x 2 −1 with roots at x = ± 1/2 and therefore
p
∫ 1
1 1
   
f (x)
√ dx = a0 f − √ +a1 f √
−1 1− x 2 2 2
and expression which we know is exact for f (x) = 1 and f (x) = x. These give
π = a0 +a1, a0 −a1 = 0
so that ∫ 1
π 1 π 1
   
f (x)
√ dx = f − √ + f √ .
−1 1− x 2 2 2 2 2
For the particular example given
∫ 1
cosx 1 1 cosx π 1

√ dx = √ dx ≈ cos √ ≈ 1.1942.
0 1− x 2 2 −1 1− x 2 2 2
^
Example 3.38 (Gauss-Laguerre quadrature).
∫ Find the 2 point Gaussian quadrature formula for

e−x f (x)dx.
0

Solution. 1. The weight is w(x) = e−x . The interval is [0,∞). The orthogonal polynomials are Laguerre
polynomials. Need the following result ∫

e−x x n dx = n!
0
2. Start with φ0 (x) = 1.∫Find φ1 (x) = x +a, where
∫ a is such that hφ0,φ1 i = 0,
∞ ∞
e−x φ0 (x)φ1 (x)dx = e−x 1·(x +a)dx = 1!a+0! = 0
0 0
which has solution a = −1. So that
φ1 (x) = x −1.
2
Next φ2 (x) = x +ax +b, where a and ∫
b are such that hφ0,φ2 i = 0 and hφ1,φ2 i = 0. This is equivalent to
∞  
h1,φ2 i = 0 = e−x x 2 +ax +b dx = 2!+1!a+0!b
∫0 ∞  
hx,φ2 i = 0 = e−x x 3 +ax 2 +bx dx = 3!+2!a+1!b
0
which has solution a = −4 and b= 2 so that
√ φ2 (x) =√x 2 −4x +2.
3. The roots of φ2 (x) are x1 = 2− 2 and x2 = 2+ 2.
4. The Lagrange basis functions are
x − x2 x − x1
L1 (x) = , L2 (x) = .
√ x1 − x2 √ x2 − x1
5. The nodes for the quadrature are 2− 2 and 2+ 2.
6. The weights are √
1− x2 2+1
∫ ∞ ∫ ∞
−x x − x2
α1 = e L1 (x)dx =
−x
e dx = = √
0 0 x 1 − x 2 x 1 − x 2 2 2

1− x1 2−1
∫ ∞ ∫ ∞
x − x1
α2 = e−x L2 (x)dx = e−x dx = = √
0 0 x 2 − x 1 x 2 − x 1 2 2
7. Summary √ √
2+1  √  2−1  √ 
∫ ∞
e−x f (x)dx ≈ √ f 2− 2 + √ f 2+ 2 .
0 2 2 2 2
^
40 CHAPTER 3. NUMERICAL INTEGRATION

Example 3.39 (Gauss-Hermite quadrature).∫ Find the 4 point Gaussian quadrature formula for

2
e−x f (x)dx.
−∞
2
Solution. 1. The weight is w(x) = e−x . The interval is (−∞,∞). The orthogonal polynomials are
Hermite polynomials.∫ Need the following result∫for k ≥ 0 an integer, √

2

2 (2k)! π
e−x x 2k+1 dx = 0, e−x x 2k dx = 2k .
−∞ −∞ 2 k!
2. Start with φ0 (x) = 1. ∫
Find φ1 (x) = x +a, where∫a is such that hφ0,φ1 i = 0,
∞ ∞ √
2 2
e−x φ0 (x)φ1 (x)dx = e−x 1·(x +a)dx = πa = 0
−∞ −∞
which has solution a = 0. So that
φ1 (x) = x.
Next φ2 (x) = x 2 +ax +b, where a and b are
∫ ∞ that
such hφ0,φ2 i = 0 and√hφ1,φ2 i = 0. This is equivalent to
2
  π √
h1,φ2 i = 0 = e−x x 2 +ax +b dx = + πb
−∞ 2

π
∫ ∞
−x 2 3
 
2
hx,φ2 i = 0 = e x +ax +bx dx = a
−∞ 2
which has solution a = 0 and b= −1/2, so that
1
φ2 (x) = x 2 −
2
Next φ3 (x) = x 3 + ax 2 + bx + c, where a, b and c are such that hφ j ,φ3 i = 0 for j = 0, 1 and 2. This
is equivalent to √
π √
∫ ∞
−x 2 3
 
2
h1,φ3 i = 0 = e x +ax +bx +c dx = a+ πc
−∞ 2
√ √
3 π π
∫ ∞
−x 2 4
 
3 2
hx,φ3 i = 0 = e x +ax +bx +cx dx = + b
−∞ 4 2
√ √
3 π π
∫ ∞
−x 2 5
 
2 4 3 2
hx ,φ3 i = 0 = e x +ax +bx +cx dx = a+ c
−∞ 4 2
which has solution a = c = 0 and b= −3/2 so that
3
φ3 (x) = x 3 − x
2
You should notice that by symmetry we can write, next φ4 (x)√= x 4 +ax 2
√ +b. With
3 π π √
∫ ∞
2
 
h1,φ4 i = 0 = e−x x 4 +ax 2 +b dx = + a+ πb
−∞ 4 2
√ √ √
15 π 3 π π
∫ ∞
−x 2 6
 
2 4 2
hx ,φ4 i = 0 = e x +ax +bx dx = + a+ b
−∞ 8 4 2
which has solution a = −3 and b= 3/4 so that
3
φ4 (x) = x 4 −3x 2 +
4
3. The roots of φ4 (x) are ±x± with s r
3 3
x± = ± .
2 2
4. Now form the Lagrange basis functions based on these points, and integrate the product of each of
them with the weight function over the interval to obtain the weights.
5. Summary
√ r ! √ r !

π 2 π 2

−x 2
e f (x)dx ≈ 1− [ f (−x+ )+ f (x+ )]+ 1+ [ f (−x− )+ f (x− )]
−∞ 4 3 4 3

^
3.4. RECURRENCE RELATIONS FOR ORTHOGONAL POLYNOMIALS 41

3.4 Recurrence relations for orthogonal polynomials

The orthogonal family of polynomials φ0 (x), φ1 (x), . . . φn (x) form a basis for polynomials of degree less
than or equal to n (deg(φk ) = k). Use the following notation
∫ b
h f ,gi = f (x)g(x)w(x)dx.
a
Then the orthogonality of the polynomials φk (x) is expressed as
hφi,φ j i = 0, if i , j.
Often a normalisation is chosen so that
hφi,φi i = 1,
but this is not always the most convenient choice for some families of polynomials. We will instead set
the coefficient of x n in φn to be 1.
The polynomial φn+1 (x) is a degree n+1 polynomial and so can be written
n−2
Õ
φn+1 (x) = xφn (x)+Bn φn (x)+Cn φn−1 (x)+ D j φ j (x).
j=0
for some constants Bn , Cn and D j for j = 0,...n−2. Orthogonality requires that
hφn+1,φi i = 0, for 0 ≤ i ≤ n.
In particular if k = 0,1,...n−2 we have
n−2
Õ
hφn+1,φk i = hxφn,φk i+Bn hφn,φk i+Cn hφn−1,φk i+ D j hφ j ,φk i
j=0
using the orthogonality of the family we have
hφn+1,φk i = hxφn,φk i+Dk hφk ,φk i.
Using the definition of h,i it is easy to see that
hxφn,φk i = hφn,xφk i
and as k ≤ n−2 the product xφk is a polynomial of degree less than or equal to n−1 and can be written as
n−1
Õ
xφk = qi φi
i=0
therefore the integral
n−1
Õ
hφn,xφk i = qi hφn,φi i = 0
i=0
as i ≤ n−1. This leads to the conclusion that Dk = 0 and therefore that
φn+1 (x) = (x +Bn )φn (x)+Cn φn−1 (x).
The constants Bn and Cn are determined by the remaining two orthogonality conditions. That is
hφn+1,φn−1 i = 0
hφn+1,φn i = 0
leading to
hφn,xφn−1 i hφn,φn i hφn,x n i
Cn = − =− =−
hφn−1,φn−1 i hφn−1,φn−1 i hφn−1,φn−1 i
hφn,xφn i
Bn = − .
hφn,φn i
Some examples for the recurrence relations for standard families of orthogonal polynomials are (note that
the normalisation condition of being monic has been abandoned)
Tn+1 = 2xTn −Tn−1, Chebyshev
2n+1 n
Pn+1 = xPn − Pn−1, Legendre
n+1 n+1
1 2n+1 n
Ln+1 = − xLn + Ln − Ln−1, Laguerre
n+1 n+1 n+1
Hn+1 = 2xHn (x)−2nHn−1, Hermite.
42 CHAPTER 3. NUMERICAL INTEGRATION

3.5 Hands-on projects

3.5.1 Simple and composite quadrature

Here you will apply the theory of section 3.2 of the Lecture Notes to find, analyse and use an unseen
Newton-Cotes quadrature rule.
1. Find a two-point open Newton-Cotes quadrature rule for the approximation of the definite integral
∫b
a
f (x)dx. (Marks: 5)
2. Find a bound for the error in using the rule derived in part 1. (Marks: 5)
3. Create a Matlab function v = TwoPtOpenRule(f,a,b) implementing the quadrature rule derived
in part 1. Your function should take as input a function f to be integrated on an interval x ∈ [a,b]
with end points a and b. The function should return a real value.
(Marks: 5)
4. Create a Matlab function v = Comp(Rule,f,a,b,N) implementing composite quadrature. Your
function should take as inputs a function Rule defining a simple quadrature rule to be applied on N
equally-long parts of the interval x ∈ [a,b], with end points a and b, where the integrand function f
is to be integrated. The function should return a real value. (Marks: 5)
5. Find a bound for the error in using a composite rule based on the two-point open Newton-Cotes
quadrature rule derived in part 1.
(Marks: 5)
6. Calling your functions from part 3 and 4, use your composite two-point open Newton-Cotes quadrature
rule to approximate ∫ π
I= sinxdx,
0
using n = 2k , k = 0,1,...,8 subintervals of the region of integration. At each value of n, calculate
numerically the error and the order of accuracy of your numerical scheme. Your code should output
the results in the following format

n Approx Value Error Order Acc


1 2.7206990464e+00 -7.206990e-01 1000.372888
2 2.1457476866e+00 -1.457477e-01 2.305924
4 2.0347862159e+00 -3.478622e-02 2.066885
. ............ ............ ......

Does this numerical test confirm your analysis from part 5? (Marks: 6)

3.5.2 Double integration

Proper multiple integrals can be represented as iterated single integrals and approximated numerically using
quadrature rules for single integration. For instance, if D is a domain defined by g(x) ≤ y ≤ h(x) where
a ≤ x ≤ b then ∬ ∫ ∫b h(x)
f (x,y)dxdy = dx f (x,y)dy.
D a g(x)

7. Use your functions from part 3 and 4 to approximate


∬ the double integral
I= xy 2 dxdy,
D
where D is the region in the first quadrant bounded by the curve y = 4x 2 , the x axis and the line x = 1.
Partitioning into n = 2k , k = 0,1,...,8 equally long subintervals in both x and y direction generate
a table giving numerically value of the integral, the error and the order of accuracy of your numerical
scheme at each n. (Marks: 7)
3.5. HANDS-ON PROJECTS 43

8. For comparison, replace the function TwoPtOpenRule(f,a,b) written in part 3 with a function
SimpsonsRule(f,a,b), implementing the simple Simpson’s rule. Repeat parts 6 and 7. Compare
the results and comment on the order of convergence.
(Marks: 2)

3.5.3 Adaptive quadrature

3.5.3.1 Background

Error estimates for quadrature rules were derived in lectures, but we did not consider mechanisms for (a)
deciding just how many integration nodes are necessary to obtain a required accuracy and (b) for minimizing
this number. This number will depend critically on the form of the integrand; if it oscillates rapidly then
a large number of points is required; if it varies only very slowly, then a small number of points is sufficient.
A procedure for concentrating the integration nodes of a given quadrature rule in subintervals where they are
needed to achieve a desired accuracy with minimal effort is called adaptive quadrature. This Assignment
will guide you in a step-by-step process of deriving, implementing and testing an adaptive Simpson’s rule.

3.5.3.2 Simpson’s and composite Simpson’s rules

Consider the three-point closed simple Simpson’s rule S( f ;a,b) with error E as introduced! in lectures
∫ b  
(b−a) a+b
f (x)dx = S( f ;a,b)+E( f ;a,b), S( f ;a,b) = f (a)+4 f + f (b) . (3.10)
a 6 2

Question 3.40. Using appropriate Taylor expansions, show that the error in the simple Simpson’s rule (3.10) is
(b−a)5 (iv)
E( f ;a,b) = − f (ξ), for some ξ ∈ [a,b]. (3.11)
2880
(Marks: 5)
Question 3.41. Using (3.10), derive the composite Simpson’s rule Sc ( f ;a,b,N) for N subintervals of [a,b]
of equal length. In the process you should (a) define an appropriate step length h; (b) index the integration
nodes appropriately, and finally (c) provide an expression in the form of a weighted sum of function values
f (x j ). (Marks: 5)
Question 3.42. The error in the composite Simpson’s rule Sc ( f ;a,b,N) is given by
(b−a)5 (iv)
Ec ( f ;a,b,N) = − f (η), for some η ∈ [a,b]. (3.12)
2880 N 4
Is it possible to derive this expression directly from equation (3.11) without further assumptions? Why
or why not? Even if possible to use (3.11), explain how you would obtain (3.12) using Taylor expansions, i.e.
start from your formula for Sc ( f ;a,b,N) derived in Question (3.41) and (a) write out the expansion of f (x j )
∫b
at a general node x j , (b) integrate the expansion of f (x) to get an expansion of a f (x)dx, and finally (c)
explain how you would form the error Ec ( f ;a,b,N) but do not perform further calculations. (Marks: 5)
Question 3.43. Implement the composite Simpson’s rule you derived in Question 3.41 as a Matlab function
Sc(f,a,b,N). Here f is a function handle holding the integrand f (x) and a,b are the endpoints of the
integration interval [a,b] and N is the number of subintervals N where the simple rule is applied. You may
find Listing ?? useful. (Marks: 5)

3.5.3.3 The adaptive refinement procedure

An adaptive refinement procedure may be applied to any (n+1)-point quadrature rule Q as follows.
44 CHAPTER 3. NUMERICAL INTEGRATION

1. Consider an interval [a,b].


2. Obtain a first integral approximation Q1 on [a,b] using n+1 nodes.
3. Obtain a second integral approximation Q2 on [a,b] using 2n+1 nodes.
4. Obtain an estimate of the error using Q1 and Q2 .
5. If the error estimate is less than some required tolerance ε accept a suitably corrected value of Q1 (or Q2 ).
6. Otherwise split [a,b] into two equal subintervals and repeat the above process over each subinterval
with tolerances adjusted to ε/2 (why ε/2?).
Note that this is a “recursive procedure” - it is most easily defined in terms of itself and ends when a “stopping
condition” is met. To implement the stopping condition we now need to find an estimate of the error on
a subinterval.
Question 3.44. Consider an interval [a,b] small enough to justify the assumption f (iv) (x) =const on [a,b].
With this assumption, eliminate f (iv) (x) between equations (3.11) and (3.12) to show that
1   ∫ b
Ec ( f ;a,b,2)  Sc ( f ;a,b,2)−S( f ;a,b,1) and then f (x)dx = Sc ( f ;a,b,2)+Ec ( f ;a,b,2),
15 a
(3.13)
where  denotes equality under the stated assumption. (Marks: 6)
Question 3.45. Implement an adaptive Simpson’s rule as a Matlab function Sa(f,a,b,eps) using
equations (3.13) at step 5 of the adaptive refinement procedure. The code in Listing ?? may be useful. Make
sure you provide comments in your script file using % to explain what you are doing. (Marks: 4)

3.5.3.4 Illustration and validation

Question 3.46. Compute an approximation of the integral

∫ 2   x −1  
sin 1−25 erf √ dx (3.14)
0 0.2 2
using your Matlab functions of the composite and the adaptive Simpson’s rules from Questions 3.43 and
3.45. In order to illustrate the adaptive refinement procedure produce a plot showing (a) a graph of the
integrand as well as (b) the location of the integration nodes. Compare that with a similar plot obtained
when using your composite rule Matlab function. (Marks: 5)
Question 3.47. Compute an approximation of the integral

∫ 5
I= exp(−50(x −1)2 )dx (3.15)
−3

using your Matlab function of the adaptive Simpson’s rule from Question √3.45
 √ for 10 different values of the
tolerance ε from 10−6 to 10−12 . The integral has the exact value I = erf 20 2 2π/10. Plot graphs to demon-
strate that your adaptive Simpson’s function produces error smaller than the required tolerance. (Marks: 5)
Chapter 4

Systems of linear equations

4.1 Problem Formulation

In this chapter we will be concerned with the following central problem


Problem 4.1. Let A be a non-singular n×n matrix of real numbers and let b be a column vector of length
n, where n ∈ N. Solve the system of linear equations
Ax = b. (4.1)
The reader is well aware of the importance of this problem in mathematics and its applications.
As usual we need to know when equation (4.1) has a solution. The answer is given by the well known
Fundamental Theorem of Invertible Matrices, a simplified version of which follows.
Claim 4.2. (Fundamental Theorem of Invertible Matrices) Let A be a n×n matrix. The following statements
are equivalent:
1. Ax = b has a unique solution for every b ∈ Rn .
2. A is invertible.
3. Ax = 0 has only the trivial solution x = 0.
4. The reduced echelon form of A is the identity matrix In .
§ 4.3. Types of numerical methods for solution of (4.1) fall into two large classes
• Direct methods – these are based on Gaussian elimination strategies.
• Iterative methods – these are fixed-point iterative methods as discussed in Ch 1. Methods important
for linear systems include Gauss-Jacobi, Gauss-Seidel.

4.2 Direct methods for solution of linear systems

§ 4.4. • A sequence of invertible row operations that solves equation (4.1) is known as Gaussian
elimination.
• Variations of the procedure exist to cater for various types of matrices and to minimize the amount
of calculation.
• An efficient strategy is LU-factorisation, which is closely related to Gaussian elimination.

45
46 CHAPTER 4. SYSTEMS OF LINEAR EQUATIONS

4.2.1 Preliminaries

Definition 4.5. An elementary matrix is a matrix obtained by performing an elementary row operation
on the identity matrix In .
Claim 4.6. Any elementary row operation (ERO) on a matrix A can be represented as a multiplication
of A from the left by the corresponding elementary matrix.
Definition 4.7. Matrices of the form
 L11 0 0 ... 0   U11 U12 U13 ... U1n 
 L21 L22 0 ... 0   0 U22 U23 ...
  
U2n 
L =  L31 L32 L33 ... 0 , U =  0 0 U33 ... ,
   
  U3n
 . . . ... .  . . . ... .

 
 Ln1 Ln2 Ln3 ... Lnn   0 0 0 ...
   
Unn 
are called lower-triangular matrix and upper-triangular matrix, respectively.
   

Definition 4.8. Let A be a square matrix. A factorization of A of the form A= LU, if it exists, where L is
lower triangular and U is upper triangular is called LU-factorization of A.

4.2.2 Gaussian elimination and LU-factorisation

Claim 4.9. (Gaussian elimination) The problem Ax = b can be solved in two steps as follows.
1. Row-echelon reduction: The problem Ax = b is reduced to an upper-triangular form U x = c, by a
sequence of ERO represented by
1
!
Ö
Gi [A|b] = [U|c].
i=n−1
The upper-triangular matrix U is called the row-echelon form of A.
2. Back-substitution: An iteration is performed inÍthe order i = n,n−1,...1 that gives the solution as
ci − nj=i+1 ui j x j
xi = .
uii
Claim 4.10. (A LU-factorisation) One possible LU-factorisation of A is A= LU where
n−1
Ö
L = G−1 i
i=1
is a lower-triangular matrix and where Gi are the elementary matrices used in the first step of Gaussian
Elimination to convert A to its row-echelon form U.

Proof. (Simple illustration on a 3x3 example appropriate for blackboard presentation)


Perform a Gaussian elimination and an LU factorisation of the matrix
 a11 a12 a13 
A=  a21 a22 a23  .
 
 a31 a32 a33 
 
Row-echelon reduction phase:
Outside the matrix we record the row ratios used for elimination:
 a11 a12 a13 
1 1 
 1
−a21 /a11  0 a22  R2 → R2 +(−a21 /a11 )R1 .

a23
−a31 /a11  0 a32 1 1  R1 → R +(−a /a )R
a33  3 3 31 11 1
And so
 1 0 0 
G1 =  −a21 /a11 1 0 


 −a31 /a11 0 1 
 
4.2. DIRECT METHODS FOR SOLUTION OF LINEAR SYSTEMS 47

Next,
 a11 a12 a13 
 0 a1 a1  .
 
22 23 
1 /a1  1  R2 → R1 + −a1 /a1 R1
0 0
 
−a32 22  a 33  3 3 32 22 2
And so
 1 0 0 
G2 =  0 1 0 


 0 −a1 /a1 1 
 32 22 
The Gaussian Elimination terminates and we identify the upper-triangular matrix U as
 a11 a12 a13  1
!
1 1 
Ö
U =  0 a22  = G2 G1 A=
 
a23 Gi A.
 0 0 a 2  i=2
 33 

Backsubstitution phase
Renaming elements we have
 u11 u12 u13   x1   c1 
 0 u22 u23   x2  =  c2  .
    
 0 0 u33
    
  x3   c3 
Solving from bottom up gives
    
x3 = c3 /u33,
x2 = (c2 −u23 x3 )/u22,
x2 = (c1 −−u12 x2 −u13 x3 )/u11,
or in general
ci − nj=i+1 ui j x j
Í
xi = .
uii
LU-factorisation
We then have
G2 G1 A=U so 1 G 2 U = LU.
A= G−1 −1

The inverses of G1 and G2 are


 1 0 0   1 0 0 
G1 =  a21 /a11 1 0 ,
−1 
G2 =  0
−1 
1 0 ,
 

 a31 /a11 0 1   0 a1 /a1 1 
32 22
since the reverse process uses the same coefficients with the opposite sign.
   

Finally, check that L is indeed lower-triangular:


Ö 2  1 0 0 
L = Gi = G1 G2 =  a21 /a11
−1 −1 −1 
1 0  .

1 1
i=1
 a31 /a11 a32 /a22 1 
 


Proof. (Proof of the Gaussian Elimination – General Case) Step 1. The sequence of operations to transform
the set of equations Ax = b to U x = c are as follows. First write the augmented matrix system (A,b),
 a11 a12 a13 ... a1n b1 
 a21 a22 a23 ... a2n b2 
 
 a31 a32 a33 ... a3n b3  .
 
 . . . ... . . 
 
 an1 an2 an3 ... ann bn 

Subtract
 
ar1
lr1 =
a11
48 CHAPTER 4. SYSTEMS OF LINEAR EQUATIONS

times row 1 from row r for r = 2,3,...n. This operation can be written as a product of A and G1 a product
of elementary matrices to give
 1 0 0 ... 0   a11 a12 a13 ... a1n b1 
 −a21 /a11 1 0 ... 0   0 a 0 a 0 ... a 0
  0 

22 23 2n b2 
G1 [A|b] =  −a31 /a11 0 1 ... 0  [A|b] =  0 a32 ... a3n b30  .
   0 0 0
a33
 . . . ... .   . . . ... . . 
 −an1 /a11 0 0 ... 1   0 a 0 a 0 ... ann 0 bn0 
 
n2 n3
The algorithm is then repeated on the smaller set of equations beginning at (row,column) = (2,2) and then
  
(3,3) etc, with matrices Gi defined similarly to G1 until the final step gives the form Rx = c with R being
upper triangular
 u11 u12 ... u1n 
 0 u22 ... u2n 
!  
n−1
Ö
Gi [A|b] =  0 0 ... ,  = [U|c]
 

i=1  . . ... . 
 0 0 ...

u nn 
as required.

Step 2. This special form can be used to find the solution by back substitution. Starting with the last row
xn = cn /un n. With xn known we find xn−1 by solving a linear equation etc. This process is summarised if
the following formula. With xi the i-th component Í of the vector x and ci the i-th component of the vector c,
ci − nj=i+1 ui j x j
xi = .
uii
Taken in the order i = n,n−1,...1 the right-hand side is known at each step.
(Proof of LU-factorisation)
From the proof of Theorem 4.9 we notice that the first
! step of the Gaussian Elimination process can be written as
n−1
Ö
Gi [A|b] = [U|c].
i=1
Then A can be factorised as !
n−1
Ö
A= G−1
i U.
i=1
Here Gi is the matrix representing elimination using row i after step i − 1 in the Gaussian Elimination
algorithm, therefore Gi and its inverse are given by
 1 0 ... 0 0 ... 0   1 0 ... 0 0 ... 0 
 0 1 ... 0 0 ... 0   0 1 ... 0 0 ... 0 
 
 . . ... . . ... 0   . . ... . . ... 0 
 
 
Gi =  0 0 ... 1 0 ... 0 , Gi =  0 0 ...
−1 
1 0 ... 0 ,

 
 0 0 ... −l(i+1)i 1 ... 0   0 0 ... l(i+1)i 1 ... 0 
 . . ... . . ... .   . . ... . . ... . 
   
 0 0 ... 0 ... 1   0 0 ... 0 ... 1 
 
−lni lni
respectively, and where where, as a reminder,
 

A(i−1)
lri = ri .
A(i−1)
ii
The product G−1 1 ...G −1 is lower triangular
n−1
 1 0 0 ... 0 
 l21 1 0 ... 0 

L = G1 ...G n−1 =  l31 l32 1 ... 0  .


−1 −1


 . . . ... 0 
 ln1 ln2 ln3 ... lnn 

 

4.2. DIRECT METHODS FOR SOLUTION OF LINEAR SYSTEMS 49

Example 4.11. Use Gaussian elimination to solve the systems of equations


8x + y−z = 8
2x + y+9z = 12
x −7y+2z = −4

Solution. The augmented matrix is


 8 1 −1 8 
 2 1 9 12 
 
 1 −7 2 −4 

 
Step 1, use the first row to remove all the remaining entries in the first column of the augmented matrix
by suitable row operations.
 8 1 −1 8 
 0 3/4 37/4 10 

 0 −57/8 17/8 −5 
 
where the transformations performed are
 
2 1
R2 → R2 − R1, R3 → R3 − R1
8 8
Step 2, Eliminate all other entries in the second column of the augmented matrix to obtain
 8 1 −1 8 
 0 3/4 37/4 10 

 0 0 90 90 
 
where the transformations that have been performed are

19
R3 → R3 + R2 .
2
Now solve by backward substitution.
90z = 90 → z=1
(3/4)y = −(37/4)z+10 → y = 1
8x = −y+z+8 → x =1
and so the solution is x = 1,y = 1,z = 1. ^
Example 4.12. Find an LU decomposition of the matrix
 8 1 −1 
A=  2 1 9 .
 
 1 2

 −7 

Solution. Outside the matrix we record the row ratios used for elimination.

Step 1.
 8 1 −1 
1/4  0 3/4 37/4  R2 → R2 −(2/8)R1 .
 
1/8  0 17/8

−57/8  R3 → R3 −(1/8)R1

Step 2.
 8 1 −1 
 0 3/4 37/4  .

−19/2  0 0 90  R3 → R3 −(−19/2)R2
 

Gaussian Elimination terminates and we identify the upper-triangular matrix U as


 8 1 −1 
U =  0 3/4 37/4
 

 0 0 90


 
and L is composed of the row ratios
 1 0 0 
L =  1/4 1 0 .
 
 1/8 1

 −19/2 

^
50 CHAPTER 4. SYSTEMS OF LINEAR EQUATIONS

4.2.3 The Thomas tridiagonal algorithm

The Thomas algorithm (Llewellyn Thomas), is a simplified form of Gaussian elimination in the case of
a tridiagonal matrix A.
Example 4.13. (Tridiagonal algorithm) Use Gaussian elimination to solve the systems of equations T x = F
where T is a general tri-diagonal matrix
 a1 b1
 0 0 ··· ··· 0 
 c2 a2 b2
 0 ··· ··· 0 
 0 c a3 b3 0 ··· ··· 
T =  3
 .. .. . . . . . . ..

 . . . . . . ··· 

 0 ··· ··· ··· 0 cn−1 an−1 
 

Solution. First, the row operation R2 → R2 −(c2 /a1 )R1 is used to remove c2 to give the tri-diagonal matrix
 a1 b1
 0 0 ··· ··· 0 
 0 b a2 b2 0 ··· ··· 0 

T (1) =  0 c3 a3 b3 0 ··· ··· 

 .. .. . . . . . . ..

 . . . . . . ··· 

 0 ··· ··· ··· 0 cn−1 an−1 
 

where ba2 and b f2 are defined by
c2 c2
a2 → b a2 = a2 − b1, f2 → bf2 = f2 − f1 .
a1 a1
This algorithm is recursive. The second row affects only a2 and f2 , the second entry of T (1) can now be
used to remove c3 provided b a2 is non-zero. This will always be true provided T is non-singular. After (n−1)
repetitions of this algorithm, in which the k th repetition operates on the k th and (k +1)th rows of T (k) , the
original equations T x = F will be reduced to the new system of equations T (n−1) x = F b where
 a1 b1
 0 0 ··· ··· 0  
 0 b a2 b2 0 ··· ··· 0 

T (n−1) =  0 0 b a3 b3 0 ··· ···  .

 .. .. . . . . . . . . 
 . . . . . . ··· 

 0 ··· ··· ··· 0 0 b
 
 an−1 
Since T (n−1) is now an upper triangular matrix, then T (n−1) x = F b can be solved for x by backward
substitution. ^

The computational effort in this algorithm is directly proportional to n, the number of equations in the system.

4.2.4 Pivoting

In Gaussian elimination

§ 4.14. • Gaussian elimination works without problems unless the diagonal entry of the row being
used to perform the elimination is zero.
• If the diagonal entry is zero we must reorder the rows to proceed.
• Reordering of rows and columns in this context is called pivoting.
• The case where only rows are swapped is called partial pivoting.
Example 4.15. Use Gaussian elimination to solve the system of equations
8x + y−z = 8
8x + y+36z = 12
4.2. DIRECT METHODS FOR SOLUTION OF LINEAR SYSTEMS 51

x −7y+2z = −4.

Solution. The augmented matrix is


 8 1 −1 8 
 8 1 36 12 
 
 1 −7 2 −4 
 
Step 1, use the first row to remove all the remaining entries in the first column of the augmented matrix
 
by suitable row operations.
 8 1 −1 8 
 0 0 37 4 

 0 −57/8 17/8 −5 

where the transformations performed are
 
8 1
R2 → R2 − R1, R3 → R3 − R1 .
8 8
At this stage the diagonal entry on the second row is zero and so we cannot use multiples of this row to
eliminate the entries below the diagonal in column 2. When the diagonal entry becomes zero like this we pivot.
That is we swap rows or columns to make sure that the diagonal entry is non-zero and proceed as before. ^
Example 4.16. Use Gaussian elimination with partial pivoting to solve the system of equations
8x + y−z = 8
2x + y+9z = 12
x −7y+2z = −4

Solution. The augmented matrix is


 8 1 −1 8 
 2 1 9 12 
 
 1 −7 2 −4 

The row with largest modulus in the first entry is already the first row, so there is no need to swap any rows.
 
Use the first row to remove all the remaining entries in the first column of the augmented matrix by suitable
row operations.
 8 1 −1 8 
 0 3/4 37/4 10 

 0 −57/8 17/8 −5 
 
where the transformations performed are
 
2 1
R2 → R2 − R1, R3 → R3 − R1 .
8 8
Now examine the entries in column 2 on, and below, row 2. We swap rows so that the modulus of a22 0 is as large

as possible. In this case we swap row 3 and row 2 to achieve this. The operation is written R2 ↔ R3 , leaving
 8 1 −1 8 
 0 −57/8 17/8 −5 

 0 3/4 37/4 10 
 
Now proceed as usual. Add multiples of row 2 to rows 3 and onwards to make the entries below the diagonal
 
in column 2 zero.
 8 1 −1 8 
 0 −57/8 17/8
 
−5 
 0 0 180/19 180/19 

the transformation that does this is

2
R3 → R3 + R2 .
19
Using backsubstitution we obtain, as before,
x = 1, y = 1, z = 1.
^
52 CHAPTER 4. SYSTEMS OF LINEAR EQUATIONS

Loss of Significance

§ 4.17 (Loss of Significance). As well as enabling the algorithm to continue, pivoting aso plays a role in
making the algorithm numerically stable and technically possible. The next example illustrates this.
Example 4.18. Use Gaussian elimination to solve the system of equations
0.0001x1 +1.00x2 = 1.00
1.00x1 +1.00x2 = 2.00
on a computer holding 3 significant figures only.
Solution. The actual solution is x1 ≈ 1.00010 and x2 = 0.99990 to five decimal places. The augmented matrix is
1.00×10 −4 1.00 1.00
 
1.00 1.00 2.00
Eliminating we obtain
1.00×10−4 1.00  1.00 
 

0 1.00−104 2.00−104
which is stored correct to three significant figures as
1.00×10−4 1.00 1.00
 
.
0 −104 −104
Back substitution gives the soluton as x2 = 1 and x1 = 0, and it is clear that a drastic error has occurred.
If we repeat the problem using partial pivoting the augmented matrix
 is
1.00×10−4 1.00 1.00
1.00 1.00 2.00
we pivot to bring the largest (in modulus) entry in column 1 into row
 1, that is R1 ↔ R2 ,
1.00 1.00 2.00

1.00×10−4 1.00 1.00
Eliminating we get
1.00 1.00  2.00
 
0 1.00−10−4 1.00−2×10−4

which the three-significant-figure computer
 will store as
1.00 1.00 2.00

0 1.00 1.00
and back substitution gives x1 = 1.00 and x2 = 1.00. This is a lot closer to the actual solution. ^

LU-factorisation in the case of partial pivoting

§ 4.19. If we allow (or if it is necessary to do) pivoting then instead of producing lower-triangular L and
upper-triangular U with
A= LU
we produce L and U and a permutation matrix P such that
P A= LU
so that a permutation of the rows of A has the Doolittle LU decomposition.
To solve Ax = b we first multiply by the permutation matrix P
Ax = b, =⇒ P Ax = Pb =⇒ LU x = Pb
and use Lz = Pb and U x = z along with forward and back substitution to find the solution.
In order to construct the perumtation matrix P, we must keep track of the row interchanges. It is easiest
to do this by storing a vector v = (1,2,...,n)T that records the row interchanges. This is illustrated by the
following example.
4.2. DIRECT METHODS FOR SOLUTION OF LINEAR SYSTEMS 53

Example 4.20. Find a LU decomposition of the matrix A via Gaussian Elimination with pivoting, when
the matrix A is given by
 0 1/3 −25/18 4/3 
 1 3/2 −25/12 3 
 
A= 
 2 −1 1/2 0 
 
 −2 2 −1/6 1 

Solution. Step 1. Pivot, swap rows 1 and 3, R1 ↔ R3 . We keep track of the permutations using a column
vector v.
 3   2 −1 1/2 0 
 2   1 3/2 −25/12 3 
  
 1   0 1/3 −25/18 4/3 
   
 4   −2 2 1 
   
−1/6
Now eliminate using row 1.
  
 3   2 −1 1/2 0 
 2  1/2  0 2 3  R2 → R2 −(1/2)R1
  
 −7/3
 1  0  0 1/3 −25/18 4/3  R3 → R3 −(0)R1
 

 4  −1  0 1 1/3 1  R4 → R4 −(−1)R1
 
 
Step 2 No need to pivot, so just eliminate using row 2.
 3   2 −1 1/2 0 
 2  1/2  0 2 −7/3 3 
  

 1  0 1/6  0 0 5/6  R3 → R3 −(1/6)R2
 
 −1
 4  −1 1/2  0 0 3/2 −1/2  R4 → R4 −(1/2)R2
 

Next, pivot by swapping rows 3 and 4, R3 ↔ R4 . This also has an effect on the row multiples (used to
 
construct L).
 3   2 −1 1/2 0 
 2  1/2  0 2 −7/3 3 
  

 4  −1 1/2  0 0 3/2 −1/2 
 

 1  0 1/6  0 0 5/6 
 
−1
Finally, eliminate using row 3.
 
 3   2 −1 1/2 0 
 2  1/2  0 2 −7/3 3 
  

 4  −1 1/2  0 0 3/2 −1/2 
 

 1  0 1/6 −2/3  0 0 0 1/2  R4 → R4 −(−2/3)R3
 
The algorithm terminates yielding L and U,
 
 1 0 0 0   2 −1 1/2 0 
 1/2 1 0 0  0 2 −7/3 3 
 
L =  U = 
 
 −1 1/2 1 0   0 0 3/2 −1/2 

 0 1/6 −2/3 1   0 0 0 1/2 
  
and the permutation can be read from the final state of v = (3,2,4,1) , giving
T
 0 0 1 0 
 0 1 0 0 
 
P=  .
0 0 0 1


 1 0 0 0 
 
To conclude we have found matrices P, L and U such that P A= LU.
 
^

4.2.5 Solution of linear systems by LU-factorisation

Claim 4.21. Let A has the LU-factorisation A= LU. Then the solution of Ax = b is x =U −1 L −1 b.

Proof. Trivial. 
54 CHAPTER 4. SYSTEMS OF LINEAR EQUATIONS

Remark 4.22. The expression x =U −1 L −1 b can be written out in explicit form ready for iterations. There
is no need to invert L and U.
Example 4.23. Write out the explicit form of x =U −1 L −1 b.
Solution. When the matrix A can be expressed as A = LU, we take advantage of the special structure of
L and U to solve Ax = b. Write
Ax = LU x = b
and define U x = z. Then z is the solution to Lz = b, and x is the solution to U x = z. As L is lower-triangular,
z is found by forward substitution, and as U is upper triangular x is found by back substitution.
Forward substitution. In component form
n
Õ
Li j z j = bi
j=1
i−1
Õ n
Õ
Li j z j + Lii zi + Li j z j = bi
j=1 j=i+1
Íi−1
bi − j=1 Li j z j
zi =
Lii
as Li j = 0 for j > i. If these equations are taken in the order i = 1,2,...,n the right-hand side is known at every
stage.
Backward substitution. Once z is determined then solve U x = z by back substitution, in components
Õ n
Ui j x j = zi
j=1
i−1
Õ n
Õ
Ui j x j +Uii xi + Ui j x j = zi
j=1 j=i+1
Ín
zi − j=i+1Ui j x j
xi =
Uii
as Ui j = 0 for j < i. If these equations are taken in the order i = n,n−1,...,1 the right-hand side is known
at every stage. ^

4.2.6 Efficient methods for LU-factorisation

Claim 4.24 (Uniqueness). The LU-factorisation is not unique.

Proof. The number of entries of L to be determined is n(n+1)/2 and the number of entries of U to be
determined is n(n+1)/2 giving a total of n2 +n unknowns. The number of equations in the expression
A= LU
is n2 . Thus we have n2 equations for n2 +n unknowns and must provide n equations in order to make the
decomposition unique. There is a variety of choices for these equations and therefore the LU-factorisation
is not unique. 

§ 4.25 (Types of LU-factorisation). Different types of LU-factorisations are possible because off the freedom
to chose n additional equations. Typical alternatives are
1. Doolittle LU-decomposition – In Doolittle LU-decomposition the n extra equations are chosen as
Lii = 1, i = 1,2,...n
so that the diagonal entries of L are 1.
(The Doolittle alternative is used in these lecture notes.)
4.2. DIRECT METHODS FOR SOLUTION OF LINEAR SYSTEMS 55

2. Crout LU-decomposition In Crout LU-decomposition the n extra equations are chosen as


Uii = 1, i = 1,2,...n
so that the diagonal entries of U are 1.
3. Cholesky LU-decomposition In this case the n extra equations are chosen as
Lii =Uii, i = 1,2,...n.
The Cholesky decomposition does not always exist.
This is most esily seen by considering the determinant. Suppose the matrix A has detA < 0 and that
the Cholesky factorisation exists. Then
n n n n
Lii2 > 0
Ö Ö Ö Ö
detA= (detL)(detU) = Lii Uii = (Lii Uii ) =
i=1 i=1 i=1 i=1
which is a contradiction, so the Cholesky factorisation does not exist.
§ 4.26. LU-factorisation is of numerical interest because there exist efficient methods to compute LU-
factorisation.
§ 4.27. The Crout’s algorithm solves the set of n2 +n equations of LU-factorisation quite trivially by just
arranging them in a certain order!
Claim 4.28 (Column-wise Crout’s algorithm). Let A= LU be a LU-factorisation of a given N ×N matrix
A into a product of a unit-lower-triangular matrix L with lii = 1 and a upper-triangular matrix U. The
elements of U and L are given by !
i−1
Õ
ui j = ai j − lik uk j , for i = 1,2,···, j, (4.2)
k=1
j−1
!
Õ 1
li j = ai j − lik uk j , for i = j +1, j +2,···,N, (4.3)
k=1
uj j
for every j = 1,2,···,N.

Proof. The elements Li j and Ui j are determined form the set of equations
Õn
Lik Uk j = Ai j ,
k=1
Lii = 1.
The elements Ui j are then
i−1
Õ n
Õ
Lik Uk j + Lii Ui j + Lik Uk j = Ai j ,
k=1 k=i+1
and so
i−1
Õ n
Õ
Ui j = (Ai j − Lik Uk j − Lik Uk j )/(Lii = 1).
k=1 k=i+1
If i < j then in the second sum all Li,i+1···n = 0, leaving only
Õi−1
Ui j = Ai j − Lik Uk j , for i = 1,2,···, j.
k=1

Similarly, the elements Li j are then


j−1
Õ n
Õ
Lik Uk j + Li j U j j + Lik Uk j = Ai j ,
k=1 k=j+1
and so
j−1
Õ n
Õ
Li j = (Ai j − Lik Uk j − Lik Uk j )/U j j
k=1 k=j+1
56 CHAPTER 4. SYSTEMS OF LINEAR EQUATIONS

If i > j then in the second sum all U j+1···n, j = 0, leaving only


Õj−1
Li j = (Ai j − Lik Uk j )/U j j , for i = j +1, j +2,···,n.
k=1

Calculating in the order j = 1,2,···,n all u’s and l’s that occur in the right-hand side of (4.2) and (4.3) are
already determined by the time they are needed. 

Example 4.29. (Column-wise Crout’s algorithm) Use the Column-wise Crout algorithm to find the
LU-factorisation of
 a11 a12 a13 
.
 
A=  a21 a22 a23 
 a31 a32 a33 
 
Solution.
LU = A
so
 1 0 0   u11 u12 u13   a11 a12 a13 
1 0   0 u22 u23  =  a21 a22 a23 .
 
 l21
l32 1   0 0 u33   a31 a32 a33
 
 l31 
Working column-by-column. First column j = 1
 
a11 = 1∗u11 +0∗0+0∗0 → u11 = a11
a21 = l21 u11 +1∗0+0∗0 → l21 = a21 /a11
a31 = l31 u11 +l32 ∗0+0∗1 → l31 = a31 /a11

Then column j = 2
a12 = 1∗u12 +0∗u22 +0∗0 → u12 = a12
a22 = l21 u12 +1∗u22 +0∗0 → u22 = a22 −a12 a21 /a11
a32 = l31 u12 +l32 ∗u22 +1∗0 → l32 = (a32 −a31 /a11 a12 )/(a22 −a12 a21 /a11 )

Then column j = 3
a13 = 1∗u13 +0∗u23 +0∗u33 → u13 = a13
a23 = l21 u13 +1∗u23 +0∗u33 → u23 = a23 −a21 /a11 a13
a33 = l31 u13 +l32 ∗u23 +1∗u33 → u33 = a33 −a31 /a11 a13 −(a32 −a31 /a11 a12 )/(a22 −a12 a21 /a11 )(a23 −a21 /a11 a13 )

^
§ 4.30 (Partial pivoting). Observe the following.
1. The u’s and l’s that occur in the right-hand side of (4.2) and (4.3) are already determined by the time
they are needed. This makes the method efficient.
2. The a’s corresponding to u’s and l’s that are already determined are never needed again. This allows
to overwrite the matrix A which is useful for pivoting, see below, and for reducing computer memory
requirements.
3. Equation (4.2) in the case of i = j (its final application) is exactly the same as equation (4.3) except for the
division in the latter equation; in both cases the upper limit of the sum is k = j −1(=i−1). This fact allows
to incorporate partial pivoting (to select a large by modulus pivot (diagonal element uii ) for the division
Í j−1
in (4.3)), which is essential for the method to work. Practically, compute li0j = (ai j − k=1 lik uk j ),
swap the rows of A (which you have been overwriting with L and U) corresponding to u j j and
m = max{|u j j |, |l j+1,
0
j |, |l j+2, j |,···, |l N, j |} and only then divide by the pivot m to fully implement
0 0

equation (4.3). Keep a permutation matrix P as you are now actually factorising P A= LU, instead.
§ 4.31. An alternative strategy also called “Crout’s algorithm” is to solve the equations crosswise as in
the next example. However, pivoting is difficult to incorporate in this CROSSWISE Crout’s algorithm.
4.3. ITERATIVE METHODS FOR SOLUTIONS OF LINEAR SYSTEMS OF EQUATIONS 57

Example 4.32. (CROSSWISE Crout’s algorithm) Find the Doolittle LU-decomposition of the matrix
 8 1 −1 
A=  2 1 9  .
 
 1 −7 2 
 
Solution. The Doolittle LU-decomposition,
 8 1 −1   1 0 0   U11 U12 U13 
A=  2 1 9  =  L21 1 0   0 U22 U23  .
  
 1 −7 2   L31 L32 1   0 0 U33 
We solve for the entries Li j and Ui j in a specific order. First use the equations given by the first row of A,
   
8 = U11
1 = U12
−1 = U13
then use the equations given the the remaining entries in the first column of A
1
2 = L21U11 =⇒ L21 =
4
1
1 = L31U11 =⇒ L31 = .
8
Now use the remaining entries on the 2nd row of A,
1 3
 
1 = L21U12 +U22 =⇒ U22 = 1−(1) =
4 4
1 37
 
9 = L21U13 +U23 =⇒ U23 = 9−(−1) =
4 4
then use the remaining entries in the 2nd column of A,
3 1 19
 
−7 = L31U12 + L32U22 =⇒ L32 = −7− (1) =⇒ L32 = −
4 8 2
and finally the remaining rows in column 3 of A,
1 19 37
    
2 = L31U13 + L32U23 +U33 =⇒ U33 = 2− (−1)− − =⇒ U33 = 90.
8 2 4
To summarise, the Doolittle LU-decomposition of A is
 8 1 −1   1 0 0   8 1 −1 
A=  2 1 9  =  1/4 1 0   0 3/4 37/4  .
  
    
 1 −7 2   1/8 −19/2 1   0 0 90 
   
^

4.3 Iterative methods for solutions of linear systems of equations

4.3.1 General formulation

Claim 4.33. The set of linear equations Ax = b can be written in the form x = (I − P−1 A)x + P−1 b where
P is any invertible matrix with dimensions same as those of A.

Proof. Consider the equivalent expressions


Ax = b,
Px + Ax −Px = b,
Px +(A−P)x = b,
x = P−1 b+(I −P−1 A)x = H x +c.
In the last line, h is a matrix and c is a constant vector. This is notation that we will find useful later. 
58 CHAPTER 4. SYSTEMS OF LINEAR EQUATIONS

Remark 4.34. The sequence given by x (n+1) = H x n +c = (I −P−1 A)x n +P−1 b can be used as a fixed-point
iterative scheme for the solution of Ax = b if it converges. Fixed-point iterative schemes were introduced
in Chapter 1.

4.3.2 Gauss-Jacobi and Gauss-Seidel methods

Before we formulate the methods we make the following remarks.


Claim 4.35. An n×n matrix A can always be represented as
A= D+ L +U,
where D, L and U are a diagonal, lower-triangular with Lii = 0 and upper-triangular with Uii = 0 matrix,
respectively.

Proof. This is trivial, e.g. the 3×3 case


 a11 a12 a13   a11 0 0   0 0 0   0 a12 a13 
 a21 a22 a23  =  0 a22 0  +  a21 0 0  +  0 0 a23
   

 a31 a32 a33   0 0   a31 a32 0   0 0 0
       
a 33

or
       
A= D+ L +U.


Note that L and U are different from L and U of the LU-factorization!!!


Claim 4.36. Let D be a diagonal matrix with elements dii . Then D−1 is also diagonal with elements 1/dii .

Proof. This is also trivial. To prove multiply


 a11 0 0   1/a11 0 0 
DD =  0 a22 0
−1   0 1/a22 0  = I.
  
 0 0 a33  0 0 1/a
 
33


  

We can now formulate the Gauss-Jacobi and the Gauss-Seidel methods as particular cases of Remark 4.34.
Example 4.37. (The Gauss-Jacobi method) Let A= D+ L +U where D is the main diagonal of A, and L
and U denote respectively the lower and upper triangular sections of A. Formulate an iterative method
for Ax = b with P = D.
Solution. Consider the transformations
Ax = b,
(D+ L +U)x = b
Dx = −(L +U)x +b,
x = −D−1 (L +U)x +D−1 b.
Then
P = D, H = −D−1 (L +U), c = D−1 b.
To use the scheme it is most useful to write this same formula in explicit elementwise form. ^
Definition 4.38. The iterative scheme
x n+1 = −D−1 (L +U)x n +D−1 b, x0 given.
is called the Gauss-Jacobi method for solution of Ax = b.
4.3. ITERATIVE METHODS FOR SOLUTIONS OF LINEAR SYSTEMS OF EQUATIONS 59

Remark 4.39. The explicit form of the Gauss-Jacobi method is


i−1 n
1 © Õ Õ
xi = ­bi − ai j x j −
m+1 m
j ®,
ai j x m i = 1..n. (4.4)
ª
aii j=1 j=i+1
« ¬
Example 4.40. (The Gauss-Seidel method) Let A = D+ L +U where D is the main diagonal of A, and L
and U denote respectively the lower and upper triangular sections of A. Formulate an iterative method
for Ax = b with P = D+ L.
Solution. Consider the transformations
Ax = b,
(D+ L +U)x = b
(D+ L)x = −U x +b,
x = −(D+ L)−1U x +(D+ L)−1 b.
Then
P = D+ L, H = −(D+ L)−1U, c = (D+ L)−1 b.
To use the scheme it is most useful to write this same formula in explicit elementwise form. ^
Definition 4.41. The iterative scheme
x n+1 = −(D+ L)−1U x n +(D+ L)−1 b, x0 given.
is called the Gauss-Seidel method of solution of Ax = b.
Remark 4.42. The explicit form of the Gauss-Seidel method is
i−1 n
1 © Õ Õ
xim+1 = ­bi − ai j x m+1 − j ®,
ai j x m i = 1..n. (4.5)
ª
j
aii j=1 j=i+1
« ¬
Remark 4.43. The Gauss-Jacobi and the Gauss-Seidel methods are convenient because D, L and U are
easily found and D and (D+ L) are easily inverted (although we do not show this for (D+ L).

This is illustrated below.


Example 4.44. Use Gauss-Jacobi and Gauss-Seidel algorithms to solve the system of equations
8x + y − z = 8
2x + y + 9z = 12
x − 7y + 2z = −4.

Solution. The original equations are re-ordered (for reasons to be discussed later) to give
8x + y − z = 8 
x − 7y + 2z = −4  .
2x + y + 9z = 12 
Now we implement formulas (4.4) and (4.5). Divide by the diagonal element
x = 1− y/8+z/8
y = 4/7+ x/7+2z/7 .
z = 4/3−2x/9− y/9
Then these equations form the basis of the Gauss-Jacobi and Gauss-Seidel iterative scheme, which in this
instance are respectively
Gauss-Jacobi Gauss-Seidel
x2(n) x3(n) x2(n) x3(n)
x1(n+1) = 1− + x1(n+1) = 1− +
8 8 8 8
x2(n+1) =
4 1 (n) 2 (n)
+ x + x3 4 1 (n+1) 2 (n) .
7 7 1 7 x2(n+1) = + x + x3
7 7 1 7
4 2 (n) x2
(n)
4 2 (n+1) 1 (n+1)
x3(n+1) = − x − x3(n+1) = − x − x2
3 9 1 9 3 9 1 9
60 CHAPTER 4. SYSTEMS OF LINEAR EQUATIONS

Both schemes take the starting values x1(0) = x2(0) = x3(0) = 0. Convergence is rapid with the solution
x1 = x2 = x3 = 1 being obtained approximately after 3 iterations.
Iteration x1 x2 x3 x1 x2 x3
0 0 0 0 0 0 0
4 4 5 65
1 1 1
7 3 7 63
53 23 26 253 1781
2 ∼1
56 21 27 252 1764
^
§ 4.45. Gauss-Seidel converges faster because we use more accurate values at each step.

4.3.3 Convergence of linear iterative methods

4.3.3.1 Absolute condition

Let H be an n×n matrix with eigenvalues λi . the maximum by modulus of the eigenvalues
Definition 4.46.
ρ(H) = max λi is called the spectral radius of H.
i=1...n

Claim 4.47. The sequence {x n } given by the iterated map x n+1 = G(x n ) = H x n +c converges starting from
any initial guess x 0 if and only if the spectral radius of H is less than unity, ρ(H) < 1.

Proof. Let x ∗ be the fixed point for which x ∗ = G(x ∗ ). Define the error after the n-th iteration to be
en = x ∗ − x n .
Then
en = x ∗ − x n = G(x ∗ )−G(x n−1 ) = H x ∗ +c−H x n−1 −c = H(x ∗ − x n−1 ) = Hen−1 .
Then
en = Hen−1 = H 2 en−2 = ... = H n e0 .
Let (λk ,vk ),k = 1...n be the eigenpairs of H. Then the eigenvectors {vk ,k = 1...n} form a complete basis
of Rn so we can decompose the error vector e0 in terms of the eigenvectors
Õn
e0 = ck vk
k=1
and
n
Õ
e n = H e0 =
n
λkn ck vk .
k=1
For convergence we require the norm (any norm) of the error to tend to zero
lim ken k = 0.
n→∞
Since, the norm of the error is
n
Õ Õn
ken k = k λkn ck vk k ≤ ρ(H)n k ck vk k = ρ(H)n ke0 k
k=1 k=1
where
ρ(H) = max λk ,

k=1...n
is the spectral radius of h, it is clear that the sequence converges if and only if ρ(G) < 1. 

Example 4.48. Show that Gauss-Jacobi iteration for the coefficient matrix A, where
 1 3 0 
A=  2 2 1 
 
 0 1 2 
will diverge.
 
4.3. ITERATIVE METHODS FOR SOLUTIONS OF LINEAR SYSTEMS OF EQUATIONS 61

Solution. The Gauss-Jacobi method


x n+1 = −D−1 (L +U)x n +D−1 b,
will converge if the spectral radius ρ(−D−1 (L +U)) < 1. So we compute the eigenvalues of −D−1 (L +U),
that is, solve the equation
−D−1 (L +U)v = λv,
(L +U)v +Dλv = 0,
((L +U)+λD)v = 0,
that has non-trivial solutions when
det((L +U)+λD) = 0.
Calculate the eigenvalues
µ 3 0  
2 2µ 1 = µ 4µ2 −1 −3(4µ) = 0
0 1 2µ
so that r r
13 13
µ = 0, µ=− , µ= .
4 4
There are some eigenvalues whose modulus is greater than 1 and so the Jacobi iteration for this matrix
will diverge in general. ^
Example 4.49. Determine whether or not the Gauss-Jacobi algorithm converges for the matrix
 2 2 1 
A=  1 3 0  .
 
 0 1 2 
 
Solution. So the eigenvalues λ satisfy
the characteristic equation
2λ 2 1

1 3λ 0 = 12λ3 −4λ+1 = 0.


0 1 2λ
This cubic has one real root and two complex conjugate roots.
The stationary values of this cubic function satisfy 36λ2 −4 = 0 from which it follows that the cubic has
a minimum turning point at (1/3,1/9) and a maximum turning point at (−1/3,17/9). Therefore the cubic
has exactly one real solution and it lies in the interval (−1,−1/3).
The remaining solutions are complex conjugate pairs. Viete’s formula, x1 x2 ... xn = (−1)n a0 /an where
x1,x2 ..xn are the roots of an x n +...a1 x +a0 = 0, then gives that the product of the three solutions is −1/12.
So the complex conjugate pair solutions must lie within a circle of radius 1/2 about the origin. Therefore
all eigenvalues lie within the unit circle and so the Gauss-Jacobi algorithm converges. ^

4.3.3.2 Diagonal dominance

§ 4.50. The absolute condition for convergence ρ(H) < 1 is difficult to establish as eigenvalues are not easy
to find especially for matrices of large dimensions. So we discuss weaker but more practical conditions.
Definition 4.51. Let V be a vector of dimension n. The supremum norm of V is defined as
||V || = max |Vj |.
1≤ j ≤n

Definition 4.52. Let A be a n×n matrix. The supremum norm of A is defined as the maximum row sum
of the moduli of its elements
n
Õ
|| A|| = max |ai j |.
1≤i ≤n
j=1

Claim 4.53. ||Hv|| ≤ ||H|| ||v||.


62 CHAPTER 4. SYSTEMS OF LINEAR EQUATIONS

Proof. ! !
n
Õ n
Õ n
Õ
||Hv|| = max | Hik vk | ≤ max |Hik vk | = max |Hik ||vk |
1≤i ≤n 1≤i ≤n 1≤i ≤n
k=1 k=1 k=1
n
!
Õ
≤ max ( max |vk |) |Hik |
1≤i ≤n 1≤k ≤n
k=1
n
! n
!
Õ Õ
= max ||v|| |Hik | = ||v|| max |Hik |
1≤i ≤n 1≤i ≤n
k=1 k=1
= ||v|| ||H||.


Claim 4.54. A sufficient condition for convergence of the sequence {x n } given by the iterated map
x n+1 = H x n +c is ||H|| < 1.

Proof. Let (λk ,vk ),k = 1...n be the eigenpairs of H so


Hvk = λk vk .
Then
||Hvk || = ||λk vk || = |λk | ||vk ||.
On the other hand,
||Hvk || ≤ ||H|| ||vk ||
so
||H|| ||vk || ≥ |λk | ||vk ||
and
||H|| ≥ |λk |.
Therefore, if ||H|| < 1 then |λk | < 1 and so ρ(H) < 1 and so convergence. 

Definition 4.55. An n×n matrix A is said to be diagonally dominated if the sum (in modulus) of all the
off-diagonal elements in any row is less than the modulus
Õ of the diagonal element in that row, i.e.
|aii | > |ai j |.
j, j,i

Claim 4.56. (Sufficient but not necessary condition) The Gauss-Jacobi method for the solution of Ax = b
converges if the matrix A is diagonally-dominated.

Proof. Let A= D+ L +U. Then the Gauss-Jacobi has matrix H = −D−1 (L +U) and we have to show that
||−D−1 (L +U)|| < 1.
Suppose A is diagonally dominated. Express
D−1 A= I +D−1 (L +U),
−D−1 (L +U) = I −D−1 A.
Let V be an arbitrary vector. Consider
n
Õ
||−D−1 (L +U)V || = max |[−D−1 (L +U)V]i | = max | [−D−1 (L +U)]iq Vq |
1≤i ≤n 1≤i ≤n
q=1
n
Õ
= max | [I −D−1 A]iq Vq |
1≤i ≤n
q=1
n
Õ aiq
= max | Vq − Vq |
1≤i ≤n aii
q=1
4.3. ITERATIVE METHODS FOR SOLUTIONS OF LINEAR SYSTEMS OF EQUATIONS 63

n
Õ aiq
= max |− Vq |
1≤i ≤n a
q=1,q,i ii
n
Õ aiq
≤ ||V || max | |
1≤i ≤n aii
q=1,q,i
ai q
(now, because A is diagonally dominated | < 1 by definition we have)
Ín
q=1,q,i | aii
n
Õ aiq
= ||V || max | | < ||V ||.
1≤i ≤n a
q=1,q,i ii
Finally, since V is arbitrary, we can chose V = 1 so we obtain
||H|| = ||−D−1 (L +U)|| < 1,
and by so the Gauss-Jacobi converges by Claim 4.54. 

Claim 4.57. (Sufficient but not necessary condition) The Gauss-Seidel method for the solution of Ax = b
converges if the matrix A is diagonally-dominated.

Proof. Let A= D+ L +U. Then the Gauss-Seidel has matrix H = −(D+ L)−1U and we have to show that
||−(D+ L)−1U|| < 1.
Suppose A is diagonally dominated. Introduce notation E = −D−1 L and F = −D−1U. Observe first that
(D+ L)−1U = [D(I +D−1 L)]−1U = (I +D−1 L)−1 D−1U = −(I −E)−1 F.
We must now show that ||(D+ L)−1UV || = ||(I − E)−1 FV || < ||V || for any vector V. Since E is a strictly
lower triangular n×n matrix then E n = 0 and it now follows that
(I −E)−1 = I +E +E 2 −···+E n−1 .
The individual elements of (I −E) may be compared to get
−1

|(I −E)−1 | = |I +E +E 2 +···+E n−1 |


< 1+|E |+|E | 2 +···+|E | n−1
= (I −|E |)−1 .
Recall from the analysis of the Gauss-Jacobi algorithm that (|E | + |F |)|V | ≤ |V |. This inequality may be
re-expressed in the form |F ||V | ≤ (I −|E |)|V | which in turn gives (I −|E |)−1 |F ||V | ≤ |V |. Therefore, given
any vector V, it follows that
||(I −E)−1 FV || < ||V ||.
and so (D+ L) U is a contraction mapping if A is a diagonally dominated matrix.
−1 

Remark 4.58. In other words, a sufficient condition for the convergence of Gauss-Jacobi and Gauss-Seidel
methods is strict diagonal dominance of the matrix A. Strict diagonal dominance is only a sufficient condition,
the method may converge even if this is not the case. If we do not have strict diagonal dominance, we must
check the eigenvalues of the iteration matrix.
Remark 4.59. We may re-order the rows and columns of the matrix to obtain a new matrix A0 which is as
close as possible to having strict diagonal dominance. In general this improves the convergence properties
of the iterative methods.
Example 4.60. Decide whether Gauss-Jacobi and Gauss-Seidel methods can be made to converge or not
for the two matrices A
 −1 2 0   1 2 4 
(i) A=  3 1 1 , (ii) A=  1/8 1 1  .
   
 
 1 2 −4   −1 4 1 
   
Solution. (i) A is not diagonally-dominated. However, we can reorder equations by swapping R1 and R2 .
The new matrix
 3 1 1 
A =  −1 2 0 
0 
 
 1 2 −4 
 
64 CHAPTER 4. SYSTEMS OF LINEAR EQUATIONS

is strictly diagonally-dominated as 3 > 1+1 and 2 > −1 +0 and −4 > 1+2. Therefore Gauss-Jacobi and

Gauss-Seidel will both converge based on A0.


(ii) In the second case we do not have strict diagonal dominance, even if rows are interchanged. So we
must compute the spectral radius.
Gauss-Jacobi case:
H = −D−1 (L +U),
Hv = λv,
−D−1 (L +U)v = λv,
(L +U)v +Dλv = 0,
((L +U)+λD)v = 0,
that has non-trivial solutions when
det((L +U)+λD) = 0.
So the eigenvalues λ satisfy the characteristic equation
1λ 2 4

1/8 1λ 1 = λ3 − 1 λ = 0.

−1 4 λ
4
The roots are λ1 = 0 and λ2,3 = ±1/2. So |λ1,2,3 | < 1 or ρ(H) < 1. Therefore the Gauss-Jacobi converges,
even though we do not have strict diagonal dominance.
Gauss-Seidel case:
H = −(D+ L)−1U,
Hv = λv,
−(D+ L)−1Uv = λv,
(U +(D+ L)λ)v = 0,
det(U +(D+ L)λ) = 0,
λ 2 4

1/8λ 1λ 1 = λ3 + 7 λ2 −2λ = 0.

−λ 4λ λ
4

The roots are λ1 = 0 and λ2,3 = (−7± 177)/8. Since |λ2,3 | > 1 or ρ(H) > 1 and so Gauss-Seidel diverges. ^

4.4 Successive Over-Relaxation

This is a modification of the Gauss-Seidel scheme and is based on an idea used throughout numerical
analysis (compare with iterative refinement in root-finding) — that we can turn non-convergent iterative
schemes into convergent schemes by use of a parameter that we have the freedom to adjust.
Example 4.61. (Succsessive Over-Relaxation) Let A= D+ L +U where D is the main diagonal of A, and
L and U denote respectively the lower and upper triangular sections of A. Formulate an iterative method
for Ax = b with H = (D+ωL)−1 ((1−ω)D−ωU), where ω is an arbitrary parameter.
Solution. Consider
0 = Ax −b,
0 = −ω(Ax −b),
Dx = Dx −ω(Ax −b),
Dx = Dx −ωDx −ωL x −ωU x +ωb (decompose A as usual),
(D+ωL)x = (ωb−ωU x +(1−ω)Dx),
x = (D+ωL)−1 ((1−ω)D−ωU)x +(D+ωL)−1 ωb= H x +c.
4.4. SUCCESSIVE OVER-RELAXATION 65

This can be used as an iterative scheme


Definition 4.62. The iterative scheme
(D+ωL)xn+1 = (ωb−ωU xn +(1−ω)Dxn )
is called the Successive Over-Relaxation method for solution of Ax = b.
The choice ω = 1 results in Gauss-Seidel iteration. The constant ω can be chosen to make the scheme
converge faster, or to turn a divergent scheme into a convergent one. ^
Example 4.63. Write out the explicit form of the SOR method.
Solution. SOR iteration for Ax = b is given by
xn+1 = ω(D+ωL)−1 b−(D+ωL)−1 (ωU −(1−ω)D)xn
which can be written
(D+ωL)xn+1 = ωb−ωU xn +(1−ω)Dxn . (4.6)
(i)
Suppose we have an N × N matrix, label the components of xn as xn for i = 1,2,...N. Then the j-th row
of (4.6) is given in component form as
ÕN N
Õ ÕN
(i) (i)
D ji xn+1 +ω i=1 L ji xn+1 = b j −
(i)
U ji xn +(1−ω) D ji xn(i)
ÍN
i=1 i=1 i=1
 j−1
Õ ÕN 
(j) (i) (i)  (j)
= ω b j − L ji xn+1 − U ji xn  +(1−ω)D j j xn

D j j xn+1 
 i=1 i=j+1 
 
 j−1
Õ ÕN 
(j) (i) (i) (j)
= ω b j − A ji xn+1 − A ji xn  +(1−ω)A j j xn
 
A j j xn+1
 i=1 i=j+1 
 
(i)
A ji xn(i) 
Í j−1 ÍN
 b j − i=1 A ji xn+1 − i=j+1
 
(j) (j)
xn+1 = (1−ω)xn +ω  
 Aj j 

for j = 1,2,...N. Where the special properties of L, U and D and their relation to A have been used.
 

From this final expression


Í j−1 (i) ÍN (i) 
 bj − i=1 A ji xn+1 − i=j+1 A ji xn 

(j) (j)
xn+1 = (1−ω)xn +ω 
 Aj j 

it is clear that xn+1 is just a linear interpolation between the current value xn and that obtained by
 
Gauss-Seidel. ^
Example 4.64. Write down the practical implementation of the SOR iterative scheme with parameter ω,
for the solution of
 2 0 1  X   1 
 2 4 1   Y  =  1 .
    
 0 1 4  Z   3 
    
Simplify the expression when ω = 3/2.
    

Solution. Solution is
1− Zn
 
Xn+1 = (1−ω)Xn +ω
2
1−2Xn+1 − Zn
 
Yn+1 = (1−ω)Yn +ω
4
3−Yn+1
 
Zn+1 = (1−ω)Zn +ω .
4
and when ω = 3/2 we have
3−2Xn −3Zn
Xn+1 =
4
66 CHAPTER 4. SYSTEMS OF LINEAR EQUATIONS

3−6Xn+1 −4Yn −3Zn


Yn+1 =
8
9−3Yn+1 −4Zn
Zn+1 = .
8
^

4.5 Eigenvalue approximation

4.5.1 Problem formulation

Problem 4.65. Let A be a n×n matrix. Find (or numerically approximate) the spectrum of A i.e. the set
of all eigenvalues λk and corresponding eigenvectors vk of A,
Avk = λk vk .
Claim 4.66. The eigenvalues of A are precisely the solutions λ of the algebraic equation
det(A−λk I) = 0.

In general this is a degree-n polynomial in λk . No direct methods exist for the calculation of the roots
when n ≥ 5. Therefore, there are no direct methods to compute the eigenvalues of a general n×n matrix
A when n ≥ 5. We can, however, develop methods to estimate the regions of the complex plane in which
the eigenvalues lie and find iterative methods to estimate certain eigenvalues.

4.5.2 Gershgorin’s Circle Theorem

This theorem gives some useful, even though incomplete, information about the distribution of the eigenvalues
of a matrix A in the complex plane.
Claim 4.67. (Gershgorin’s Circle Theorem)
• 1. The eigenvalues of an n×n matrix A lie in the union of the n Gershgorin row circles,
n n n
Ø Õ o
λk ∈ Ri, Ri = z ∈ C : z− Aii ≤ Ai j .

i j=1, j,i
• 2. The eigenvalues of an n×n matrix A lie in the union of the n Gershgorin column circles,
n n n
Ø Õ o
λk ∈ Ci, Ci = z ∈ C : z− Aii ≤ A ji .

i j=1, j,i
• 3. If the union of m of the circles forms a set that is disjoint to the remaining n−m circles then m
eigenvalues lie in this union of m circles.

Proof. For the first part consider an eigenvalue λ, then, the corresponding eigenvector v has a component
which is greater than or equal to, in modulus,
the remaining components. That is there exists an i with
vi ≥ v j , j = 1,2,...n.
Then for
Av = λv
the i-th row is
Õn
Ai j v j = λvi
j=1
n
Õ
(λ− Aii )vi = Ai j v j
j=1, j,i
4.5. EIGENVALUE APPROXIMATION 67

n
Õ Õ n n
Õ
(λ− Aii )vi = λ− Aii vi =

Ai j v j ≤ Ai j v j ≤ vi Ai j

j=1, j,i j=1, j,i j=1, j,i
Õn
λ− Aii ≤

Ai j
j=1, j,i
so λ ∈ Ri . Therefore, all the λ must lie in the unionØ
of the Ri ,
λ ∈ Ri .
i

For the second part, note that the eigenvalues of A are the eigenvalues of AT , and apply the first part to AT .
The third part regarding m circles is harder to prove and is omitted. 

Example 4.68. Sketch the Gershgorin row circles for the matrix
 9 1 1 
A=  0 1 1  . (4.7)
 
 −2 4 0 
 
Solution. The Gershgorin circles are C1 with centre 9 and radius |1|+|1| = 2, C2 with centre 1 and radius
|0|+|1| = 1, and C3 with centre 0 and radius |−2|+|4| = 6. ^
68 CHAPTER 4. SYSTEMS OF LINEAR EQUATIONS
Chapter 5

Numerical
differentiation by finite differences

5.1 Problem formulation

5.1.1 General setting

§ 5.1. Numerical differentiation is a family of algorithms for numerical approximation of the value of a
derivative at a given point to a given degree of accuracy.
Definition 5.2. Let f (x) be a real function, defined on a neighbourhood of x0 : |x − x0 | < r. Let h , 0 with
x0 +h ∈ [x0 −r,x0 +r]. If the limit
f (x0 +h)− f (x0 )
D[ f ](x0 ) ≡ f 0(x0 ) = lim
h→0 h
exists, then f(x) is said to be differentiable at x0 and f 0(x0 ) is called the derivative of f at x0 .
§ 5.3. An approximation to a derivative of a function f (x) can be made by first finding a convenient numerical
approximation to f (x) and then differentiating the result.

5.1.2 Finite-difference problem formulation

§ 5.4. One method of approximation to the function f (x) is polynomial interpolation as discussed previously.
Problem 5.5. Given a set of n+1 data points (xi, f (xi )), where xi ∈ [a,b] and f (x) is differentiable on [a,b],
find the derivative f 0(x) at x ∈ [a,b].
Remark 5.6. A finite difference is a mathematical expression of the form f (x + b) − f (x + a). A finite
difference divided by b−a is known as a difference quotient or Newton quotient.

5.2 General derivative approximation

Claim 5.7. Let f (x) be (n+1) times differentiable on x ∈ [a,b] and let xi,i = 0,1,...,n be a n+1-point partition
of the interval [a,b]. Then
Õn
f (x) =
0 0
f (xk )Ln,k (x)+ε (5.1)
k=0

69
70 CHAPTER 5. NUMERICAL DIFFERENTIATION BY FINITE DIFFERENCES

where
n n
1 d Ö 1 Ö d
ε= f (n+1) (ξ(x)) (x − xk )+ (x − xk ) f (n+1) (ξ(x)). (5.2)
(n+1)! dx k=0 (n+1)! k=0 dx

Proof. Approximate f (x) by its polynomial interpolant, namely


n n
Õ 1 Ö
f (x) = f (xk )Ln,k (x)+ f (n+1) (ξ(x)) (x − xk ).
k=0
(n+1)! k=0
Then differentiate to obtain the result. 

Remark 5.8. If x , xi , i = 0,1,...,n, it is not possible to estimate the error accurately since ξ(x) ∈ [a,b] is
in general unknown. However, if x = xi , then the error can be determined as follows.
Corollary 5.9. n
Õ
≤ ε, (5.3)
0 0

f (x j )− f (x k )L (x j )
n,k
k=0
where
1 n
Ö
ε= max f (x) max (x j − xk )
(n+1)

(n+1)! x ∈[a,b] x ∈[a,b]
k=0
k,j

Proof. The proof of this corrolary is similar to the analogous corrolary for polynomial interpolation.
d (n+1)
Observe that the term multiplying f (ξ(x)) in Equation 5.2 is equal to 0 in this case, that is
dx
k=0 (x j − xk ) = 0.
În


Remark 5.10. Optimal partitions can be found by a process similar to Chebyshev economization.
Example 5.11. Derive the three-point differentiation formulas for f (x) at the nodes of the 3-point partition
x0,x1,x2 of [a,b].
Solution. The number of nodes is (n + 1) = 3, hence, n = 2. Then, construct the Lagrange polynomials
L2,0,L2,1 and L2,2 .
(x − x1)(x − x2 ) x 2 −(x1 + x2 )x + x1 x2 x 2 −(x1 + x2 )x + x1 x2
L2,0 (x) = = 2 =
(x0 − x1 )(x0 − x2 ) x0 −(x1 + x2 )x0 + x1 x2 (x0 − x1 )(x0 − x2 )
Now differentiate L2,0 (x).
2x −(x1 + x2 )
0
L2,0 (x) = .
(x0 − x1 )(x0 − x2 )
Similarly, obtain the differentiated Lagrange polynomials L2,1 0 (x) and L 0 (x), i.e.
2,2
2x −(x0 + x2 ) 2x −(x0 + x1 )
L2,1 (x) =
0
, L2,2 (x) =
0
.
(x1 − x0 )(x1 − x2 ) (x2 − x0 )(x2 − x1 )
Then the derivative at a particular node can be approximated by using the formula in Equation 5.2 as follows
n
f (3) (ξ j ) Ö
f 0(x j ) = f (x0 )L2,0
0 0
(x j )+ f (x1 )L2,1 0
(x j )+ f (x2 )L2,2 (x j )+ (x j − xk )
6 k=0
k,j
When j = 1, i.e. at the node x1 the derivative can be approximated by the expression
f (3) (ξ)
f 0(x1 ) = f (x0 )L2,0
0 0
(x1 )+ f (x1 )L2,1 0
(x1 )+ f (x2 )L2,2 (x1 )+ (x1 − x0 )(x1 − x2 ).
6
Similarly for j = 0 and j = 2. ^
Example 5.12. Simplify the general formula obtained in Example 5.11 in the case of a uniform partition
with step size h , 0.
5.3. USING TAYLOR’S THEOREM TO FIND DERIVATIVES 71

Solution. The points of the partition are x0 , x1 = x0 +h, x2 = x0 +2h. Only the derivative at the node x1 will
be simplified. The derivatives at the other two nodes can be found similarly.
As in the previous example, compute the Lagrange polynomials and differentiate to obtain
2x0 +2h− x0 −h− x0 −2h h 1
(x1 ) = L2,0 (x0 +h) = =− 2 =− .
0 0    
L2,0
(x0 − x0 −h)(x0 − x0 −2h) 2h 2h
Similarly,
1
0
L2,1 (x1 ) = ... = 0,L2,2
0
(x2 ) = ... = .
2h
Now calculate the error ε.
1 1 (3) 2
ε = f (3) (ξ1 )( x0 +h− 0 −2h) = − f (ξ1 )h .
x
6 6
Hence, using the formula for the derivative f 0(x1 ),
f (x2 )− f (x0 ) 1 (3)
 
f (x1 ) =
0
− f (ξ1 ) h2
2h 6
f (x2 )− f (x0 )
= +O(h2 ).
2h
The formula for f 0(x1 ) is called the central difference formula. Note that in the final expression the term
O(h2 ) denotes the truncation error in Big-Oh notation.
The expressions for f 0(x0 ) and f 0(x2 ) are referred to as the forward-difference formula and the
backwards-difference formula respectively, and are given by
−3 f (x0 )+4 f (x1 )− f (x2 ) h2 (3)
Forward-difference: f 0(x0 ) = + f (ξ0 )
2h 3
f (x )−4 f (x )+3 f (x ) h 2
0 1 2
Backward-difference: f 0(x2 ) = + f (3) (ξ2 )
2h 3
^
Remark 5.13. Note that the accuracy we have achieved with 3 points is O(h2 ).

5.3 Using Taylor’s Theorem to find derivatives

Remark 5.14. Taylor’s Theorem provides an alternative method to find derivatives and estimate their
truncation errors. This is demonstrated on some examples below.
Remark 5.15 (Notation). Throughout this section, we consider a unifrom partition {xi = x0 +ih;i ∈ N}
and denote fi = f (xi ), fi0 = f 0(xi ) etc.
Example 5.16. Use Taylor series expansions to derive the following finite difference approximations to
derivatives.
• First order derivative
– Forward-diference, first-order accurate
fi+1 − fi
f 0(xi ) = +O(h),
h
– Backward-diference, first-order accurate
fi − fi−1
f 0(xi ) = +O(h),
h
– Centered-difference, second order accurate
fi+1 − fi−1
f 0(xi ) = +O(h2 ).
2h
– Backward-diference, second order accurate
3 fi −4 fi−1 + fi−2
f 0(xi ) = +O(h2 ).
2h
– Forward-diference, second order accurate
−3 fi +4 fi+1 − fi+2
f 0(xi ) = +O(h2 ).
2h
72 CHAPTER 5. NUMERICAL DIFFERENTIATION BY FINITE DIFFERENCES

• Second order derivative, second order accurate


fi−1 −2 fi + fi+1
fi00 = +O(h2 ).
h2

Solution. Forward-difference:
fi = f (xi ),
h2
fi+1 = f (xi+1 ) = f (xi +h) = f (xi )+h f 0(xi )+ f (2) (ξ).
2!
Now subtract fi from fi+1 to obtain
h2
fi+1 − fi = h f 0(xi )+ f (2) (ξ),
2!
then rearrange in order to obtain an expression for f 0(xi ):
fi+1 − fi h (2)
f 0(xi ) = − f (ξ).
h 2
Note that the truncation error is O(h).
Backward-difference:
fi = f (xi ),
h2 (2)
fi−1 = f (xi−1 ) = f (xi −h) = f (xi )−h f 0(xi )+ f (ξ).
2!
Again, subtract to obtain
h2 (2)
fi − fi−1 = h f 0(xi )− f (ξ),
2!
then
fi − fi−1 h2 (2)
f 0(xi ) = + f (ξ),
h 2!
fi − fi−1
f (xi ) =
0
+O(h).
h
Again, note the truncation error is O(h). In conclusion, the accuracy we have achieved is O(h).
Centered-difference:
h2 (2) h3
fi+1 = f (xi )+h f 0(xi )+ f (xi )+ f (3) (ξ),
2 6
h 2 h 3
fi−1 = f (xi )−h f 0(xi )+ f (2) (xi )− f (3) (ξ).
2 6
Subtract
2h 3
 
fi+1 − fi−1 = 3h f (xi )+
0
f (ξ) .
(3)
6
Then
fi+1 − fi−1
f 0(xi ) = +O(h2 ).
2h
Backward-difference:
4h2 (2) 8h3 (3)
fi−2 = f (xi −2h) = f (xi )−2h f 0(xi )+ f (xi )− f (ξ),
2 6
h2 h3
fi−1 = f (xi −h) = f (xi )−h f 0(xi )+ f (2) (xi )− f (3) (ξ),
2 6
fi = f (xi ).
Add the three expressions together to obtain
2 fi −4 fi−1 + fi−2 = 2h f 0(xi )+O(h3 ).
Then
3 fi −4 fi−1 + fi−2
fi0 = f 0(xi ) = +O(h2 ).
2h
5.3. USING TAYLOR’S THEOREM TO FIND DERIVATIVES 73

Forward-difference: Obtained in a similar manner, the forward-difference formula is given by:


−3 fi +4 fi−1 − fi−2
fi0 = +O(h2 ).
2h
Note, that the accuracy for three points is O(h2 ). ^
Example 5.17. Use the values fi−1 , fi , fi+1 to construct an approximation of the second derivative fi00 that
is of at least second order O(h2 ).
Solution. We take a linear combination of fi−1 , fi , and fi+1 and try to find coefficients α, β, and γ such
that the requirements hold, i.e.
fi00 +O(h2 ) = α fi−1 + β fi +γ fi+1 . (5.4)
Now use the Taylor expansion of each function and substitute in Equation 5.4 to obtain
h2 h3 1
fi00 +O(h2 ) = (α+ β+γ) fi +(γ−α)h fi0 +(γ+α) fi00 +(γ−α) fi000 + h4 (α+γ) f (4) (ξ).
2 6 24
Now in order for the requirements to hold, we require
α+ β+γ = 0,
γ−α = 0,
h2
(α+γ) = 1.
2
Solving the system of equations yields
1 2 1
α = 2, β =− 2, γ= 2.
h h h
Finally, substituting the values for the coefficients in Equation 5.4 we obtain
fi−1 −2 fi + fi+1
fi00 = +O(h2 ).
h2
^
Example 5.18. Find a forward difference approximation of y 0(ti ) that is at least second order, using a
uniform grid and the notation yi = y(ti ).
Solution. We must take a linear combination of yi , yi+1 and yi+2 and we require
αyi + βyi+1 +γyi+2 = y 0(ti )+O(h2 ). (5.5)
Use Taylor expansion to obtain
h2
(α+ β+γ)y(ti )+h(β+2γ)y 0(ti )+ (β+4γ)y 00(ti )+o(h2 ) = y 0(ti )+O(h2 ) (5.6)
2
which gives the simultaneous equations α+ β+γ = 0, β+2γ = 1/h and β+4γ = 0, so β = 2/h, γ = −1/2h
and α = −3/2h so
−3yi +4yi+1 − yi+2
= y 0(ti )+O(h2 ) (5.7)
2h
is a second order finite difference approximation to y (ti ).
0 ^
Example 5.19. Use the values y(t), y(t − h) and y(t + λh) to construct an approximation of y 00(t) and
calculate the leading order error.
Solution. We must take a linear combination and we require
αy(t)+ βy(t −h)+γy(t +λh) = y 00(t)+O(h p ) (5.8)
where p is to be determined. Use Taylor expansion to see
h2   h3  
(α+ β+γ)y(t)+h(−β+γλ)y 0(t)+ β+λ2 γ y 00(t)+ −β+λ3 γ y 000(t)+O(h4 ) = y 00(t)+O(h p ) (5.9)
2 6
which gives the equations α+ β+γ = 0, −β+λγ = 0 and h2 (β+λ2 γ)/2 = 1 which have solution
2(λ+1) 2λ 2
α=− 2
, β= 2
, γ=
λ(λ+1)h λ(λ+1)h λ(λ+1)h2
so
2 h
(−(λ+1)y(t)+λy(t −h)+ y(t +λh)) = y 00(t)+ (λ−1)y 000(t)+O(h2 ) (5.10)
λ(λ+1)h2 6
for general λ , 1 the error isorder h, for λ = 1 the error isorder h2 . In the case λ = 1 the finitedifference formula is
y(t +h)+ y(t −h)−2y(t)
= y 00(t)+O(h2 ). (5.11)
h2
74 CHAPTER 5. NUMERICAL DIFFERENTIATION BY FINITE DIFFERENCES

5.4 Partial derivatives and Differential operators

Below we use Taylor’s theorem to derive formulas for certain second order partial derivatives and for the
Laplacian operator in R2 .
Remark 5.20. Throughout this section, we consider a unifrom grid {(xi,y j );xi = x0 +ih,y j = y0 + j k;i, j ∈ N}
and denote fi, j = f (xi,y j ).

Recall the R2 version of Taylor’s Theorem.


Claim 5.21.
N
f (x +h,y+k) = 1 1 N +1 f (ξ(x), χ(y))
n! (h∂x +k∂y ) f (x,y)+ (N +1)! (h∂x +k∂y )
Í n
n=0
1
f (x,y)+h fx +k fy + h2 fxx +2hk fxy +k 2 fyy
h i
=
2
1h 3
+ h fxxx +3h k fxxy +3hk 2 fxyy +k 3 fyyy +···
2
i
6
Example 5.22. Show that the following usual central difference approximations are valid,
∂ f (xi,y j ) fi+1, j − fi−1, j
= +O(h2 )
∂x 2h
∂ f (xi,y j ) fi, j+1 − fi, j−1
= +O(k 2 )
∂y 2k
∂ 2 f (xi,y j ) fi+1, j −2 fi, j + fi−1, j
2
= +O(h2 )
∂x h2
∂ 2 f (xi,y j ) fi, j+1 −2 fi, j + fi, j−1
2
= +O(k 2 )
∂y k2
∂ 2 f (xi,y j ) fi+1, j+1 − fi−1, j+1 − fi+1, j−1 + fi−1, j−1
= +O(hk).
∂ x∂ y 4hk
where fi, j = f (x0 +ih,y0 + j k).
Solution. Solution proceeds as previously.
To maintain simplicity, it will henceforth be assumed that h = k, that is, the resolutions of the mesh in the
x and y directions are identical so that the region in which the solution is sought is now partitioned into
squares of side h by the nodes. The simplified finite difference formulae for the first and second order partial
derivatives of f become
∂ f (xi,y j ) fi+1, j − fi−1, j
= +O(h2 )
∂x 2h
∂ f (xi,y j ) fi, j+1 − fi, j−1
= +O(h2 )
∂y 2h
∂ 2 f (xi,y j ) fi+1, j −2 fi, j + fi−1, j
2
= +O(h2 )
∂x h2
∂ 2 f (xi,y j ) fi, j+1 −2 fi, j + fi, j−1
2
= +O(h2 )
∂y h2
∂ 2 f (xi,y j ) fi+1, j+1 − fi−1, j+1 − fi+1, j−1 + fi−1, j−1
= +O(h2 ).
∂ x∂ y 4h2
For instance formula 5 is obtained as follows.
A= fi+1, j+1 − fi−1, j+1 = 2h∂x fi, j+1 +O(h3 )
5.5. DIFFERENTIATION MATRICES 75

B = fi+1, j−1 − fi−1, j−1 = 2h∂x fi, j−1 +O(h3 )


A−B = 2h∂x ( fi, j+1 − fi, j−1 )+O(h3 )−O(h3 ) = 2h(2h∂x ∂y fi, j +O(h3 ))
= 4h2 ∂xy
2
fi, j +O(h4 ).
^

5.4.1 Laplacian and its computational molecule (stencil)

A computational “molecule” (or “stencil”) is a convenient way to describe the complex mix of variables
necessary to construct the finite difference representation of a differential operator.
Example 5.23. Construct a computational stencil for the Laplacian operator
∇2 = ∂x2 +∂y2 .

Solution. The Laplacian at (xi,y j ) in two dimensions has the finite difference form
ui+1, j −2ui, j +ui−1, j ui, j+1 −2ui, j +ui, j−1
u xx +uyy = + +O(h2 )
h2 h2
ui+1, j +ui, j+1 −4ui, j +ui, j−1 +ui−1, j
= +O(h2 ) = 0.
h2
The centre of the molecule (stencil) is positioned at the node at which the operator is to be expressed and
the entries of the molecule (stencil) indicate the proportion of the surrounding nodes to be mixed in order
to obtain the finite difference representation of the operator. Assuming the molecular orientation
(i−1, j +1) (i, j +1) (i+1, j +1)
(i−1, j) (i, j) (i+1, j)
(i−1, j −1) (i, j −1) (i+1, j −1)
the Laplace operator corresponds to the computational molecule (stencil)
0 1 0
1
u xx +uyy = 2 1 −4 1
h
0 1 0
Similarly, the mixed derivative operator corresponds to the computational molecule (stencil)
−1 0 1
1
u xy = 2 0 0 0
4h
1 0 −1
^

5.5 Differentiation matrices

Remark 5.24. Differentiation is a linear operator. The numerical approximations of differentiation


preserve this property. Because of linearity, numerical differentiation can be represented as a matrix-vector
multiplication. The matrices involved are known as numerical differentiation matrices.
Example 5.25. Consider a uniform partition {x0,...,xn } with xi = x0 +ih and a set of corresponding data
values { f0,..., fn }. Represent the standard second order centred-difference formula as a matrix-vector
multiplication.
Solution. The standard second order centred-difference formula is
fi+1 − fi−1
fi0 = +O(h2 ).
2h
76 CHAPTER 5. NUMERICAL DIFFERENTIATION BY FINITE DIFFERENCES

Assume that the problem is periodic so that f−1 = fn and f0 = fn+1 . This is needed as we cannot apply this
formula at the end points; alternatively we can use back/forward-difference formulas there. Then we have
the following representation as a matrix-vector multiplication
1 1 
 f00   0 2h 0 0 ··· ··· − 2h
  f0 
 0   1 1
 f1   − 2h 0 2h 0 ··· ··· 0   f1 
 .     .. 

 .   0 −1 0 1
0 ··· ···   .  +O(h2 )
 . = 2h 2h
 .   . . . . . .   .. 
 ..   .. .. .. .. .. ..

  ···  . 
 f 0   1 1
0   fn 

··· ··· ··· ··· − 2h

 n   2h
i.e.
f 0 = Df.
The matrix D on the RHS is the numerical differentiation matrix in this case. ^
Example 5.26. Consider a unifrom grid {(xi,y j );xi = x0 +ih,y j = y0 + j h;i, j ∈ N} with corresponding data
values ui, j = u(xi,y j ). Construct a differentiation matrix for the usual finite-difference approximation to
the Laplacian operator ∇2 .
Solution.
 T I
 0 0 ··· ··· 0 
 I T
 I 0 ··· ··· 0 
 0 I T I 0 ··· ··· 
D = 
 .. .. . . . . . . . . . . . .

 . . ··· 

 0 ··· ··· ··· 0
 
I T 
where I is the (n−1)×(n−1) identity matrix and T is the (n−1)×(n−1) tri-diagonal matrix

 −4
 1 0 0 ··· ··· 0 
 1 −4 1 0 ··· ··· 0 

 0 1 −4 1 0 ··· ···  .
T = 
 .. .. . . . . . . . . 
 . . . . . . ··· 
 0 ··· ··· ··· 0 1 −4 
 

^

The fundamental difference between the one-dimensional and higher dimensional finite difference schemes
lies in the fact that any node has more than two neighbours contributing to the numerical solution (in fact 4 in
this case). Consequently, nodes close geographically to each other can no longer be numbered sequentially.
An enumeration scheme for the unknowns ui, j is required. The choice is ultimately arbitrary, but one obvious
possibility is
vi+(n−1)∗(j−1) = ui, j , (i, j) ∈ (1,···,n−1)×(1,···,n−1).
The vector V of unknowns has dimension (n−1)2 and enumerates the nodes from left to right (in the direction
of increasing x) and from bottom to top (in the direction of increasing y). The vector V is then our new set of
data points renumbered. The numerical scheme for the Laplacian can then be represented as a matrix-vector
multiplication
∇2V = DV,
Where D is the differentiation matrix in this case, and has the following structure. The k th row of D will contain
the elements of the Laplacian computational module appropriately positioned. The different possibilities are
(a) If i = 1 then Dk,k = −4 and Dk,k+1 = 1.
(b) If 1 < i < (n−1) then Dk,k−1 = 1, Dk,k = −4 and Dk,k+1 = 1.
(c) If i = n−1 then Dk,k−1 = 1 and Dk,k = −4.
5.5. DIFFERENTIATION MATRICES 77

For each value of k, the value of j is examined. If j = 1 then Dk,k+n−1 = 1, while if 1 < j < (n −1) then
Dk,k−n+1 = Dk,k+n−1 = 1, and finally if j = n−1 then Dk,k−n+1 = 1. The structure of D may be represented
efficiently by the block matrix form
 T I
 0 0 ··· ··· 0 
 I T
 I 0 ··· ··· 0 
 0 I T I 0 ··· ··· 
D = 
 .. .. . . . . . . . . . . . .

 . . ··· 

 0 ··· ··· ··· 0
 
I T 
where I is the (n−1)×(n−1) identity matrix and T is the (n−1)×(n−1) tri-diagonal matrix
 
 −4
 1 0 0 ··· ··· 0 
 1 −4 1 0 ··· ··· 0 

 0 1 −4 1 0 ··· ···  .
T = 
 .. .. . . . . . . . . 
 . . . . . . ··· 

 0 ··· ··· ··· 0 1
 
 −4 

This D is the block of the differentiation matrix situaten in the “middle”. A first and a last row must be
added but their form will depend on particular boundary conditions chosen, so we will not specify them
within this example.
78 CHAPTER 5. NUMERICAL DIFFERENTIATION BY FINITE DIFFERENCES
Chapter 6

Discretisation of
differential equations by finite-differences

In the following chapters we discuss the methods for numerical solution of problems formulated in terms
of differential equations.
Remark 6.1. A differential equation is a relationship between a function, its arguments and its derivatives.
Remark 6.2. A problem formulated in terms of differential equations involves 3 elements:
• a differential equation (or a set of),
• a domain of the independent variables (region of integration),
• additional conditions that may be (a) initial conditions – imposed at the initial moment of time (if
time is an independent variable), (b) boundary conditions – conditions imposed at the boundaries
of the domain of integration (if space is an independent variable).
§ 6.3. Symbolically we may write a differential equation problem (equation and associated conditions and
domain) as
∂k uj
 
M[u] = Fi x1,x2,...,xm ;u1,u2,...,un ;..., k ,... = 0, i = 1,2,...n.
∂ p xs ...∂ kq xt
Remark 6.4. There are three major classes of numerical methods for solution of differential equations
• Finite-Difference methods,
• Finite-Element methods,
• Spectral methods.

In this course we will consider in some detail Finite-Difference methods.

6.1 Finite-difference problem formulation

§ 6.5. Outline of the FD method


• In the finite-difference method we approximate the continuous derivatives of a differential equation
problem by finite difference formulas.
• The continuous differential equation problem is then reduced to a set of algebraic equations for values
of the unknown function on a grid.
Remark 6.6. The usual questions arise about
• Existence and uniqueness of the solution to the difference equations,

79
80 CHAPTER 6. FD DISCRETISATION

• Convergence of the solution of the difference equations to the solution of the continuous problem,
• Efficiency of the computation algorithm,
• Estimation and control of errors.

6.2 Discretization of steady-state problems (Space)

§ 6.7. In this section we demonstrate the procedure of obtaining “difference” equations from a continuous
problem at hand. We will not yet be concerned with convergence. We will use a number of examples
grouped by some common features. References to material in earlier chapters will be made.

6.2.1 Steady-state problems

A wide class of steady-state problems in science and engineering is formulated in terms of second-order
elliptic Boundary Value Problems. To be more specific a second-order elliptic Boundary Value Problem
in 2D is defined as follows.
Definition 6.8. A Boundary Value Problem (BVP) for a second-order elliptic PDE in two independent
variables (x,y) ∈ A ⊂ R2 is specified by the equation
∂2u ∂2u ∂2u ∂u ∂u
A 2 +B +C 2 +D +E +Fu = g, (6.1)
∂x ∂ x∂ y ∂y ∂x ∂y
where A= A(x,y,u x,uy,u),···,F = F(x,y,u x,uy,u) and B2 −4AC < 0, subject to the boundary conditions
 u = f1 (x,y) (x,y) ∈ C1

 ∂u
 = f2 (x,y) (x,y) ∈ C2 ,
∂n


 ∂u

+ f (x,y)u = f4 (x,y) (x,y) ∈ C3
 ∂n 3

where f1 (x,y),···, f4 (x,y) are known functions and ∂A = C1 ∪C2 ∪C3 .
Remark 6.9. Note the following general remarks.
• In the case of one independent variable (6.1) reduces to an ODE. Similarly, PDE (6.1) can extended
to the case of three or more independent variables.
• PDE (6.1) may include linear as well as non-linear cases.
Remark 6.10. Commonly occurring boundary conditions are known under specific names.
• The case C1 = ∂A is known as a Dirichlet (or Value) boundary condition.
• The case C2 = ∂A is known as a Neumann (or Flux or Gradient) boundary condition.
• The case C3 = ∂A is known as a Robin boundary condition and is a linear combination of the other two.

As in other chapters in numerical analysis, we should aim to understand if the mathematical problem is
well-posed in the first place before attacking numerically.
Remark 6.11. (Existence and Uniqueness in a particular case) A 1D second-order ODE boundary value
problem may have the following types of solutions
• a two-parameter family of solutions,
• a one-parameter family of solutions,
• a unique solution,
• no solutions.

To understand this, recall that a 2-nd order ODE has a general solution involving 2 arbitrary parameters.
The boundary conditions may be such that when trying to impose them, they (are “useless”) leave both
6.2. DISCRETIZATION OF STEADY-STATE PROBLEMS (SPACE) 81

parameters undetermined, or such that they fix one but leave the other parameter undetermined, or allow to
fix both parameters, or finally are “inconsistent” with the equation and no vales of the parameters satisfying
both exist. Similar statements remain true for differential equations of different order.
We will illustrate some of these possibilities in the following examples.
Example 6.12. (Two-parameter family of solutions) Solve the following BVP
y 00 +4y = 0 y(−π) = y(π), y 0 (−π) = y(π).
(Such BCs are called periodic BCs.)
Solution. Using the techniques of Chapter 1 we find the general solution to the ODE
y(x) = Acos(2x)+Bsin(2x),
0
y (x) = −2Asin(2x)+2Bcos(2x).
Imposing the BCs
y(π) = A= y(−π) = A,
y 0 (−π) = 2B = y(π) = 2B,
which leaves A and B undetermined. Thus, the BVP has a two-parameter family of solutions
y(x) = Acos(2x)+Bsin(2x).
In this case that happens because the solution of the ODE as well as the BC are 2π periodic. ^
Example 6.13. (One-parameter family of solutions) Solve the following BVP
y 00 + y = cos(2x) y 0 (0) = 0, y 0 (π) = 0.

Solution. Using the techniques of Chapter 1 we find the general solution to the ODE
y(x) = Acosx +Bsinx −1/3cos2x.
Imposing the BCs
y 0 (0) = −Asinx +Bcosx +2/3sin2x| x=0 = B = 0,
y 0 (π) = −Asinx +Bcosx +2/3sin2x| x=π = −B = 0.
Thus B = 0, while A remains undetermined. We have the solution
y(x) = Acosx −1/3cos2x,
which is a one-parameter family of solutions. ^
Example 6.14. (Unique solution) Solve the following initial value problem (IVP), given by the ODE
6y 00 −5y 0 + y = 0,
subject to the initial condition y(0) = 4 and y 0 (0) = 0.
Solution. We seek solutions of the form eλx . Since the equation is linear and homogeneous, any linear combination of such
solutions is also a solution. If we put this into the ODE we obtain for λ the equation
6λ2 −5λ+1 = 6(λ2 − 65 λ+ 16 ) = 6(λ− 31 )(λ− 12 ) = 0
and hence the general solution
x x
y(x) = αe 3 + βe 2
for arbitrary constants α,β. Differentiating we have
α x β x
y0 = e3 + e2.
3 2
Then
y(0) = 4 =⇒ 4 = α+ β,=⇒ β = 4−α.
α β 2α+3β 2α+12−3α
y 0 (0) = 0 =⇒ 0 = + = =
3 2 6 6
α
= 2− =⇒ α = 12, =⇒ β = −8.
6
Hence, the solution, subject to the initial conditions, is
x x
y(x) = 12e 3 −8e 2 .
^
82 CHAPTER 6. FD DISCRETISATION

Example 6.15. (No solutions) Solve the following BVP


y 00 +4y = 4x y(−π) = y(π), y 0 (−π) = y(π).

Solution. Using the techniques of Chapter 1 we find the general solution to the ODE
y(x) = Acos(2x)+Bsin(2x)+ x.
Imposing the BCs
y(π) = π+ A= y(−π) = −π+ A,
which is always false. Thus, the ODE and the BCs are inconsistent and there are no A and B that satisfy them. There is no solution. ^

6.2.2 Linear ODEs with Dirichlet boundary conditions

Example 6.16. (Second-order ODE with constant coefficients and Dirichlet BCs) Construct a Finite-
Difference scheme for the solution of the boundary value problem
u xx = f (x), u(0) =U0 u(1) =U1
where f (x) is a prescribed function and U0 and U1 are given.
Solution. We use a uniform partition of [0,1] by (n+1) evenly spaced nodes 0 = x0 < x1 < ··· < xn = 1 where
xk = k/n. Then, h = 1/n so that xk = k h. In the usual way let uk denote u(xk ) then
uk+1 −2uk +uk−1
u xx (xk ) = +O(h2 )
h2
at each interior point of [0,1]. The boundary conditions are satisfied by asserting that u0 =U0 and un =U1 .
The finite difference solution for u(x) is now obtained by enforcing the original equation at each interior
node of [0,1], that is,
uk+1 −2uk +uk−1
+O(h2 ) = fk , k = 1,2···,(n−1)
h2
where fk = f (xk ). When arranged in sequence, these equations become
u2 −2u1 +u0 = h2 f1 +O(h4 )
u3 −2u2 +u1 = h2 f2 +O(h4 )
u4 −2u3 +u2 = h2 f3 +O(h4 )
.. .
. = ..
un −2un−1 +un−2 = h2 fn−1 +O(h4 ).
However, u0 and un are known and so the first and last equations are re-expressed in the form
u2 −2u1 = h2 f1 −U0 +O(h4 )
−2un−1 +un−2 = h2 fn−1 −U1 +O(h4 ).
Thus the unknown values u1 , u2 , ··· un−1 are the solution of the system of linear equations
 −2 1 0 0 ··· ··· 0   u1   h2 f1 −U0 
h2 f2

 1 −2 1 0 ··· ··· 0   u2  
  
. .
 
 0 1 −2 1 0 ··· ···   ..  = 
 
..  +O(h4 ) (6.2)
  

 . .
 . .. . . . . . . . . . . . . ···   ..   ..
  
 .
 
  .   2 .


 0 ··· ··· ··· 0 1 −2   un−1   h fn−1 −U1 
 

• The coefficient matrix of this system has a main diagonal in which each entry is −2 and sub- and super-
diagonals in which each entry is 1. All other entries are zero. Matrices of this type are called tri-diagonal
matrices. Efficient methods for the solution of tri-diagonal systems exist as discussed in section 4.2.3.
• The order of convergence is 2.
^
Example 6.17. (General second order ODE with non-constant coefficients and Dirichlet BCs) Construct
a Finite-Difference scheme for the solution of the boundary value problem
a(x)u xx +b(x)u x +c(x)u = f (x), u(x0 ) =U0, u(xn ) =U1,
where a(x),b(x),c(x) and f (x) are known functions of x, the constants U0 and U1 are given.
6.2. DISCRETIZATION OF STEADY-STATE PROBLEMS (SPACE) 83

Solution. The numerical solution of ordinary differential equations with non-constant coefficients follows
a similar pattern to the constant coefficient case in the previous example.
Let x0 and xn be the left hand and right hand endpoints of an interval which is uniformly dissected by
the points xk = x0 + k h where h = (xn − x0 )/n. At the k th internal node of the interval [x0,xn ], the finite
difference representation of the differential equation is
h u −2u +u i h u −u i
k+1 k k−1 2 k+1 k−1 2
ak +O(h ) +b k +O(h ) +ck uk = fk
h2 2h
where ak = a(xk ), bk = b(xk ), ck = c(xk ) and fk = f (xk ). This equation is now multiplied by 2h2 and the
terms re-arranged to give
(2ak +hbk )uk+1 −(4ak −2h2 ck )uk +(2ak −hbk )uk−1 = 2h2 fk +O(h4 )
for k = 1,2,···(n−1). As previously, the first and last equations take advantage of the boundary conditions
to get the final system of equations
−(4a1 −2h2 c1 )u1 +(2a1 +hb1 )u2 = 2h2 f1 −(2a1 −hb1 )U0 +O(h4 )
.. ..
. = .
(2ak −hbk )uk−1 −(4ak −2h2 ck )uk +(2ak +hbk )uk+1 = 2h2 fk +O(h4 ) .
.. ..
. = .
(2an−1 −hbn−1 )un−2 −(4an−1 −2h2 cn−1 )un−1 = 2h2 fn−1 −(2an−1 +hbn−1 )U1 +O(h4 )

• The matrix of the resulting set of linear algebraic equations is again tri-diagonal. Efficient methods
for the solution of tri-diagonal systems exist as discussed in section 4.2.3.
• The order of convergence is 2.
^

6.2.3 Norms

The following definitions will be useful in discuss the questions of convergence and measuring errors of
vectors and functions.
Definition 6.18. A norm || · || on a vector space V is a real valued function such that for every u,v ∈ V and
for every a ∈ R
1. ||v|| ≥ 0; and ||v|| = 0 if and only if v = 0,
2. ||u+v|| ≤ ||u||+||v||, (triangle inequality),
3. ||av|| = |a|||v||.
There are many ways to define a vector norm that satisfy the above definition. Some examples follow.
Example 6.19. (Vector norms in Rn ) Let x = (x1,x2,...,xn ) ∈ Rn .
• p-norm or L p -norm
n
! 1/p
= (|x1 | p +|x2 | p +...+|xn | p )1/p,
Õ
p
||x|| p = |xi | p≥1
i=1
• L ∞ -norm or maximum norm (or uniform norm)
n
||x||∞ = max(|xi |) = max(|x1 |,|x2 |,...,|xn |).
i=1
Can be shown to be equivalent to the L p -norm for p → ∞.

Definition 6.20. (Mathrix norm) The norm of a matrix A ∈ Rm×m is denoted by || A|| and has the property
that C = || A|| is the smallest value of the constant C for which the bound
|| Ax|| ≤ C||x||
84 CHAPTER 6. FD DISCRETISATION

holds for every vector x ∈ Rm . Alternatively


|| Ax||
|| A|| = max = max || Ax||.
x,0 ||x|| x=1

Example 6.21. (Matrix norms in Rn×m )


• || A|| = sup | |x | |=1, x ∈Rm || Ax||,
• || A||1 = max i |ai j |,
Í
j
• || A||2 = ρ(AT A),
p

• || A||∞ = max j |ai j |,


Í
i

Functions are may be considered as infinite dimensional vectors. Hence, function norms are induced from
vector norms by replacing summation by integration.
Example 6.22. (Function norms)
∫ b  1/p
• p-norm || f || p = a f (x) p dx ,
• Uniform norm || f ||∞ = max | f (x)|.
[a,b]

6.2.4 “Experimental measurement” of the order of accuracy

Finite-difference schemes are discretization methods and so their order of accuracy is defined as in section ??.
In simple cases it is directly possible to determine the order of convergence of a finite-difference scheme as
illustrated in the simple examples of the previous section. In more involved examples the order of convergence
of a scheme is not immediately obvious. Thus, a method for “experimental measurement” of the order of con-
vergence is desirable. In fact, Theorem ?? already provides such method even when the exact solution of a BVP
is not known. If the exact solution is known as is the case in some examples below, this can be done even easier.
Recall results for “Experimental measurement” of the order of accuracy:
Claim 6.23 (When exact solution is known). Let N[ f ](h) be a discretisation scheme for approximation
of the problem M[ f ]. The order of accuracy of N[ f ](h) is
M[ f ]−N[ f ](2h)
2 p +O(hq−p ) = .
M[ f ]−N[ f ](h)

Proof. Consider the approximations


M[ f ] = N[ f ](h)+cp h p +cq hq +o(hq ),
M[ f ] = N[ f ](2h)+cp (2h) p +cq (2h)q +o(hq ),
as a linear system of equations for unknowns p and cp . Eliminate cp to get the result. 

Claim 6.24 (When exact solution is not known). The order of accuracy of a discretisation scheme N[ f ](h) is
N[ f ](4h)−N[ f ](2h)
2 p +O(hq−p ) = .
N[ f ](2h)−N[ f ](h)

Proof. This is Claim ?? that we have proven previously. 

Example 6.25. Let the numerical solution and the exact solution of the ODE problem u 00 = f (x) at the node
xk be N[ f ](h) =b
uk and M[ f ] = uk , respectively. Give a representation of M[ f ]−N[ f ](h).
6.2. DISCRETIZATION OF STEADY-STATE PROBLEMS (SPACE) 85

Solution. Observe that expressions like M[ f ]−N[ f ](h) actually represent truncation errors. We can have
various representation of M[ f ]−N[ f ](h), say
• Local truncation error Ek [n] = uk − ubk ,
• (Global) root-mean-square error defined by a L 2 vector norm
n
||u−b u ||2 1 hÕ i 1/2
En (u−bu) = = uk )2
(uk −b (6.3)
n+1 n+1 k=0
• (Global) mean absolute error defined by a L 1 vector norm
n
||u−b u ||1 1 Õ
En (u−bu) = = |uk −b
uk |. (6.4)
n+1 n+1 k=0

Other error measures are possible. Measures usually paint the same picture, but some differences appear,
e.g. it is well known that the L 1 error is more discriminating. For convenience we shall always use the
L 2 norm for measuring errors in the following and use the notation En to denote the value of the L 2 norm
when applied to a dissection containing 2n intervals of uniform length h. ^

Further in this chapter the orders of convergence of various finite difference schemes are measured
experimentally applying Theorem 6.23 on examples with known solutions.
Example 6.26. Measure the order of convergence of finite-difference scheme of Example 6.17 written for
the boundary value problem
u xx = sin(αx), u(0) = u(1) = 0.

SOLUTION. This corresponds to the choices f (x) = sin(αx) and U0 =U1 = 0.


The exact solution to this differential equation is
xsin(α)−sin(αx)
u(x) = .
α2
Table 6.1 gives the L 1 and L 2 norms of the numerical solution to this differential equation for various choices
of n when α = 1 and α = 50 respectively.

α=1 α = 50
n En En−1 /En En En−1 /En
4 0.0002242 − 0.0121297 −
8 0.0000561 3.904 0.0121549 0.960
16 0.0000140 4.001 0.0001235 0.990
32 0.0000035 4.002 0.0000663 0.984
64 0.0000009 4.000 0.0000151 1.860
128 0.0000002 4.000 0.0000037 4.401
256 0.0000001 4.000 0.0000009 4.090
512 0.0000000 4.000 0.0000002 4.020
1024 0.0000000 4.000 0.0000001 4.010

Table 6.1: Comparison of L 2 errors in the solution of u xx = sin(αx) for the boundary conditions
u(0) = u(1) = 0 when α = 1 and when α = 50.

Points to note: First, an accurate solution for α = 1 and α = 50 can be obtained by using a suitably fine
mesh. Of course, in practice the finite numerical precision available in the computation will intervene and
provide a practical limit to the maximum available accuracy. Second, the ratio of errors has limiting value
4 independent of the value of α. As n is doubled, h is halved, and the error decreases by a factor of 4, that
is, the error is proportional to h2 . This is in complete agreement with the presumed accuracy of the finite
difference approximation of the second derivative.
86 CHAPTER 6. FD DISCRETISATION

Example 6.27. Measure the order of convergence of finite-difference scheme of Example 6.17 written for
the boundary value problem
x 2 u xx + xu x +u = x, u(ε) = u(1) = 0,
in which ε ∈ (0,1).
Solution. This equation corresponds to the choices a(x) = x 2 , b(x) = f (x) = x, c(x) = 1 and U0 = U1 = 0.
The exact solution to this differential equation is
1 h sin(logx/ε)−εsin(logx) i
u(x) = x + .
2 sin(logε)
Table 6.2 gives the L 1 and L 2 norms of the numerical solution to this differential equation for various choices
of n when ε = 0.01 and when ε = 0.001.

ε = 0.01 ε = 0.001
n En En−1 /En En En−1 /En
4 1.8429306 − 4.4799446 −
8 0.6984934 0.141 0.0312476 0.127
16 0.2161714 2.638 0.3327640 143.4
32 0.0752496 3.231 0.5445338 0.094
64 0.0206367 2.873 0.8809907 0.611
128 0.0045843 3.646 3.9401618 0.618
256 0.0010199 4.502 0.5773118 0.224
512 0.0002431 4.495 0.1126978 6.825
1024 0.0000599 4.195 0.0236637 5.123
4096 0.0000037 4.056 0.0011824 4.762
16384 0.0000002 4.004 0.0000719 4.292
65536 0.0000000 4.000 0.0000045 4.023

Table 6.2: Comparison of L 2 errors in the solution of x 2 u xx + xu x +u = x for the boundary conditions
u(ε) = u(1) = 0 when ε = 0.01 and when ε = 0.001.

As previously, it is clear that the finite difference approach based on central difference formulae has error
O(h2 ) provided h is suitably small. The difficult in this application stems from the fact that the differential
equation has a singular solution at x = 0. The closer ε approaches zero, the more troublesome will be the
numerical solution. All numerical procedures to solve this differential equation will experience the same
difficulty. For example, it is clear that in the vicinity of x = ε that u xx /u has order ε −2 - very large when ε is
small. Consequently the error in the finite difference approximation to the derivatives of u around x = ε will
experience a true numerical error of order (h/ε)2 . This is only small when h is very small. The standard way
to eliminate problems of this sort is to re-express the differential equation in terms of another independent
variable, say z = log(x) in this particular example. ^

6.2.5 BVP with gradient boundary conditions

We consider several alternative ways to impose gradient (aka Neumann or flux) BCs.

6.2.5.1 First-order for-/backward differences

Remark 6.28. First-order accurate forward and/or backward difference formulas for the derivatives can
be used at the appropriate boundaries. We recall the relevant formulas.
Left hand boundary Right hand boundary
u1 −u0 un −un−1
u x (x0 ) = +O(h) u x (xn ) = +O(h)
h h
6.2. DISCRETIZATION OF STEADY-STATE PROBLEMS (SPACE) 87

Example 6.29. Use the finite difference algorithm to solve the differential equation
a(x)u xx +b(x)u x +c(x)u = f (x), u(x0 ) =U0, u x (xn ) = G1
where a(x),b(x),c(x) and f (x) are known functions of x, and the constants U0 and G1 are given. Use second-
order accurate formulas for the derivatives at the internal points and first-order accurate for-/backward
formulas at the boundaries.
Solution. One enforces the ordinary differential equation at each interior node of [x0,xn ] and uses the finite
difference representation of the boundary condition at x = xn . Note that one further equation is now required
because un is no longer known and has to be determined as part of the solution process. The finite difference
equations are

−(4a1 −2h2 c1 )u1 +(2a1 +hb1 )u2 = 2h2 f1 −(2a1 −hb1 )U0 +O(h4 )
.. ..
. = .
(2ak −hbk )uk−1 −(4ak −2h2 ck )uk +(2ak +hbk )uk+1 = 2h2 fk +O(h4 )
.
.. ..
. = .
(2an−1 −hbn−1 )un−2 −(4an−1 −2h2 cn−1 )un−1 +(2an +hbn )un = 2h2 fn−1 +O(h4 )
−un−1 +un = hG1 +O(h2 )

^
Remark 6.30. The inferior accuracy of the last equation degrades the accuracy of the entire solution.
Example 6.31. Measure the experimental error of the numerical solution of
x 2 u xx + xu x +u = x, u(ε) = u 0(1) = 0, ε = 0.01.
found by the above procedure.
Solution. This equation corresponds to the choices a(x) = x 2 , b(x) = f (x) = x, c(x) = 1 and U0 = G1 = 0
and has exact solution
1 h sin(logε/x)−εcos(logx) i
u(x) = x + .
2 cos(logε)
Table 6.3 illustrates the error structure of the numerical solution based on the finite difference method.

n En En−1 /En
4 3.7369443 −
8 3.4181676 1.246
16 3.1350210 1.093
32 2.5472682 1.090
64 1.4788262 1.231
128 0.5404178 1.722
256 0.1713286 2.736
512 0.0598533 3.154
1024 0.0235584 2.862
2048 0.0101889 2.541
4096 0.0046959 2.312
8192 0.0022481 2.169
32768 0.0005433 2.089

Table 6.3: Comparison of L 2 errors in the solution of x 2 u xx + xu x +u = x where u(0.01) = u 0(1) = 0. The
gradient boundary condition is implemented to O(h) accuracy.

It is clear from Table 6.3 that the accuracy of the finite difference scheme is degraded to the accuracy of
its worst equation - in this case the boundary condition equation which is O(h) accurate. We need O(h2 )
accurate forward and backward difference formulae for gradients. ^
88 CHAPTER 6. FD DISCRETISATION

6.2.5.2 Second-order for-/backward differences

Remark 6.32. Second-order accurate forward and/or backward difference formulas for the derivatives
can be used at the appropriate boundaries. We recall the relevant formulas.
Left hand boundary Right hand boundary
−3u0 +4u1 −u2 un−2 −4un−1 +3un
u x (x0 ) = +O(h2 ) u x (xn ) = +O(h2 )
2h 2h
Example 6.33. Use the finite difference algorithm to solve the differential equation
a(x)u xx +b(x)u x +c(x)u = f (x), u(x0 ) =U0, u x (xn ) = G1
where a(x),b(x),c(x) and f (x) are known functions of x, and the constants U0 and G1 are given. Use
second-order accurate centred formulas for the derivatives at the internal points and fore-/backward
formulas at the boundaries.
Solution. This is the same as example . The new finite difference formulation of the original problem only
differs from the original formulation in respect of one equation, namely that contributed by the boundary
condition at x = 1. The numerical form is
−(4a1 −2h2 c1 )u1 +(2a1 +hb1 )u2 = 2h2 f1 −(2a0 −hb0 )U0 +O(h4 )
.. ..
. = .
(2ak −hbk )uk−1 −(4ak −2h2 ck )uk +(2ak +hbk )uk+1 = 2h2 fk +O(h4 )
.. .. .
. = .
(2an−1 −hbn−1 )un−2 −(4an−1 −2h2 cn−1 )un−1 +(2an +hbn )un = 2h2 fn−1 +O(h4 )
un−2 −4un−1 +3un = 2hG1 +O(h3 )

Unfortunately, the matrix of the resulting system of equations is no longer tri-diagonal, but this causes no
difficulty since the matrix can be converted to tri-diagonal form by a row operation in which the penultimate
row of the matrix is used to eliminate un−2 from the last row of the matrix. ^
Example 6.34. Measure the experimental error of the numerical solution of
x 2 u xx + xu x +u = x, u(ε) = u 0(1) = 0, ε = 0.01.
found by the above procedure.

n En En−1 /En n En En−1 /En


4 3.7567566 − 512 0.0261418 4.060
8 3.4135978 1.116 1024 0.0064812 4.097
16 3.1071541 1.101 2048 0.0016165 4.033
32 2.4869724 1.099 4096 0.0004039 4.009
64 1.3708621 1.249 8192 0.0001009 4.002
128 0.4348676 1.814 16384 0.0000252 4.001
256 0.1071061 3.152 32768 0.0000063 4.000

Table 6.4: L 2 errors in the solution of x 2 u xx + xu x +u = x with boundary conditions u(0.01) = u x (1) = 0
are compared. The gradient boundary condition is implemented to O(h2 ) accuracy.

Solution. Table 6.4 illustrates the behaviour of the numerical solution based on the finite difference method
in which the gradient boundary condition is implemented to O(h2 ) accuracy. Clearly O(h2 ) accuracy is
now recovered by using the second order accurate implementation of the boundary condition. ^
6.2. DISCRETIZATION OF STEADY-STATE PROBLEMS (SPACE) 89

6.2.5.3 Fictitious nodes

Remark 6.35. Second-order accurate centred difference formulas for the derivatives can be used at the
appropriate boundaries. We recall the relevant formulas.
u1 −u−1 un+1 −un−1
u x (x0 ) = +O(h2 ), u x (xn ) = +O(h2 ).
2h 2h
This however introduces the so called “fictitious nodes”.
Example 6.36. Use the finite difference algorithm to solve the differential equation
a(x)u xx +b(x)u x +c(x)u = f (x), u(x0 ) =U0, u x (xn ) = G1
where a(x),b(x),c(x) and f (x) are known functions of x, and the constants U0 and G1 are given. Use
second-order accurate CENTERED formulas throughout.
Solution. The most direct way to derive a second order accurate boundary condition is to use the central
difference formulae
u1 −u−1 un+1 −un−1
u x (x0 ) = +O(h2 ), u x (xn ) = +O(h2 ).
2h 2h
The difficulty with these approximations is that they introduce fictitious solutions u−1 and un+1 at the
fictitious nodes x−1 (for u x (x0 )) and xn+1 (for u x (xn )) respectively. Both nodes are fictitious because they
lie outside the region in which the differential equation is valid. The introduction of a fictitious node creates
another unknown to be determined. The procedure is as follows. The differential equation is assumed to
be valid at the hnode xn to obtain
un+1 −2un +un−1 2
i un+1 −un−1
an +O(h ) +bn G1 +cn un = fn, = G1 +O(h2 ).
h 2 2h
The task is now to eliminate un+1 between both equations. To do this, multiply the first equation by h2
and the second equation by 2han and subtract to get
2an un−1 +(h2 cn −2an )un = h2 fn −(2an +hbn )hG1 +O(h3 ).
This equation is now used as the final equation of the tri-diagonal system which becomes

−(4a1 −2h2 c1 )u1 +(2a1 +hb1 )u2 = 2h2 f1 −(2a0 −hb0 )U0 +O(h4 )
.. ..
. = .
(2ak −hbk )uk−1 −(4ak −2h2 ck )uk +(2ak +hbk )uk+1 = 2h2 fk +O(h4 )
.
.. ..
. = .
(2an−1 −hbn−1 )un−2 −(4an−1 −2h2 cn−1 )un−1 +(2an +hbn )un = 2h2 fn−1 +O(h4 )
2an un−1 +(h2 cn −2an )un = h2 fn −(2an +hbn )hG1 +O(h3 )

^
Example 6.37. Measure the experimental error of the numerical solution of
x 2 u xx + xu x +u = x, u(ε) = u 0(1) = 0, ε = 0.01.
found by the above procedure.
Solution. Table 6.5 compares the error structure of the backward difference and fictitious node implementations
of the gradient boundary condition when ε = 0.01.
^

6.2.6 Non-linear equations

Remark 6.38. When the continuous problem is non-linear, the resulting algebraic equations obtained by
the application of the finite-difference procedure are also nonlinear.
Sets of nonlinear algebraic equations are solved by fixed-point iteration methods as discussed in Chapter 1.
90 CHAPTER 6. FD DISCRETISATION

Backward difference Fictitious nodes


n En En−1 /En En En−1 /En
4 3.7567566 − 3.7573398 −
8 3.4135978 1.116 3.4144871 1.234
16 3.1071541 1.101 3.1108516 1.100
32 2.4869724 1.099 2.4917194 1.098
64 1.3708621 1.249 1.3755215 1.248
128 0.4348676 1.814 0.4371938 1.811
256 0.1071061 3.152 0.1078104 3.146
512 0.0261418 4.060 0.0263256 4.055
1024 0.0064812 4.097 0.0065275 4.095
2048 0.0016165 4.033 0.0016281 4.033
4096 0.0004039 4.009 0.0004068 4.009
8192 0.0001009 4.002 0.0001017 4.002
16384 0.0000252 4.001 0.0000254 4.001
32768 0.0000063 4.000 0.0000064 4.000

Table 6.5: Comparison of L 2 errors in the solution of x 2 u 00 + xu 0 + u = x for the boundary conditions
u(0.01) = u 0(1) = 0. The results on the left use the backward difference implementation and those on the right
use the fictitious nodes implementation of the gradient boundary condition u 0(1) = 0. Both implementations
are O(h2 ) accuracy.

6.2.6.1 The multi-dimensional Newton-Raphson method

Claim 6.39. The Taylor series expansion formula for a function F : Rn → Rm is



Õ 1
F(r+a) = (a·∇)k F(r),
k=0
k!
where r is the position vector, ∇ is the nabla operator, so a·∇ is the directional derivative in the direction
of a. The function F(r) is assumed to have the required properties and the sum is assumed to converge.

Proof. Ommited. 

Definition 6.40. Consider a function F : Rn → Rm . The matrix of first order partial derivatives
∂Fi
Ji j = .
∂ xj
is known as the Jacobian of F.
Example 6.41. Write out the Taylor series expansion of a function F : R2 → R2 near a point r0 to O((r−r0 )2 )
in terms of the Jacobian of the function.
Solution.
F(r) = F(r0 +(r−r0 )) = F(r0 )+((r−r0 )·∇)F(r0 )+O((r−r0 )2 )
= F(r0 )+(r−r0 )·(∇F(r0 ))+O((r−r0 )2 )
= F(r0 )+(r−r0 )· J[F(r0 )]+O((r−r0 )2 ),
where
∂Fi
∇F ≡ J[F] =
∂ xj
is a matrix of 1st order partial derivatives known as the Jacobian. ^

With this we can easily derive the formula of the multi-dimensional Newton-Raphson method.
Claim 6.42. The multi-dimensional Newton-Raphson method is
rn+1 = G(rn ) = rn − J −1 (rn )F(rn ), (6.5)
6.2. DISCRETIZATION OF STEADY-STATE PROBLEMS (SPACE) 91

where the notation J(rn ) = J[F(rn )] is used for brevity.

Proof. Obvious modification of the proof of Claim ?? of Chapter 1. 

Remark 6.43. It is difficult to invert matrices. So the multi-dimensional Newton-Raphson formula (6.5)
is used in the equivalent form
J(r0 )r∗ = J(r0 )r0 −F(r0 ).
This represents a set of algebraic equations for the components of the vector r∗ and so the various efficient
methods presented in Chapter 4 may be used.
Example 6.44. Find the solutions of the simultaneous equations
x y−sinx − y 2 = 0, x 2 y 3 −tany−1 = 0.
using the Newton-Raphson algorithm.
Solution. Here F(X) = [x y−sinx − y 2,x 2 y"3 −tany−1] and the Jacobian# of this vector is
y−cosx x −2y
J(X) =
2xy 3 3x 2 y 2 −sec 2 y
With starting"values X = [0.0,−1.0]
# " the first
# "iteration produces the# " linear
# system,
2 2 0
" #
−2 x1 −2 −1
= −
0 −sec 2 (−1) y1 0 −sec 2 (−1) −1 −1−tan(−1)
Then the first 10 iterations assuming a fixed A and with A updated as iteration proceeds (Newton-Raphson)
are given in Table 6.6.

A not updated Newton-Raphson


n x y x y
1 −0.3372779 −0.8372779 −0.3372779 −0.8372779
2 −0.3686518 −0.8247921 −0.3762218 −0.8235464
3 −0.3748875 −0.8230965 −0.3767514 −0.8235034
4 −0.3762579 −0.8230912 −0.3767515 −0.8235034
.. .. .. .. ..
. . . . .
15 −0.3767514 −0.8235032 −0.3767515 −0.8235034
16 −0.3767515 −0.8235034 −0.3767515 −0.8235034

Table 6.6: Estimated solutions of the simultaneous equations xy −sinx − y 2 = 0 and x 2 y 3 −tany −1 = 0
using a fixed A and a continually updated A (Newton-Raphson).

6.2.6.2 Use in nonlinear FD

Example 6.45. Use the finite difference algorithm to solve the boundary value problem
u xx = u2, u(0) = 0, u(1) = 1.

Solution. In the usual notation, the finite difference approximation to this equation is
uk+1 −2uk +uk−1 −h2 uk2 = 0, k = 1,···,(n−1)
92 CHAPTER 6. FD DISCRETISATION

where u0 = 0 and un = 1. These values appear in the first and last equations and so F is
F1 = u2 −2u1 −h2 u12
.. ..
. = .
Fk = uk+1 −2uk +uk−1 −h2 uk2 .
.. ..
. = .
Fn−1 = 1−2un−1 +un−2 −h2 u2n−1
The Jacobian matrix is Jik = Fi,k . In this example, it is clear that J is tri-diagonal. To be specific,
 −2−2h2 u1 1 0 0 ··· ··· 0 
 

 1 −2−2h2 u2 1 0 ··· ··· 0 


 0 1 2
−2−2h u3 1 0 ··· ···


.. .. .. .. .. ..
 
. . . . . .
 
J=
 ··· 
.. .. ..

. 2
 

 . . 1 −2−2h uk 1 ··· 

 .. .. .. .. .. .. 

 . . . . . . ··· 

0 ··· ··· ··· 0 1 −2−2h2 un−1 
 

The algorithm is started by guessing the initial solution. Often a suitable guess is obtained by assuming
 
any sensible function of x connecting the boundary data - say u(x) = x in this instance. The refinement
of the mesh is set by choosing N and the initial estimate of u0,u1,···,un determined from the guessed solution.
The value of n is then doubled and the computation repeated. The root mean square error norm
h 1 Õ n i
(2) 2
En = (uk(1) −u2k )
n+1 k=0
is then used to decide when the algorithm should terminate. With a stopping criterion of En < 10−8 , Table
6.7 shows the result of twelve iterations of the Newton-Raphson algorithm.

N EN E N −1 /E N
4 0.0001037 −
8 0.0000262 3.959
16 0.0000066 3.992
32 0.0000016 3.998
64 0.0000004 4.000

Table 6.7: Estimated solutions of the u xx = u2 with boundary conditions u(0) = 0 and u(1) = 1. The
accuracy is based on 12 iterations of the Newton-Raphson algorithm.

6.2.7 Elliptic equations in R2

The prototypical equations in this class are the Laplace and the Poisson equations, both involving the
Laplacian operator.
Remark 6.46. The differentiation matrix of the Laplacian is sparse, that is it is essentially non tridiagonal
but it has a large number of zero elements.
Remark 6.47. • Direct Gauss elimination methods are very inefficient for sparse linear systems.
• Iterative methods are much more efficient for sparse systems.
• Iterative methods were introduced in Chapter 4.
Example 6.48. (May 2001) Consider the boundary value problem
u xx +uyy = −2, (x,y) ∈ D
6.2. DISCRETIZATION OF STEADY-STATE PROBLEMS (SPACE) 93

where D is the interior of the rectangle with vertices at (0,0), (0,1), (1/2.0) and (1/2,1), and the solution
is required to satisfy the boundary conditions
u(x,0) = 1, u(x,1) = 0, u(0,y) = 1− y, u x (1/2,y) = u2 (1/2,y).
(i) Construct a finite difference representation of this problem for general h using the fictitious nodes
procedure to describe the gradient boundary condition.
(ii) Write out the specific equations to be solved in the case in which h = 1/4.
(iii) Restructure these equations in a form suitable for iteration.

Solution. (i) Take xi = ih and y j = j h where h = 1/2n. In this case i = 0,1,···,n and j = 0,1···,2n. The
Laplace molecule gives
ui−1, j +ui+1, j +ui, j−1 +ui, j+1 −4ui, j = −2h2
where i = 1,···,(n−1) and j = 1,···,(2n−1). The boundary conditions yield
u0, j = 1− j h j = 0,···,2n
ui,0 = 1 i = 0,···,n
ui,2N = 0 i = 0,···,n
The boundary condition on x = 1/2 requires more work. The fictitious nodes procedure gives
un+1, j −un−1, j = 2hu2n, j
#
j = 1,···,2n−1.
un−1, j +un+1, j +un, j−1 +un, j+1 −4un, j = −2h2
The fictitious value un+1, j is now eliminated between these equations to obtain
2un−1, j +2hu2n, j +un, j−1 +un, j+1 −4un, j = −2h2 j = 1,···,2n−1.
This is the final form for the boundary condition on x = 1/2.
(ii) Now take h = 1/4 or n = 2. Let u1,···,u6 be the unknowns defined in the diagram

#1
94 CHAPTER 6. FD DISCRETISATION

The finite difference for h = 1/4 are therefore


3 1 15
u2 + +1+u3 −4u1 = u2 +u3 −4u1 =

− −
4 8 8



2 +1+u −4u 1  9
2u1 +2hu2, 4 2 = −  4u1 +u22 +2u4 −8u2 = −
j 8 4


1 1 5

u4 + +u1 +u5 −4u3 = u4 +u1 +u5 −4u3 =

− −
2 8 8


 →
1  1
2u3 +2hu42 +u2 +u6 −4u4 = −  4u3 +u42 +2u2 +2u6 −8u4 = −
8 4


1 1 3

u6 + +u3 +0−4u5 = u6 +u3 −4u5 =

− −
4 8 8



1  1
2u5 +2hu62 +u4 −4u6 = −  4u5 +u62 +2u4 −8u6 = −
8 4


(iii) The j th equation is solved for u j to get the basic formulation of the numerical problem, which is subsequently
rewritten in iterative forma and leads to the Gauss-Seidel algorithm
1 15  1  (k) (k) 15 
u1 = u2 +u3 +
 u1(k+1) = u +u3 +
4 8 4 2 8


2  (k+1) (k) 9 

2  9  (k+1)
= 2u1 +u4 +
u2 = 2u1 +u4 +  u2
8
8−u2 8 8−u2(k)


1 5

(k+1) 1 (k+1) (k) (k) 5 

= +u4 +u5 +
 
u3 = u1 +u4 +u5 + u u

3 4 1 8
4 8


→
2 1  (k+1) 2 
(k+1) (k+1) (k) 1

= +2u +u +
  
u4 = u2 +2u3 +u6 + u 4 u 2 3 6 8
8−u4 8  8−u4(k)
1 3

(k+1) 1 
(k+1) (k) 3 
= +u6 +
 
u5 = u3 +u6 + u5 u

4 8 4 3 8


2 (k+1) 1

2 1
 
u6(k+1) = (k+1)
+2u +
  
u6 = u4 +2u5 +  u 4 5 8
8−u6 8 8−u(k)

 6

^
Example 6.49. Construct the finite difference scheme for the solution of the partial differential equation
u xx +uyy +a(x,y)u x +b(x,y)uy +c(x,y)u = f
with Dirichlet boundary conditions. Investigate conditions under which the resulting system of linear
equations will be diagonally dominated.
Solution. The central difference representation of the differential equation is
ui−1, j +ui, j−1 −4ui, j +ui, j+1 +ui+1, j ui+1, j −ui−1, j ui, j+1 −ui, j−1
+ai, j +bi, j +ci, j ui, j = fi, j .
h 2 2h 2h
This equation is multiplied by 2h2 and like terms collected together to obtain
(2−hai, j )ui−1, j +(2−hbi, j )ui, j−1 −(8−2h2 ci, j )ui, j +(2+hbi, j )ui, j+1 +(2+hai, j )ui+1, j = 2h2 fi, j .
Strict diagonal dominance requires that
|8−2h2 ci, j | > |2−hai, j |+|2−hbi, j |+|2+hbi, j |+|2+hai, j |.
For suitably small h, this inequality becomes
8−2h2 ci, j > 2−hai, j +2−hbi, j +2+hbi, j +2+hai, j → ci, j < 0.
Thus the finite difference representation of the partial differential equation will be strictly diagonally
dominated provided c(x,y) < 0 in the region in which the solution is sought. ^
Example 6.50. Construct the finite difference scheme for the solution of the partial differential equation
u xx +uyy = −2, (x,y) ∈ D, u = 0 on ∂D,
where D = (−1,1)×(−1,1). Iterate the finite difference scheme
 for chosen
 h, anduse the exact solution
∞ (−1) cosh
k (2k+1)πx
cos (2k+1)πy
2 32 2 2
Õ
u(x,y) = 1− y − 3
π k=0
 
(2k +1)3 cosh (2k+1)π2
6.2. DISCRETIZATION OF STEADY-STATE PROBLEMS (SPACE) 95

for the boundary value problem to investigate the convergence of the finite difference scheme.
Solution. The finite difference scheme to solve u xx +uyy = −2 is
ui−1, j +ui, j−1 −4ui, j +ui, j+1 +ui+1, j = −2h2,
which may be re-expressed in the form
1h i h2
ui, j = ui−1, j +ui, j−1 +ui, j+1 +ui+1, j + (6.6)
4 2
with h = 2/n, xi = −1+ih and y j = −1+ j h. Equation (6.6) may be used as the basis of the iterative algorithm
1 h (k+1) (k+1) (k) i h2
ui,(k+1) = u +u +u +u (k)
+ , ui,(0)j = 0 (6.7)
j 4 i−1, j i, j−1 i, j+1 i+1, j 2
where the initial solution is taken to be u = 0, which satisfies the boundary condition. For example, when
(1)
n = 2 then u1,1 (2)
= u1,1 = ··· = h2 /2 and the iterative algorithm clearly terminates after two iterations. To
control the number of Gauss-Seidel iteration, the root mean square error norm
1/2 1/2
n n
 1 2  1 2 
 Õ
Õ new old  new old 
En =  ui, j −ui, j  =

2
 ui, j −ui, j 
 (n+1) i, j=0  n+1 
i, j=0 
   
is used to terminate Gauss-Seidel iteration whenever En < 10 . The result of this procedure, now measuring
−6

the numerical error against the true solution, is given in Table 6.8.

n ITER En En−1 /En


2 2 0.0893708 −
4 19 0.0207940 4.298
8 70 0.0048225 4.312
16 245 0.0011738 4.109
32 837 0.0003828 3.067
64 2770 0.0004880 0.784
128 8779 0.0016882 0.289

Table 6.8: Comparison of L 2 errors in the solution of u xx +uyy = −2 when using Gauss-Seidel iteration
and the termination condition En < 10−6 .

Clearly the Gauss-Seidel algorithm is using many more iterations to produce an inferior result as the mesh
is refined by increasing n. This is a counter-intuitive result. The difficulty lies in the stopping criterion.
The use of an absolute termination condition can be problematic when an iterative procedure converges
very slowly as happens here. The usual way to ameliorate this hazard is to use a termination condition
which becomes increasingly demanding as the number of iterations increases. One popular choice is to
make the termination condition inversely proportional to the number of iterations performed. The same
calculation repeated with the termination condition En < 10−5 /IT E R, where IT E R counts the number
of iterations of the Gauss-Seidel algorithm is shown in Table 6.9.

n ITER En En−1 /En IT E R/n2


2 2 0.0893708 − 0.500
4 20 0.0207934 4.298 1.250
8 83 0.0048172 4.316 1.297
16 336 0.0011487 4.194 1.313
32 1344 0.0002802 4.100 1.313
64 5379 0.0000696 4.024 1.313
128 21516 0.0000178 3.901 1.313

Table 6.9: Comparison of L 2 errors in the solution of u xx +uyy = −2 when using Gauss-Seidel iteration
and the termination condition En < 10−5 /IT E R.
96 CHAPTER 6. FD DISCRETISATION

The fact that IT E R/n2 is effectively constant is simply another manifestation of the fact that the numerical
accuracy is proportional to h2 , that is 1/n2 . ^

6.2.8 Richardson extrapolation

To obtain FD methods of greater accuracy, we may (a) reduce the discretization step h or (b) use methods
of acceleration of convergence.
Acceleration of convergence was already discussed in Chapter 3 section ?? and all results hold for
FD-discretizations.
Example 6.51. (Richardson extrapolation for FD schemes) Consider a FD numerical scheme A[uh ] = 0
for the solution of a continuous problem P[u] = 0 such that the error in solutions is
u−uh = O(h2 ),
where u is the exact solution of P[u] = 0 and uh is the exact solution of A[uh ] = 0. Combine two solutions u(1)
and u(2) of A[uh ] = 0 obtained at discretisation steps h1 and h2 , respectively, to obtain a new approximation
w of higher accuracy.Âă
Solution. Since the error is O(h2 ) then
u(1) = u+ch2 +O(h3 )
1 1
= u+ch2 +O(h23 )
2
(2)
u
Eliminate, the unknown constant c between these two equations to obtain
h22 u(1) −h12 u(2) +O(h22 h13 −h12 h23 ) 1
u= 2 2
= w+ 2 2 O(h22 h13 −h12 h23 ).
h2 −h1 h2 −h1
In the case when h2 = h1 /2, it is easy to see that the “extrapolated” solution is
4u(2) −u(1)
u= +O(h3 ).
3
It is also easy to see that the new approximation
4u(2) −u(1)
w= ,
3
is more accurate than either of u(1) and u(2) .
This example is a particular case of the more general formula of Claim ??.
Similarly to Romberg integration, the process described in this example may be continued further to obtain
a sequence of approximations of increasing accuracy. ^
Example 6.52. Construct the finite difference equations for the numerical solution of the boundary value
problem
u 00 −u 0tanx = −1, u(0) = 0, u 0(π/4) = −1.
Apply the Richardson extrapolation method for acceleration of convergence to increase the order of
convergence. The exact solution is u(x) = logcosx.
Solution. Take h = π/(4n) then a(x) = 1, b(x) = −tanx, c(x) = 0 and f (x) = −1. The boundary conditions
are U0 = 0 and G1 = −1. The finite difference equations based on the fictitious nodes procedure are therefore
−4u1 +(2+hb1 )u2 = 2h2 f1 +O(h4 )
.. ..
. = .
(2−hbk )uk−1 −4uk +(2+hbk )uk+1 = 2h2 fk +O(h4 )
.. .. .
. = .
(2−hbn−1 )un−2 −4un−1 +(2+hbn )un = 2h2 fk +O(h4 )
2un−1 −2un = h2 fn +(2+hbn )h+O(h3 )
Suppose initially that n = 40 - the initial choice of n is arbitrary but should be sensible. Let the finite
difference solution computed at these nodes be u0(1),···,u(1)
n . Now take n = 2n and recompute the new solution.
6.2. DISCRETIZATION OF STEADY-STATE PROBLEMS (SPACE) 97

Let this solution be u0(2),···,u2n


(2)
. The solution uk(1) may be compared directly with the solution u2k (2)
since
both apply at the same node. The root mean square
h 1 Õ n i 1/2
(2) 2
En = (uk(1) −u2k )
n+1 k=0
is a measure of the error incurred in computing the original solution. In this case E40 is determined. The
procedure may be continued by doubling n again - E80 is now computed. The algorithm continues to double
n and terminates whenever En < ε where ε is a user supplied error bound. Moreover, since we already know
that the procedure is O(h2 ), we expect the ratio En−1 /En to have a limiting value of 4 so that every iteration of
this loop approximately reduces En by a factor of 4. Table 6.10 illustrates the convergence properties of En .

n En En−1 /En
40 0.0001026 −
80 0.0000261 3.922
160 0.0000066 3.960
320 0.0000017 3.980
640 0.0000004 3.990

Table 6.10: Estimated L 2 norm of the errors incurred in the solution of u xx −tanxu x = −1 with boundary
conditions u(0) = 0 and u x (π/4) = −1.

Richardson extrapolation is used to improve the estimated solution and the L 2 norm is computed with respect
to the exact solution so as to check the improved convergence. To be specific, u(1) and u(2) are combined
to give the more accurate solution b
u(1) which is then used for the norm computation. In the next stage, u(2)
and u(3) are used to give the more accurate solution b
u(2) which is then used for the norm computation. The
procedure is repeated. Table 6.11 records the results of these calculations.

n En En−1 /En
40 0.1166×10−5 −
80 0.1493×10−6 7.810
160 0.1890×10−7 7.902
320 0.2378×10−8 7.946
640 0.3029×10−9 7.851

Table 6.11: Estimates the L 2 norm of the errors incurred in the solution of u xx −tanxu x = −1 with boundary
conditions u(0) = 0 and u x (π/4) = −1. The norm is computed from the exact solution and the Richardson
extrapolation of the finite difference solution.

6.2.9 Curvilinear domains

Remark 6.53. Problems on curvilinear domains are treated easily by the finite difference procedure if their
boundary curves can be represented as coordinate lines in some appropriately chosen set of coordinates.
Example 6.54. Describe a procedure for using the FD method for the solution of Poisson’s equation
∇2 u = f (®
r ) inside a circle of radius a centred at the origin of the coordinate system.
Solution. In polar coordinates (r,θ), the Laplacian operator takes the form,
∂2 1 ∂ 1 ∂2
∇2 = 2 + + .
∂r r ∂r r 2 ∂θ 2
98 CHAPTER 6. FD DISCRETISATION

So in polar coordinates the given equation is


∂ 2 u 1 ∂u 1 ∂ 2 u
+ + = f (r,θ). (6.8)
∂r 2 r ∂r r 2 ∂θ 2
The region of integration is
r ×θ = [0,a]×[0,2π],
and given boundary conditions, say Dirichlet, will be specified at r = a,
u(a,θ) = g(θ).
This equation may be solved using the finite difference procedure by taking ri = ih and θ j = j k in which
h = R/n and k = 2π/m. Each derivative appearing in equation (6.8) may be replaced by its finite difference
representation to get
ui+1, j −2ui, j +ui−1, j 1 ui+1, j −ui−1, j 1 ui, j+1 −2ui, j +ui, j−1
+ + 2 = fi, j
h2 ri 2h ri k2
except at the origin. The finite difference equation contributed by the origin may be constructed by returning
to the original cartesian version of the Laplace molecule. This molecule can be used provided nodes of
θ fall on the coordinate axes, that is, when m is a multiple of 4. ^
Remark 6.55. However, domains that exhibit little or no regularity are most effectively treated using
finite-element methods.

6.3 Discretization of time-dependent problems (Time)

In this section we will derive a variety of FD schemes for time-dependent problems, primarily for the
diffusion equation. In this section we will not be concerned with boundary and initial conditions nor with
errors and convergence.
Remark 6.56. FD schemes for time-dependent problems are obtained in the same way as FD schemes
for time-independent problems – by replacing the exact derivatives with their FD approximating formulas.
Examples follow.
Remark 6.57. Time is typically discretised on a uniform grid as follows
t −→ t j = t0 + j k, y(t) −→ y(t j ) = y j , j ∈ N.

6.3.1 The Euler method for ODE IVPs

Example 6.58. Derive the forward explicit Euler method (step)


y j+1 = y j +k f (y j ,t j ) (6.9)
for approximation of the ODE IVP
dy
= f (y,t), y(t0 ) = y0 .
dt

Solution. Using the forward difference f-la for y 0, we write down the following difference equations for
the approximate solution y j
y j+1 − y j
= f (y j ,t j ) (6.10)
k
This numerical scheme is called the forward Euler method or simply the Euler method.
The Euler method (6.12) is a difference equation for the yi and it is straightforward to solve by noting that
we can write it in the form
y j+1 = y j +k f (y j ,t j ), (6.11)
in this form it is clear to see that given y0 we can use the difference equation repeatedly to calculate any
y j . The initial data y0 = y(t0 ) is given by the initial conditions for the IVP.
6.3. DISCRETIZATION OF TIME-DEPENDENT PROBLEMS (TIME) 99

The method is called explicit because y j+1 can be expressed explicitly by y j and iterations can be
performed. ^
Example 6.59. Derive the backward implicit Euler method (step)
y j+1 −k f (y j+1,t j+1 ) = y j (6.12)
for approximation of the ODE IVP
dy
= f (y,t), y(t0 ) = y0 .
dt

Solution. Using the backward difference f-la for y 0, we write down the following difference equations for
the approximate solution y j
y j+1 − yi
= f (y j+1,t j+1 ) (6.13)
k
This numerical scheme is called the backward Euler method. This method is an example of an implicit
method, in which the expression is written implicitly for y j+1 . At every step a nonlinear equation
v = y j +h f (v,t j+1 )
must be solved for v to give y j+1 .
The two explicit and the implicit methods have different stability properties. ^

6.3.2 The Euler method for the diffusion equation

Example 6.60. Derive the forward Euler method


ui, j+1 = ui, j +r(ui+1, j −2ui, j +ui−1, j )
with computational molecule
0 1 0
ut −u xx = −r −(1−2r) −r
0 0 0
for approximation of the I-BVP for the diffusion equation
ut = u xx .
2
Here and below r = k/h is called the Courant Number.
Solution. Let ui, j = u(xi,t j ), then Euler’s scheme applied to the diffusion equation gives
ui, j+1 = ui, j +r(ui+1, j −2ui, j +ui−1, j )
2
where r = k/h is called the Courant Number. This finite difference scheme may be rewritten
ui, j+1 −rui+1, j −(1−2r)ui, j −rui−1, j = 0
which in turn leads to the computational molecule
0 1 0
ut −u xx = −r −(1−2r) −r
0 0 0
^
Example 6.61. Use the Euler method to solve the diffusion equation ut = u xx where x ∈ (0, 1) with
initial condition u(x,0) = sin(2πx) and boundary conditions u(0,t) = u(1,t) = 0. The exact solution is
2
u(x,t) = e−4π t sin(2πx)
Solution. Note how the efficacy of the method depends on the size of r - small h entails even smaller k. ^

6.3.3 The Richardson method for the diffusion equation

Example 6.62. Derive the Richardson method


ui, j+1 = ui, j−1 +2r(ui+1, j −2ui, j +ui−1, j )
100 CHAPTER 6. FD DISCRETISATION

k values for n = 8 k values for n = 16 k values for n = 32


t u(1/4,t) 10−2 10 −3 10 −4 10−2 10 −3 10 −4 10−2 10−3 10−4
0.01 0.67 0.63 0.68 0.69 0.56 0.62 0.63 0.59 0.66 0.66
0.02 0.45 0.39 0.47 0.47 0.34 0.42 0.42 0.36 0.44 0.45
0.03 0.31 0.24 0.32 0.32 0.21 0.28 0.29 0.22 0.29 0.30
0.04 0.21 0.15 0.22 0.22 0.13 0.19 0.19 0.13 −23.6 0.20
0.05 0.14 0.10 0.15 0.15 0.08 0.13 0.13 0.08 ∗∗∗ 0.14
0.06 0.09 0.06 0.10 0.11 0.05 0.09 0.09 0.05 ∗∗∗ 0.09
0.07 0.06 0.04 0.07 0.07 0.03 0.06 0.06 0.03 ∗∗∗ 0.06
0.08 0.04 0.02 0.05 0.05 0.02 0.04 0.04 0.02 ∗∗∗ 0.04
0.09 0.03 0.01 0.03 0.03 0.01 0.03 0.03 0.01 ∗∗∗ 0.03
0.10 0.02 0.01 0.02 0.02 0.01 0.02 0.02 −0.03 ∗∗∗ 0.02
0.11 0.01 0.01 0.01 0.02 0.00 0.01 0.01 1.54 ∗∗∗ 0.01
0.12 0.01 0.00 0.01 0.01 0.00 0.01 0.01 −59.6 ∗∗∗ 0.01
0.13 0.01 0.00 0.01 0.01 0.00 0.01 0.01 ∗∗∗ ∗∗∗ 0.01
0.14 0.00 0.00 0.00 0.01 0.00 0.00 0.00 ∗∗∗ ∗∗∗ 0.00
0.15 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ∗∗∗ ∗∗∗ 0.00

Table 6.12: Calculations illustrating the use of Euler’s method to solve ut = u xx with initial
condition u(x,0) = sin(2πx) and boundary conditions u(0,t) = u(1,t) = 0. The exact solution is
2
u(x,t) = e−4π t sin(2πx).

with computational molecule


0 1 0
ut −u xx = −2r 4r −2r
0 −1 0
for approximation of the I-BVP for the diffusion equation
ut = u xx .

Solution. Richardson’s method replaces the time derivative by its central difference formula
dui ui, j+1 −ui, j−1
= +O(k 2 ).
dt 2k
The result is the Richardson finite difference scheme
2k
ui, j+1 −ui, j−1 = 2 (ui+1, j −2ui, j +ui−1, j ) = 2r(ui+1, j −2ui, j +ui−1, j )
h
with computational molecule
0 1 0
ut −u xx = −2r 4r −2r
0 −1 0
One difficulty with the Richardson scheme arises at j = 0 since the backward solution ui,−1 does not exist.
One way around this difficulty is to use the Euler scheme for the first step of the Richardson scheme. ^
Example 6.63. Use Richardson’s method to solve the diffusion equation ut = u xx where x ∈ (0,1) with
initial condition u(x,0) = sin(2πx) and boundary conditions u(0,t) = u(1,t) = 0. The exact solution is
2
u(x,t) = e−4π t sin(2πx).
Solution. To investigate the effectiveness of the Richardson algorithm, the scheme is started with j = 1. The
initial condition is used to obtain values of ui,0 and the exact solution is used to get values for ui,1 . Values
for ui, j with j > 1 are obtained by iterating the scheme. This arrangement overstates the effectiveness of
the Richardson scheme since the solutions at j = 0 and j = 1 are exact.
Table 6.13 illustrates the use of Richardson’s algorithm to solve the initial boundary problem discussed
previously. Richardson’s algorithm is unstable for the choices of h and k in this table. In fact, it will be
6.3. DISCRETIZATION OF TIME-DEPENDENT PROBLEMS (TIME) 101

shown later that the algorithm is unstable for all choices of h and k. This result, however, is not obvious a
priori and is counter-intuitive in the respect that the treatment of the time derivative is more accurate than the
Euler scheme and so one would expect the Richardson algorithm to be superior to the Euler algorithm. ^

k values for n = 8 k values for n = 16 k values for n = 32


t u(1/4,t) 10−2 10−3 10−4 10 −2 10 −3 10 −4 10−2 10−3 10−4
0.01 0.67 0.49 0.66 0.68 0.44 0.60 0.62 0.46 0.64 ∗∗∗
0.02 0.45 0.30 0.45 0.47 0.28 0.41 0.42 0.30 −23.4 ∗∗∗
0.03 0.31 0.27 0.31 0.32 0.22 0.28 0.29 0.23 ∗∗∗ ∗∗∗
0.04 0.21 0.10 0.21 0.22 0.11 0.16 2.08 0.12 ∗∗∗ ∗∗∗
0.05 0.14 0.19 0.14 0.15 0.13 ∗∗∗ ∗∗∗ 0.13 ∗∗∗ ∗∗∗
0.06 0.09 −0.04 0.09 0.10 0.00 ∗∗∗ ∗∗∗ 0.02 ∗∗∗ ∗∗∗
0.07 0.06 0.22 0.06 0.07 0.13 ∗∗∗ ∗∗∗ 0.12 ∗∗∗ ∗∗∗
0.08 0.04 −0.21 0.03 0.05 −0.10 ∗∗∗ ∗∗∗ −0.09 ∗∗∗ ∗∗∗
0.09 0.03 0.38 0.00 0.03 0.21 ∗∗∗ ∗∗∗ 1.14 ∗∗∗ ∗∗∗
0.10 0.02 −0.49 −0.02 0.02 −0.26 ∗∗∗ ∗∗∗ −73.7 ∗∗∗ ∗∗∗
0.11 0.01 0.75 −0.05 0.01 0.40 ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗
0.12 0.01 −1.05 −0.08 0.00 −0.46 ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗
0.13 0.01 1.53 −0.12 −0.01 −1.62 ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗
0.14 0.00 −2.20 −0.19 −0.01 48.1 ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗
0.15 0.00 3.18 −0.34 −0.02 ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗

Table 6.13: Calculations illustrating the use of Richardson’s algorithm to solve ut = u xx with initial
condition u(x,0) = sin(2πx) and boundary conditions u(0,t) = u(1,t) = 0.

6.3.4 The DuFort-Frankel method for the diffusion equation

Example 6.64. Derive the DuFort-Frankel method


(1+2r)ui, j+1 −(1−2r)ui, j−1 −2rui+1, j −2rui−1, j = 0
with computational molecule
0 1+2r 0
ut −u xx = −2r 0 −2r
0 −1+2r 0
for approximation of the I-BVP for the diffusion equation
ut = u xx .

Solution. The Dufort-Frankel method is a modification of the Richardson method in which ui, j is replaced
by its time averaged value, that is,
ui, j+1 +ui, j−1
ui, j = +O(k 2 )
2
With this modification of the Richardson algorithm, we get the Dufort-Frankel algorithm
ui, j+1 −ui, j−1 1 h 1 i
= 2 ui+1, j −2× (ui, j+1 +ui, j−1 )+ui−1, j .
2k h 2
Taking r = k/h2 , the Dufort-Frankel algorithm becomes
(1+2r)ui, j+1 −(1−2r)ui, j−1 −2rui+1, j −2rui−1, j = 0
with computational molecule
0 1+2r 0
ut −u xx = −2r 0 −2r
0 −1+2r 0
^
102 CHAPTER 6. FD DISCRETISATION

Example 6.65. Use the DuFort-Frankel method to solve the diffusion equation ut = u xx where x ∈ (0,1)
with initial condition u(x,0) = sin(2πx) and boundary conditions u(0,t) = u(1,t) = 0. The exact solution
2
is u(x,t) = e−4π t sin(2πx).
Solution. Table 6.14 illustrates the result of using the Dufort-Frankel algorithm to solve the original partial
differential equation by iterating
1−2r 2r 2r
ui, j+1 = ui, j−1 + ui+1, j + ui−1, j
1+2r 1+2r 1+2r
taking ui,0 from the boundary condition and ui,1 from the exact solution. The numerical scheme is clearly
more stable with no evidence of numerical blow-up.
However, it would appear that the process of increasing n does not in itself ensure a more accurate solution.
Spatial refinement cannot be achieved without an accompanying temporal refinement. Table 6.14 repeats
some of these calculations with a wide range of spatial discretisations. The most important point to observe
from this Table is that the scheme appears to converge as n increases, but the convergence is not to the true
solution of the partial differential equation.
This is the worst possible scenario. Without knowing the exact solution, one might (erroneously) take the
result of this numerical calculation to be a reasonable approximation to the exact solution of the partial
differential equation. ^

n = 32 n = 320 n = 3200
t u(1/4,t) k = 10−2 k = 10−3 k = 10−4
0.01 0.6738 0.3477 0.5742 0.6021
0.02 0.4540 0.0218 0.1873 0.2083
0.03 0.3059 −0.3038 −0.1993 −0.1853
0.04 0.2062 −0.6290 −0.5854 −0.5783
0.05 0.1389 −0.9537 −0.9710 −0.9708

Table 6.14:
Calculations illustrating the use of Dufort-Frankel algorithm to solve ut = u xx with initial condition
u(x,0) = sin(2πx) and boundary conditions u(0,t) = u(1,t) = 0.

6.3.5 The Crank-Nicolson method for the diffusion equation

Example 6.66. Derive the Crank-Nicolson method


r r r r
− ui+1, j+1 +(1+r)ui, j+1 − ui−1, j+1 = ui+1, j +(1−r)ui, j + ui−1, j .
2 2 2 2
with computational molecule
−r/2 1+r −r/2
ut −u xx = −r/2 −1+r −r/2
0 0 0
for approximation of the I-BVP for the diffusion equation
ut = u xx .

Solution. The Crank-Nicolson algorithm averages not only ui, j but also ui+1, j and ui−1, j . Therefore the
Crank-Nicolson formulation of the diffusion equation is
ui, j+1 −ui, j−1 1 h ui+1, j+1 +ui−1, j−1 −2(ui, j+1 +ui, j−1 )+ui−1, j+1 +ui−1, j−1 i
= 2 .
2k 2h
6.3. DISCRETIZATION OF TIME-DEPENDENT PROBLEMS (TIME) 103

This equation can be re-expressed in the form


ui,(j−1)+2 −ui,(j−1) 1 h
= ui+1,(j−1)+2 +ui−1,(j−1)
2k 2h2
i
−2(ui,(j−1)+2 +ui,(j−1) )+ui−1,(j−1)+2 +ui−1,(j−1) .
In particular, the value of the solution at time t j nowhere enters the algorithm which in practice advances the
solution from t j−1 to t j+1 through a time step of length 2k. By reinterpreting k to be 2k, the Crank-Nicolson
algorithm becomes
ui, j+1 −ui, j 1 h i
= 2 ui+1, j+1 +ui+1, j −2(ui, j+1 +ui, j )+ui−1, j+1 +ui−1, j .
k 2h
The Courant number r = k/h2 is now introduced to obtain
rh i
ui, j+1 −ui, j = ui+1, j+1 +ui+1, j −2(ui, j+1 +ui, j )+ui−1, j+1 +ui−1, j .
2
which may be re-arranged to give
r r r r
− ui+1, j+1 +(1+r)ui, j+1 − ui−1, j+1 = ui+1, j +(1−r)ui, j + ui−1, j .
2 2 2 2
The computational module corresponding to the Crank-Nicolson algorithm is therefore
−r/2 1+r −r/2
ut −u xx = −r/2 −1+r −r/2 .
0 0 0
^
Example 6.67. Formulate in matrix form the Crank-Nicolson algorithm for the problem
ut −u xx = f (x,t), (x,t) ∈ [0,1]×[0,∞)
u(x,0) = w(x),
u(0,t) = u(1,t) = 0.

Solution. Let U (j) = (u0, j ,u1, j ,···,un, j ) be the vector of u values at time t j then the Crank-Nicolson algorithm for
the solution of the diffusion equation ut = u xx with Dirichlet boundary conditions may be expressed in the form
TLU (j+1) =TRU (j) +F (j)
in which TL and TR are tri-diagonal matrices
 (1+r) −r/2
 0 0 ··· ··· 0 
 −r/2 (1+r) −r/2
 0 ··· ··· 0 
 0 −r/2 (1+r) −r/2 0 ··· ··· ,
TL = 
.. .. .. .. .. ..


 . . . . . . ···


 0 0
 
 ··· ··· ··· −r/2 (1+r) 

 (1−r) r/2
 0 0 ··· ··· 0 
 r/2 (1−r) r/2
 0 ··· ··· 0 
 0 r/2 (1−r) r/2 0 ··· ···  .
TR = 
.. .. .. .. .. ..


 . . . . . . ··· 

 0 0
 
··· ··· ··· r/2 (1−r) 
and the vector F is
 
 f1, j 
 f2, j 
 
 . 
F (j) =  .. 
 
 . 
 .. 
 
 f 
 n−1, j 
The first and last rows of these matrices contain the boundary conditions which may also arise in the
expression for F (j) . The initial solution is determined from the initial conditions and subsequent solutions
are obtained by solving a tri-diagonal system of equations. ^
104 CHAPTER 6. FD DISCRETISATION

Example 6.68. Use the Crank-Nicolson method to solve the diffusion equation ut = u xx where x ∈ (0,1)
with initial condition u(x,0) = sin(2πx) and boundary conditions u(0,t) = u(1,t) = 0. The exact solution
2
is u(x,t) = e−4π t sin(2πx).

k values for n = 8 k values for n = 16 k values for n = 32


t u(1/4,t) 10−2 10−3 10−4 10−2 10−3 10−4 10−2 10−3 10−4
0.01 0.67 0.68 0.69 0.69 0.62 0.63 0.63 0.66 0.66 0.66
0.02 0.45 0.47 0.47 0.47 0.42 0.42 0.42 0.44 0.45 0.45
0.03 0.31 0.32 0.32 0.32 0.28 0.29 0.29 0.30 0.30 0.30
0.04 0.21 0.22 0.22 0.22 0.19 0.19 0.19 0.20 0.20 0.20
0.05 0.14 0.15 0.15 0.15 0.13 0.13 0.13 0.13 0.14 0.14
0.06 0.09 0.10 0.11 0.11 0.09 0.09 0.09 0.09 0.09 0.09
0.07 0.06 0.07 0.07 0.07 0.06 0.06 0.06 0.06 0.06 0.06
0.08 0.04 0.05 0.05 0.05 0.04 0.04 0.04 0.04 0.04 0.04
0.09 0.03 0.03 0.03 0.03 0.03 0.03 0.03 0.03 0.03 0.03
0.10 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02
0.11 0.01 0.02 0.02 0.02 0.01 0.01 0.01 0.01 0.01 0.01
0.12 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01
0.13 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01
0.14 0.00 0.00 0.01 0.01 0.00 0.00 0.00 0.00 0.00 0.00
0.15 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

Table 6.15: The numerical solution of ut = u xx with initial condition u(x,0) = sin(2πx) and boundary
conditions u(0,t) = u(1,t) = 0 calculated by the Crank Nicolson algorithm.

Solution. Table 6.15 gives the solution of the original problem using the Crank-Nicolson algorithm.
Table 6.16 illustrate the convergence of this solution as spatial resolution is refined.

n = 32 n = 320 n = 3200
t u(1/4,t) k = 10−2 k = 10−3 k = 10−4
0.01 0.67382545 0.67030550 0.67379098 0.67382511
0.02 0.45404074 0.44930947 0.45399428 0.45404027
0.03 0.30594421 0.30117461 0.30589725 0.30594374
0.04 0.20615299 0.20187900 0.20611081 0.20615257
0.05 0.13891113 0.13532060 0.13887560 0.13891078
0.06 0.09360186 0.09070614 0.09357313 0.09360157
0.09 0.02863695 0.02731839 0.02862376 0.02863681
0.12 0.00876131 0.00822760 0.00875593 0.00876125
0.15 0.00268047 0.00247795 0.00267842 0.00268045

Table 6.16:
Calculations illustrating the convergence of the Crank Nicolson algorithm for various combinations of
parameters.

6.3.6 Summary

Several numerical schemes have been examined with varying degrees of success. The performance of each
scheme is now summarised.
6.4. HANDS-ON PROJECTS 105

Euler Converges if k is sufficiently small but otherwise will blow-up.


Richardson Seems to be unstable for all choices of the Courant number r.
Dufort-Frankel Converges but not necessarily to the correct solution!
Crank-Nicolson Seems to exhibit stable behaviour for all choices of the Courant number
and converges to the correct solution as spatial resolution is improved.

6.4 Hands-on projects

6.4.1 A two-point boundary value problem

In this assignment, we will use the Newton-Raphson method to find the finite-difference solution to the
two-point second-order nonlinear boundary value problem (BVP)
u 00 +exp(u)u 0 = µsin(2πx), u(0) = 1, u 0(1)+u3 (1) = 0, (6.14)
We will investigate the behaviour of the numerical scheme as a function of the grid spacing. We will also
investigate the behaviour of the solution as a function of the parameter µ.
Below is a list of specific questions that are useful as a guide in completing these tasks.
Question 6.69. Use a uniform grid composed of N +1 nodes. Define the grid spacing and the grid points
xi . Introduce appropriate notation to describe the approximate solution to the BVP at each node. (Marks: 2)
Question 6.70. Use second-order of convergence finite-difference approximations to generate difference
equations that correspond to the ODE discretised at the interior nodes. (Marks: 5)
Question 6.71. Generate difference equations corresponding to the boundary conditions. Ensure second-
order of convergence and use the backward-difference formula for u 0 where appropriate. (Marks: 5)
Question 6.72. Write the difference equations in the form F(®
® u) = 0 and calculate the entries of the Jacobian
Ji j = ∂Fi /∂u j . (Marks: 8)
Question 6.73. Complete the Matlab code for the solution of BVP (1) provided below. You should specify
the value of the parameter µ, the number of nodes N, the interval end points, the desired error tolerance,
®b
the initial approximation, the entries of the vector F® and the entries of the Jacobian J.
When you run the script it should generate a solution matrix SOL whose columns are successive approx-
imations to the solution to the difference equations. Plot the iterates (the columns of SOL) to visualize the
steps the Newton-Raphson method takes as it converges towards the solution. Use µ = 10, tolerance 10−8
and sufficient number of nodes. (Marks: 8)
Question 6.74. Modify the script produced in Question 6.80 into a function that takes as its argument
the value for µ and outputs the values of u at x = 1/4, 1/2, and 3/4. Use this function to produce a plot
showing curves of u(1/4), u(1/2) and u(3/4) against the parameter µ for 51 values of µ from −20 to 30.
Use tolerance 10−8 and sufficient number of nodes. (Marks: 6)

We now investigate the order of convergence of one of the two main numerical methods used – the
finite-difference discretisation method. The solution depends on two independent numerical parameters,
namely the number of of the finite-difference discretisation nodes N and the number of Newton-Raphson
iterations n. The rate of convergence of the finite-difference method is related to N, and we keep all other
parameter values fixed. To simplify the notation we denote the approximation to the solution u at x = 1/2
and µ = 10, after n = 10 Newton-Raphson iterations by v, i.e.
1
 
v(N) = u x = , µ = 10; n = 10, N .
2
106 CHAPTER 6. FD DISCRETISATION

Question 6.75. Demonstrate using a graph that the order of convergence of the finite-difference disretisation
method is 2. One way to do this is to consider the convergence of the value of u at a fixed location, say
x = 1/2, then plot the ratio k−1
v(2 )−v(2k−2 )

v(2k )−v(2k−1 )
as a function of k and observe that the ratio tends to 4 as k is increased. (See Theorem 3.15 and Tables
such as 5.4 in the revised Lecture notes). (Marks: 6)

6.4.2 Poisson’s equation in a R2 annulus

Consider, the problem


∇2 u(x,y) = −2, (x,y) ∈ D,
u(x,y) = sin tan (y/x) , on (x,y) ∈ ∂D1
−1
(6.15)


u(x,y) = cos 3 tan (y/x) on (x,y) ∈ ∂D2,


−1 
2
where D ⊂ R is the closed annular region sandwiched between a circle ∂D1 of radius 1 and a circle ∂D2
of radius 3, both centred at the origin (0,0) of the coordinate system. In this assignment, you will apply the
finite-difference method to solve problem (6.15). You will also investigate the behaviour of the numerical
scheme as a function of the grid spacing and illustrate your solution by plots.
A list of specific questions follows to guide you in this task. Further remarks of how to apply the finite
difference method to problems posed on curvilinear domains are given in section 6.3.8 of the updated
Lecture Notes posted on Moodle.
Question 6.76. The transformations fromq
Cartesian coordinates (x, y) to polar coordinates (r, θ) are given by
r = x 2 + y 2, θ = tan−1 (y/x),
while in polar coordinates the Laplacian operator takes the form
∂2 1 ∂ 1 ∂2
∇2 = 2 + + .
∂r r ∂r r 2 ∂θ 2
Using these facts, transform problem (6.15), including the PDE, the boundary conditions and the domain
of integration, from (x, y) to (r, θ). You should find that the domain of integration is a rectangular region
in terms of r and θ. (Marks: 2)
Question 6.77. Set up a regular rectangular grid composed of n+1 nodes in the radial direction and m+1
nodes in the azimuthal direction. Define grid spacings ∆r and ∆θ and grid points ri and θ j . Introduce
appropriate notation to describe the approximation to the solution of BVP (6.15) at each node. Provide
a sketch of the grid. Remember to list the range of values for the indices i and j where appropriate.
Use the Matlab convention that indices MUST start from 1, rather than the usual mathematical
convention where indices start from 0. (Marks: 5)
Question 6.78. Use second-order of convergence, centred-difference approximations to the derivatives
to generate difference equations that correspond to the PDE discretised at the interior nodes. Explicitly
state the ranges of the i and j indices. (Marks: 3)
Question 6.79. Generate difference equations corresponding to the boundary conditions. Ensure second-
order of convergence. [HINT: Note, that in Cartesian coordinates the region of integration D is an annulus
and so only 2 boundary conditions are explicitly stated in formulation (6.15). However, in polar coordinates
D is a rectangle and it needs one boundary condition on each one of its four sides. Require that u(r,θ)
is a C 1 -smooth function, i.e. it is continuous and has a continuous first derivative in the azimuthal direction.
These are known as “periodic boundary conditions”.] (Marks: 5)
Question 6.80. Write the difference equations in the form of a Gauss-Jacobi iteration. Remember to list
index ranges in the Matlab convention. (Marks: 8)
Question 6.81. Write a Matlab function with the header
6.4. HANDS-ON PROJECTS 107

function [r,t,u,niter] = PolarPoisson(n,m,tol)

that iterates the Gauss-Jacobi equations of Question 6.80. The function should take as inputs the numbers
of discretisation steps in radial and azimuthal directions, n and m, respectively, and a specified tolerance τ.
The function should return a vector of values of ri , a vector of values of θ j , a matrix containing the solution
ui, j and the number of iterations taken to satisfy the stopping condition
|| Ẽ || = ||u(p+1) −u(p) || < τ,
where p is a integer denoting the iteration step number. [HINT: The function

CartesianPoisson

that solves the simpler Poisson problem


∇2 u(x,y) = −2, (x,y) ∈ [0,1]×[0,1],
u(0,y) = 0, u(1,y) = 0, (6.16)
uy0 (x,0) = 0, uy0 (x,1) = 0,
is provided in Listing 1 as an example.] (Marks: 10)
Question 6.82. Run your

PolarPoisson

function to produce plots of the solution both in polar coordinates (r, θ) and in Cartesian coordinates (x, y).
The Matlab plotting command surf may be used for plotting, while the Matlab command pol2cart may
be used to convert from polar to Cartesian coordinates. Experiment with changing the boundary conditions
and the non-homogeneous term in problem (6.15) to get other interesting solutions. (Marks: 7)
108 CHAPTER 6. FD DISCRETISATION
Chapter 7

Convergence of finite-difference methods

The properties required by a numerical scheme to enable it to solve a given differential equation problem
can be summarised in terms of the consistency, stability and convergence of the scheme.
A scheme is said to be consistent if it solves the problem it purports to solve. In the context of the diffusion
equation, the numerical scheme must be consistent with the partial differential equation ut = u xx . It would
appear, for example, that the Dufort-Frankel algorithm is not consistent.
A numerical algorithm is said to be stable provided small errors in arithmetic remain bounded. For example,
particular combinations of temporal and spatial discretisation may cause the scheme to blow up as happens
with the Euler scheme. It would appear that there are no combinations of h and k for which the Richardson
scheme is stable. On the other hand, calculation would suggest that the Crank-Nicolson scheme is stable
for all combinations of temporal and spatial discretisation.
A scheme converges if it is both consistent and stable as temporal and spatial discretisation is reduced.

7.1 Convergence, consistency and stability of steady-state problems

Problem 7.1. Consider a continuous linear problem


L[û] = L û− f = 0, (7.1)
where L is a linear differential operator representing a given linear differential equation or a system and the
associated boundary conditions, f is a given vector function and û is the exact solution satisfying problem (7.1).
Let the finite-difference system of algebraic equations used for the discrete approximation of problem (7.1) be
A(h) [u(h) ] = A(h) u(h) −F(h) = 0, (7.2)
where A(h) is the differentiation matrix corresponding to L, F(h) is vector corresponding to f , and u(h) is
the solution of (7.2), and [](h) denotes that these objects depend on the size of the discretization used as
represented by the parameter h.
Investigate the convergence of the finite-difference solution u(h) to the exact solution û as h tends to 0, i.e.
investigate the conditions for
lim kE(h) k = lim ku(h) − ûk = 0.
h→0 h→0
Example 7.2. Construct a finite-difference scheme for the solution of
û 00 − f = 0, û(0) = 0, û(1) = 0.

109
110 CHAPTER 7. CONVERGENCE OF FINITE-DIFFERENCE METHODS

Solution. Here the continuous problem is


 û 00 − f 
L[û] =  û(0)  = 0.
 

 û(1) 
The FD representation is
 
uk−1 −2uk +uk+1
− fk = 0, u0 = 0, un = 0,
h2
or in matrix form
 −2 1 ··· ··· ···  
[u(h) ]1   f1 
1  1 −2 1 ··· ···  
A(h) [u(h) ] = 2     ···  −  ···  = A(h) u(h) −F(h) = 0.
h  . . . . . .
  
. . .   u(h) n−1   fn−1 

 ··· ···
 
Obviously, the exact solutions of these two equations are in general different. ^
Definition 7.3. The quantity
E(h) = u(h) − û,
is called the overall error of the solution to problem (7.1).
Definition 7.4. The quantity
τ(h) = A(h) [û]−L[û] = A(h) [û] = A(h) û−F(h) . (7.3)
is called the local truncation error (local residual) of the numerical scheme A(h) [] = 0.
Example 7.5. Find the local truncation error of the numerical scheme
uk−1 −2uk +uk+1
A(h) [u] = − fk = 0
h2
for the solution of
u 00 − f = 0.

Solution. Now
ûk−1 −2ûk + ûk+1
A(h) [û] = − fk = (ûk00 +O(h2 ))− fk ,
h2
but
L[û] = ûk00 − fk = 0,
so using the definition directly
τ(h) = A(h) [û]−L[û] = O(h2 ).
^
Remark 7.6. Note that the exact solution to the continuous problem is not actually needed to find τ despite
the definition!
Claim 7.7. The overall error E(h) is related to the local truncation error τ(h) and the differentiation matrix
A as follows
E(h) = −A−1
(h) τ(h) .

Proof. Subtract equations (7.2) and (7.3)


A(h) u(h) −F(h) = 0,
A(h) û−F(h) = τ(h),
to get
A(h) (u(h) − û) = A(h) E(h) = −τ(h) .
Then
E(h) = −A−1
(h) τ(h) .


Example 7.8. Illustrate Claim 7.7 in the case of the Example 7.2.
7.1. CONVERGENCE, CONSISTENCY AND STABILITY OF STEADY-STATE PROBLEMS 111

Solution.
 −2/h2 1/h2 −1
··· ··· ···   O(h2 ) 
 1/h2 −2/h2 1/h2

E(h) =  ··· ···   ···  = −A−1 τ(h) .
 
(h)
.. .. ..   2

. . . ···   O(h ) 

 ···

^
Claim 7.9. The solution u(h) of the numerical scheme A(h) u(h) −F(h) = 0 converges to the solution û of the
continuous problem L û− f = 0 if the following two conditions are both satisfied.
1. The norm of the local truncation error vanishes as h → 0,
lim kτ(h) k = 0,
h→0
2. There is a constant C independent of h and a number h0 such that
k A−1
(h) k ≤ C for all h < h0 .

Proof. Since
E(h) = −A−1
(h) τ(h), (7.4)
then
kE(h) k = k A−1
(h) τ(h) k ≤ k A(h) k kτ(h) k ≤ Ckτ(h) k.
−1

Now
0 ≤ lim kE(h) k ≤ C lim kτ(h) k,
h→0 h→0
0 ≤ lim kE(h) k ≤ 0.
h→0


Definition 7.10. (Consistency) The numerical scheme A(h) u(h) − F(h) = 0 is called consistent with the
continuous problem L û− f = 0 if the local truncation error vanishes as h → 0,
lim kτ(h) k = 0.
h→0
Definition 7.11. (Stability) The numerical scheme A(h) u(h) −F(h) = 0 is called stable if there is a constant
C independent of h and a number h0 such that
k A−1
(h) k ≤ C for all h < h0 .
Remark 7.12. Although Claim 7.9 has been demonstrated only for the linear boundary value problem,
it holds for many other types of finite-difference methods for differential equations.

So with these definitions Claim 7.9 is often summarized as follows.


Claim 7.13. (The fundamental theorem of FD methods) A numerical scheme converges to the solution of
its continuous problem if it is stable and consistent, or
CONSISTENCY + STABILITY =⇒ CONVERGENCE.
Remark 7.14. • Consistency is usually the easy part to check.
• Stability is the hard part. – Even for the linear boundary value problem just discussed it is not at
all clear how to check the condition k A−1
(h)
k < C since the matrices A(h) get larger as h → 0.
• For other problems it may not even be clear how to define stability in an appropriate way. There
are many different definitions of âĂIJstabilityâĂİ for different types of problems.
Example 7.15. Investigate the convergence of the of the solution of the numerical scheme
uk−1 −2uk +uk+1
A(h) [u] = − fk = 0.
h2
to the solution of the problem
û 00 − f = 0, û(0) = 0, û(1) = 0.
(Continuing with example 7.2.)
112 CHAPTER 7. CONVERGENCE OF FINITE-DIFFERENCE METHODS

Solution. (Stability) The matrix A(h) is given in example 7.2. Consider the 2-norm, which, because A(h)
is symmetric is
k A(h) k = ρ(A(h) ) = max |λ p |,
p
where ρ(A(h) ) is the spectral radius and λ p is a eigenvalue. Again because A(h) is symmetric
(h) k = ρ(A(h) ) = max |λ p | = (min |λ p |) .
k A−1 −1 −1 −1
p p
One can show by direct substitution that in the case h = 1/n the eigenpairs of this standard tridiagonal matrix are
2 p
λ p = 2 (cos(pπh)−1), u j = sin(pπh j) for p = 1,2,...,n−1.
h
Note, that these expressions are far from incidental; u is, in fact, the discrete version of the eigenfunction
p

sinx of the ∂x2 operator.


Using a Maclaurin expansion
2 2 −p2 π 2 h2 p4 π 4 h4
λ p = 2 (cos(pπh)−1) = 2 ( + +O(h6 )) = −p2 π 2 +O(h2 ).
h h 2 24
So the smallest eigenvalue by modulus is λ1 = −π 2 and so
2
(h) k < 1/π = const.
k A−1
So the method is stable.
(Consistency) The truncation error was found above to be O(h2 ) and tends to zero as h tends to 0+ ,
O(h2 ) → 0 as h → 0+
.
(Convergence) Then, convergence follows, indeed. ^

7.2 Convergence of linear parabolic problems

7.2.1 General conditions for convergence illustrated on a particular case

The relation between consistency, stability and convergence can be demonstrated in a particular case that
is simple enough but captures the main points.
Problem 7.16. Consider a continuous linear parabolic problem
L[û] = ∂t û−∂xx û− f (x,t) = 0, (7.5)
with associated initial and boundary conditions and let û be the exact solution satisfying problem (7.5).
Consider a finite-difference approximation of problem (7.5) in the form of a system of algebraic equations
A(h,k) [u(h,k) ] = u(h,k), j+1 −B(h,k) u(h,k), j −F(h,k), j = 0, (7.6)
where B(h,k) is the differentiation matrix, and [](h,k) denotes that these objects depend on the size of the
discretization used as represented by the parameters h and k.
Establish conditions for convergence of the finite-difference solution u(h,k) to the exact solution û at any
arbitrary but finite time t ∗ = t j = j k as h,k tend to 0, i.e. establish conditions for
lim ∗ kE(h,k), j k = lim ∗ ku(h,k), j − ûk = 0.
h,k→0, jk=t h,k→0, jk=t

Claim 7.17. The overall error E(h,k) is related to the local truncation error τ(h,k) and the differentiation
matrix B(h,k) by the bound
Õj
j−s
kE(h,k), j k ≤ kB(h,k) k kτ(h,k),s−1 k. (7.7)
s=1

Proof. Subtract the definition of local truncation error from equation (7.6)
u j+1 = Bu j +Fj ,
û j+1 = Bû j +Fj +τj ,
7.2. CONVERGENCE OF LINEAR PARABOLIC PROBLEMS 113

to get
(u j+1 − û j+1 ) = B(u j − û j )−τj .
that is the equation for the overall error
E j+1 = BE j −τj .
Solve by recursion
j
Õ
E j = BE j−1 −τj−1 = B(BE j−2 −τj−2 )−τj−1 = ..... = B j E0 − B j−s τs−1
s=1
Assume that the initial conditions are enforced exactly so E0 = 0, then
j
Õ
E j = − B j−s τs−1 .
s=1
This expression shows how the local truncation error accumulates step by step. Bound the last expression
to obtain the result
j
Õ Õj Õ j
kE j k = k B τs−1 k ≤
j−s
kB τs−1 k ≤
j−s
kB j−s k kτs−1 k.
s=1 s=1 s=1


Claim 7.18 (Lax equivalence theorem in a particular case). Given bound (7.7) the necessary and sufficient
conditions for convergence are consistency together with stability, i.e.
1. (consistency)
lim kτ(h,k) k = 0,
h,k→0
2. (stability) there exist a finite constant such that
j
lim ∗ kB(h,k) k ≤ M j .
h,k→0, jk=t

Proof. In general the truncation error at a particular time step has the form kτs k < Ts k α , where Ts is a
sufficiently large finite constant and α is the order of accuracy wrt time. Let us denote by T a sufficiently
large finite constant such that
max kτs k ≤ T k α .
s=1.. j
Also let us denote by M a sufficiently large
 finite constant such
 that
j
max lim ∗ kB(h,k) k ≤ M.
s=1.. j h,k→0, jk=t
Then
j j
MT k α = MT( j k)k α−1 = MT k α−1 t ∗,
Õ Õ
kE j k ≤ kB j−s k kτs−1 k ≤
s=1 s=1
where t ∗ is a finite moment in time at which we wish to establish convergence.
Now since M is finite (stability) and t ∗ is also finite, and limk→0T = 0 (consistency) we obtain convergence
lim MT k α−1 t ∗ = 0.
k→0


Claim 7.19. A necessary and sufficient condition for stability is


kBk ≤ 1
.

Proof. Since
j
kB(h,k) k ≤ kBk j ,
this remains finite as j → ∞ only and if only kBk ≤ 1. 

Remark 7.20. In 2-norm, kBk = ρ(B), so the condition for stability can be written as ρ(B) ≤ 1.
114 CHAPTER 7. CONVERGENCE OF FINITE-DIFFERENCE METHODS

Claim 7.21. For the homogeneous diffusion equation ut −u xx = 0 the condition for stability may be written as
ku j+1 k ≤ ku j k.

Proof. From Claim ??


kBk ≤ 1,
Multiply both sides by ku j k,
kBk ku j k ≤ 1ku j k,
so
kBu j k ≤ ku j k
and because for the homogeneous diffusion equation u j+1 = Bu j , we have
ku j+1 k ≤ ku j k.


7.2.2 Consistency of methods for the diffusion equation

The Dufort-Frankel and the Crank-Nicolson algorithms are examined for consistency in this section, as
examples of the method.
In both cases, the analysis takes advantage of the Taylor series expansions
∂ ûi, j k 2 ∂ 2 ûi, j
ûi, j+1 = ûi, j +k + +O(k 3 ),
∂t 2 ∂t 2
∂ ûi, j k 2 ∂ 2 ûi, j
ûi, j−1 = ûi, j −k + +O(k 3 ),
∂t 2 ∂t 2
∂ ûi, j h2 ∂ 2 ûi, j h3 ∂ 3 ûi, j
ûi+1, j = ûi, j +h + + +O(h4 ),
∂x 2 ∂ x2 6 ∂ x3
∂ ûi, j h2 ∂ 2 ûi, j h3 ∂ 3 ûi, j
ûi−1, j = ûi, j −h + − +O(h4 ).
∂x 2 ∂ x2 6 ∂ x3
Example 7.22. Investigate the consistency of the Dufort-Frankel algorithm
ui, j+1 −ui, j−1 ui+1, j −ui, j+1 −ui, j−1 +ui−1, j
A[u] = − =0
2k h2
with the diffusion equation
(∂t −∂x2 )û = 0.

Solution. Let û be the exact solution of the diffusion equation. We calculate the local truncation error at
an arbitrary point with indices i and j on the discretisation grid
τi, j = A[û].
Each term is now replaced from its Taylor series to obtain after some algebra
∂ ûi, j ∂ 2 ûi, j  k  2 ∂ 2 ûi, j
τi, j = − + +O(h2 )+O(k 2 )+O(k 3 /h2 ).
∂t ∂ x2 h ∂t 2
. Since û is the exact solution of the diffusion equation the first two terms vanish
∂ ûi, j ∂ 2 ûi, j
− = 0,
∂t ∂ x2
and the local truncation error is
 k  2 ∂ 2 û
+O(h2 )+O(k 2 )+O(k 3 /h2 ).
i, j
τi, j =
h ∂t 2
Any implementation of the Dufort-Frankel algorithm in which h and k individual tend to zero with k/h = c,
a constant, will solve ut = u xx −c2 utt .
Consequently the Dufort-Frankel algorithm is not consistent for some choices of the parameters. ^
7.2. CONVERGENCE OF LINEAR PARABOLIC PROBLEMS 115

Example 7.23. Investigate the consistency of the Crank-Nicolson algorithm


ui, j+1 −ui, j 1 h i
A[u] = − 2
ui+1, j+1 +ui+1, j −2(ui, j+1 +ui, j )+ui−1, j+1 +ui−1, j = 0.
k 2h
with the diffusion equation
(∂t −∂x2 )û = 0.

Solution. Let û be the exact solution of the diffusion equation. We calculate the local truncation error at
an arbitrary point with indices i and j on the discretisation grid
τi, j = A[û].
As with the Dufort-Frankel algorithm, each term is now replaced from its Taylor series. The algebra is
more complicated and so we list the individual components as follows,
ûi, j+1 − ûi, j ∂ ûi, j k ∂ 2 ûi, j
= + +O(k 2 )
k ∂t 2 ∂t 2
∂ 2 ûi, j+1
ûi+1, j+1 −2ûi, j+1 + ûi−1, j+1 = h2 +O(h4 )
∂ x2
∂ 2 ûi, j
ûi+1, j −2ûi, j + ûi−1, j = h2 +O(h4 ).
∂ x2
When these components are inserted into the Crank-Nicolson algorithm, the result is
∂ ûi, j k ∂ 2 ûi, j 2 1 h 2 ∂ 2 ûi, j+1 2 ∂ 2 ûi, j 4
i
τi, j = + +O(k )− h +h +O(h )
∂t 2 ∂t 2 2h2 ∂ x2 ∂ x2
which in turn simplifies to
∂ ûi, j k ∂ 2 ûi, j 2 1 h ∂ 2 ûi, j ∂ 3 ûi, j 2 2
i
τi, j = + +O(k )− 2 +k +O(k )+O(h ) .
∂t 2 ∂t 2 2 ∂ x2 ∂ x 2 ∂t
It is now straight forward algebra to observe that this equation can be rewritten in the form
∂ ûi, j ∂ 2 ûi, j k  ∂ 3 ûi, j ∂ 2 ûi, j 
τi, j = − − − +O(k 2 )+O(h2 ).
∂t ∂ x 2 2 ∂ x 2 ∂t ∂t 2
Since û is the exact solution of the diffusion equation the first two terms vanish
∂ ûi, j ∂ 2 ûi, j
− = 0,
∂t ∂ x2
and when this is differentiated wrt time the second two terms also vanish
 ∂ 3 û ∂ 2 ûi, j 
i, j
− = 0,
∂ x 2 ∂t ∂t 2
so
τi, j = O(k 2 )+O(h2 ).
In the limit of k,h → 0 this vanishes,
τi, j = O(k 2 )+O(h2 ) → 0 as k,h → 0.
Thus, the Crank-Nicolson algorithm is consistent. ^

7.2.3 Stability of finite-difference methods for the diffusion equation

Two methods, one including the effect of boundary conditions and the other excluding the effect of boundary
conditions, are used to investigate stability. Both methods are attributed to John von Neumann. The approach
which excludes the effect of boundary conditions is usually called the “Fourier-modes method” while that which
includes the effect of boundary conditions is usually called the “matrix method”. In practice, it is normally
assumed that the boundary conditions have a negligible impact on the stability of the numerical procedure.
116 CHAPTER 7. CONVERGENCE OF FINITE-DIFFERENCE METHODS

7.2.3.1 Direct matrix method

Remark 7.24. The Direct matrix method consists of


1. Find the matrix B of the numerical method u j+1 = Bu j +Fj .
2. Establish conditions for stability by imposing ρ(B) ≤ 1.
Example 7.25. (Dirichlet BCs) Use the matrix method to investigate the stability of the Euler method
u p,q+1 = u p,q +r(u p+1,q −2u p,q +u p−1,q ) = (1−2r)u p,q +ru p+1,q +ru p−1,q
for the solution of the diffusion problem
ut −u xx = f (x,t), (x,t) ∈ [0,1]×[0,∞),
u(x,0) = w(x),
u(0,t) = g1, u(1,t) = g2 .

Solution. The values of u0,q and un,q are given by the boundary conditions in the Dirichlet problem. The aim
of the algorithm is to compute u1,q,···,un−1,q for all values of q. Let U (q) = (u1,q,···,un−1,q )T then Euler’s
algorithm corresponds to the matrix operation
U (q+1) =TU (q) +r F (q) (7.8)
where the (n−1)×(n−1) tri-diagonal matrix T and the (n−1) dimensional vector F (q) are
 1−2r r 0 0 ··· ··· 0 
  u0,q 
 r
 1−2r r 0 ··· ··· 0   0 
 
 0
T =  r 1−2r r 0 ··· ··· ,

=
(q) 
 ..  .
F  . 
 .. .. .. .. .. ..

 . . . . . . ··· 
  0 
 
 0 ··· 0 r 1−2r 
   u 
 ··· ···   n,q 

Gershgorin’s circle theorem asserts that all the eigenvalues of T (which are known to be real) lie within
the region
D = {z ∈ C : |z−(1−2r)| =r }∪{z ∈ C : |z−(1−2r)| = 2r } = {z ∈ C : |z−(1−2r)| = 2r }.
This is a circle centre (1−2r,0) and radius 2r. The extremities of the circle on the x-axis are the points (1−4r,0)
and (1,0). Stability is therefore guaranteed in the Euler algorithm provided 1−4r > −1, that is, r < 1/2. ^
Example 7.26. (Backward Euler scheme) Provide an example of the stability Backward Euler scheme to
illustrate B−1 .
Solution. ^
Example 7.27. (Gradient boundary conditions) Use the matrix method to investigate the stability of the
Euler method
u p,q+1 = u p,q +r(u p+1,q −2u p,q +u p−1,q ) = (1−2r)u p,q +ru p+1,q +ru p−1,q
for the solution of the diffusion problem
ut −u xx = f (x,t), (x,t) ∈ [0,1]×[0,∞),
u(x,0) = w(x),
u x (0,t) = 0, u x (1,t) = 0.

Solution. As previously, ut is replaced by its forward difference approximation and u xx is replaced by its
central difference approximation to obtain
ui, j+1 = ui, j +r(ui+1, j −2ui, j +ui−1, j )+O(k 2,k h2 ), i = 1,···,(n−1)
in which r is the Courant number. The boundary conditions when expressed in terms of forward/backward
differences of u give
−3u0, j +4u1, j −u2, j 3un, j −4un−1, j +un−2, j
= O(h2 ), = O(h2 ).
2h 2h
7.2. CONVERGENCE OF LINEAR PARABOLIC PROBLEMS 117

The values of the solution at time j +1 can be computed from the solution at time j using the results
u1, j+1 = (1−2r)u1, j +ru2, j +ru0, j +O(k h2,k 2 )

u2, j+1 = (1−2r)u2, j +ru3, j +ru1, j +O(k h2,k 2 )

un−1, j+1 = (1−2r)un−1, j +run, j +run−2, j +O(k h2,k 2 )

un−2, j+1 = (1−2r)un−2, j +run−1, j +run−3, j +O(k h2,k 2 )


These results are now used to compute u0, j+1 and un, j+1 as follows:-
1h i
u0, j+1 = 4u1, j+1 −u2, j+1 +O(h3 )
3
1h i
= (4−9r)u1, j +(6r −1)u2, j +4ru0, j −ru3, j +O(k h2,k 2,h3 )
3
1h i
un, j+1 = 4un−1, j+1 −un−2, j+1 +O(h3 )
3
1h i
= (4−9r)un−1, j +4run, j +(6r −1)un−2, j −run−3, j +O(k h2,k 2,h3 )
3
Now let U (j) = (u0, j ,···,un, j )T then the Euler scheme for the determination of the solution becomes
U (j+1) = AU (j)
where A is the (n+1)×(n+1) matrix
 4r 4−9r 6r −1 r
0 0 

− ···
 3 3 3 3


 r 1−2r r 0 ··· ··· 0 

 
 0 r 1−2r r 0 ··· ··· 

 .. .. .. .. .. ..
 
 . . . . . .

 b 

 0 r 6r −1 4−9r 4r 
··· ··· −
3 3 3 3 
 
Clearly A is neither symmetric or tri-diagonal. Nevertheless Gershgorin’s circle theorem indicates that

all the eigenvalues of A are located within the region


D = {z ∈ C : |z−(1−2r)| = 2r }∪{z ∈ C : |3z−4r | =r +|4−9r |+|6r −1|}.
As previously, the circle centre (1-2r,0) lies within the unit circle provided r < 1/2. However the analysis
of the second circle requires the treatment of several cases. The boundary values are r = 4/9 and r = 1/6.
Case 1 Suppose 0 < r < 1/6 then 0 < 9r < 3/2 < 4. The second circle is
|3z−4r | =r +4−9r +1−6r = 5−14r → |3z−4r | = 5−14r.
The line segment of the x-axis lying within this circle is −5 + 14r ≤ 3x − 4r ≤ 5 − 14r, which becomes
after simplification −5/3+6r ≤ x ≤ 5/3−10r/3. This line segment lies within the unit circle provided
5/3−10r/3 < 1 and −1 < −5/3+6r, that is, provided 1/5 < r and 1/9 < r. But r < 1/6 by assumption
and so these conditions cannot be satisfied.
Case 2 When 1/6 < r < 4/9, the second circle has equation
|3z−4r | =r +4−9r +6r −1 = 3−2r → |3z−4r | = 3−2r.
The line segment of the x-axis lying within this circle is −3+2r ≤ 3x −4r ≤ 3−2r, which becomes after
simplification −1+2r ≤ x ≤ 1+2r/3. Clearly 1+2r/3 > 1 for r > 0 and so this line segment never lies within
the unit circle.
Case 3 When 4/9 < r < 1/2, the second circle has equation
|3z−4r | =r +9r −4+6r −1 = 16r −5 → |3z−4r | = 16r −5.
The line segment of the x-axis lying within this circle is −16r + 5 ≤ 3x − 4r ≤ 16r − 5, which becomes
after simplification 5/3 − 4r ≤ x ≤ 20r/3 − 5/3. This line segment lies within the unit circle provided
20r/3−5/3 < 1 and −1 < 5/3−4r, that is, provided r < 2/5 and r < 2/3. But r > 4/9 and so again these
conditions cannot be satisfied.
118 CHAPTER 7. CONVERGENCE OF FINITE-DIFFERENCE METHODS

In conclusion, Gershgorin’s theorem on this occasion provides no useful information regarding the stability
of the algorithm. In fact, the issue of stability is of secondary importance in this problem to the issue of
accuracy. The detailed analysis of the error structure in this Neumann problem indicates that the numerical
scheme no longer has the expected O(k h2,k 2 ) accuracy of the Dirichlet problem. The error at each iteration
is dominated by O(h3 ) which behaves like O(k h) assuming that k = r h2 = O(1). The treatment of the
Neumann boundary value problem for the diffusion equation is non-trivial and is not pursued further here.
^

7.2.3.2 Method of Fourier modes

§ 7.28 (Scope). The method of Fourier modes (due to von Neumann) provides a simple approach to
investigating the stability in the case of
• Linear homogeneous problems
• Periodic boundary conditions (or infinite domains).
§ 7.29 (Basic idea). • The method of Fourier modes can be regarded as a discrete version of the method
of separation of variables for the solution of certain PDEs.
• Similarly to the continuous case, the method relies on the fact that the complex exponentials are
eigenfunctions of finite-difference operators.
Claim 7.30. Let {u p,0 = u(hp,0),p = 0..n,h > 0} be a set of points that discretise the initial condition u(x,0)
of a PDF problem. The equivalent representation in terms of a sum of Fourier modes
n
Õ
u p,0 = Ak eiαk ph
k=0
holds, where αk = kπ/(nh).

Proof. The relations


n
Õ
u p,0 = Ak eiαk ph
k=0
provide n+1 equations for the n+1 unknowns A1 ...An . These can be solved using the orthogonality relation
for eiαk ph (
1 l ikπx/l −imπx/l 0,m , k,

e e dx =
2l −l 1,m = k.

1
Claim 7.31. The eigenmodes of the finite-difference operator ∆v p = 2h v p+1 −v p−1 are ( hi sin(hα),eiαph ).


Proof.
1  iα(p+1)h iα(p−1)h 
∆eiαph = e −e
2h
1  iαh −iαh  iαph
= e −e e
2h
i
= sin(hα) eiαph .
h

1
Claim 7.32. The eigenmodes of the finite-difference operator ∆2 v p = ∆∆v p = (2h)2 v p+2 −2v p +v p−2 are


(− h12 sin2 (hα),eiαph ).

Proof. Apply the above one more time. 


7.2. CONVERGENCE OF LINEAR PARABOLIC PROBLEMS 119

Remark 7.33. So for linear difference equations


1. the initial Fourier modes are preserved,
2. the amplitudes of the initial Fourier modes evolve in time independently of each other because separate
solutions are additive.
Thus to investigate this evolution it is sufficient to consider one Fourier mode only
u p,q = λ q eiαph,
where p and q are the indices for time and space, respectively, i.e. tq = qk and, x p = ph, and α is the

frequency of the mode, while i = −1, and λ is called the “amplification factor” of the scheme.
Claim 7.34. Assuming a numerical scheme for the homogeneous diffusion equation has Fourier mode
solutions λ q eiαph it is stable whenever |λ| ≤ 1 and unstable whenever |λ| > 1.

Proof. ||uq+1 || ≤ ||uq || so


||λ q+1 eiαph || ≤ ||λ q eiαph ||
and cancelling all common factors
|λ| < 1.


Remark 7.35 (Summary). The method of Fourier modes can be summarized as follows.
1. Assume solution ansatz u pq = λ q eiαph .
2. Substitute ansatz into the given equation.
3. Find λ = λ(k,h).
4. Establish conditions for stability |λ| ≤ 1.
Example 7.36. Use the method of Fourier-modes to investigate the stability of the Euler method
u p,q+1 = u p,q +r(u p−1,q −2u p,q +u p+1,q ).

Solution. Assume
u p,q = λ q eiαph .
Substitute to obtain  
λ q+1 eiαph = λ q eiαph +r λ q eiα(p−1)h −2λ q eiαph +λ q eiα(p+1)h .
This equation simplifies immediately
 to   
λ = 1+r e −iαh
−2+eiαh = 1+2r cosαh−1 = 1−4rsin2 αh/2.
Clearly λ is real-valued and satisfies λ < 1. Therefore Euler’s method is stable provided λ > −1 which in
turn requires that
1
1−4rsin2 αh/2 > −1 → 2
>r
2sin αh/2
for all choices of α. This condition can be satisfied only provided r < 1/2, and therefore the numerical
stability of the Euler method requires that r < 1/2. ^
Example 7.37. Use the method of Fourier-modes to investigate the stability of the Richardson method
u p,q+1 = u p,q−1 +2r(u p+1,q −2u p,q +u p−1,q ).

Solution. Assume
u p,q = λ q eiαph .
Substitute to obtain  
λ q+1 eiαph = λ q−1 eiαph +2r λ q eiα(p+1)h −2λ q eiαph +λ q eiα(p−1)h .
This equation simplifies immediately
 to   
λ = λ +2r e −2+e−iαh = λ−1 +4r cosαh−1 = λ−1 −8rsin2 αh/2.
−1 iαh

In conclusion, λ is the solution of the quadratic equation


λ2 +8λrsin2 αh/2−1 = 0
120 CHAPTER 7. CONVERGENCE OF FINITE-DIFFERENCE METHODS

with solutions q
λ = −4rsin2 αh/2± 1+16r 2 sin4 αh/2.
Clearly both solutions are real but one solution satisfies |λ| > 1. Therefore Richardson’s method is never
stable (as was observed in the numerical example). The oscillatory nature of blow-up in Richardson’s
algorithm is due to the fact that instability ensues through a negative value of λ. ^
Example 7.38. Use the method of Fourier-modes to investigate the stability of the Dufort-Frankel method
u p,q+1 = u p,q−1 +2r(u p+1,q −u p,q+1 −u p,q−1 +u p−1,q ).

Solution. Assume
u p,q = λ q eiαph .
Substitute to obtain  
λ q+1 eiαph = λ q−1 eiαph +2r λ q eiα(p+1)h −λ q+1 eiαph −λ q−1 eiαph +λ q eiα(p−1)h .
This equation simplifies immediately to
λ = λ−1 +2r(eiαh −λ−λ−1 +e−iαh ) = −2rλ+(1−2r)λ−1 +4rcos(αh).
In conclusion, λ is the solution of the quadratic equation
(1+2r)λ2 −4λrcosαh−(1−2r) = 0
with solutions √
2rcosαh± 1−4r 2 sin2 αh
λ= .
1+2r
The roots of this quadratic are either both real or form a complex conjugate pair. If the roots are real, the
triangle inequality gives
2rcosαh± √1−4r 2 sin2 αh |2rcosαh|+ √1−4r 2 sin2 αh 2r +1
|λ| = ≤ ≤ = 1.

1+2r 1+2r 1+2r
On the other hand, if the roots are complex then 2r > 1 and
4r 2 cos2 αh+4r 2 sin2 αh−1 4r 2 −1 2r −1
|λ| 2 = = ≤ ≤ 1.
(1+2r)2 (1+2r)2 2r +1
Either way, the Dufort-Frankel scheme is stable. This means that it will not blow-up due to the presence
of rounding error, although it has already been shown to be inconsistent. It is now clear why this scheme
converges, but not necessarily to the desired solution. ^
Example 7.39. Use the method ofh Fourier-modes to investigate the stability of the Crank-Nicolson method
r i
u p,q+1 = u p,q + u p+1,q+1 +u p+1,q −2(u p,q+1 +u p,q )+u p−1,q+1 +u p−1,q .
2

Solution. Assume
u p,q = λ q eiαph .
Substitute to obtain
rh
λ q+1 eiαph = λ q eiαph + λ q+1 eiα(p+1)h +λ q eiα(p+1)h
2
  i
− 2 λ q+1 eiαph +λ q eiαph +λ q+1 eiα(p−1)h +λ q eiα(p−1)h .
This equation simplifies immediately to
r h iαh iαh i
λ = 1+ λe +e −2(λ+1)+λe−iαh +e−iαh .
2
Further simplification gives
1−2rsin2 αh/2
λ = 1+r(λcosαh+cosαh−λ−1) → λ = .
1+2rsin2 αh/2
Irrespective of the choice of k or h, it is clear that |λ| < 1 and therefore the Crank-Nicolson scheme is stable.
This means that it will not blow-up due to the presence of rounding error. ^
7.3. CONVERGENCE OF THE EULER METHOD FOR NONLINEAR IVP ODES 121

7.3 Convergence of the Euler method for nonlinear IVP ODEs

In this final section, we provide for completeness a proof that the Euler method for the IVPs in a scalar ODE
is convergent even in the nonlinear case. Note how the notions of consistency, stability and convergence
feature in the following proof.
Claim 7.40. Under suitable conditions (see proof) the Euler method for the solution of
y 0 = f (y,t), y(a) = α, t ∈ [a,b]
has an error bound as derived below and converges when the time step tends to zero.

Proof. In order to analyse the Euler method we define the local truncation error (LTE). This is the remainder
when the difference equations are evaluated using the true solution to the IVP. In the case of the Euler
method this is
yi+1 − yi
τi+1 = − f (yi,ti ). (7.9)
h
In general if the LTE is O(h ) then the method is said to be an order p method. If we have τi → 0 as h → 0
p

then the method is said to be consistent. We define the error as


ei = ui − yi . (7.10)
A method is is said to be convergent if the error at some fixed location tends to zero. We now prove a result
linking consistency and convergence for the Euler method.
Let τ = maxi |τi | the maximum local truncation error. Assume that f (y,t) is such that it satisfied a uniform
Lipschitz condition with Lipschitz constant K, that is
| f (y,t)− f ( ŷ,t)| ≤ K |y− ŷ|.
We deduce a bound for the error, first noting that ui satisfies the difference equation and using the definition
of local truncation error
ui+1 = ui +h f (ui,ti )
yi+1 = yi +h f (yi,ti )+hτi+1
subtraction leads to
|ei+1 | = |ei +h( f (ui,ti )− f (yi,ti ))−hτi+1 |
≤ |ei |+h| f (ui,ti )− f (yi,ti )|+h|τi+1 | triangle inequality
≤ |ei |+h| f (ui,ti )− f (yi,ti )|+hτ definition of τ
≤ |ei |+hK |ui − yi |+hτ use of uniform Lipschitz property
= (1+hK)|ei |+hτ
this recurrence can be used repeatedly to find that  τ
|ei | ≤ (1+K h)i+1 |e0 |+ (7.11)
K
note that 1+ x ≤ e for x ≥ 0 so (1+ x) ≤ e for x ≥ 0 and
x n nx
 therefore
τ
|ei | ≤ e K(ti −t0 )
|e0 |+ (7.12)
K
where we have used that hi = ti −t0 . Examine the errorat some fixed location ti = b, with t0 = a, then
K(b−a) τ
|eb | ≤ e |e0 |+
K
and note that if |ea | = 0 (we usually take the initial conditions for the difference equation to be the same
as those for the IVP so this is the case) and if τ → 0 as h → 0 (the method is consistent) then |eb | → 0 as
h → 0 and so the Euler method is convergent. 

You might also like