LectureNotes Part01
LectureNotes Part01
Numerical Analysis
Lecture Notes Part 1 (Pre-Midsem)
MA 214, Spring 2014-15
Authors:
S. Baskar and S. Sivaji Ganesh
Department of Mathematics
Indian Institute of Technology Bombay
Powai, Mumbai 400 076.
Contents
Mathematical Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3 Dierentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
1.4 Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
14
18
19
21
1.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22
Error Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
26
26
27
29
31
32
33
36
36
2.4.2 Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
37
2.4.3 Division . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
38
38
39
42
2.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
44
47
48
49
49
53
55
56
3.2.5 LU Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
58
66
74
74
79
81
82
85
85
86
M A 214, IITB
CHAPTER 1
Mathematical Preliminaries
This chapter reviews some of the concepts and results from calculus that are frequently used in this course. We recall important definitions and theorems whose
proof is outlined briefly. The readers are assumed to be familiar with a first course
in calculus.
In Section 1.1, we introduce sequences of real numbers and discuss the concept of
limit and continuity in Section 1.2 with the intermediate value theorem. This theorem
plays a basic role in finding initial guesses in iterative methods for solving nonlinear
equations. In Section 1.3 we define derivative of a function, and prove Rolles theorem
and mean-value theorem for derivatives. The mean-value theorem for integration is
discussed in Section 1.4. These two theorems are crucially used in devising methods
for numerical integration and dierentiation. Finally, Taylors theorem is discussed in
Section 1.5, which is essential for derivation and error analysis of almost all numerical
methods discussed in this course. In Section 1.6 we introduce tools useful in discussing
speed of convergence of sequences and rate at which a function f pxq approaches a
point f px0 q as x x0 .
Let a, b P R be such that a b. We use the notations ra, bs and pa, bq for the closed
and the open intervals, respectively, and are defined by
ra, bs t x P R : a x b u and pa, bq t x P R : a x b u.
n8
n N.
n8
[
\
M A 214, IITB
[
\
Note that any bounded sequence need not converge. The monotonicity in the above
theorem is very important. The following result is known as algebra of limits of
sequences.
Theorem 1.8. Let tan u and tbn u be two sequences. Assume that lim an and lim bn
n8
n8
exist. Then
(1) lim pan ` bn q lim an ` lim bn .
n8
n8
n8
n8
n8
n8
1
1
, provided lim an 0.
n8 an
n8
lim an
(4) lim
n8
if we can make the values of f pxq arbitrarily close to l (as close to l as we like)
by taking x to be suciently close to a and x less than a.
M A 214, IITB
if we can make the values of f pxq arbitrarily close to r (as close to r as we like)
by taking x to be suciently close to a and x greater than a.
(3) Let f be a function defined on both sides of a, except possibly at a itself. Then, we
saythe limit of f pxq as x approaches a, equals L and denote
lim f pxq L,
xa
if we can make the values of f pxq arbitrarily close to L (as close to L as we like)
by taking x to be suciently close to a (on either side of a) but not equal to a.
Remark 1.10. Note that in each of the above definitions the value of the function
f at the point a does not play any role. In fact, the function f need not be defined
at the point a.
[
\
In the previous section, we have seen some limit laws in the context of sequences.
Similar limit laws also hold for limits of functions. We have the following result, often
referred to as the limit laws or as algebra of limits.
Theorem 1.11. Let f, g be two functions defined on both sides of a, except possibly
at a itself. Assume that lim f pxq and lim gpxq exist. Then
xa
xa
xa
xa
xa
xa
xa
1
1
(4) lim
xa
xa
[
\
M A 214, IITB
xa
xa
Theorem 1.14 (Sandwich Theorem). Let f , g, and h be given functions such that
(1) f pxq gpxq hpxq when x is in an interval containing a (except possibly at a) and
(2) lim f pxq lim hpxq L,
xa
xa
then
lim gpxq L.
xa
We will now give a rigorous definition of the limit of a function. Similar definitions
can be written down for left-hand and right-hand limits of functions.
Definition 1.15. Let f be a function defined on some open interval that contains a,
except possibly at a itself. Then we say that the limit of f pxq as x approaches a is L
and we write
lim f pxq L.
xa
xa`
xa
(3) continuous at a if
lim f pxq f paq.
xa
M A 214, IITB
[
\
1.3 Dierentiation
Definition 1.20 (Derivative).
The derivative of a function f at a, denoted by f 1 paq, is
f pa ` hq f paq
,
h0
h
f 1 paq lim
(1.3.1)
(1.3.2)
f pa ` hq f pa hq
,
h0
2h
(1.3.3)
f 1 paq lim
and
f 1 paq lim
provided the limits exist.
Baskar and Sivaji
[
\
10
M A 214, IITB
f 1 paq lim
Interpretation: Take the graph of f , draw the line joining the points pa, f paqq, px, f pxqq.
Take its slope and take the limit of these slopes as x a. Then the point px, f pxqq
tends to pa, f paqq. The limit is nothing but the slope of the tangent line at pa, f paqq
to the curve y f pxq. This geometric interpretation will be very useful in describing
the Newton-Raphson method in the context of solving nonlinear equations.
[
\
Theorem 1.22. If f is dierentiable at a, then f is continuous at a.
Proof:
f pxq f paq
f pxq f paq
px aq
xa
f pxq f paq
px aq ` f paq
xa
Taking limit as x a in the last equation yields the desired result.
f pxq
[
\
The converse of Theorem 1.22 is not true. For, the function f pxq |x| is continuous
at x 0 but is not dierentiable there.
Theorem 1.23. Suppose f is dierentiable at a. Then there exists a function such
that
f pxq f paq ` px aqf 1 paq ` px aqpxq,
and limxa pxq 0.
Proof: Define by
f pxq f paq
f 1 paq.
xa
Since f is dierentiable at a, the result follows on taking limits on both sides of the
last equation as x a.
[
\
pxq
Theorem 1.24 (Rolles Theorem). Let f be a function that satisfies the following
three hypotheses:
(1) f is continuous on the closed interval ra, bs.
(2) f is dierentiable on the open interval pa, bq.
(3) f paq f pbq.
Then there is a number c in the open interval pa, bq such that f 1 pcq 0.
B askar and Sivaji
11
M A 214, IITB
f pbq f paq
.
ba
or, equivalently,
f pbq f paq f 1 pcqpb aq.
Proof: The strategy is to define a new function pxq satisfying the hypothesis of
Rolles theorem. The conclusion of Rolles theorem for should yield the conclusion
of Mean Value Theorem for f .
Define on ra, bs by
pxq f pxq f paq
f pbq f paq
px aq.
ba
We can apply Rolles theorem to on ra, bs, as satisfies the hypothesis of Rolles
theorem. Rolles theorem asserts the existence of c P pa, bq such that 1 pcq 0. This
concludes the proof of Mean Value Theorem.
[
\
Baskar and Sivaji
12
M A 214, IITB
1.4 Integration
In Theorem 1.25, we have discussed the mean value property for the derivative of a
function. We now discuss the mean value theorems for integration.
Theorem 1.26 (Mean Value Theorem for Integrals). If f is continuous on
ra, bs, then there exists a number c in ra, bs such that
b
f pxq dx f pcqpb aq.
a
Proof: Let m and M be minimum and maximum values of f in the interval ra, bs,
respectively. Then,
b
mpb aq f pxq dx M pb aq.
a
Since f is continuous, the result follows from the intermediate value theorem.
[
\
b
f pxq dx.
a
Observe that the first mean value theorem for integrals asserts that the average of
an integrable function f on an interval ra, bs belongs to the range of the function f .
Interpretation: Let f be a function on ra, bs with f 0. Draw the graph of f and
find the area under the graph lying between the ordinates x a and x b. Also,
look at a rectangle with base as the interval ra, bs with height f pcq and compute its
area. Both values are the same.
[
\
The Theorem 1.26 is often referred to as the first mean value theorem for
integrals. We now state the second mean value theorem for integrals, which is a
general form of Theorem 1.26
Theorem 1.27 (Second Mean Value Theorem for Integrals). Let f and g
be continuous on ra, bs, and let gpxq 0 for all x P R. Then there exists a number
c P ra, bs such that
b
b
f pxqgpxq dx f pcq gpxq dx.
a
[
\
13
M A 214, IITB
f pkq paq
px aqk , x P R.
k!
k0
(1.5.4)
f pn`1q pq
px aqn`1 ,
pn ` 1q!
(1.5.5)
where Tn is the Taylors polynomial of degree n for f at the point a given by (1.5.4)
and the second term on the right hand side is called the remainder term.
Proof: Let us assume x a and prove the theorem. The proof is similar if x a.
Define gptq by
gptq f ptq Tn ptq Apt aqn`1
and choose A so that gpxq 0, which gives
A
f pxq Tn pxq
.
px aqn`1
Note that
g pkq paq 0 for k 0, 1, . . . n.
Also, observe that the function g is continuous on ra, xs and dierentiable in pa, xq.
Apply Rolles theorem to g on ra, xs (after verifying all the hypotheses of Rolles
theorem) to get
Baskar and Sivaji
14
M A 214, IITB
f pn`1q pcn`1 q
.
pn ` 1q!
f pn`1q pcn`1 q
px aqn`1 .
pn ` 1q!
[
\
Observe that the mean value theorem 1.25 is a particular case of the Taylors
theorem.
Remark 1.30. The representation (1.5.5) is called the Taylors formula for the
function f about the point a.
The Taylors theorem helps us to obtain an approximate value of a suciently
smooth function in a small neighborhood of a given point a when the value of f and
all its derivatives up to a sucient order is known at the point a. For instance, if
we know f paq, f 1 paq, , f pnq paq, and we seek an approximate value of f pa ` hq for
some real number h, then the Taylors theorem can be used to get
f 2 paq 2
f pnq paq n
f pa ` hq f paq ` f paqh `
h ` `
h .
2!
n!
Note here that we have not added the remainder term and therefore used the approximation symbol . Observe that the remainder term
1
f pn`1q pq n`1
h
pn ` 1q!
is not known since it involves the evaluation of f pn`1q at some unknown value lying
between a and a ` h. Also, observe that as h 0, the remainder term approaches to
zero, provided f pn`1q is bounded. This means that for smaller values of h, the Taylors
polynomial gives a good approximation of f pa ` hq.
[
\
B askar and Sivaji
15
M A 214, IITB
n`1
pq
Mn`1
n`1
px
aq
x
a
pn ` 1q!
pn ` 1q!
We can further get an estimate of the reminder term that is independent of x as
pn`1q
pq
n`1
Mn`1 pb aqn`1 ,
px
aq
pn ` 1q!
pn ` 1q!
which holds for all x P I. Observe that the right hand side of the above estimate is a
fixed number. We refer to such estimates as remainder estimates.
In most applications of Taylors theorem, one never knows precisely. However in
view of remainder estimate given above, it does not matter as long as we know that
the remainder can be bounded by obtaining a bound Mn`1 which is valid for all
between a and x.
[
\
Definition 1.32 (Truncation Error).
The remainder term involved in approximating f pxq by the Taylors polynomial
Tn pxq is also called the Truncation error.
Example 1.33. A second degree polynomial approximation to
?
f pxq x ` 1, x P r1, 8q
using the Taylors formula about a 0 is given by
f pxq 1 `
x x2
,
2
8
where the remainder term is neglected and hence what we obtained here is only an
approximate representation of f .
The truncation error is obtained using the remainder term in the formula (1.5.5)
with n 2 and is given by
x3
?
,
16p 1 ` q5
for some point between 0 and x.
16
M A 214, IITB
[
\
f pkq paq
px aqk
k!
k0
is called the Taylors series of f about the point a.
The question now is when this series converges and what is the limit of this series.
These questions are answered in the following theorem.
Theorem 1.35. Let f be C 8 pIq and let a P I. Assume that there exists an open
interval Ia I of the point a such that there exists a constant M (may depend on a)
pkq
f pxq M k ,
for all x P Na and k 0, 1, 2, . Then for each x P Ia , we have
f pxq
f pkq paq
px aqk .
k!
k0
Example 1.36. As another example, let us approximate the function f pxq cospxq
by a polynomial using Taylors theorem about the point a 0. First, let us take the
Taylors series expansion
f pxq cosp0q sinp0qx
cosp0q 2 sinp0q 3
x `
x `
2!
3!
p1qk 2k
x .
p2kq!
k0
p1qk 2k
x ,
p2kq!
k0
17
M A 214, IITB
f(x)=cos(x)
Taylor polynomial of degree 10 (n=5)
1.5
1.5
0.5
0.5
f(x)=cos(x)
Taylor polynomial of degree 2 (n=1)
0.5
0.5
1.5
1.5
2
6
2
6
Fig. 1.1. Comparison between the graph of f pxq cospxq and the Taylor polynomial of degree 2 and 10
about the point a 0.
which is the Taylor polynomial of degree 2n for the function f pxq cospxq about the
point a 0. The remainder term is given by
p1qn`1
cospq
x2pn`1q ,
p2pn ` 1qq!
where lies between 0 and x. It is important to observe here that for a given n, we
get the Taylor polynomial of degree 2n. Figure 1.1 shows the comparison between the
Taylor polynomial (red dot and dash line) of degree 2 (n 1) and degree 10 (n 5)
for f pxq cospxq about a 0 and the graph of cospxq (blue solid line). We observe
that for n 1, Taylor polynomial gives a good approximation in a small neighborhood
of a 0. But suciently away from 0, this polynomial deviates significantly from the
actual graph of f pxq cospxq. Whereas, for n 5, we get a good approximation in
a suciently large neighborhood of a 0.
[
\
18
M A 214, IITB
for all n N.
Remark 1.39.
"
an
bn
"
an
bn
(1) If bn 0 for every n, then we have an Opbn q if and only if the sequence
is bounded. That is, there exists a constant C such that
an
C
bn
(2) If bn 0 for every n, then we have an opbn q if and only if the sequence
converges to 0. That is,
an
0.
n8 bn
lim
(3) For any pair of sequences tan u and tbn u such that an opbn q, it follows that an
Opbn q. The converse is not true. Consider the sequences an n and bn 2n ` 3,
for which an Opbn q holds but an opbn q does not hold.
B askar and Sivaji
19
M A 214, IITB
whenever |x x0 | .
p1qk 2k
cospq
cospxq
x ` p1qn`1
x2pn`1q
p2kq!
p2pn
`
1qq!
k0
cospq
x2pn`1q .
p2pn ` 1qq!
`
gpxq O x2pn`1q as x 0.
20
[
\
M A 214, IITB
n8
We would like to measure the speed at which the convergence takes place. For example, consider
1
lim
0
n8 2n ` 3
and
1
lim 2 0.
n8 n
We feel that the first sequence goes to zero linearly and the second goes with a much
superior speed because of the presence of n2 in its denominator. We will define the
notion of order of convergence precisely.
Definition 1.42 (Rate of Convergence or Order of Convergence).
Let tan u be a sequence such that lim an a.
n8
(1) We say that the rate of convergence is atleast linear if there exists a constant
c 1 and a natural number N such that
|an`1 a| c |an a|
for all n N.
(2) We say that the rate of convergence is atleast superlinear if there exists a
sequence tn u that converges to 0, and a natural number N such that
|an`1 a| n |an a| for all n N.
(3) We say that the rate of convergence is at least quadratic if there exists a constant
C (not necessarily less than 1), and a natural number N such that
|an`1 a| C |an a|2
for all n N.
(4) Let P R` . We say that the rate of convergence is atleast if there exists a
constant C (not necessarily less than 1), and a natural number N such that
|an`1 a| C |an a|
21
for all n N.
M A 214, IITB
1.7 Exercises
Sequences of Real Numbers
(1) Let L be a real number and let tan u be a sequence of real numbers. If there exists
a positive integer N such that
|an L| |an1 L|,
for all n N and for some fixed P p0, 1q, then show that an L as n 8.
(2) Consider the sequences tan u and tbn u, where
an
1
1
, bn 2 , n 1, 2, .
n
n
Clearly, both the sequences converge to zero. For the given 102 , obtain the
smallest positive integers Na and Nb such that
|an | whenever n Na , and |bn | whenever n Nb .
For any 0, show that Na Nb .
(3) Let txn u and tyn u be two sequences such that xn , yn P ra, bs and xn yn for each
n 1, 2, . If xn b as n 8, then show that the sequence tyn u converges.
Find the limit of the sequence tyn u.
n2 n`2
(4) Let In
,
, n 1, 2, and tan u be a sequence with an is chosen
2n
2n
1
arbitrarily in In for each n 1, 2, . Show that an as n 8.
2
22
M A 214, IITB
f pxi qgi f pq
i1
i1
(9) Let f : r0, 1s r0, 1s be a continuous function. Prove that the equation f pxq x
has at least one solution lying in the interval r0, 1s (Note: A solution of this equation is called a fixed point of the function f ).
(10) Show that the equation f pxq x, where
x ` 1
,
f pxq sin
2
x P r1, 1s
Dierentiation
(11) Let c P pa, bq and f : pa, bq R be dierentiable at c. If c is a local extremum
(maximum or minimum) of f , then show that f 1 pcq 0.
(12) Suppose f is dierentiable in an open interval pa, bq. Prove the following statements
(a) If f 1 pxq 0 for all x P pa, bq, then f is non-decreasing.
(b) If f 1 pxq 0 for all x P pa, bq, then f is constant.
(c) If f 1 pxq 0 for all x P pa, bq, then f is non-increasing.
(13) Let f : ra, bs R be given by f pxq x2 . Find a point c specified by the meanvalue theorem for derivatives. Verify that this point lies in the interval pa, bq.
Integration
(14) Let g : r0, 1s R be a continuous function. Show that there exists a c P p0, 1q
such that
1
1
x2 p1 xq2 gpxqdx gpq.
30
0
23
M A 214, IITB
p1qn
,
sinpt2 q dt
c
?
n
a
?
where n c pn ` 1q.
Taylors Theorem
(16) Find the Taylors polynomial of degree 2 for the function
?
f pxq x ` 1
about the point a 1. Also find the remainder.
(17) Use Taylors formula about a 0 to evaluate approximately the value of the function f pxq ex at x 0.5 using three terms (i.e., n 2) in the formula. Obtain
the remainder R2 p0.5q in terms of the unknown c. Compute approximately the
possible values of c and show that these values lie in the interval p0, 0.5q.
n`1
n2
n`1
n2
n`1
?
n
op n1 q as n 8.
Op n1 q as n 8.
op1q as n 8.
1
op n1 q as n 8.
ln n
1
op n1 q as n 8.
n ln n
en
Op n1 q as n 8.
n5
24
M A 214, IITB
CHAPTER 2
Error Analysis
Numerical analysis deals with developing methods, called numerical methods, to approximate a solution of a given Mathematical problem (whenever a solution exists).
The approximate solution obtained by this method will involve an error which is precisely the dierence between the exact solution and the approximate solution. Thus,
we have
Exact Solution Approximate Solution ` Error.
We call this error the mathematical error.
The study of numerical methods is incomplete if we dont develop algorithms and
implement the algorithms as computer codes. The outcome of the computer code
is a set of numerical values to the approximate solution obtained using a numerical
method. Such a set of numerical values is called the numerical solution to the given
Mathematical problem. During the process of computation, the computer introduces
a new error, called the arithmetic error and we have
Approximate Solution Numerical Solution ` Arithmetic Error.
The error involved in the numerical solution when compared to the exact solution
can be worser than the mathematical error and is now given by
Exact Solution Numerical Solution ` Mathematical Error ` Arithmetic Error.
The Total Error is defined as
Total Error Mathematical Error ` Arithmetic Error.
A digital calculating device can hold only a finite number of digits because of
memory restrictions. Therefore, a number cannot be stored exactly. Certain approximation needs to be done, and only an approximate value of the given number will
finally be stored in the device. For further calculations, this approximate value is
used instead of the exact value of the number. This is the source of arithmetic error.
In this chapter, we introduce the floating-point representation of a real number
and illustrate a few ways to obtain floating-point approximation of a given real number. We further introduce dierent types of errors that we come across in numerical
(2.1.1)
d1 d2
dn
dn`1
` 2 ` ` n ` n`1 `
(2.1.2)
is a -fraction called the mantissa, s is called the sign and the number is called
the radix. The representation (2.1.1) of a real number is called the floating-point
representation.
Remark 2.1. When 2, the floating-point representation (2.1.1) is called the
binary floating-point representation and when 10, it is called the decimal
floating-point representation. Throughout this course, we always take 10.[
\
Due to memory restrictions, a computing device can store only a finite number of
digits in the mantissa. In this section, we introduce the floating-point approximation
and discuss how a given real number can be approximated.
2.1.1 Floating-Point Approximation
A computing device stores a real number with only a finite number of digits in the
mantissa. Although dierent computing devices have dierent ways of representing
the numbers, here we introduce a mathematical form of this representation, which
we will use throughout this course.
Definition 2.2 (n-Digit Floating-point Number).
Let P N and 2. An n-digit floating-point number in base is of the form
p1qs p.d1 d2 dn q e
(2.1.3)
where
p.d1 d2 dn q
d1 d2
dn
` 2 ` ` n
(2.1.4)
26
M A 214, IITB
[
\
(2.1.5)
27
M A 214, IITB
1.000000
NaN
If your computer is not showing inf for i = 308.25472, try increasing the value of
i till you get inf.
[
\
Example 2.8 (Underflow). Run the following MATLAB code on a computer with
32-bit intel processor:
j=-323.6;
if(10^j>0)
28
M A 214, IITB
29
M A 214, IITB
(2.1.6)
(2.1.7)
where
[
\
Remark 2.11. Most of the modern processors, including Intel, uses IEEE 754 standard format. This format uses 52 bits in mantissa, (64-bit binary representation), 11
bits in exponent and 1 bit for sign. This representation is called the double precision
number.
When we perform a computation without any floating-point approximation, we
say that the computation is done using infinite precision (also called exact arith[
\
metic).
30
M A 214, IITB
?
f p100000q 100000
100001 100000 .
The evaluation of
?
100001 using six-digit rounding is as follows.
?
100001 316.229347
0.316229347 103 .
?
flp 100000q 0.316228 103 .
The six-digit rounded approximation of the dierence between these two numbers is
?
?
fl flp 100001q flp 100000q 0.1 102 .
Finally, we have
flpf p100000qq flp100000q p0.1 102 q
p0.1 106 q p0.1 102 q
100.
Using six-digit chopping, the value of flpf p100000qq is 200.
B askar and Sivaji
31
[
\
M A 214, IITB
(2.2.8)
(2.2.9)
(2.2.10)
[
\
32
M A 214, IITB
(2.3.11)
[
\
[
\
33
M A 214, IITB
[
\
Remark 2.21. Number of significant digits roughly measures the number of leading
non-zero digits of xA that are correct relative to the corresponding digits in the true
value x. However, this is not a precise way to get the number of significant digits as
it is evident from the above examples.
[
\
The role of significant digits in numerical calculations is very important in the
sense that the loss of significant digits may result in drastic amplification of the
relative error as illustrated in the following example.
34
M A 214, IITB
35
M A 214, IITB
x
? .
x`1` x
With this new form of f , we obtain f p100000q 158.114000 using six-digit rounding.
[
\
Example 2.24. Consider evaluating the function
f pxq 1 cos x
near x 0. Since cos x 1 for x near zero, there will be loss of significance in the
process of evaluating f pxq for x near zero. So, we have to use an alternative formula
for f pxq such as
f pxq 1 cos x
1 cos2 x
1 ` cos x
sin2 x
1 ` cos x
which can be evaluated quite accurately for small x.
[
\
Remark 2.25. Unlike the above examples, we may not be able to write an equivalent
formula for the given function to avoid loss of significance in the evaluation. In such
cases, we have to go for a suitable approximation of the given function by other
functions, for instance Taylors polynomial of desired degree, that do not involve loss
of significance.
[
\
36
M A 214, IITB
xT yT
Er pxA yA q
.
xT yT
(2.4.12)
The above expression shows that there can be a drastic increase in the relative error
during subtraction of two approximate numbers whenever xT yT as we have witnessed in Examples 2.22 and 2.23. On the other hand, it is easy to see from (2.4.12)
that
|Er pxA ` yA q| |Er pxA q| ` |Er pyA q|,
which shows that the relative error propagates slowly in addition. Note that such an
inequality in the case of subtraction is not possible.
2.4.2 Multiplication
The relative error Er pxA yA q is given by
pxT yT q pxA yA q
xT yT
pxT yT q ppxT q pyT qq
xT yT
xT ` yT
xT yT
xT
yT
xT
yT
Er pxA yA q
Thus, we have
Er pxA yA q Er pxA q ` Er pyA q Er pxA qEr pyA q.
(2.4.13)
37
M A 214, IITB
Thus, we have
Er pxA {yA q
1
pEr pxA q Er pyA qq.
1 Er pyA q
(2.4.14)
The above expression shows that the relative error increases drastically during division whenever Er pyA q 1. This means that yA has 100% error when compared
to y, which is very unlikely because we always expect the relative error to be very
small, ie., very close to zero. In this case the right hand side is approximately equal
to Er pxA q Er pyA q. Hence, we have
|Er pxA {yA q| |Er pxA q Er pyA q| |Er pxA q| ` |Er pyA q|,
which shows that the relative error propagates slowly in division.
2.4.4 Total Error
In Subsection 2.1.4, we discussed the procedure of performing arithmetic operations
using n-digit floating-point approximation. The computed value flpflpxq d flpyqq involves an error (when compared to the exact value x d y) which comprises of
(1) Error in flpxq and flpyq due to n-digit rounding or chopping of x and y, respectively,
and
(2) Error in flpflpxq d flpyqq due to n-digit rounding or chopping of the number flpxq d
flpyq.
The total error is defined as
px d yq flpflpxq d flpyqq rpx d yq pflpxq d flpyqqs ` rpflpxq d flpyqq flpflpxq d flpyqqs,
in which the first term on the right hand side is called the propagated error and
the second term is called the floating-point error. The relative total error is
obtained by dividing both sides of the above expression by x d y.
Baskar and Sivaji
38
M A 214, IITB
1
5In1 , I0 lnp6{5q.
n
1
1
In , I30 0.54046330 102 .
5n 5
The following table shows the computed value of In using both iterative formulas
along with the exact value. The numbers are computed using MATLAB using double
precision arithmetic and the final answer is rounded to 6 digits after the decimal
point.
n Forward Iteration
1
0.088392
5
0.028468
10
0.015368
15
0.010522
20
0.004243
25
11.740469
30
-36668.803026
Clearly the backward iteration gives exact value up to the number of digits shown,
whereas forward iteration tends to increase error and give entirely wrong values. This
is due to the propagation of error from one iteration to the next iteration. In forward
iteration, the total error from one iteration is magnified by a factor of 5 at the next
iteration. In backward iteration, the total error from one iteration is divided by 5 at
the next iteration. Thus, in this example, with each iteration, the total error tends
to increase rapidly in the forward iteration and tends to increase very slowly in the
backward iteration.
[
\
39
M A 214, IITB
f 1 pq
px xA q.
f pxq
Thus, we have
Er pf pxA qq
f 1 pq
x Er pxA q.
f pxq
(2.5.15)
Since xA and x are assumed to be very close to each other and lies between x and
xA , we may make the approximation
f pxq f pxA q f 1 pxqpx xA q.
In view of (2.5.15), we have
Er pf pxA qq
f 1 pxq
x Er pxA q.
f pxq
(2.5.16)
The expression inside the bracket on the right hand side of (2.5.16) is the amplification
factor for the relative error in f pxA q in terms of the relative error in xA . Thus, this
expression plays an important role in understanding the propagation relative error
in evaluating the function value f pxq and hence motivates the following definition.
Definition 2.27 (Condition Number of a Function).
The condition number of a continuously dierentiable function f at a point x c
is given by
1
f pcq
(2.5.17)
f pcq c .
The condition number of a function at a point x c can be used to decide whether
the evaluation of the function at x c is well-conditioned or ill-conditioned depending on whether this condition number is smaller or larger as we approach this point.
It is not possible to decide a priori how large the condition number should be to say
that the function evaluation is ill-conditioned and it depends on the circumstances
in which we are working.
40
M A 214, IITB
f pcq
f pcq c
at c is small. The process of evaluating a function at x c is said to be illconditioned if it is not well-conditioned.
Example 2.29. Consider the function f pxq
?
x, for all x P r0, 8q. Then
1
f 1 pxq ? , for all x P r0, 8q.
2 x
The condition number of f is
1
f pxq 1
10
, for all x P R.
1 x2
f pxq p20x{p1 x2 q2 qx
f pxq x 10{p1 x2 q
2x2
|1 x2 |
and this number can be quite large for |x| near 1. Thus, for x near 1 or -1, the process
of evaluating this function is ill-conditioned.
[
\
The above two examples gives us a feeling that if the process of evaluating a
function is well-conditioned, then we tend to get less propagating relative error. But,
this is not true in general as shown in the following example.
41
M A 214, IITB
1
1
1
f pxq 1
x
x
`
1
x
x
?
f pxq 2
x`1 x
1
x
?
?
2 x`1 x
1
,
2
(2.5.18)
which shows that the process of evaluating f is well-conditioned for all x P p0, 8q.
But, if we calculate f p12345q using six-digit rounding, we find
?
?
f p12345q 12346 12345 111.113 111.108 0.005,
while, actually, f p12345q 0.00450003262627751 . The calculated answer has 10%
error.
[
\
The above example shows that a well-conditioned process of evaluating a function at
a point is not enough to ensure the accuracy in the corresponding computed value. We
need to check for the stability of the computation, which we discuss in the following
subsection.
2.5.1 Stable and Unstable Computations
Suppose there are n steps to evaluate a function f pxq at a point x c. Then the
total process of evaluating this function is said to have instability if atleast one of
the n steps is ill-conditioned. If all the steps are well-conditioned, then the process is
said to be stable.
Example 2.32. We continue the discussion in example 2.31 and check the stability
in evaluating the function f . Let us analyze the computational process. The function
f consists of the following four computational steps in evaluating the value of f at
x x0 :
x1 : x0 ` 1, x2 :
?
?
x1 , x3 : x0 , x4 : x2 x3 .
Now consider the last two steps where we already computed x2 and now going to
compute x3 and finally evaluate the function
42
M A 214, IITB
f4 ptq t
f4 ptq t x2 t .
Thus, f4 is ill-conditioned when t approaches x2 . Therefore, the above process of
evaluating the function f pxq is unstable.
Let us rewrite the same function f pxq as
1
fpxq ?
? .
x`1` x
The computational process of evaluating f at x x0 is
x1 : x0 ` 1, x2 :
?
?
x1 , x3 : x0 , x4 : x2 ` x3 , x5 : 1{x4 .
It is easy to verify that the condition number of each of the above steps is wellconditioned. For instance, the last step defines
f5 ptq
1
,
x2 ` t
f1 pxq t 1
f5 pxq x2 ` t 2
for t suciently close to x2 . Therefore, this process of evaluating fpxq is stable.
Recall from example 2.23 that the above expression gives a more accurate value for
suciently large x.
[
\
Remark 2.33. As discussed in Remark 2.25, we may not be lucky all the time to
come out with an alternate expression that lead to stable evaluation for any given
function when the original expression leads to unstable evaluation. In such situations,
we have to compromise and go for a suitable approximation with other functions with
stable evaluation process. For instance, we may try approximating the given function
[
\
with its Taylors polynomial, if possible.
43
M A 214, IITB
2.6 Exercises
Floating-Point Approximation
(1) Let X be a suciently large number which result in an overflow of memory on a
computing device. Let x be a suciently small number which result in underflow
of memory on the same computing device. Then give the output of the following
operations:
(i) x X
(ii) 3 X
(iii) 3 x
(iv) x{X
(v) X{x.
(2) In the following problems, show all the steps involved in the computation.
(i) Using 5-digit rounding, compute 37654 ` 25.874 37679.
(ii) Let a 0.00456, b 0.123, c 0.128. Using 3-digit rounding, compute
pa ` bq ` c, and a ` pb ` cq. What is your conclusion?
(iii) Let a 2, b 0.6, c 0.602. Using 3-digit rounding, compute a pb ` cq,
and pa bq ` pa cq. What is your conclusion?
(3) To find the mid-point of an interval ra, bs, the formula a`b
is often used. Com2
pute the mid-point of the interval r0.982, 0.987s using 3-digit chopping. On the
number line represent all the three points. What do you observe? Now use the
more geometric formula a ` ba
to compute the mid-point, once again using 32
digit chopping. What do you observe this time? Why is the second formula more
geometric?
(4) In a computing device that uses n-digit rounding binary floating-point arithmetic,
show that 2n is the machine epsilon. What is the machine epsilon in a
computing device that uses n-digit rounding decimal floating-point arithmetic?
Justify your answer.
Types of Errors
(5) If flpxq is the approximation of a real number x in a computing device, and is
the corresponding relative error, then show that flpxq p1 qx.
(6) Let x, y and z be real numbers whose floating point approximations in a computing device coincide with x, y and z respectively. Show that the relative error in computing xpy ` zq equals 1 ` 2 1 2 , where 1 Er pflpy ` zqq and
2 Er pflpx flpy ` zqqq.
(7) Let Er pflpxqq. Show that
(i) || 10n`1 if the computing device uses n-digit (decimal) chopping.
(ii) || 12 10n`1 if the computing device uses n-digit (decimal) rounding.
(iii) Can the equality hold in the above inequalities?
44
M A 214, IITB
45
M A 214, IITB
1
1
f pxq x `
x
3
3
for large values of x.
(21) Check for stability of computing the function
x2
x2
3
3`
3
3
gpxq
2
x
for values of x very close to 0.
(22) Check for stability of computing the function
hpxq
sin2 x
1 cos2 x
46
M A 214, IITB
CHAPTER 3
Numerical Linear Algebra
In this chapter, we study the methods for solving system of linear equations, and
computing an eigenvalue and the corresponding eigen vector for a matrix. The methods for solving linear systems is categorized into two types, namely, direct methods
and iterative methods. Theoretically, direct methods give exact solution of a linear system and therefore these methods do not involve mathematical error. However,
when we implement the direct methods on a computer, because of the presence of
arithmetic error, the computed value from a computer will still be an approximate
solution. On the other hand, an iterative method generates a sequence of approximate solutions to a given linear system which is expected to converge to the exact
solution.
An important direct method is the well-known Gaussian elimination method.
After a short introduction to linear systems in Section 3.1, we discuss the direct
methods in Section 3.2. We recall the Gaussian elimination method in Subsection
3.2.1 and study the eect of arithmetic error in the computed solution. We further
count the number of arithmetic operations involved in computing the solution. This
operation count revels that this method is more expensive in terms of computational
time. In particular, when the given system is of tri-diagonal structure, the Gaussian
elimination method can be suitably modified so that the resulting method, called
the Thomas algorithm, is more ecient in terms of computational time. After
introducing the Thomal algorithm in Subsection 3.2.4, we discuss the LU factorization methods for a given matrix and solving a system of linear equation using LU
factorization in Subsection 3.2.5.
Some matrices are sensitive to even a small error in the right hand side vector
of a linear system. Such a matrix can be identified with the help of the condition
number of the matrix. The condition number of a matrix is defined in terms of the
matrix norm. In Section 3.3, we introduce the notion of matrix norms, define condition number of a matrix and discuss few important theorems that are used in the
error analysis of iterative methods. We continue the chapter with the discussion of
iterative methods to linear system in Section 3.4, where we introduce two basic iterative methods and discuss the sucient condition under which the methods converge.
(3.1.1)
Throughout this chapter, we assume that the coecients aij and the right hand side
numbers bi , i, j 1, 2, , n are real.
The above system of linear equations can be written in the matrix notation as
a11 a12 a1n
x1
b1
a21 a22 a2n x2 b2
(3.1.2)
.. ..
.. .. ..
. . . . .
an1 an2 ann
xn
bn
(3.1.3)
where A stands for the n n matrix with entries aij , x px1 , x2 , , xn qT and the
right hand side vector b pb1 , b2 , , bn qT .
Let us now state a result concerning the solvability of the system (3.1.2).
Theorem 3.1. Let A be an n n matrix and b P Rn . Then the following statements
concerning the system of linear equations Ax b are equivalent.
(1) detpAq 0
(2) For each right hand side vector b, the system Ax b has a unique solution x.
48
M A 214, IITB
(3.2.4)
For convenience, we call the first, second, and third equations by names E1 , E2 , and
E3 respectively.
Step 1: If a11 0, then define
m21
a21
,
a11
m31
a31
.
a11
(3.2.5)
We will now obtain a new system that is equivalent to the system (3.2.4) as follows:
Retain the first equation E1 as it is.
Replace the second equation E2 by the equation E2 m21 E1 .
Replace the third equation E3 by the equation E3 m31 E1 .
The new system equivalent to (3.2.4) is given by
a11 x1 ` a12 x2 ` a13 x3 b1
p2q
p2q
p2q
p2q
p2q
p2q
0 ` a22 x2 ` a23 x3 b2
(3.2.6)
0 ` a32 x2 ` a33 x3 b3 ,
p2q
p2q
49
M A 214, IITB
p2q
Note that the variable x1 has been eliminated from the last two equations.
p2q
m32
a32
(3.2.8)
.
p2q
a22
We still use the same names E1 , E2 , E3 for the first, second, and third equations of
the modified system (3.2.6), respectively. We will now obtain a new system that is
equivalent to the system (3.2.6) as follows:
Retain the first two equations in (3.2.6) as they are.
Replace the third equation by the equation E3 m32 E2 .
The new system is given by
a11 x1 ` a12 x2 ` a13 x3 b1
p2q
p2q
p2q
p3q
p3q
0 ` a22 x2 ` a23 x3 b2
(3.2.9)
0 ` 0 ` a33 x3 b3 ,
p3q
p3q
p2q
p2q
p3q
p2q
p2q
Observe that the system (3.2.9) is readily solvable for x3 if the coecient a33
0. Substituting the value of x3 in the second equation of (3.2.9), we can solve for
x2 . Substituting the values of x1 and x2 in the first equation, we can solve for x1 .
This solution phase of the (Naive) Gaussian elimination method is called Backward
substitution phase.
The coecient matrix of the system (3.2.9) is an upper triangular matrix given
by
(3.2.10)
1 0 0
L m21 1 0.
m31 m32 1
50
(3.2.11)
M A 214, IITB
0 1
x1
1
.
1 1
x2
2
[
\
(3.2.12)
The Step 1 cannot be started as a11 0. Thus the naive Gaussian elimination method
fails.
[
\
Example 3.4. Let 0 ! 1. Consider the system of equations
1
x1
1
.
1 1
x2
2
(3.2.13)
Since 0, after Step 1 of the Naive Gaussian elimination method, we get the system
1
x1
1
.
(3.2.14)
0 1 1
2 1
x2
Thus the solution is given by
x2
2 1
,
1 1
x1 p1 x2 q1 .
(3.2.15)
51
M A 214, IITB
(3.2.17)
Let us solve this system using (naive) Gaussian elimination method using 4-digit
rounding.
In 4-digit rounding approximation, the above system takes the form
6.000x1 ` 2.000x2 ` 2.000x3 2.000
2.000x1 ` 0.6667x2 ` 0.3333x3 1.000
1.000x1 ` 2.000x2 1.000x3 0.000
After eliminating x1 from the second and third equations, we get (with m21 0.3333,
m31 0.1667)
6.000x1 ` 2.000x2 ` 2.000x3 2.000
0.000x1 ` 0.0001x2 0.3333x3 1.667
0.000x1 ` 1.667x2 1.333x3 0.3334
(3.2.18)
After eliminating x2 from the third equation, we get (with m32 16670)
6.000x1 ` 2.000x2 ` 2.000x3 2.000
0.000x1 ` 0.0001x2 0.3333x3 1.667
0.000x1 ` 0.0000x2 ` 5555x3 27790
Using back substitution, we get x1 1.335, x2 0 and x3 5.003, whereas
the actual solution is x1 2.6, x2 3.8 and x3 5. The diculty with this
elimination process is that the second equation in (3.2.18), where the coecient of x2
should have been zero, but rounding error prevented it and makes the relative error
[
\
very large.
The above examples highlight the inadequacy of the Naive Gaussian elimination
method. These inadequcies can be overcome by modifying the procedure of Naive
Gaussian elimination method. There are many kinds of modification. We will discuss
one of the most popular modified methods which is called Modified Gaussian
elimination method with partial pivoting.
52
M A 214, IITB
(3.2.19)
For convenience, we call the first, second, and third equations by names E1 , E2 , and
E3 respectively.
Step 1: Define s1 max t |a11 |, |a21 |, |a31 | u. Note that s1 0 (why?). Let k be the
least number such that s1 |ak1 |. Interchange the first equation and the k th equation.
Let us re-write the system after this modification.
p1q
p1q
p1q
p1q
p1q
p1q
p1q
p1q
(3.2.20)
p1q
a32 x2
p1q
a33 x3
p1q
b3 .
where
p1q
p1q
p1q
p1q
p1q
p1q
p1q
p1q
a11 ak1 , a12 ak2 , a13 ak3 , ak1 a11 , ak2 a12 , ak3 a13 ; b1 bk , bk b1 ,
(3.2.21)
p1q
and rest of the coecients aij are same as aij as all equations other than the first
and k th remain untouched by the interchange of first and k th equation. Now eliminate
the x1 variable from the second and third equations of the system (3.2.20). Define
p1q
m21
a21
p1q
p1q
m31
a11
a31
p1q
(3.2.22)
a11
We will now obtain a new system that is equivalent to the system (3.2.20) as follows:
The first equation will be retained as it is.
Replace the second equation by the equation E2 m21 E1 .
Replace the third equation by the equation E3 m31 E1 .
The new system is given by
p1q
p1q
p1q
p1q
p2q
p2q
p2q
p2q
p2q
p2q
0 ` a22 x2 ` a23 x3 b2
(3.2.23)
0 ` a32 x2 ` a33 x3 b3 ,
p2q
p2q
53
M A 214, IITB
p1q
p1q
i, j 2, 3
p2q
p1q
p1q
i 2, 3.
Note that the variable x1 has been eliminated from the last two equations.
!
)
p2q
p2q
Step 2: Define s2 max |a22 |, |a32 | . Note that s2 0 (why?). Let l be the least
p2q
number such that sl |al2 |. Interchange the second row and the lth rows. Let us
re-write the system after this modification.
p1q
p1q
p1q
p1q
p3q
p3q
p3q
p3q
p3q
p3q
(3.2.24)
0 ` a22 x2 ` a23 x3 b2
0 ` a32 x2 ` a33 x3 b3 ,
p3q
p3q
p2q
p3q
are given by
p2q
p3q
p2q
p3q
p2q
p2q
p3q
b2 bl , b l
p2q
b2
We still use the same names E1 , E2 , E3 for the first, second, and third equations of
the modified system (3.2.24), respectively.
In case l 2, both second and third equations stay as they are. Let us now
eliminate x2 from the last equation. Define
p3q
m32
a32
(3.2.25)
p3q
a22
We will now obtain a new system that is equivalent to the system (3.2.24) as follows:
The first two equations in (3.2.24) will be retained as they are.
Replace the third equation by the equation E3 m32 E2 .
The new system is given by
p1q
p1q
p1q
p1q
p3q
p3q
p3q
p4q
p4q
0 ` a22 x2 ` a23 x3 b2
(3.2.26)
0 ` 0 ` a33 x3 b3 ,
p4q
p4q
p3q
p3q
p4q
p3q
p3q
54
M A 214, IITB
Now the system (3.2.26) is readily solvable for x3 if the coecient a33 0. In fact,
it is non-zero (why?). Substituting the value of x3 in the second equation of (3.2.26),
we can solve for x2 . Substituting the values of x1 and x2 in the first equation, we can
solve for x1 . This solution phase of the modified Gaussian elimination method with
partial pivoting is called Backward substitution phase.
3.2.3 Operations Count in Naive Gaussian Elimination Method
It is important to know the length of a computation and for that reason, we count
the number of arithmetic operations involved in Gaussian elimination. Let us divide
the count into three parts.
(1) The elimination step: We now count the additions/subtractions, multiplications and divisions in going from the given system to the triangular system.
Step
1
2
.
.
.
n1
Total
Additions/Subtractions
pn 1q2
pn 2q2
.
.
.
1
npn1qp2n1q
6
Multiplications Divisions
pn 1q2
n1
2
pn 2q
n2
.
.
.
.
.
.
1
1
npn1qp2n1q
6
npn1q
2
p
ppp ` 1q 2 ppp ` 1qp2p ` 1q
j
,
j
, p 1.
2
6
j1
j1
Let us explain the first row of the above table. In the first step, computation
of m21 , m31 , , mn1 involve pn 1q divisions. For each i, j 2, 3, , n, the
p2q
computation of aij involves a multiplication and a subtraction. In total, there are
pn 1q2 multiplications and pn 1q2 subtractions. Note that we do not count the
operations involved in computing the coecients of x1 in the 2nd to nth equations
p2q
(namely, ai1 ), as we do not compute them and simply take them as zero. Similarly,
other entries in the above table can be accounted for.
(2) Modification of the right side: Proceeding as before, we get
Addition/Subtraction = pn 1q ` pn 2q ` ` 1
55
npn 1q
2
M A 214, IITB
npn 1q
2
1 2 n3 .
6
2
3
n
3
Addition/Subtraction =
56
M A 214, IITB
`0xn2
`0xn2
`0xn2
`0xn2
`0xn1
`0xn1
`0xn1
`0xn1
`0xn
`0xn
`0xn
`0xn
b1
b2
b3
b4
(3.2.27)
1
b1
, f1 .
1
1
Eliminating x1 from the second equation of the system (3.2.27) by multiplying the
above equation by 2 and subtracting the resulting equation with the second equation
of (3.2.27), we get
x2 ` e2 x3 f2 , e2
2
b2 2 f1
, f2
.
2 2 e1
2 2 e 1
We now generalize the above procedure by assuming that the j th equation is reduced
to the form
xj ` ej xj`1 fj ,
where ej and fj are known quantity, reduce the pj ` 1qth equation in the form
xj`1 ` ej`1 xj`2 fj`1 , ej`1
j`1
bj`1 j`1 fj
, fj`1
,
j`1 j`1 ej
j`1 j`1 ej
n1
bn1 n1 fn2
, fn1
.
n1 n1 en2
n1 n1 en2
To obtain the reduced form of the nth equation of the system (3.2.27), eliminate xn1
from the nth equation by multiplying the above equation by n and subtracting the
resulting equation with the nth equation of (3.2.27), which gives
57
M A 214, IITB
f1
f2
which is an upper triangular matrix and hence, by back substitution we can get the
solution.
Remark 3.6. If the denominator of any of the ej s or fj s is zero, then the Thomas
method fails. This is the situation when j j ej1 0 which is the coecient of xj
in the reduced equation. A suitable partial pivoting as done in the modified Gaussian
elimination method may sometime help us to overcome this problem.
[
\
3.2.5 LU Factorization
In Theorem 3.1, we have stated that when a matrix is invertible, then the corresponding linear system can be solved. Let us now ask the next question:
Can we give examples of a class(es) of invertible matrices for which the system of
linear equations (3.1.2) given by
a11 a12 a1n
x1
b1
a21 a22 a2n x2 b2
.. ..
.. .. ..
. . . . .
an1 an2 ann
xn
bn
is easily solvable?
There are three types of matrices whose simple structure makes the linear system
solvable readily. These matrices are as follows:
(1) Invertible Diagonal matrices: These matrices look like
d1 0 0 0
0 d2 0 0
.. ..
..
. . .
0 0 0 dn
Baskar and Sivaji
58
M A 214, IITB
T
b1 b2
bn
is given by x
, , ,
.
d1 d2
dn
(2) Invertible Lower triangular matrices: These matrices look like
l11 0 0 0
l21 l22 0 0
.. ..
..
. . .
ln1 ln2 ln3 lnn
and lii 0 for each i 1, 2, , n. The linear system takes the form
l11 0 0 0
x1
b1
l21 l22 0 0 x2 b2
.. ..
. . .
. . .. .. ..
ln1 ln2 ln3 lnn
xn
bn
(3.2.28)
b1
.
l11
l22
b1
l11
Proceeding in this manner, we solve for the vector x. This procedure of obtaining
solution may be called the forward substitution.
(3) Invertible Upper triangular matrices: These matrices look like
.. ..
..
. . .
0 0 0 unn
and uii 0 for each i 1, 2, , n. The linear system takes the form
.. ..
. . .
. . .. .. ..
0 0 0 unn
xn
bn
(3.2.29)
59
M A 214, IITB
bn
.
unn
Substituting this value of xn in the penultimate equation, we get the value of xn1
as
bn
bn1 un1,n
unn
.
xn1
un1,n1
Proceeding in this manner, we solve for the vector x. This procedure of obtaining
solution may be called the backward substitution.
In general, an invertible matrix A need not be one among the simple structures listed
above. However, in certain situations we can always find an invertible lower triangular
matrix L and an invertible upper triangular matrix U in such a way that
A LU.
In this case, the system Ax b becomes
`
L U x b.
To solve for x, we first solve the lower triangular system
Lz b
for the vector z, which can be obtained easily using forward substitution. After
obtaining z, we solve the upper triangular system
Ux z
for the vector x, which is again obtained easily using backward substitution.
Remark 3.7. In Gaussian elimination method discussed in Section 3.2.1, we have
seen that a given matrix A can be reduced to an upper triangular matrix U by an
elimination procedure and thereby can be solved using backward substitution. In the
elimination procedure, we also obtained a lower triangular matrix L in such a way
that
A LU
as remarked in this section.
[
\
60
M A 214, IITB
61
M A 214, IITB
a11 a12
(3.2.31)
These gives first column of L and the first row of U . Next multiply row 2 of L times
columns 2 and 3 of U , to obtain
a22 l21 u12 ` u22 , a23 l21 u13 ` u23 .
(3.2.32)
These can be solved for u22 and u23 . Next multiply row 3 of L to obtain
l31 u12 ` l32 u22 a32 , l31 u13 ` l32 u23 ` u33 a33 .
(3.2.33)
These equations yield values for l32 and u33 , completing the construction of L and U .
In this process, we must have u11 0, u22 0 in order to solve for L, which is true
because of the assumptions that all the leading principal minors of A are non-zero.
The decomposition we have found is the Doolittles factorization of A.
[
\
1 1 1
A 1 2 2 .
2 1 1
62
M A 214, IITB
a21
a31
1, l31
2
u11
u11
1 0 0
1
1 1 0
0
A
2 3 1
0
as
1 1
1 1
0 2
Further, taking b p1, 1, 1qT , we now solve the system Ax b using LU factorization,
with the matrix A given above. As discussed earlier, first we have to solve the lower
triangular system
1 0 0
z1
1
1 1 0 z2 1.
2 3 1
1
z3
Forward substitution yields z1 1, z2 0, z3 3. Keeping the vector z p1, 0, 3qT
as the right hand side, we now solve the upper triangular system
1 1 1
x1
1
0 1 1 x2 0.
3
0 0 2
x3
Backward substitution yields x1 1, x2 3{2, x3 3{2.
[
\
Crouts factorization
In Doolittles factorization, the lower triangular matrix has special property. If we
ask the upper triangular matrix to have special property in the LU decomposition,
it is known as Crouts factorization. We give the definition below.
Definition 3.15 (Crouts factorization). A matrix A is said to have a Crouts
factorization if there exists a lower triangular matrix L, and an upper triangular
matrix U with all diagonal elements as 1 such that
A LU.
The computation of the Crouts factorization for a 3 3 matrix can be done in
a similar way as done for Doolittles factorization in Example 3.13. For invertible
matrices, one can easily obtain Doolittles factorization from Crouts factorization
and vice versa.
B askar and Sivaji
63
M A 214, IITB
64
M A 214, IITB
Ak
a
A
,
aT apk`1qpk`1q
where Ak is the kk principal sub-matrix of A and a pa1pk`1q , a2pk`1q , , akpk`1q qT .
Observe that Ak is positive definite and therefore by our assumption, there exists a
unique k k lower triangular matrix Lk such that Ak Lk LTk . Define
Lk
0
L
,
lT lpk`1qpk`1q
where the real number lpk`1qpk`1q and the vector l pl1pk`1q , l2pk`1q , , lkpk`1q qT are
to be chosen such that A LLT . That is
Lk
0
Ak
a
Lk
l
(3.2.34)
.
aT apk`1qpk`1q
0T lpk`1qpk`1q
lT lpk`1qpk`1q
Here 0 denotes the zero vector of dimension k. Clearly,
Lk l a,
(3.2.35)
which by forward substitution yields the vector l. Here, we need to justify that Lk is
invertible, which is left as an exercise.
2
Finally, we have lT l ` lpk`1qpk`1q
apk`1qpk`1q and this gives
2
lpk`1qpk`1q
apk`1qpk`1q lT l,
(3.2.36)
2
provided the positivity of lpk`1qpk`1q
is justified, which follows from taking determinant
on both sides of (3.2.34) and using the property (3) of Lemma 3.17. A complete proof
of this justification is left as an exercise.
[
\
9 3 2
3 .
A 3 2
2 3 23
We can check that this matrix is positive definite by any of the equivalent conditions
listed in Lemma 3.17. Therefore, we expect a unique Choleskys factorization for A.
(1) For n 1, we have A1 p9q and therefore let us take L1 p3q.
65
M A 214, IITB
A2
9 3
3 2
Therefore,
L2
L1 0
l l22
2
This gives l 1 and l2 ` l22
2 or l22 1.
3
L2
1
3 0
l l22
Thus, we have
0
.
1
L2 0
lT l33
where lT pl13 , l23 q and l33 are to be obtained in such a way that A LLT .
The vector l is obtained by solving the lower triangular system (3.2.35) with
a p2, 3qT (by forward substitution), which gives l p2{3, 11{3qT . Finally,
2
from (3.2.36), we have l33
82{9. Thus, the required lower triangular matrix L is
3
0
0
1
1 ? 0 .
L
2{3 11{3
82{3
It is straightforward to check that A LLT .
[
\
It can be observed that the idea of LU factorization method for solving a system of
linear equations is parallel to the idea of Gaussian elimination method. However, LU
factorization method has an advantage that once the factorization A LU is done,
the solution of the linear system Ax b is reduced to solving two triangular systems.
Thus, if we want to solve a large set of linear systems where the coecient matrix A
is fixed but the right hand side vector b varies, we can do the factorization once for
A and then use this factorization with dierent b vectors. This is obviously going to
reduce the computation cost drastically as we have seen in the operation counting of
Gaussian elimination method that the elimination part is the most expensive part of
the computation.
66
M A 214, IITB
}x}2
|xi |2 .
(3.3.37)
i1
(3.3.39)
All the three norms defined above are indeed norms; it is easy to verify that they
satisfy the defining conditions of a norm given in Definition 3.22.
[
\
Example 3.24. Let us compute norms of some vectors now. Let x p4, 4, 4, 4qT ,
y p0, 5, 5, 5qT , z p6, 0, 0, 0qT . Verify that }x}1 16, }y}1 15, }z}1 6;
}x}2 8, }y}2 8.66, }z}2 6; }x}8 4, }y}8 5, }z}8 6.
From this example we see that asking which vector is big does not make sense.
But once the norm is fixed, this question makes sense as the answer depends on the
norm used. In this example each vector is big compared to other two but in dierent
norms.
[
\
Remark 3.25. In our computations, we employ any one of the norms depending on
convenience. It is a fact that all vector norms on Rn are equivalent; we will not
elaborate further on this.
[
\
B askar and Sivaji
67
M A 214, IITB
1jn
where aij denotes the element in the ith row and j th column of A.
There can be many matrix norms on Mn pRq. We will describe some of them now.
Example 3.27. The following define norms on Mn pRq.
g
f
n
fn
(1) }A} e
|aij |2 .
i1 j1
(3.3.40)
The formula (3.3.40) indeed defines a matrix norm on Mn pRq. The proof of this fact
is beyond the scope of our course. In this course, by matrix norm, we always mean
a norm subordinate to some vector norm. An equivalent and more useful formula for
the matrix norm subordinate to a vector norm is given in the following lemma.
Baskar and Sivaji
68
M A 214, IITB
(3.3.41)
z
max ||Az|| .
max ||Ax|| max A
||x||1
||z ||0
||z|| ||z ||0 ||z||
[
\
The matrix norm subordinate to a vector norm has additional properties as stated
in the following theorem whose proof is left as an exercise.
Theorem 3.30. Let } } be a matrix norm subordinate to a vector norm. Then
(1) }Ax} }A}}x} for all x P Rn .
(2) }I} 1 where I is the identity matrix.
(3) }AB} }A} }B} for all A, B P Mn pRq.
[
\
We will now state a few results concerning matrix norms subordinate to some of
the vector norms described in Example 3.23. We will not discuss their proofs.
Theorem 3.31 (Matrix norm subordinate to the maximum norm).
The matrix norm subordinate to the l8 norm (also called maximum norm given
in (3.3.38)) on Rn is denoted by }A}8 and is given by
}A}8 max
1in
|aij |.
(3.3.42)
j1
The norm }A}8 given by the formula (3.3.42) is called the maximum-of-row-sums
norm of A.
[
\
Theorem 3.32 (Matrix Norm Subordinate to the l1 -norm).
The matrix norm subordinate to the l1 norm (given in (3.3.39)) on Rn is denoted
by }A}1 and is given by
n
}A}1 max
|aij |.
(3.3.43)
1jn
i1
The norm }A}1 given by the formula (3.3.43) is called the maximum-of-columnsums norm of A.
[
\
Description and computation of the matrix norm subordinate to the Euclidean vector
norm on Rn is more subtle.
B askar and Sivaji
69
M A 214, IITB
where 1 , 2 , , n are eigenvalues of the matrix AT A. The norm }A}2 given by the
formula (3.3.44) is called the spectral norm of A.
[
\
Example 3.34. Let us now compute }A}8 and }A}2 for the matrix
1 1 1
1 2 2.
2 1 1
(1) }A}8 5 since
3
j1
3
j1
3
j1
Ax b and A
x b,
are given vectors. Then
respectively, where b and b
}
}b b}
}x x
}A} }A1 }
}x}
}b}
(3.3.45)
for any fixed vector norm and the matrix norm subordinate to this vector norm.
70
M A 214, IITB
} }A1 }}b b}
}x x
(3.3.46)
The last inequality (3.3.46) estimates the error in the solution caused by error on
the right hand side vector of the linear system Ax b. The inequality (3.3.45) is
concerned with estimating the relative error in the solution in terms of the relative
error in the right hand side vector b.
Since Ax b, we get }b} }Ax} }A}}x}. Therefore }x} }b} . Using this
}A}
[
\
Remark 3.36.
(1) In the above theorem, it is important to note that a vector norm is fixed and the
matrix norm used is subordinate to this fixed vector norm.
(2) The theorem holds no matter which vector norm is fixed as long as the matrix
norm subordinate to it is used.
(3) In fact, whenever we do analysis on linear systems, we always fix a vector norm
and then use matrix norm subordinate to it.
[
\
Notice that the constant appearing on the right hand side of the inequality (3.3.45)
(which is }A} }A1 }) depends only on the matrix A (for a given vector norm). This
number is called the condition number of the matrix A. Notice that this condition
number depends very much on the vector norm being used on Rn and the matrix
norm that is subordinate to the vector norm.
Definition 3.37 (Condition Number of a Matrix). Let A be an n n invertible
matrix. Let a matrix norm be given that is subordinate to a vector norm. Then the
condition number of the matrix A (denoted by pAq) is defined as
pAq : }A} }A1 }.
(3.3.47)
Remark 3.38. From Theorem 3.35, it is clear that if the condition number is small,
then the relative error in the solution will also be small whenever the relative error
in the right hand side vector is small. On the other hand, if the condition number is
large, then the relative error could be very large even though the relative error in the
right hand side vector is small. We illustrate this in the following example.
[
\
Example 3.39. The linear system
5x1 ` 7x2 0.7
7x1 ` 10x2 1
B askar and Sivaji
71
M A 214, IITB
1 2 3 n1
1 1 1 1
n`1
2 3 4
Hn
1 1
1
1
2n1
n n`1 n`1
(3.3.48)
For n 4, we have
72
M A 214, IITB
25
13620 28000
12
.
pAq
}A}
(3.3.49)
Proof. We have
1
1
1
pAq
}A}}A1 }
}A}
1
1
1
where y 0 is arbitrary. Take y Az, for some arbitrary vector z. Then we get
}Az}
1
1
.
pAq
}A} }z}
Let z 0 be such that Bz 0 (this is possible since B is singular), we get
1
}pA Bqz}
pAq
}A}}z}
}pA Bq}}z}
}A}}z}
}A B}
,
}A}
and we are done.
[
\
From the above theorem it is apparent that if A is close to a singular matrix, then
the reciprocal of the condition number will be near to zero, ie., pAq itself will be
large. Let us illustrate this in the following example.
Example 3.44. Clearly the matrix
1 1
1 1
73
M A 214, IITB
1 1`
.
1 1
1 1
.
1 ` 1
2`
4
2.
Thus, if 0.01, then pAq 40, 000. As 0, the matrix A tends to approach
the matrix B and consequently, the above inequality says that the condition number
pAq tends to 8.
Also, we can see that when we attempt to solve the system Ax b, then the above
inequality implies that a small relative perturbation in the right hand side vector b
could be magnified by a factor of at least 40, 000 for the relative error in the solution.
[
\
74
M A 214, IITB
a11 0
0 a22
D .. ..
. .
0
0
0
..
.
0 ann
where aii , i 1, 2, , n are the diagonal elements of the matrix A. Then the given
system of linear equations can be re-written as
Dx Cx ` b.
(3.4.50)
If we assume that the right hand side vector is fully known to us, then the above
system can be solved very easily as D is a diagonal matrix. But obviously, the right
hand side vector cannot be a known quantity because it involves the unknown vector
x. Rather, if we choose (arbitrarily) some specific value for x, say x xp0q , then the
resulting system
Dx Cxp0q ` b
can be readily solved. Let us call the solution of this system as xp1q . That is,
Dxp1q Cxp0q ` b.
Now taking x xp1q on the right hand side of (3.4.50) we can obtain the solution
of this system, which we denote as xp2q and repeat this procedure to get a general
iterative procedure as
Dxpk`1q Cxpkq ` b, k 0, 1, 2, .
If D is invertible, then the above iterative procedure can be written as
xpk`1q Bxpkq ` c, k 0, 1, 2, ,
(3.4.51)
75
M A 214, IITB
p0q
p0q
p0q
Let xp0q px1 , x2 , x3 qT be an initial guess to the true solution x, which is chosen
arbitrarily. Define a sequence of iterates (for k 0, 1, 2, ) by
,
1
pk`1q
pkq
pkq /
x1
x1
pk`1q
x2
pk`1q
x3
1
pkq
pkq
p2 x2 2x3 q,
6
1
pkq
pkq
p1 x1 0.5x3 q,
4
1
pkq
pkq
p0 ` x1 0.5x2 q.
4
(3.4.52)
76
M A 214, IITB
p1 4x2 0.5x3 q,
pk`1q
p2 6x1 2x3 q,
1
pkq
pkq
p0 ` x1 0.5x2 q.
x1
x2
pk`1q
x3
pkq
pkq
pkq
pkq
[
\
Thus, we need to look for a condition on the system for which the Jacobi iterative
sequence converges to the exact solution. Define the error in the k th iterate xpkq
compared to the exact solution by
epkq x xpkq .
77
M A 214, IITB
|aij | |aii |,
i 1, 2, , n.
j1,ji
We now prove the sucient condition for the convergence of the Jacobi method. This
theorem asserts that if A is a diagonally dominant matrix, then B in (3.4.51) of the
Jacobi method is such that }B}8 1.
Theorem 3.49. If the coecient matrix A is diagonally dominant, then the Jacobi
method (3.4.51) converges to the exact solution of Ax b.
Proof: Since A is diagonally dominant, the diagonal entries are all non-zero and
hence the Jacobi iterating sequence xpkq given by
1
pk`1q
pkq
, i 1, 2, , n.
xi
bi
aij xj
(3.4.53)
aii
j1,ji
is well-defined. Each component of the error satisfies
pk`1q
ei
aij pkq
ej ,
a
ii
j1
i 1, 2, , n.
(3.4.54)
ji
which gives
Baskar and Sivaji
78
M A 214, IITB
aij pkq
pk`1q
}e }8 .
|ei
|
(3.4.55)
aii
j1
ji
Define
aij
.
max
aii
1in
(3.4.56)
j1
ji
Then
pk`1q
|ei
| }epkq }8 ,
(3.4.57)
(3.4.58)
The matrix A is diagonally dominant if and only if 1. Then iterating the last
inequality we get
}epk`1q }8 k`1 }ep0q }8 .
(3.4.59)
Therefore, if A is diagonally dominant, the Jacobi method converges.
[
\
Remark 3.50. Observe that the system given in Example 3.46 is diagonally dominant, whereas the system in Example 3.47 is not so.
[
\
3.4.2 Gauss-Seidel Method
Gauss-Seidel method is a modified version of the Jacobi method discussed in Section
3.4.1. We demonstrate the method in the case of a 3 3 system and the method for
a general n n system can be obtained in a similar way.
Example 3.51. Consider the 3 3 system
a11 x1 ` a12 x2 ` a13 x3 b1 ,
a21 x1 ` a22 x2 ` a23 x3 b2 ,
a31 x1 ` a32 x2 ` a33 x3 b3 .
When the diagonal elements of this system are non-zero, we can rewrite the above
system as
1
pb1 a12 x2 a13 x3 q,
a11
1
x2
pb2 a21 x1 a23 x3 q,
a22
1
pb3 a31 x1 a32 x2 q.
x3
a33
x1
Let
B askar and Sivaji
79
M A 214, IITB
p0q
p0q
xp0q px1 , x2 , x3 qT
be an initial guess to the true solution x. Define a sequence of iterates (for k
0, 1, 2, ) by
,
1
pk`1q
pkq
pkq
/
/
x1
a23 x3 q,
x2
pb2 a21 x1
pGSSq
/
a22
/
/
/
1
pk`1q
pk`1q /
pk`1q
pb3 a31 x1
a32 x2
q. /
x3
a33
This sequence of iterates is called the Gauss-Seidel iterative sequence and the
method is called Gauss-Seidel Iteration method.
Remark 3.52. Compare (JS) and (GSS).
[
\
Theorem 3.53. If the coecient matrix is diagonally dominant, then the GaussSeidel method converges to the exact solution of the system Ax b.
Proof:
Since A is diagonally dominant, all the diagonal elements of A are non-zero, and
hence the Gauss-Seidel iterative sequence given by
#
+
i1
n
1
pk`1q
pk`1q
pkq
xi
bi aij xj
aij xj
, i 1, 2, , n.
(3.4.60)
aii
j1
ji`1
is well-defined. The error in each component is given by
pk`1q
ei
i1
aij pk`1q
aij pkq
ej
ej , i 1, 2, , n.
a
a
ii
ii
j1
ji`1
(3.4.61)
aij
,
i
aii
j1
aij
,
i
aii
ji`1
with the convention that 1 n 0. Note that given in (3.4.56) can be written
as
max pi ` i q
1in
80
M A 214, IITB
|ei
| i }epk`1q }8 ` i }epkq }8 , i 1, 2, , n.
(3.4.62)
}epk`1q }8 |el
|.
(3.4.63)
l
}epkq }8 .
1 l
i
.
1in 1 i
max
(3.4.64)
(3.4.65)
(3.4.66)
i
i r1 pi ` i qs
i
r1 s 0,
1 i
1 i
1 i
(3.4.67)
we have
1.
(3.4.68)
(3.4.69)
Recall from Chapter 2 that the mathematical error is due to the approximation
made in the numerical method where the computation is done without any floatingpoint approximation ( ie., without rounding or chopping). Observe that to get the
B askar and Sivaji
81
M A 214, IITB
(3.4.70)
(3.4.71)
This shows that the error e satisfies a linear system with the same coecient matrix
A as in the original system Ax b, but a dierent right hand side vector. Thus,
by having the approximate solution x in hand, we can obtain the error e without
knowing the exact solution x of the system.
3.4.4 Residual Corrector Method
When we use a computing device for solving a linear system, irrespective to whether
we use direct methods or iterative methods, we always get an approximate solution.
An attractive feature (as discussed in the above section) of linear systems is that the
error involved in the approximate solution when compared to the exact solution can
theoretically be obtained exactly. In this section, we discuss how to use this error to
develop an iterative procedure to increase the accuracy of the obtained approximate
solution using any other numerical method.
There is an obvious diculty in the process of obtaining e as the solution of the
system (3.4.71), especially on a computer. Since b and Ax are very close to each
other, the computation of r involves loss of significant digits which leads to zero
residual error, which is of no use. To avoid this situation, the calculation of (3.4.70)
should be carried out at a higher-precision. For instance, if x is computed using
single-precision, then r can be computed using double-precision and then rounded
back to single precision. Let us illustrate the computational procedure of the residual
error in the following example.
Example 3.55 (Computation of residual at higher precision).
Consider the system
0.729x1 ` 0.81x2 ` 0.9x3 0.6867
x1 ` x2 ` x3 0.8338
1.331x1 ` 1.210x2 ` 1.100x3 1.000
Baskar and Sivaji
82
M A 214, IITB
83
M A 214, IITB
(3.4.72)
where
Aepkq r pkq
with
r pkq b Axpkq ,
for k 0, 1, . The above iterative procedure is called th residual corrector
method (also called the iterative refinement method). Note that in computing
r pkq and epkq , we use a higher precision than the precision used in computing xpkq .
Example 3.56. Using Gaussian elimination with pivoting and four digit rounding
in solving the linear system
x1 ` 0.5x2 ` 0.3333x3 1
0.5x1 ` 0.3333x2 ` 0.25x3 0
0.3333x1 ` 0.25x2 ` 0.2x3 0
we obtain the solution as x p9.062, 36.32, 30.30qT . Let us start with an initial
guess of
xp0q p8.968, 35.77, 29.77qT .
Using 8-digit rounding arithmetic, we obtain
r p0q p0.00534100, 0.00435900, 0.00053440qT .
After solving the system Aep0q r p0q using Gaussian elimination with pivoting using
8-digit rounding and the final answer is rounded to four digits, we get
ep0q p0.0922, 0.5442, 0.5239qT .
Hence, the corrected solution after the first iteration is
xp1q p9.060, 36.31, 30.29qT .
Similarly, we can predict the error in xp1q when compared to the exact solution and
correct the solution to obtain the second iterated vector as
r p1q p0.00065700, 0.00037700, 0.00019800qT ,
ep2q p0.0017, 0.0130, 0.0124qT ,
xp2q p9.062, 36.32, 30.30qT ,
and so on.
[
\
The convergence of the iterative procedure to the exact solution is omitted for this
course.
Baskar and Sivaji
84
M A 214, IITB
(3.4.73)
Thus, for a given suciently small positive number , we stop the iteration if
}r pkq } ,
for some vector norm } }.
85
M A 214, IITB
86
M A 214, IITB
[
\
Example 3.60. A matrix may have a unique dominant eigenvalue or more than one
dominant eigenvalues. Further, even if dominant eigenvalue is unique the corresponding algebraic and geometric multiplicities could be more than one, and also both algebraic and geometric multiplicities may not be the same. All these possibilities are
illustrated in this example.
(1) The matrix
1
0
0
1
A 0 2
0
0 1
has eigenvalues 1, 1, and 2. The matrix A has a unique dominant eigenvalue,
which is 2 as this is the largest in absolute value, of all eigenvalues. Note that
the dominant eigenvalue of A is a simple eigenvalue.
(2) The matrix
1 3
4
1
B 0 2
0 0 2
has eigenvalues 1, 2, and 2. According to our definition, the matrix B has two
dominant eigenvalues. They are 2 and 2. Note that both the dominant eigenvalues of B are simple eigenvalues.
1
C1 0
0
3 4
2
2 5 , C2
0
0 2
0
2
, C3
2
0
1
2
The matrix C1 has a unique dominant eigenvalue 2, which has algebraic multiplicity 2 and geometric multiplicity 1. The matrix C2 has a unique dominant
eigenvalue 2, whose algebraic and geometric multiplicities equal 2. The matrix
C3 has a unique dominant eigenvalue 2, which has algebraic multiplicity 2 and
geometric multiplicity 1.
[
\
B askar and Sivaji
87
M A 214, IITB
|1 | |2 | |3 | |n |
c1 0.
2
n
p0q
Ax c1 1 v 1 ` ` cn n v n 1 c1 v 1 ` c2
v 2 ` ` cn
vn .
1
1
Note here that we have assumed 1 0, which follows from the Assumption (1)
above.
Pre-multiplying by A again and simplying, we get
2
2
n
2
v 2 ` ` cn
A2 xp0q 21 c1 v 1 ` c2
vn
1
1
For each k P N, applying A k-times on xp0q yields
k
k
n
2
k p0q
k
v 2 ` ` cn
A x 1 c1 v 1 ` c2
vn
1
1
88
(3.5.76)
M A 214, IITB
(3.5.77)
For c1 0, the right hand side of the above equation is a scalar multiple of the
eigenvector.
From the above expression for Ak xp0q , we also see that
pAk`1 xp0q qi
1 ,
lim
k8 pAk xp0q qi
(3.5.78)
where i is any index such that the fractions on the left hand side are meaningful
k
(which is the case when xp0q R Y8
k1 KerA ).
The power method generates two sequences tk u and txpkq u, using the results
(3.5.77) and (3.5.78), that converge to the dominant eigenvalue 1 and the corresponding eigen vectors v 1 , respectively. We will now give the method of generating
these two sequences.
Step 1: Choose a vector xp0q arbitrarily and set y p1q : Axp0q .
p1q
Step 2: Define 1 : yi , where i P t1, , nu is the least index such that
p1q
}y p1q }8 |yi |
and set
xp1q : y p1q {1 .
From xp1q , we can obtain 2 and xp2q in a similar way.
Power method iterative sequence:
In general, for k 0, 1, , we choose the initial vector xp0q arbitrarily and generate the sequences tpkq u and txpkq u using the formulas
pk`1q
k`1 yi
, xpk`1q
y pk`1q
,
k`1
(3.5.79)
where
pk`1q
|.
(3.5.80)
89
M A 214, IITB
cj v j ,
(3.5.81)
j1
KerAk .
k1
Axpkq
Ay pkq
AAxpk1q
A2 xpk1q
Ak`1 xp0q
.
k`1
k`1 k
k`1 k
k`1 k
k`1 k 1
Therefore, we have
where mk`1
p0q
1{p1 2 k`1 q. But, x
cj v j , c1 0. Therefore
j1
c1 v 1 `
j2
90
cj
j
1
k`1
vj
M A 214, IITB
k`1
n
c1 v 1 ` cj j
1 mk`1 k`1
vj .
1
1
j2
8
Since |j {1 |k 0 as k 8, we get
1
8.
|c1 |}v 1 }8
k8
lim xpk`1q
k8
$
v1
either
`
}v 1 }8
&
v1
k`1
lim mk`1 1 c1 v 1 or
. (3.5.82)
}v 1 }8
k8
or
oscillates between
%
the above two vectors
k8
k8
0, for all k N.
pk`1q
0, for all k N.
yj
Similarly, we see that
xj
Also, we know that
k`1
yj
pk`1q
xj
91
pAxpkq qj
,
pxpk`1q qj
M A 214, IITB
pv 1 qj
1 .
lim k`1
k8
[
\
Note that the above theorem does not guarantee the convergence of the sequence
txn u to an eigenvector corresponding to the dominant eigenvalue. However, if the
dominant eigenvalue has an eigenvector with a unique dominant component, then
this sequence converges as discussed in the following theorem.
Theorem 3.64 (Second Convergence Theorem for Power Method).
Let A be an n n matrix satisfying the hypotheses (H1), (H2), and (H3) of
Theorem 3.62. In addition to these hypotheses,
(H4) let v 1 pv11 , v12 , , v1n qT be such that there exists a unique index j P t1, 2, , nu
with the property
|v1j | }v 1 }8
(3.5.83)
92
M A 214, IITB
(3.5.87)
Observe that
pAk xp0q qi
k1 c1 v1i ` k2 c2 v2i ` ` kn cn vni
.
pAk xp0q qj
k1 c1 v1j ` k2 c2 v2j ` ` kn cn vnj
The last equation can be written in the form
k2 c2
kn cn
v
`
`
vni
2i
pAk xp0q qi
k1 c1
k1 c1
pAk xp0q qj
k c 2
k c n
v1j ` 2k v2j ` ` nk vnj
1 c 1
1 c1
v1i `
(3.5.88)
v1i
as k 8. Since,
v1j
(3.5.89)
for i j,
pAk xp0q qj 1.
(3.5.90)
kn cn
vnj
k1 c1
.
c
k1
n
n
vnj
k1
1 c1
(3.5.91)
Thus,
lim k 1 .
k8
(3.5.92)
93
M A 214, IITB
, ,
, 1,
, , k p0q
.
pAk xp0q qj
pAk xp0q qj
pAk xp0q qj
pA x qj
xpkq
1
v
v1j 1
which
[
\
3
0 0
6 2 .
A 4
16 15 5
94
M A 214, IITB
95
M A 214, IITB
B 0
0
3
4
2
1 ,
0 2
which has eigenvalues 1, 2, and 2. Clearly, the matrix B has two dominant eigenvalues, namely, 2 and 2. We start with an initial guess xp0q p1, 1, 1q and the first
five iterations generated using power method are given below:
Iteration No: 1
y 1 Ax0 p8.000000, 3.000000, 2.000000qT
1 8.000000
y
x1 1 p1.000000, 0.375000, 0.250000qT
1
96
M A 214, IITB
97
M A 214, IITB
3
0 0
6 2 ,
A 4
16 15 5
The eigenvalues of this matrix are
1 3, 2 1 and 3 0.
Baskar and Sivaji
98
M A 214, IITB
k
k
n
3
v 3 ` ` cn
Ak v k2 c2 v 2 ` c3
vn .
2
2
This makes the iteration to converge to 2 , which is the next dominant eigenvalue.[
\
99
M A 214, IITB
fi
91.4 22.0 44.8000
86.4 fl .
A 175.2 41.0
105.2 26.0 51.4000
The eigenvalues of this matrix are 1 5, 2 3 and 3 1. The corresponding
eigenvectors are v 1 p3, 5, 4qT , v 2 p2, 6, 1qT and v 3 p1, 2, 3qT .
Note that the matrix A satisfies the hypothesis (H1) since -5 is the unique dominant eigenvalue and it is also a simple eigenvalue. The matrix A satisfies the hypothesis (H2) as all eigenvalues are distinct and hence eigevectors form a basis for R3 .
Thus the fate of the power method iterates depends solely on the choice of the initial
guess xp0q and whether it satisfies the hypothesis (H3)
Let us take the initial guess xp0q p1, 0.5, 0.25qT . Note that c1 0 for this initial guess. Thus the initial guess satisfies the hypothesis (H3). Therefore by the
theorem on Power method (Theorem 3.62), the iterative sequences generated by
power method converges to the dominant eigenvalue 1 5 and the corresponding eigenvector (with a scalar multiple) 15 v 1 .
Let us take the initial guess xp0q p0, 0.5, 0.25qT . Note that c1 0 for this initial guess. Thus the initial guess satisfies the hypothesis (H3). Therefore by the
theorem on Power method (Theorem 3.62), the iterative sequences generated by
power method converges to the dominant eigenvalue 1 5 and the corresponding eigenvector (with a scalar multiple) 15 v 1 . Compare this with Example 3.70. In
the present case the first coordinate of the initial guess vector is zero, just as in
Example 3.70. In Example 3.70 the power method iterate converged to the second
dominant eigenvalue and the corresponding eigenvector, which does not happen
in the present case. The reason is that in the Example 3.70, c1 0 for the initial
guess chosen, but in the current example c1 0.
[
\
100
M A 214, IITB
|akj |,
j1
jk
and Dk denotes the closed disk in the complex plane with centre akk and radius k ,
i.e.,
(
Dk z P C : |z akk | k .
(1) Each eigenvalue of A lies in one of the disks Dk . That is, no eigenvalue of A lies
in Cz Ynk1 Dk .
(2) Suppose that among the disks D1 , D2 , , Dn , there is a collection of m disks
whose union (denoted by R1 ) is disjoint from the union of the rest of the nm disks
(denoted by R2 ). Then exactly m eigenvalues lie in R1 and nm eigenvalues lie in
R2 (here each eigenvalue is counted as many times as its algebraic multiplicity).
Proof. We will prove only (i) as it is easy, and the proving (ii) is beyond the scope
of this course.
Let be an eigenvalue of A. Then there exists a v pv1 , v2 , , vn q P Rn and v 0
such that
Av v
(3.5.93)
Let 1 r n be such that |vr | maxt|v1 |, |v2 |, , |vn |u. The rth equation of the
system of equations (3.5.93) is given by (actually, of Av v 0)
ar1 v1 ` ` ar,r1 vr1 ` parr qvr ` ar,r`1 vr`1 ` ` arn vn 0
From the last equation, we get
arr
vr1
vr`1
vn
v1
ar1 ` `
ar,r1 `
ar,r`1 ` `
arn
vr
vr
vr
vr
(3.5.94)
Taking modulus on both sides of the equation (3.5.94), and using the triangle inequality |a ` b| |a| ` |b| repeatedly we get
B askar and Sivaji
101
M A 214, IITB
|v1 |
|vr1 |
|vr`1 |
|vn |
|ar1 | ` `
|ar,r1 | `
|ar,r`1 | ` `
|arn | (3.5.95)
|vr |
|vr |
|vr |
|vr |
|vs |
|vr |
1. The last
(3.5.96)
Observe that the right hand side of the inequality (3.5.96) is r . This proves that
P Dr .
[
\
Example 3.74. For the matrix
4 1 1
0 2 1 ,
2 0 9
the Gerschgorins disks are given by
(
D1 z P C : |z4| 2 , D2 z P C : |z2| 1 ,
(
D3 z P C : |z9| 2 .
Draw a picture of these disks and observe that D3 neither intersects D1 nor D2 . By
(ii) of Theorem 3.73, D3 has one eigenvalue and D1 YD2 has two eigenvalues counting
multiplicities. Note that the eigenvalues are approximately 4.6318, 1.8828 P D1 Y D2
and 8.4853 P D3 .
[
\
Remark 3.75. Gerschgorins circle theorem is helpful in finding bound for eigenvalues. For the matrix in Example 3.74, any number z in D1 satisfies |z| 6. Similarly
any number z in D2 satisfies |z| 3, and any number z in D3 satisfies |z| 11. Since
any eigenvalue lies in one of three disks, we can conclude that || 11.
[
\
Remark 3.76. The main disadvantage of the power method discussed in Section
3.5.1 is that if a given matrix has more than one dominant eigenvalues, then the
method may not converge. So, for a given matrix, we do not know whether the power
method will converge or not. Also, as the power method is reasonably slow (see
Example 3.72 for an illustration) we may have to perform reasonably large number
of iterations to come to know that the method is not actually converging.
Thus, a tool to find out whether the given matrix has a unique dominant eigenvalue
or not is highly desirable. The Gerschgorin theorem 3.73 can sometime be used to see
if power method can be used for a given matrix. For instance, in Example 3.74, we
see that the power method can be used to obtain an approximation to the dominant
eigenvalue.
[
\
Since the matrices A and its transpose (denoted by AT ) have same eigenvalues, we
can apply Gerschgorin Circle Theorem to AT and conclude the following corollary.
Baskar and Sivaji
102
M A 214, IITB
|ajk |,
j1
jk
and Bk denotes the closed disk in the complex plane with centre akk and radius k .
That is,
(
Bk z P C : |z akk | k .
(1) Each eigenvalue of A lies in one of the disks Bk . That is, no eigenvalue of A lies
in Cz Ynk1 Bk .
(2) Suppose that among the disks B1 , B2 , , Bn , there is a collection of m disks whose
union (denoted by C1 ) is disjoint from the union of the rest of the n m disks
(denoted by C2 ). Then exactly m eigenvalues lie in C1 and nm eigenvalues lie in
C2 (here each eigenvalue is counted as many times as its algebraic multiplicity).[
\
Remark 3.78. Advantage of the above corollary is the following: Let A be an n n
matrix. Let R denote the region defined by R Ynk1 Dk where Dk are as given by
Theorem 3.73. Similarly, let C denote the region defined by C Ynk1 Bk where Bk
are as given by Corollary 3.77. It may happen that C R, in which case we will get
better estimate for eigenvalues of A. It may not happen for every matrix A. However
whenever such a thing happens, we get better bounds for eigenvalues of A. Such
bounds which we obtain using both informations R and C may be called optimum
bounds. Let us illustrate this with two examples.
[
\
Example 3.79. For the matrix
3
0
1
0 2
2
0
2 3
the k , k (in the notations of Theorem 3.73 and Corollary 3.77) are given by
1 1, 2 2, 3 3 and 1 0, 2 2, 3 3
Therefore, the regions R and C are given by
R Ynk1 Dk tz : |z 3| 1u Y tz : |z ` 2| 2u Y tz : |z ` 3| 3u
and
C Ynk1 Bk tz : |z 3| 0u Y tz : |z ` 2| 2u Y tz : |z ` 3| 3u.
Clearly C R and hence the bounds obtained using C are optimum bounds. We can
also see that any eigenvalue of the given matrix satisfies
3 || 6.
Draw both regions C and R.
[
\
103
M A 214, IITB
3.6 Exercises
Gaussian Elimination Methods
(1) Solve the following systems of linear equations using Naive Gaussian elimination
method, and modified Gaussian elimination method with partial pivoting.
(i) 6x1 ` 2x2 ` 2x3 2,
x1 ` 2x2 x3 0.
x1 ` x2 ` x3 0.8338,
3x1 3x2 ` x3 1,
x1 ` x2 3.
104
M A 214, IITB
an1 x1 ` `ann xn bn ,
where aij 0 whenever i j 2. Write the general form of this system. Use
naive Gaussian elimination method to solve it, taking advantage of the elements
that are known to be zero. Count the number of operations involved in this computation.
(6) Use Thomas method to solve the tri-diagonal system of equations
3x1 ` x2 2,
2x1 3x2 ` x3 6,
2x2 3x3 ` x4 1,
2x3 3x4 2.
LU Decomposition
(7) Prove or disprove the following statements:
(i) An invertible matrix has at most one Doolittle factorization.
(ii) If a singular matrix has a Doolittle factorization, then the matrix has at least
two Doolittle factorizations.
(8) Prove that if an invertible matrix A has an LU -factorization, then all principal
minors of A are non-zero.
105
M A 214, IITB
2 2 1
1 1 1
3 2 1
is invertible but has no LU factorization. Do a suitable interchange of rows to get
an invertible matrix, which has an LU factorization.
(12) Consider
2
6 4
17 17 .
A 6
4 17 20
4 6 2
A 6 10 3
2 3 5
106
M A 214, IITB
Matrix Norms
(17) The following inequalities show that the notion of convergent sequences of vectors
in Rn is independent of the vector norm. Show that the following inequalities hold
for each x P Rn
?
(i) }x}8 }x}2 n}x}8 ,
(ii) }x}8 }x}1 n}x}8 ,
?
(iii) }x}2 }x}1 n}x}2 .
(18) Show that the norm defined on the set of all n n matrices by
}A} : max |aij |
1in
1jn
0.1 0.1
Apq
1.0
2.5
For each P R, compute the condition number of Apq. Determine an 0 such
that the condition number of Ap0 q is the minimum of the set t pApqq : P Ru.
In the computation of condition numbers, use the matrix norm that is subordinate
to the maximum vector norm on R2 .
(22) Consider the following two systems of linear equations
x1 ` x2 1, x1 ` 2x2 2.
and
104 x1 ` 104 x2 104 , x1 ` 2x2 2.
Let us denote the first and second systems by A1 x b1 and A2 x b2 respectively.
Use maximum-norm for vectors and the matrix norm subordinate to maximumnorm for matrices in your computations.
(i) Solve each of the above systems using Naive Gaussian elimination method.
107
M A 214, IITB
Iterative Methods
(23) Let A be a diagonally dominant matrix. Show that all the diagonal elements of
A are non-zero (i.e., aii 0 for i 1, 2, , n.). As a consequence, the iterating
sequences of Jacobi and Gauss-Seidel methods are well-defined if the coecient
matrix A in the linear system Ax b is a diagonally dominant matrix.
(24) Find the nn matrix B and the n-dimensional vector c such that the Gauss-Seidal
method can be written in the form xpk`1q Bxpkq ` c, k 0, 1, 2, .
(25) For each of the following systems, write down the formula for iterating sequences
of Jacobi and Gauss-Seidel methods. Compute three iterates by taking x0
p0, 0, 0qT . Discuss if you can guarantee that these sequences converge to the exact
solution. In case you are not sure about convergence, suggest another iterating
sequence that converges to the exact solution if possible; and justify that the new
sequence converges to the exact solution.
(i) 5x1 ` 2x2 ` x3 0.12, 1.75x1 ` 7x2 ` 0.5x3 0.1, x1 ` 0.2x2 ` 4.5x3 0.5.
(ii) x1 2x2 ` 2x3 1, x1 ` x2 x3 1, 2x1 2x2 ` x3 1.
(iii) x1 ` x2 ` 10x3 1, 2x1 ` 3x2 ` 5x3 6, 3x1 ` 2x2 3x3 4.
[Note: As it is only asked to discuss the guarantee of convergence, it is enough
to check the sucient condition for convergence, ie. to check whether the given
system is diagonally dominant or not.]
(26) Consider the following system of linear equations
0.8647x1 ` 0.5766x2 0.2885,
108
M A 214, IITB
Eigenvalue Problems
(27) The matrix
3 0 0
A 2 1 0
1 0 2
has eigenvalues 1 3, 1 and 3 2 and the corresponding eigenvectors may be taken as v1 p1, 0, 0qT , v2 p1, 2, 0qT and v3 p1, 0, 5qT . Perform
3 iterations to find the eigenvalue and the corresponding eigen vector to which
the power method converges when we start the iteration with the initial guess
xp0q p0, 0.5, 0.75qT . Without performing the iteration, find the eigenvalue and
the corresponding eigenvector to which the power method converges when we start
the iteration with the initial guess xp0q p0.001, 0.5, 0.75qT . Justify your answer.
(28) Find the value of }A}8 for the matrix A given in the above problem using the
power method with xp0q p0, 0.5, 0.75qT .
(29) The matrix
5.4
0
0
A 113.0233 0.5388 0.6461
46.0567 6.4358 0.9612
109
M A 214, IITB
0.5
0
0.2
3.15
1 .
A 0
0.57
0
7.43
lie, given that all eigenvalues of A are real. Show that power method can be
applied for this matrix to find the dominant eigenvalue without computing eigenvalues explicitly. Compute the first three iterates of Power method sequences.
(32) Use the Gerschgorin Circle theorem to determine bounds for the eigenvalues for
the following matrices. Also find optimum bounds wherever possible. Also draw
the pictures of all the regions given by Greschgorin circles.
4 1
0
1
0 0
1
1
4 1 ,
0 1 ,
1 1 2
1 1
4
1 0 1
1
4.75 2.25 0.25
2 2 1
1
.
2.25 4.75
1.25 ,
0 1
3 2
0.25 1.25
4.75
1 0
1
4
(33) Show that the imaginary parts of the
3
1
1{2
eigenvalues of
1{3 2{3
4
0
1{2 1
110
M A 214, IITB
Index
l1 norm, 67
n-digit floating-point number, 26
Absolute error, 32
Arithmetic error, 25
Backward substitution, 50, 55, 60
Big Oh, 19, 20
Binary representation, 26
Bounded sequence, 6
Choleskys factorization, 64
Chopping a number, 30
Condition number
of a function, 40
of a matrix, 71
Continuity of a function, 9
Continuous function, 9
Convergent sequence, 6
Crouts factorization, 63
Decimal representation, 26
Decreasing sequence, 6
Derivative of a function, 10
Diagonally dominant matrix, 78
Dierentiable function, 10
Direct method, 47, 49, 74
Dominant eigenvalue, 86
Doolittles factorization, 61
Double precision, 30
Eigenvalue
Power method, 89
dominant, 86
Error, 32
absolute, 32
arithmetic, 25
floating-point, 38
in iterative procedure, 77
mathematical, 25
percentage, 32
propagated, 38
propagation of, 36
relative, 32
relative total, 38
residual, 82
total, 25, 38
truncation, 16, 33
Euclidean norm, 67
Exact arithmetic, 30
Exponent, 26
First mean value theorem for integrals, 13
Floating-point
approximation, 29
error, 38
representation, 26
Forward
elimination, 50, 55
substitution, 59
Gauss-Seidel method, 80
Gaussian elimination method
modified, 52
Naive, 49
operations count, 55
Gerschgorins
circle theorem, 101
disk, 102
Hilbert matrix, 72
Ill-conditioned
function evaluation, 41
Ill-conditioned matrix, 72
Increasing sequence, 6
Infinite precision, 30
Intermediate value theorem, 10
Iterative method, 47, 74
Gauss-Seidel, 80
Jacobi, 75
refinement, 84
Index
residual corrector, 84
Jacobi method, 75
Limit
of a function, 8
of a sequence, 6
left-hand, 7
of a function, 9
right-hand, 8
Linear system
direct method, 47, 49
Gaussian elimination method, 49, 52
iterative method, 47
LU factorization, 60
Thomas method, 56
Little oh, 19, 20
LU factorization/decomposition, 60
Choleskys, 64
Crouts, 63
Doolittles, 61
Machine epsilon, 32
Mantissa, 26
Mathematical error, 25
Matrix norm, 68
maximum-of-column-sums, 69
maximum-of-row-sums, 69
spectral, 70
subordinate, 68
Maximum norm, 67
Maximum-of-column-sums norm, 69
Maximum-of-row-sums norm, 69
Mean value theorem
derivative, 12, 15
integrals, 13
Modified Gaussian elimination method, 52
Monotonic sequence, 6
Naive Gaussian elimination method, 49
Norm
matrix, 68
maximum-of-column-sums, 69
maximum-of-row-sums, 69
spectral, 70
subordinate, 68
vector, 67
l1 , 67
Euclidean, 67
maximum, 67
Oh
Big and Little, 19, 20
Optimum bound, 103
Order
of convergence, 21
Overflow, 27
Percentage error, 32
Positive definite matrix, 64
Power method, 88, 89
Precision, 29
double, 30
infinite, 30
Principal minors, 61
leading, 61
Principal sub-matrix, 61
Propagated error, 38
Propagation of error, 36
Radix, 26
Rate of convergence, 21
Relative error, 32
Relative total error, 38
remainder estimate, 16
Remainder term
in Taylors formula, 14
Residual
error, 82
vector, 82
Residual corrector method, 84
Rolles theorem, 11
Rounding a number, 30
Sandwich theorem, 6, 9
Second mean value theorem for integrals, 13
Sequence, 5
bounded, 6
convergent, 6
decreasing, 6
increasing, 6
limit, 6
monotonic, 6
Sign, 26
Significant digits, 33
loss of, 35
number of, 34
Spectral norm, 70
Stability
in function evaluation, 42
Stable computation, 42
Stopping criteria
method for nonlinear equations, 85
Subordinate norms, 68
Taylors
formula, 15
polynomial, 14
series, 17
Theorem, 33
theorem, 14
Thomas method, 56
Total error, 25, 38
Triangle inequality, 67, 68
Truncation error, 16, 33
112
M A 214, IITB
Section Index
Underflow, 27
Unit round, 32
Unstable computation, 42
Vector norm, 67
l1 , 67
Euclidean, 67
maximum, 67
Well-conditioned
function evaluation, 41
Well-conditioned matrix, 72
Wilkinsons example, 86
113
M A 214, IITB