0% found this document useful (0 votes)
32 views20 pages

Floating Point Numbers and Errors

The document discusses floating point numbers, focusing on computer arithmetic, roundoff errors, and the representation of numbers in a finite digit system. It covers definitions of key concepts such as normalized numbers, machine epsilon, and IEEE standards for binary floating-point arithmetic. Additionally, it highlights the importance of understanding error propagation and the implications of using different arithmetic methods in computational settings.

Uploaded by

desheng wang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views20 pages

Floating Point Numbers and Errors

The document discusses floating point numbers, focusing on computer arithmetic, roundoff errors, and the representation of numbers in a finite digit system. It covers definitions of key concepts such as normalized numbers, machine epsilon, and IEEE standards for binary floating-point arithmetic. Additionally, it highlights the importance of understanding error propagation and the implications of using different arithmetic methods in computational settings.

Uploaded by

desheng wang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

MATH3511

Assoc Prof Linda Stals

Semester 1, 2025

1
Contents
1 Floating Point Numbers 3
1.1 Computer Arithmetic and Roundoff Errors . . . . . . . . . . . . . . . . . . . 3
1.2 Measuring Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3 Accumulation of Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4 Catastrophic Cancellation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

Definitions
1.1 Definition (Floating Point System) . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Definition (Normalised Numbers) . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Definition (Machine Epsilon) . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Definition (Absolute and relative errors) . . . . . . . . . . . . . . . . . . . . . 8
1.5 Definition (Floating point value) . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.6 Definition (Floating point model) . . . . . . . . . . . . . . . . . . . . . . . . . 9

Examples
1.1 Example (Floating point model - multiplication) . . . . . . . . . . . . . . . . 10
1.2 Example (Floating point model - subtraction) . . . . . . . . . . . . . . . . . . 12
1.3 Example (Floating point model - addition) . . . . . . . . . . . . . . . . . . . 13
1.4 Example (Floating point model - complex multiplication) . . . . . . . . . . . 14
1.5 Example (Catastrophic cancellation) . . . . . . . . . . . . . . . . . . . . . . . 16

Algorithms

Matlab Code
1 Modification of code to avoid cancellation . . . . . . . . . . . . . . . . . . . . 17

2
1 Floating Point Numbers

Computer Numbers
A knowledge of Computer Hardware is important for efficient computation. The speed
of a computer is often measured by the number of floating point operations per second. The
usual notation is 1,000,000 flops = 1 Megaflop; 1000 Megaflops = 1 Gigaflop, 1000 Gigaflops
= 1 Teraflop, 1000 Teraflop = 1 Exaflop.
Computer hardware is of primary importance when it comes to the efficiency of the com-
putations. Some (small PC’s) may carry out thousands of operations per second, while others
(super-computers) carry out billions of operations per second. To get the best performance
out of a machine, we need to use techniques that are at the forefront of both the mathematical
and computer sciences.
Computers have a limited number of digits to store a number. If a number contains too
many digits, we must shorten it to fit into the available space. We usually accomplish this
by rounding (or truncating) the number. From a numerical computing perspective, this is
crucial because it indicates that errors can be present from the moment data is entered. As
the calculation proceeds, further errors are introduced and can propagate, sometimes with
disastrous consequences. For example, on a hand calculator that stores ten decimal places,
the result of the multiplication

0.234567 × 0.9876543 = 0.2316711061881

would either be rounded off to the nearest number with only ten decimal places, giving

0.2316711062

or chopped (rounded down, truncated), discarding the extra digits after the tenth, giving

0.2316711061.

Rounding changes the number by at most 0.5 × 10−10 while chopping can produce an error
(change) of up to 10−10 .
In this section, we will consider ways in which these errors propagate during elementary
operations, as well as the phenomenon of a catastrophic cancellation.
There are models designed to predict the behaviour of the error in a general setting. We
shall briefly discuss a popular model in this course. Our goal is to use the model to explain
the type of behaviour we see in a set of example model problems.
Be aware this is still a research area, and many other models are in use.

1.1 Computer Arithmetic and Roundoff Errors

Floating Point System


Every number processed on a computer must be represented with a finite number of digits.
Two possible forms of representation are fixed point and floating point. In a fixed point
system, all numbers have a fixed number of decimal places. For example

0.011 11.000 1100.000.

3
In a floating point system, numbers are represented in a scientific notation where there
are a fixed number of significant digits. The floating point representations of the numbers
given above are
1.10 × 10−2 1.10 × 101 1.10 × 103
if the number of significant digits is 3 here.
Machine Numbers
While we work in the decimal system, the computer operates with a binary system. The
1
storage of even some simple numbers, such as 10 , in the binary representation requires round-
ing or chopping to be represented by a fixed number of digits. A computer can store only a
finite set of numbers exactly. They are called machine numbers.
Floating Point Arithmetic
It is most common on computers to use floating point arithmetic, which means that
numbers are essentially stored in scientific notation.

Definition 1.1 (Floating Point System). The floating point system is characterised by:

• β, the base or radix of the system

• t the precision of the system (number of significant figures)

• [L, U ] the exponent range

A number in the floating point system has the form


 
b1 bt−1
±1 b0 + + · · · + t−1 β e
β β

where 0 ≤ bi < β for i = 0, ..., t − 1 and L ≤ e ≤ U .

The fractional part or the mantissa is the number


b1 bt−1
m = b0 + + · · · + t−1 .
β β
It follows that 1 ≤ m < β.
The number e is called the exponent and the ±1 is the sign of the number.
The precision t is related to the number of digits, while L and U determine the smallest
and largest possible number that may be stored. See the discussion below.
Observe we can rewrite the floating point number as

±1 b0 β t−1 + b1 β t−2 + · · · + bt−1 β e−t+1 ,



(1)

with the advantage being that the term inside the bracket is now an integer.
Normalised Numbers

Definition 1.2 (Normalised Numbers). To ensure uniqueness of representation we can specify


that for all non-zero numbers the first digit b0 must be non-zero. The system is then said to
be normalised.

4
Figure 1: The floating point numbers are not evenly spaced

For example, 0.03 = 0.03 × 100 = 0.30 × 10−1 = 3.00 × 10−2 = 30.0 × 10−3 = · · · . In a
normalised system 0.03 will be stored as 3.00 × 10−2 .
For systems using base 2 the leading digit for a normalised number will always be 1 (and
therefore does not need to be explicitly stored).
In a normalised system, the smallest positive floating point number is β L . The largest
floating point number is
t−1 t−1 t−1
!
X X X
(β − 1) β −i β U = β U −i+1 − β U −i
i=0 i=0 i=0
t−1
X Xt
= β U +1−i − β U +1−i
i=0 i=1
U +1 U +1−t
=β −β .

Floating Point Number Spacing


If we fix the exponent and vary the mantissa, the numbers are equally spaced. However,
as we increase the exponent, the spacing between numbers increases. See Figure 1 which is
taken from [Link]
rs/?s_tid=blogs_rc_1 (where emin = L, emax = U ) The gui that was used to generate the
above Figure is available at [Link]
37976-numerical-computing-with-matlab
IEEE Arithmetic
Should a computer use rounding or chopping? Should the numbers be normalised? What
happens when you convert an integer to a real number? How many bits should be used to
store a single precision number? What results should calculations like 0/0 return?
The IEEE Standard for Binary Floating-Point Arithmetic addresses such questions and
is widely used. Incorporating such standards means that a piece of code will return the same
result independent of what machine is used (assuming that we are considering single-processor
machines).
Most computers use the same standard floating point system, first defined by the IEEE
(Institute of Electrical and Electronics Engineers) in 1985 and last updated in 2019.
The most common format in use today is IEEE double precision, but other standards are
defined too:

5
Table 1: An example double precision number

0 10000000011 1011100100010000000000000000000000000000000000000000

IEEE format β t L U
Half precision 2 11 −14 15
Single precision 2 24 −126 127
Double precision 2 53 −1022 1023
Quadruple precision 2 113 −16382 16383

One double precision number takes 64 bits/binary digits (8 bytes) of memory:


• 1 bit for the sign (0 for +, 1 for −)

• 11 bits for the exponent e (save the binary representation of e + U ∈ {1, . . . , 2046})

• 52 bits for the mantissa (t = 53 by not explicitly storing the normalised b0 = 1)

As an example, consider the double precision number (sign-exponent-mantissa) given in


Table 1.
This represents:

• Sign 0 is +.

• Exponent 10000000011 is e + U = 210 + 21 + 20 = 1027, so e = 4 (since U = 1023).

• Mantissa 1011100100010000000000000000000000000000000000000000. Consequently,


the mantissa represents b0 = 1 (normalised) plus b1 =b3 =b4 =b5 =b8 =b12 =1, all other
bi = 0.

Hence it represents the (decimal) number

+ 1 + 2−1 + 2−3 + 2−4 + 2−5 + 2−8 + 2−12 × 24 = 27.56640625 (exactly).




Note that the exponent bits can never be all zeros or all ones (i.e. e + U ̸= 0, 2047).
The double numeric type values in Matlab are stored using IEEE-754 format for double
precision. This consists of

• 52+1 bits describing a fractional part of the form 1.b1 b2 ...b52 where each of the b1 , b2
etc. is a bit [0 or 1]. Since the first bit is always 1, it does not need to be stored.

• 11 bits describing an exponent e giving a factor 2e . Of the 2048 possible values, the
ones actually used are -1022 to 1023, with the base of the powers being 2.

• one bit for the sign of the fractional part.

With β = 2, the largest possible number is 21024 − 2(1024−53) giving approximately 1.80 ×
10308 .See [Link] for more detail.
The IEEE standard also defines some special ‘numbers’, such as ±0 and ±∞. See Table
2.

6
Table 2: Some special cases represented in IEEE format

0 00000000000 0000000000000000000000000000000000000000000000000000 = +0.0


1 00000000000 0000000000000000000000000000000000000000000000000000 = -0.0
0 11111111111 0000000000000000000000000000000000000000000000000000 = +inf
1 11111111111 0000000000000000000000000000000000000000000000000000 = -inf
0 11111111111 ************* any 52 bits not all zero ************* = NaN

NaN means not a number, and is a result of performing undefined operations such
as 0/0, or arithmetic involving existing NaN values, such as NaN + 3 = NaN. Note the
convention that NaN = NaN and NaN > NaN and NaN < NaN are all false (this can cause
problems).
The value −0 acts essentially the same way as 0 for arithmetic, and has the convention
that +0 = −0 is true. It comes about from calculations like (−1) × (+0) or from rounding
very small negative numbers to zero.
The values ±∞ act as you might expect: ∞ − 1 = ∞ and 3 > −∞, but 0 × ∞ = NaN.
Overflow and underflow

• If the result of a computation is larger than that allowed by the computer, you have an
overflow.
• When a result is smaller than the smallest positive number, then it either becomes zero
or is considered inaccurate because it has less than the normal 52 significant bits. This
loss of accuracy is called underflow.
• Underflow is a more subtle problem than overflow, but still important, and some ma-
chines will stop with an error message (or at least print a warning message) when it
occurs.

Matlab interprets overflow as ‘infinity’ which is printed as inf or −inf according to its
sign, and continues computing. Other operations with no well-defined answer, like 0.0/0.0,
give the result NaN. Note that evaluating, say, sqrt(−1.0) will result in a complex number.
However in other languages such as Python, sqrt(−1.0) will return NaN.
Machine Epsilon

Definition 1.3 (Machine Epsilon). The smallest positive machine number εM such that
1 + εM ̸= 1 is called machine epsilon.

In our floating point model, 1 = 1 × β 0 . So the next biggest number is 1 + β 1−t β 0 .




Hence εM = β 1−t .
For normalised IEEE 64 bit floating point numbers εM = 2−52 ≃ 2.2204×10−16 . So this is
roughly the maximum relative error introduced when you round or chop a number. Another
way of describing this is that numbers have 52 significant bits, or slightly better than fifteen
significant decimal digits.
For traditional a 32-bit machines εM = 2−23 ≃ 1.1921 × 10−7 . That is, numbers have 23
significant bits or close to seven significant decimal digits.

7
1.2 Measuring Errors

Rounding and Truncation


For example, with nine decimal places in the fraction,

231.6711061881 = 2.316711061881 × 102

is rounded to
2.316711062 × 102 = 231.6711062
or chopped to
2.316711061 × 102 = 231.6711061.
This is rounding to ten significant figures.
The actual error in the numerical example above is 1 × 10−7 . Is that a large or small
error? It depends on the size of the numbers being calculated, which leads us to consider the
difference between absolute and relative errors.

Definition 1.4 (Absolute and relative errors). If x̃ is an approximation of x, the error in


this approximation is
x − x̃,
the absolute error is
|x − x̃|,
and the relative error is
x − x̃
.
x

The relative error provides a better understanding of the significance of the error. As seen
above, rounding or chopping to ten significant decimal places introduces a relative error of
about 10−10 regardless of the size of the number.

1.3 Accumulation of Errors

Floating Point Model


The rounding errors of about one part in 1016 are rarely a problem in themselves. The
combined effects over the many calculations in an algorithm can be far worse. ([Link]
[Link]/~arnold/disasters/[Link])

Definition 1.5 (Floating point value). The floating point value fl(x) is the result of the
rounding (or chopping) done on x. fl(x) is the floating point number representation of x as
given in Definition 1.1.

Recall the number of digits in the mantissa is the number of significant digits. The
rounding or chopping is done on the fractional part.

8
Write x ∈ R as

!
X bi
x= βe
βi
i=0

!
X bi
= β e−t+1
β i−t+1
i=0

!
X
= bi β (t−1)−i β e−t+1
i=0
e−t+1
= µβ ,

where β t−1 ≤ µ. We have assumed a normalised system where 0 < b0 .


Consider
X∞ t−1
X ∞
X
(t−1)−i (t−1)−i
µ= bi β = bi β + bi β (t−1)−i .
i=0 i=0 i=t

Compare the above equation to the floating point representation given in Equation (1).
Then y1 = ⌊µ⌋β e−t+1 and y2 = ⌈µ⌉β e−t+1 are two floating point numbers such that
y1 ≤ x ≤ y2 . (⌊µ⌋ rounds a number down to the nearest integer, ⌈µ⌉ rounds a number up to
the nearest integer.)
So
|x − fl(x)| ≤ β e−t+1 ,
for chopping and
|x − fl(x)| ≤ β e−t+1 /2,
for rounding.
Furthermore
|x − fl(x)| β e−t+1
≤ ≤ β 1−t = εM ,
|x| µβ e−t+1
for chopping and
|x − fl(x)|
≤ β 1−t /2 = εM /2,
|x|
for rounding.

Definition 1.6 (Floating point model). Any system of floating point numbers will have an
upper limit on rounding ε, so that

fl(x) = x(1 + δ), |δ| ≤ ε

for numbers in the allowable range of absolute values.

When using rounding, which is the case of IEEE, ε = β 1−t /2. Note that |δ| ≤ 2−24 in single
precision and |δ| ≤ 2−53 in double precision (if rounding).
Machines may have slight differences, but we will assume that whenever two machine
numbers are combined through some arithmetical procedure, they are first combined, then
normalised, rounded off, and finally stored in memory.

9
Absolute Error - Single
0.0040
0.0035
0.0030
0.0025
0.0020
0.0015
0.0010
0.0005
0.0000
0 500 1000 1500 2000 2500 3000

Figure 2: Absolute error |x − fl(x)| when using single precision

We will define the floating point rounding function “fl” whose value is the result of the
rounding done on its argument by the machine in use, and then consider the accumulative error
through the examination of the effects of addition, subtraction, multiplication and division.
Thus, the result that you would expect for multiplication of two numbers x and y that can
be represented exactly in the machine’s floating point system is fl(x × y). If, alternatively,
the numbers themselves have to be rounded when input, the result is fl(fl(x) × fl(y)).
The floating point model in Definition 1.6 gives a uniform bound on the fractional change
in a number when it is rounded to a normalised value.
Numerical Experiment - Single Precision
As an experiment, I randomly generated 3000 real numbers x such that x ∈ [−100000, 100000].
I then rounded them to single precision and plotted |fl(x) − x|. The result is in Figure 2.
(Actually, I cheated a bit. Since we can’t store real numbers on a computer, I randomly
generated 3000 quadruple precision numbers, and used those as the ‘real’ numbers.)
Let us now consider the relative |fl(x) − x|/|x| as shown in Figure 3. Notice that the
relative error is ≤ 2−24 ≈ 5.96 × 10−8 .
Multiplication

Example 1.1 (Floating point model - multiplication). Use the floating point model given in
Definition 1.6 to bound the error when multiplying two floating point numbers.

Using the bound ε, consider the total rounding error in the multiplication:

fl(fl(x) × fl(y)) = fl(x(1 + δ1 ) × y(1 + δ2 ))


= [x(1 + δ1 )y(1 + δ2 )](1 + δ3 ),

where |δi | ≤ ε.
Thus
fl(fl(x) × fl(y)) = (xy)(1 + δ4 )

10
1e 8 Relative Error - Single
6

0
0 500 1000 1500 2000 2500 3000

Figure 3: Relative error |x − fl(x)|/|x| when using single precision

where
(1 + δ4 ) = (1 + δ1 )(1 + δ2 )(1 + δ3 ).
The extreme values of (1 + δ4 ) give

(1 − ε)3 = 1 − 3ε + O(ε2 ) ≤ 1 + δ4
≤ (1 + ε)3 = 1 + 3ε + O(ε2 ).

So considering the very small value of ε we have roughly that

|δ4 | ≤ 3ε

(more carefully, |δ4 | ≤ 3ε + 3ε2 + ε3 ).


So for multiplication the combination of three errors is roughly additive.
To continue with the numerical experiment, I randomly generated 3000 points x, y and
evaluated |fl(fl(x) × fl(y)) − x × y|/|x × y| using single precision. The results shown in
Figure 4 have relative error is ≤ 3 × 2−24 ≈ 1.79 × 10−7 .
Addition and Subtraction
The situation for addition and subtraction is different as seen by the following example.
Using ten significant decimal digits we have

x = 0.8888888888888888 − 0.8888888888444444
= 0.000000000444444,
x̃ = fl(0.8888888888888888) − fl(0.8888888888444444)
= 0.8888888889 − 0.8888888888
= 0.0000000001.

11
1e 7 Relative Error - Single
1.4
1.2
1.0
0.8
0.6
0.4
0.2
0.0
0 500 1000 1500 2000 2500 3000

Figure 4: Relative error |fl(fl(x) × fl(y)) − x × y|/|x × y| when using single precision

The relative error is


x − x̃
≈ 0.775.
x
But ε ≃ 10−10 and the relative error is ≈ 8 × 109 times larger. We have only one significant
figure of accuracy. This is catastrophic!
If instead we try adding the two numbers,

x = 0.8888888888888888 + 0.8888888888444444
= 1.7777777777333332,
x̃ = fl(0.8888888888888888) + fl(0.8888888888444444)
= 0.8888888889 + 0.8888888888
= 1.7777777777.

The relative error is


x + x̃
≈ 1.9 × 10−11 .
x

Example 1.2 (Floating point model - subtraction). Use the floating point model given in
Definition 1.6 to get a more precise understanding of why subtraction may be problematic with
floating point numbers.

Let’s calculate the roundoff error when exactly subtracting two, different, numbers:

fl(x) − fl(y) = x(1 + δ1 ) − y(1 + δ2 )


 
xδ1 − yδ2
= (x − y) 1 +
x−y
= (x − y)(1 + δ3 )

12
Relative Error - Single
10 5

10 6

10 7

10 8

10 9

10 10

10 11

0 500 1000 1500 2000 2500 3000

Figure 5: Relative error |fl(fl(x) − fl(y)) − (x − y)|/|x − y| when using single precision

where
max(|x|, |y|)
|δ3 | ≈ 2ε .
|x − y|
This can be very large if x−y is very small relative to x and y. Again, I randomly generated
3000 real numbers x, y ∈ [0, 100000] and evaluated |fl(fl(x) + fl(y)) − (x − y)|/|x − y| using
single precision. The results shown in Figure 5 have relative error is ≫ 2−24 ≈ 5.96 × 10−8 .
Beware of subtracting two similar numbers.
For example, recall the finite different formula
f (x + h) − f (x)
f ′ (x) ≈ . (2)
h
In exact arithmetic, we expect the error to decrease as h → 0. However, f (x + h) ≈ f (x)
for h small. We saw in the chapter on numerical differentiation that the large roundoff errors
have a major impact on the accuracy of the solution. See Figure 6 for a reminder.

Example 1.3 (Floating point model - addition). Use the floating point model given in Def-
inition 1.6 to show that if we add two numbers with the same sign, the relative errors are
well-behaved.
If x, y > 0
fl(fl(x) + fl(y)) = (x(1 + δ1 ) + y(1 + δ2 ))(1 + δ3 )
= x + y + x (δ1 + δ3 + δ1 δ3 ) + y(δ2 + δ3 + δ2 δ3 )
= x + y + δ4 ,
where
|δ4 | = |x (δ1 + δ3 + δ1 δ3 ) + y(δ2 + δ3 + δ2 δ3 )|
≤ max{|(δ1 + δ3 + δ1 δ3 )| , |(δ2 + δ3 + δ2 δ3 )|}|x + y|
≤ (2ε + ε2 )|x + y|.

13
100
Truncation Error
O(h)
10−2 O(h^2)

10−4
Error

10−6

10−8

10−10
100 10−2 10−4 10−6 10−8
h

Figure 6: Error in approximating f ′ (1) for f (x) = x sin x using forward differences.

So
fl(fl(x) + fl(y)) = (x + y)(1 + δ4 ), |δ4 | ≤ 2ε + ε2 ≃ 2ε.
This is consistent with the results of the numerical experiment shown in Figure 7.
Multiplication of Complex Numbers
Before looking at the next example, note that if r, s ∈ R

2(r2 + s2 ) − (r + s)2 = (r − s)2 ≥ 0. (3)

Consider x = a + ib and y = c + id ∈ C. Suppose that the multiplication of two complex


numbers is given by
xy = (ac − bd) + i(ad + bc).
Let us suppose that a, b, c and d are floating point numbers as specified in Definition 1.1.
That is a = fl(a), b = fl(b), c = fl(c) and d = fl(d). Now, as an additional example, let us
apply the floating point model to complex multiplication.

Example 1.4 (Floating point model - complex multiplication). Use the floating point model
given in Definition 1.6 to bound the error when multiplying two complex numbers.

According to the model,

fl(xy) = (ac(1 + δ1 ) − bd(1 + δ2 )) (1 + δ3 ) + i ((ad(1 + δ4 ) + bc(1 + δ5 )) (1 + δ6 )


= ac(1 + δ1 )(1 + δ3 ) − bd(1 + δ2 )(1 + δ3 ) + i ((ad(1 + δ4 )(1 + δ6 ) + bc(1 + δ5 )(1 + δ6 ))
= (ac − bd) + i(ad + bc) + e,

where |δi | ≤ ε for 1 ≤ i ≤ 6 and

e = ac (δ1 + δ3 + δ1 δ3 ) − bd (δ2 + δ3 + δ2 δ3 ) + i (ad (δ4 + δ6 + δ4 δ6 ) + bc (δ5 + δ6 + δ5 δ6 )) .

14
1e 7 Relative Error - Single
1.0

0.8

0.6

0.4

0.2

0.0
0 500 1000 1500 2000 2500 3000

Figure 7: Relative error |fl(fl(x) + fl(y)) − (x + y)|/|x + y| when using single precision

Now

|e|2 = (ac (δ1 + δ3 + δ1 δ3 ) − bd (δ2 + δ3 + δ2 δ3 ))2 + (ad (δ4 + δ6 + δ4 δ6 ) + bc (δ5 + δ6 + δ5 δ6 ))2


≤ (|ac| |δ1 + δ3 + δ1 δ3 | + |bd| |δ2 + δ3 + δ2 δ3 |)2 + (|ad| |δ4 + δ6 + δ4 δ6 | + |bc| |δ5 + δ6 + δ5 δ6 |)2
 
≤ γ 2 (|ac| + |bd|)2 + (|ad| + |bc|)2 ,

where

γ = max{|δ1 + δ3 + δ1 δ3 | , |δ2 + δ3 + δ2 δ3 | , |δ4 + δ6 + δ4 δ6 | , |δ5 + δ6 + δ5 δ6 |} ≤ 2ϵ + O(ϵ2 ).

From Equation (3),

|e|2 ≤ 2γ 2 |ac|2 + |bd|2 + |ad|2 + |bc|2




= 2γ 2 a2 + b2 c2 + d2
 

= 2γ 2 |x|2 |y|2 .

Finally
fl(xy) = xy(1 + e)
√ √
where |e| ≤ 2γ ≈ 2 2ϵ.

1.4 Catastrophic Cancellation

Cancellation Error
In any calculation there is the danger of a catastrophic calculation. Suppose two numbers
agree (are equal) in all but the last digit, then their difference x − y can have only one
significant digit of accuracy, and this without any roundoff error. Future calculations will
then only have at most one significant digit. Such routines in extreme cases can become

15
x ex(x) ex
1 2.718281828459046e+00 2.7182818284590455e+00
10 2.202646579480671e+04 2.2026465794806718e+04
20 4.851651954097905e+08 4.8516519540979028e+08
40 2.353852668370201e+17 2.3538526683701997e+17
-1 3.678794411714424e-01 3.6787944117144233e-01
-10 4.539992967040021e-05 4.5399929762484847e-05
-20 6.147561828914626e-09 2.0611536224385579e-09
-40 3.116951588217358e-01 4.2483542552915889e-18

Table 3: Accuracy of the Matlab ex function

numerically unstable and lead to completely incorrect solutions. Such calculations are known
as catastrophic cancellation.
This potential problem is often difficult to foresee, although it is sometimes possible to
avoid. In general, if x and y are represented with n significant digits, and agree to one or
more digits, then their difference will not be accurate to the full n digits.
Catastrophic Cancellation Example
Let us consider another situation in which this problem arises.

Example 1.5 (Catastrophic cancellation). Calculate ex using the formula



x
X xn
e = .
n!
n=0

Consider the following Matlab program which can be used to approximate ex .

function ex_approx = ex_Taylor(x)

% Program to calculate exp(x) using naive Taylor series method

ex_approx = 1;
term = 1;
n=0;
while ( ex_approx+term ˜= ex_approx )
n = n + 1;
term = term *(x/n);
ex_approx = ex_approx + term;
end

Note that we stop the calculation when the addition of a new term does not make a difference
to the sum.
Table 3 compares the results from this program with the exact value of ex . When x > 0
the results are fine. However, when x < 0 the results become progressively worse as x → −∞.
What is happening? For positive x we have a sum of positive terms. For negative x we
have a sum of positive and negative terms.
Table 4 displays the terms in the series e−25 . It is easy to see that in a situation like this,
catastrophic cancellation can and does occur.

16
n n-th term in Taylor’s series
0 1
1 -24
2 2.8850 × 102
..
.
12 8.3320 × 107
13 −1.5598 × 108
14 2.7134 × 108
15 −4.4086 × 108
..
.
85 −8.1807 × 10−7
86 8.1814 × 10−7
87 −8.1812 × 10−7
88 8.1813 × 10−7

Table 4: Terms of the series approximation of e−25 .

In this case we can avoid the situation in which catastrophic cancellation will occur by
using the formula
1
ex = −x
e
for x < 0.
Specifically we the code given in Listing 1

Listing 1: Modification of code to avoid cancellation


function ex_approx = ex_new(x)

% Program to calculate exp(x) avoiding cancellation errors

if x > 0
ex_approx = ex_Taylor(x);
else
ex_approx = 1.0/ex_Taylor(−x);
end

As can be seen in Table 5, the results obtained with the new algorithm are quite satisfac-
tory.
A particularly interesting example given by Rump [2] of catastrophic cancellation is given
by the expression
a
y = 333.75b6 + a2 (11a2 b2 − b6 − 121b4 − 2) + 5.5b8 +
2b
with a = 77617 and b = 33096. The exact value of this expression is
−54767
≈ −.8273960599468.
66192

17
x new ex(x) ex
1 2.718281828459045e+00 2.7182818284590455e+00
10 2.202646579480671e+04 2.2026465794806718e+04
20 4.851651954097905e+08 4.8516519540979028e+08
40 2.353852668370201e+17 2.3538526683701997e+17
-1 3.678794411714423e-01 3.6787944117144233e-01
-10 4.539992976248485e-05 4.5399929762484847e-05
-20 2.061153622438557e-09 2.0611536224385579e-09
-40 4.248354255291587e-18 4.2483542552915889e-18

Table 5: Accuracy of new Matlab new_ex

Now what do we get if we use double precision floating point arithmetic? All the constants
in the expression can be represented exactly in double precision, so we might expect an
accurate result. But using Matlab we obtain −1.1805916207174D+21!
If we were to do this calculation in extended precision we would still obtain essentially
the same result. It turns out that you need at least 37 decimal digits of accuracy to obtain a
reasonable result.
An explanation of the behaviour of this function can be found in the paper by Cuyt et. al. [1].
Error in Continuous Processes
Numerical schemes to solve differential equations will be dealt with later in the course,
but Eulers Method is one which already may have heard. Errors which occur in most of these
computational problems are a combination of roundoff errors due to the finite nature of the
floating point system, errors of discretisation, convergence errors and errors due to the nature
of the numerical scheme chosen. Of course, we could use various tools in Matlab to do our
calculation to any desired precision, but then the speed of our calculations will be slow. So
we usually make the compromise of using floating point systems which allow fast calculations,
but which maintain only a limited number of significant figures.
As we have seen in the previous discussions, calculations with many operations may be
swamped by roundoff errors; one with too few, may lack sufficient accuracy. For instance,
when we use the fourth order Runge-Kutta method (see the ODE Chapter) with a time-step
of size h to approximate the solution of an ODE at time T , we will use O(1/h) operations,
which will typically produce a roundoff error of size O(ε/h). As shown in Figure 8, the total
error is a combination of discretisation error and roundoff error.
For high efficiency there needs to be a balance / tradeoff / compromise between the errors
and required accuracy in context and thus for each problem, the most efficient method may
be different.

18
−6

−8

−10

−12
log(error)

−14

−16

−18

−20

−22
0 2 4 6 8 10 12 14 16
n

Figure 8: Typical errors using 4th Order Runge Kutta method, showing discretisation error
dominating for large h and roundoff error dominating for smaller h.

19
References
[1] A. Cuyt, B. Verdonk, S. Becuwe, and P. Kuterna. A remarkable example of catastrophic
cancellation unraveled. Computing, 66:309–320, 2001. [Link]
ticle/10.1007/s006070170028.

[2] S. M. Rump. Algorithms for verifying inclusions - theory and practice. In R. E. Moore,
editor, Reliability in Computing, pages 109–126. Academic Press, New York, 1988.

20

You might also like