0% found this document useful (0 votes)

32 views20 pages

Floating Point Numbers and Errors

The document discusses floating point numbers, focusing on computer arithmetic, roundoff errors, and the representation of numbers in a finite digit system. It covers definitions of key concepts such as normalized numbers, machine epsilon, and IEEE standards for binary floating-point arithmetic. Additionally, it highlights the importance of understanding error propagation and the implications of using different arithmetic methods in computational settings.

Uploaded by

desheng wang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views20 pages

Floating Point Numbers and Errors

Uploaded by

desheng wang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

MATH3511

Assoc Prof Linda Stals

Semester 1, 2025

1
Contents
1 Floating Point Numbers 3
1.1 Computer Arithmetic and Roundoff Errors . . . . . . . . . . . . . . . . . . . 3
1.2 Measuring Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3 Accumulation of Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4 Catastrophic Cancellation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

Definitions
1.1 Definition (Floating Point System) . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Definition (Normalised Numbers) . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Definition (Machine Epsilon) . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Definition (Absolute and relative errors) . . . . . . . . . . . . . . . . . . . . . 8
1.5 Definition (Floating point value) . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.6 Definition (Floating point model) . . . . . . . . . . . . . . . . . . . . . . . . . 9

Examples
1.1 Example (Floating point model - multiplication) . . . . . . . . . . . . . . . . 10
1.2 Example (Floating point model - subtraction) . . . . . . . . . . . . . . . . . . 12
1.3 Example (Floating point model - addition) . . . . . . . . . . . . . . . . . . . 13
1.4 Example (Floating point model - complex multiplication) . . . . . . . . . . . 14
1.5 Example (Catastrophic cancellation) . . . . . . . . . . . . . . . . . . . . . . . 16

Algorithms

Matlab Code
1 Modification of code to avoid cancellation . . . . . . . . . . . . . . . . . . . . 17

2
1 Floating Point Numbers

Computer Numbers
A knowledge of Computer Hardware is important for efficient computation. The speed
of a computer is often measured by the number of floating point operations per second. The
usual notation is 1,000,000 flops = 1 Megaflop; 1000 Megaflops = 1 Gigaflop, 1000 Gigaflops
= 1 Teraflop, 1000 Teraflop = 1 Exaflop.
Computer hardware is of primary importance when it comes to the efficiency of the com-
putations. Some (small PC’s) may carry out thousands of operations per second, while others
(super-computers) carry out billions of operations per second. To get the best performance
out of a machine, we need to use techniques that are at the forefront of both the mathematical
and computer sciences.
Computers have a limited number of digits to store a number. If a number contains too
many digits, we must shorten it to fit into the available space. We usually accomplish this
by rounding (or truncating) the number. From a numerical computing perspective, this is
crucial because it indicates that errors can be present from the moment data is entered. As
the calculation proceeds, further errors are introduced and can propagate, sometimes with
disastrous consequences. For example, on a hand calculator that stores ten decimal places,
the result of the multiplication

0.234567 × 0.9876543 = 0.2316711061881

would either be rounded off to the nearest number with only ten decimal places, giving

0.2316711062

or chopped (rounded down, truncated), discarding the extra digits after the tenth, giving

0.2316711061.

Rounding changes the number by at most 0.5 × 10−10 while chopping can produce an error
(change) of up to 10−10 .
In this section, we will consider ways in which these errors propagate during elementary
operations, as well as the phenomenon of a catastrophic cancellation.
There are models designed to predict the behaviour of the error in a general setting. We
shall briefly discuss a popular model in this course. Our goal is to use the model to explain
the type of behaviour we see in a set of example model problems.
Be aware this is still a research area, and many other models are in use.

1.1 Computer Arithmetic and Roundoff Errors

Floating Point System

Every number processed on a computer must be represented with a finite number of digits.
Two possible forms of representation are fixed point and floating point. In a fixed point
system, all numbers have a fixed number of decimal places. For example

0.011 11.000 1100.000.

3
In a floating point system, numbers are represented in a scientific notation where there
are a fixed number of significant digits. The floating point representations of the numbers
given above are
1.10 × 10−2 1.10 × 101 1.10 × 103
if the number of significant digits is 3 here.
Machine Numbers
While we work in the decimal system, the computer operates with a binary system. The
1
storage of even some simple numbers, such as 10 , in the binary representation requires round-
ing or chopping to be represented by a fixed number of digits. A computer can store only a
finite set of numbers exactly. They are called machine numbers.
Floating Point Arithmetic
It is most common on computers to use floating point arithmetic, which means that
numbers are essentially stored in scientific notation.

Definition 1.1 (Floating Point System). The floating point system is characterised by:

• β, the base or radix of the system

• t the precision of the system (number of significant figures)

• [L, U ] the exponent range

A number in the floating point system has the form

b1 bt−1
±1 b0 + + · · · + t−1 β e
β β

where 0 ≤ bi < β for i = 0, ..., t − 1 and L ≤ e ≤ U .

The fractional part or the mantissa is the number

b1 bt−1
m = b0 + + · · · + t−1 .
β β
It follows that 1 ≤ m < β.
The number e is called the exponent and the ±1 is the sign of the number.
The precision t is related to the number of digits, while L and U determine the smallest
and largest possible number that may be stored. See the discussion below.
Observe we can rewrite the floating point number as

±1 b0 β t−1 + b1 β t−2 + · · · + bt−1 β e−t+1 ,

(1)

with the advantage being that the term inside the bracket is now an integer.
Normalised Numbers

Definition 1.2 (Normalised Numbers). To ensure uniqueness of representation we can specify

that for all non-zero numbers the first digit b0 must be non-zero. The system is then said to
be normalised.

4
Figure 1: The floating point numbers are not evenly spaced

For example, 0.03 = 0.03 × 100 = 0.30 × 10−1 = 3.00 × 10−2 = 30.0 × 10−3 = · · · . In a
normalised system 0.03 will be stored as 3.00 × 10−2 .
For systems using base 2 the leading digit for a normalised number will always be 1 (and
therefore does not need to be explicitly stored).
In a normalised system, the smallest positive floating point number is β L . The largest
floating point number is
t−1 t−1 t−1
!
X X X
(β − 1) β −i β U = β U −i+1 − β U −i
i=0 i=0 i=0
t−1
X Xt
= β U +1−i − β U +1−i
i=0 i=1
U +1 U +1−t
=β −β .

Floating Point Number Spacing

If we fix the exponent and vary the mantissa, the numbers are equally spaced. However,
as we increase the exponent, the spacing between numbers increases. See Figure 1 which is
taken from [Link]
rs/?s_tid=blogs_rc_1 (where emin = L, emax = U ) The gui that was used to generate the
above Figure is available at [Link]
37976-numerical-computing-with-matlab
IEEE Arithmetic
Should a computer use rounding or chopping? Should the numbers be normalised? What
happens when you convert an integer to a real number? How many bits should be used to
store a single precision number? What results should calculations like 0/0 return?
The IEEE Standard for Binary Floating-Point Arithmetic addresses such questions and
is widely used. Incorporating such standards means that a piece of code will return the same
result independent of what machine is used (assuming that we are considering single-processor
machines).
Most computers use the same standard floating point system, first defined by the IEEE
(Institute of Electrical and Electronics Engineers) in 1985 and last updated in 2019.
The most common format in use today is IEEE double precision, but other standards are
defined too:

5
Table 1: An example double precision number

0 10000000011 1011100100010000000000000000000000000000000000000000

IEEE format β t L U
Half precision 2 11 −14 15
Single precision 2 24 −126 127
Double precision 2 53 −1022 1023
Quadruple precision 2 113 −16382 16383

One double precision number takes 64 bits/binary digits (8 bytes) of memory:

• 1 bit for the sign (0 for +, 1 for −)

• 11 bits for the exponent e (save the binary representation of e + U ∈ {1, . . . , 2046})

• 52 bits for the mantissa (t = 53 by not explicitly storing the normalised b0 = 1)

As an example, consider the double precision number (sign-exponent-mantissa) given in

Table 1.
This represents:

• Sign 0 is +.

• Exponent 10000000011 is e + U = 210 + 21 + 20 = 1027, so e = 4 (since U = 1023).

• Mantissa 1011100100010000000000000000000000000000000000000000. Consequently,

the mantissa represents b0 = 1 (normalised) plus b1 =b3 =b4 =b5 =b8 =b12 =1, all other
bi = 0.

Hence it represents the (decimal) number

+ 1 + 2−1 + 2−3 + 2−4 + 2−5 + 2−8 + 2−12 × 24 = 27.56640625 (exactly).

Note that the exponent bits can never be all zeros or all ones (i.e. e + U ̸= 0, 2047).
The double numeric type values in Matlab are stored using IEEE-754 format for double
precision. This consists of

• 52+1 bits describing a fractional part of the form 1.b1 b2 ...b52 where each of the b1 , b2
etc. is a bit [0 or 1]. Since the first bit is always 1, it does not need to be stored.

• 11 bits describing an exponent e giving a factor 2e . Of the 2048 possible values, the
ones actually used are -1022 to 1023, with the base of the powers being 2.

• one bit for the sign of the fractional part.

With β = 2, the largest possible number is 21024 − 2(1024−53) giving approximately 1.80 ×
10308 .See [Link] for more detail.
The IEEE standard also defines some special ‘numbers’, such as ±0 and ±∞. See Table
2.

6
Table 2: Some special cases represented in IEEE format

0 00000000000 0000000000000000000000000000000000000000000000000000 = +0.0

1 00000000000 0000000000000000000000000000000000000000000000000000 = -0.0
0 11111111111 0000000000000000000000000000000000000000000000000000 = +inf
1 11111111111 0000000000000000000000000000000000000000000000000000 = -inf
0 11111111111 ************* any 52 bits not all zero ************* = NaN

NaN means not a number, and is a result of performing undefined operations such
as 0/0, or arithmetic involving existing NaN values, such as NaN + 3 = NaN. Note the
convention that NaN = NaN and NaN > NaN and NaN < NaN are all false (this can cause
problems).
The value −0 acts essentially the same way as 0 for arithmetic, and has the convention
that +0 = −0 is true. It comes about from calculations like (−1) × (+0) or from rounding
very small negative numbers to zero.
The values ±∞ act as you might expect: ∞ − 1 = ∞ and 3 > −∞, but 0 × ∞ = NaN.
Overflow and underflow

• If the result of a computation is larger than that allowed by the computer, you have an
overflow.
• When a result is smaller than the smallest positive number, then it either becomes zero
or is considered inaccurate because it has less than the normal 52 significant bits. This
loss of accuracy is called underflow.
• Underflow is a more subtle problem than overflow, but still important, and some ma-
chines will stop with an error message (or at least print a warning message) when it
occurs.

Matlab interprets overflow as ‘infinity’ which is printed as inf or −inf according to its
sign, and continues computing. Other operations with no well-defined answer, like 0.0/0.0,
give the result NaN. Note that evaluating, say, sqrt(−1.0) will result in a complex number.
However in other languages such as Python, sqrt(−1.0) will return NaN.
Machine Epsilon

Definition 1.3 (Machine Epsilon). The smallest positive machine number εM such that
1 + εM ̸= 1 is called machine epsilon.

In our floating point model, 1 = 1 × β 0 . So the next biggest number is 1 + β 1−t β 0 .

Hence εM = β 1−t .
For normalised IEEE 64 bit floating point numbers εM = 2−52 ≃ 2.2204×10−16 . So this is
roughly the maximum relative error introduced when you round or chop a number. Another
way of describing this is that numbers have 52 significant bits, or slightly better than fifteen
significant decimal digits.
For traditional a 32-bit machines εM = 2−23 ≃ 1.1921 × 10−7 . That is, numbers have 23
significant bits or close to seven significant decimal digits.

7
1.2 Measuring Errors

Rounding and Truncation

For example, with nine decimal places in the fraction,

231.6711061881 = 2.316711061881 × 102

is rounded to
2.316711062 × 102 = 231.6711062
or chopped to
2.316711061 × 102 = 231.6711061.
This is rounding to ten significant figures.
The actual error in the numerical example above is 1 × 10−7 . Is that a large or small
error? It depends on the size of the numbers being calculated, which leads us to consider the
difference between absolute and relative errors.

Definition 1.4 (Absolute and relative errors). If x̃ is an approximation of x, the error in

this approximation is
x − x̃,
the absolute error is
|x − x̃|,
and the relative error is
x − x̃
.
x

The relative error provides a better understanding of the significance of the error. As seen
above, rounding or chopping to ten significant decimal places introduces a relative error of
about 10−10 regardless of the size of the number.

1.3 Accumulation of Errors

Floating Point Model

The rounding errors of about one part in 1016 are rarely a problem in themselves. The
combined effects over the many calculations in an algorithm can be far worse. ([Link]
[Link]/~arnold/disasters/[Link])

Definition 1.5 (Floating point value). The floating point value fl(x) is the result of the
rounding (or chopping) done on x. fl(x) is the floating point number representation of x as
given in Definition 1.1.

Recall the number of digits in the mantissa is the number of significant digits. The
rounding or chopping is done on the fractional part.

8
Write x ∈ R as
∞
!
X bi
x= βe
βi
i=0
∞
!
X bi
= β e−t+1
β i−t+1
i=0
∞
!
X
= bi β (t−1)−i β e−t+1
i=0
e−t+1
= µβ ,

where β t−1 ≤ µ. We have assumed a normalised system where 0 < b0 .

Consider
X∞ t−1
X ∞
X
(t−1)−i (t−1)−i
µ= bi β = bi β + bi β (t−1)−i .
i=0 i=0 i=t

Definition 1.6 (Floating point model). Any system of floating point numbers will have an
upper limit on rounding ε, so that

fl(x) = x(1 + δ), |δ| ≤ ε

for numbers in the allowable range of absolute values.

When using rounding, which is the case of IEEE, ε = β 1−t /2. Note that |δ| ≤ 2−24 in single
precision and |δ| ≤ 2−53 in double precision (if rounding).
Machines may have slight differences, but we will assume that whenever two machine
numbers are combined through some arithmetical procedure, they are first combined, then
normalised, rounded off, and finally stored in memory.

9
Absolute Error - Single
0.0040
0.0035
0.0030
0.0025
0.0020
0.0015
0.0010
0.0005
0.0000
0 500 1000 1500 2000 2500 3000

Figure 2: Absolute error |x − fl(x)| when using single precision

We will define the floating point rounding function “fl” whose value is the result of the
rounding done on its argument by the machine in use, and then consider the accumulative error
through the examination of the effects of addition, subtraction, multiplication and division.
Thus, the result that you would expect for multiplication of two numbers x and y that can
be represented exactly in the machine’s floating point system is fl(x × y). If, alternatively,
the numbers themselves have to be rounded when input, the result is fl(fl(x) × fl(y)).
The floating point model in Definition 1.6 gives a uniform bound on the fractional change
in a number when it is rounded to a normalised value.
Numerical Experiment - Single Precision
As an experiment, I randomly generated 3000 real numbers x such that x ∈ [−100000, 100000].
I then rounded them to single precision and plotted |fl(x) − x|. The result is in Figure 2.
(Actually, I cheated a bit. Since we can’t store real numbers on a computer, I randomly
generated 3000 quadruple precision numbers, and used those as the ‘real’ numbers.)
Let us now consider the relative |fl(x) − x|/|x| as shown in Figure 3. Notice that the
relative error is ≤ 2−24 ≈ 5.96 × 10−8 .
Multiplication

Example 1.1 (Floating point model - multiplication). Use the floating point model given in
Definition 1.6 to bound the error when multiplying two floating point numbers.

Using the bound ε, consider the total rounding error in the multiplication:

fl(fl(x) × fl(y)) = fl(x(1 + δ1 ) × y(1 + δ2 ))

= [x(1 + δ1 )y(1 + δ2 )](1 + δ3 ),

where |δi | ≤ ε.
Thus
fl(fl(x) × fl(y)) = (xy)(1 + δ4 )

10
1e 8 Relative Error - Single
6

0
0 500 1000 1500 2000 2500 3000

Figure 3: Relative error |x − fl(x)|/|x| when using single precision

where
(1 + δ4 ) = (1 + δ1 )(1 + δ2 )(1 + δ3 ).
The extreme values of (1 + δ4 ) give

(1 − ε)3 = 1 − 3ε + O(ε2 ) ≤ 1 + δ4
≤ (1 + ε)3 = 1 + 3ε + O(ε2 ).

So considering the very small value of ε we have roughly that

|δ4 | ≤ 3ε

(more carefully, |δ4 | ≤ 3ε + 3ε2 + ε3 ).

So for multiplication the combination of three errors is roughly additive.
To continue with the numerical experiment, I randomly generated 3000 points x, y and
evaluated |fl(fl(x) × fl(y)) − x × y|/|x × y| using single precision. The results shown in
Figure 4 have relative error is ≤ 3 × 2−24 ≈ 1.79 × 10−7 .
Addition and Subtraction
The situation for addition and subtraction is different as seen by the following example.
Using ten significant decimal digits we have

x = 0.8888888888888888 − 0.8888888888444444
= 0.000000000444444,
x̃ = fl(0.8888888888888888) − fl(0.8888888888444444)
= 0.8888888889 − 0.8888888888
= 0.0000000001.

11
1e 7 Relative Error - Single
1.4
1.2
1.0
0.8
0.6
0.4
0.2
0.0
0 500 1000 1500 2000 2500 3000

Figure 4: Relative error |fl(fl(x) × fl(y)) − x × y|/|x × y| when using single precision

The relative error is

x − x̃
≈ 0.775.
x
But ε ≃ 10−10 and the relative error is ≈ 8 × 109 times larger. We have only one significant
figure of accuracy. This is catastrophic!
If instead we try adding the two numbers,

x = 0.8888888888888888 + 0.8888888888444444
= 1.7777777777333332,
x̃ = fl(0.8888888888888888) + fl(0.8888888888444444)
= 0.8888888889 + 0.8888888888
= 1.7777777777.

The relative error is

x + x̃
≈ 1.9 × 10−11 .
x

Example 1.2 (Floating point model - subtraction). Use the floating point model given in
Definition 1.6 to get a more precise understanding of why subtraction may be problematic with
floating point numbers.

Let’s calculate the roundoff error when exactly subtracting two, different, numbers:

fl(x) − fl(y) = x(1 + δ1 ) − y(1 + δ2 )

xδ1 − yδ2
= (x − y) 1 +
x−y
= (x − y)(1 + δ3 )

12
Relative Error - Single
10 5

10 6

10 7

10 8

10 9

10 10

10 11

0 500 1000 1500 2000 2500 3000

Figure 5: Relative error |fl(fl(x) − fl(y)) − (x − y)|/|x − y| when using single precision

where
max(|x|, |y|)
|δ3 | ≈ 2ε .
|x − y|
This can be very large if x−y is very small relative to x and y. Again, I randomly generated
3000 real numbers x, y ∈ [0, 100000] and evaluated |fl(fl(x) + fl(y)) − (x − y)|/|x − y| using
single precision. The results shown in Figure 5 have relative error is ≫ 2−24 ≈ 5.96 × 10−8 .
Beware of subtracting two similar numbers.
For example, recall the finite different formula
f (x + h) − f (x)
f ′ (x) ≈ . (2)
h
In exact arithmetic, we expect the error to decrease as h → 0. However, f (x + h) ≈ f (x)
for h small. We saw in the chapter on numerical differentiation that the large roundoff errors
have a major impact on the accuracy of the solution. See Figure 6 for a reminder.

Example 1.3 (Floating point model - addition). Use the floating point model given in Def-
inition 1.6 to show that if we add two numbers with the same sign, the relative errors are
well-behaved.
If x, y > 0
fl(fl(x) + fl(y)) = (x(1 + δ1 ) + y(1 + δ2 ))(1 + δ3 )
= x + y + x (δ1 + δ3 + δ1 δ3 ) + y(δ2 + δ3 + δ2 δ3 )
= x + y + δ4 ,
where
|δ4 | = |x (δ1 + δ3 + δ1 δ3 ) + y(δ2 + δ3 + δ2 δ3 )|
≤ max{|(δ1 + δ3 + δ1 δ3 )| , |(δ2 + δ3 + δ2 δ3 )|}|x + y|
≤ (2ε + ε2 )|x + y|.

13
100
Truncation Error
O(h)
10−2 O(h^2)

10−4
Error

10−6

10−8

10−10
100 10−2 10−4 10−6 10−8
h

Figure 6: Error in approximating f ′ (1) for f (x) = x sin x using forward differences.

So
fl(fl(x) + fl(y)) = (x + y)(1 + δ4 ), |δ4 | ≤ 2ε + ε2 ≃ 2ε.
This is consistent with the results of the numerical experiment shown in Figure 7.
Multiplication of Complex Numbers
Before looking at the next example, note that if r, s ∈ R

2(r2 + s2 ) − (r + s)2 = (r − s)2 ≥ 0. (3)

Consider x = a + ib and y = c + id ∈ C. Suppose that the multiplication of two complex

numbers is given by
xy = (ac − bd) + i(ad + bc).
Let us suppose that a, b, c and d are floating point numbers as specified in Definition 1.1.
That is a = fl(a), b = fl(b), c = fl(c) and d = fl(d). Now, as an additional example, let us
apply the floating point model to complex multiplication.

Example 1.4 (Floating point model - complex multiplication). Use the floating point model
given in Definition 1.6 to bound the error when multiplying two complex numbers.

According to the model,

fl(xy) = (ac(1 + δ1 ) − bd(1 + δ2 )) (1 + δ3 ) + i ((ad(1 + δ4 ) + bc(1 + δ5 )) (1 + δ6 )

= ac(1 + δ1 )(1 + δ3 ) − bd(1 + δ2 )(1 + δ3 ) + i ((ad(1 + δ4 )(1 + δ6 ) + bc(1 + δ5 )(1 + δ6 ))
= (ac − bd) + i(ad + bc) + e,

where |δi | ≤ ε for 1 ≤ i ≤ 6 and

e = ac (δ1 + δ3 + δ1 δ3 ) − bd (δ2 + δ3 + δ2 δ3 ) + i (ad (δ4 + δ6 + δ4 δ6 ) + bc (δ5 + δ6 + δ5 δ6 )) .

14
1e 7 Relative Error - Single
1.0

0.8

0.6

0.4

0.2

0.0
0 500 1000 1500 2000 2500 3000

Figure 7: Relative error |fl(fl(x) + fl(y)) − (x + y)|/|x + y| when using single precision

Now

|e|2 = (ac (δ1 + δ3 + δ1 δ3 ) − bd (δ2 + δ3 + δ2 δ3 ))2 + (ad (δ4 + δ6 + δ4 δ6 ) + bc (δ5 + δ6 + δ5 δ6 ))2

≤ (|ac| |δ1 + δ3 + δ1 δ3 | + |bd| |δ2 + δ3 + δ2 δ3 |)2 + (|ad| |δ4 + δ6 + δ4 δ6 | + |bc| |δ5 + δ6 + δ5 δ6 |)2

≤ γ 2 (|ac| + |bd|)2 + (|ad| + |bc|)2 ,

where

γ = max{|δ1 + δ3 + δ1 δ3 | , |δ2 + δ3 + δ2 δ3 | , |δ4 + δ6 + δ4 δ6 | , |δ5 + δ6 + δ5 δ6 |} ≤ 2ϵ + O(ϵ2 ).

From Equation (3),

|e|2 ≤ 2γ 2 |ac|2 + |bd|2 + |ad|2 + |bc|2

= 2γ 2 a2 + b2 c2 + d2

= 2γ 2 |x|2 |y|2 .

Finally
fl(xy) = xy(1 + e)
√ √
where |e| ≤ 2γ ≈ 2 2ϵ.

1.4 Catastrophic Cancellation

Cancellation Error
In any calculation there is the danger of a catastrophic calculation. Suppose two numbers
agree (are equal) in all but the last digit, then their difference x − y can have only one
significant digit of accuracy, and this without any roundoff error. Future calculations will
then only have at most one significant digit. Such routines in extreme cases can become

15
x ex(x) ex
1 2.718281828459046e+00 2.7182818284590455e+00
10 2.202646579480671e+04 2.2026465794806718e+04
20 4.851651954097905e+08 4.8516519540979028e+08
40 2.353852668370201e+17 2.3538526683701997e+17
-1 3.678794411714424e-01 3.6787944117144233e-01
-10 4.539992967040021e-05 4.5399929762484847e-05
-20 6.147561828914626e-09 2.0611536224385579e-09
-40 3.116951588217358e-01 4.2483542552915889e-18

Table 3: Accuracy of the Matlab ex function

numerically unstable and lead to completely incorrect solutions. Such calculations are known
as catastrophic cancellation.
This potential problem is often difficult to foresee, although it is sometimes possible to
avoid. In general, if x and y are represented with n significant digits, and agree to one or
more digits, then their difference will not be accurate to the full n digits.
Catastrophic Cancellation Example
Let us consider another situation in which this problem arises.

Example 1.5 (Catastrophic cancellation). Calculate ex using the formula

∞
x
X xn
e = .
n!
n=0

Consider the following Matlab program which can be used to approximate ex .

function ex_approx = ex_Taylor(x)

% Program to calculate exp(x) using naive Taylor series method

ex_approx = 1;
term = 1;
n=0;
while ( ex_approx+term ˜= ex_approx )
n = n + 1;
term = term *(x/n);
ex_approx = ex_approx + term;
end

Note that we stop the calculation when the addition of a new term does not make a difference
to the sum.
Table 3 compares the results from this program with the exact value of ex . When x > 0
the results are fine. However, when x < 0 the results become progressively worse as x → −∞.
What is happening? For positive x we have a sum of positive terms. For negative x we
have a sum of positive and negative terms.
Table 4 displays the terms in the series e−25 . It is easy to see that in a situation like this,
catastrophic cancellation can and does occur.

16
n n-th term in Taylor’s series
0 1
1 -24
2 2.8850 × 102
..
.
12 8.3320 × 107
13 −1.5598 × 108
14 2.7134 × 108
15 −4.4086 × 108
..
.
85 −8.1807 × 10−7
86 8.1814 × 10−7
87 −8.1812 × 10−7
88 8.1813 × 10−7

Table 4: Terms of the series approximation of e−25 .

In this case we can avoid the situation in which catastrophic cancellation will occur by
using the formula
1
ex = −x
e
for x < 0.
Specifically we the code given in Listing 1

Listing 1: Modification of code to avoid cancellation

function ex_approx = ex_new(x)

% Program to calculate exp(x) avoiding cancellation errors

if x > 0
ex_approx = ex_Taylor(x);
else
ex_approx = 1.0/ex_Taylor(−x);
end

As can be seen in Table 5, the results obtained with the new algorithm are quite satisfac-
tory.
A particularly interesting example given by Rump [2] of catastrophic cancellation is given
by the expression
a
y = 333.75b6 + a2 (11a2 b2 − b6 − 121b4 − 2) + 5.5b8 +
2b
with a = 77617 and b = 33096. The exact value of this expression is
−54767
≈ −.8273960599468.
66192

17
x new ex(x) ex
1 2.718281828459045e+00 2.7182818284590455e+00
10 2.202646579480671e+04 2.2026465794806718e+04
20 4.851651954097905e+08 4.8516519540979028e+08
40 2.353852668370201e+17 2.3538526683701997e+17
-1 3.678794411714423e-01 3.6787944117144233e-01
-10 4.539992976248485e-05 4.5399929762484847e-05
-20 2.061153622438557e-09 2.0611536224385579e-09
-40 4.248354255291587e-18 4.2483542552915889e-18

Table 5: Accuracy of new Matlab new_ex

Now what do we get if we use double precision floating point arithmetic? All the constants
in the expression can be represented exactly in double precision, so we might expect an
accurate result. But using Matlab we obtain −1.1805916207174D+21!
If we were to do this calculation in extended precision we would still obtain essentially
the same result. It turns out that you need at least 37 decimal digits of accuracy to obtain a
reasonable result.
An explanation of the behaviour of this function can be found in the paper by Cuyt et. al. [1].
Error in Continuous Processes
Numerical schemes to solve differential equations will be dealt with later in the course,
but Eulers Method is one which already may have heard. Errors which occur in most of these
computational problems are a combination of roundoff errors due to the finite nature of the
floating point system, errors of discretisation, convergence errors and errors due to the nature
of the numerical scheme chosen. Of course, we could use various tools in Matlab to do our
calculation to any desired precision, but then the speed of our calculations will be slow. So
we usually make the compromise of using floating point systems which allow fast calculations,
but which maintain only a limited number of significant figures.
As we have seen in the previous discussions, calculations with many operations may be
swamped by roundoff errors; one with too few, may lack sufficient accuracy. For instance,
when we use the fourth order Runge-Kutta method (see the ODE Chapter) with a time-step
of size h to approximate the solution of an ODE at time T , we will use O(1/h) operations,
which will typically produce a roundoff error of size O(ε/h). As shown in Figure 8, the total
error is a combination of discretisation error and roundoff error.
For high efficiency there needs to be a balance / tradeoff / compromise between the errors
and required accuracy in context and thus for each problem, the most efficient method may
be different.

18
−6

−8

−10

−12
log(error)

−14

−16

−18

−20

−22
0 2 4 6 8 10 12 14 16
n

Figure 8: Typical errors using 4th Order Runge Kutta method, showing discretisation error
dominating for large h and roundoff error dominating for smaller h.

19
References
[1] A. Cuyt, B. Verdonk, S. Becuwe, and P. Kuterna. A remarkable example of catastrophic
cancellation unraveled. Computing, 66:309–320, 2001. [Link]
ticle/10.1007/s006070170028.

[2] S. M. Rump. Algorithms for verifying inclusions - theory and practice. In R. E. Moore,
editor, Reliability in Computing, pages 109–126. Academic Press, New York, 1988.

Understanding Floating-point Arithmetic
No ratings yet
Understanding Floating-point Arithmetic
30 pages
Understanding Floating Point Arithmetic
No ratings yet
Understanding Floating Point Arithmetic
11 pages
Numerical Error Analysis in Computation
No ratings yet
Numerical Error Analysis in Computation
9 pages
Floating Point Representation Explained
No ratings yet
Floating Point Representation Explained
30 pages
Numerical Analysis Lecture Notes
No ratings yet
Numerical Analysis Lecture Notes
6 pages
Numerical Methods and Error Analysis
No ratings yet
Numerical Methods and Error Analysis
24 pages
Computer Architecture: Arithmetic Operations
No ratings yet
Computer Architecture: Arithmetic Operations
48 pages
Understanding Approximation Errors in Engineering
No ratings yet
Understanding Approximation Errors in Engineering
27 pages
Floating Point Arithmetic and Errors
No ratings yet
Floating Point Arithmetic and Errors
192 pages
Floating Point Arithmetic & Error Analysis
No ratings yet
Floating Point Arithmetic & Error Analysis
19 pages
Roundoff Errors in Floating-Point Arithmetic
No ratings yet
Roundoff Errors in Floating-Point Arithmetic
13 pages
Understanding Round-Off Errors
No ratings yet
Understanding Round-Off Errors
46 pages
Computer Arithmetic Fundamentals
No ratings yet
Computer Arithmetic Fundamentals
155 pages
Numerical Analysis Lecture Notes
No ratings yet
Numerical Analysis Lecture Notes
68 pages
Round-off and Truncation Errors Explained
No ratings yet
Round-off and Truncation Errors Explained
26 pages
From Durham's Course Work
No ratings yet
From Durham's Course Work
5 pages
Data Representation in Computing
No ratings yet
Data Representation in Computing
28 pages
Basic Computer Arithmetic
No ratings yet
Basic Computer Arithmetic
21 pages
MATH 2160 Numerical Analysis Notes
No ratings yet
MATH 2160 Numerical Analysis Notes
111 pages
Floating Point Arithmetic and Errors
No ratings yet
Floating Point Arithmetic and Errors
158 pages
CHAP 03e
No ratings yet
CHAP 03e
32 pages
Machine Arithmetic: Fixed vs Floating Point
No ratings yet
Machine Arithmetic: Fixed vs Floating Point
26 pages
IEEE Standard for Floating-Point Arithmetic
No ratings yet
IEEE Standard for Floating-Point Arithmetic
3 pages
Floating-Point Arithmetic Basics
No ratings yet
Floating-Point Arithmetic Basics
14 pages
Understanding Errors in Numerical Methods
No ratings yet
Understanding Errors in Numerical Methods
10 pages
Finite Precision Arithmetic Explained
No ratings yet
Finite Precision Arithmetic Explained
72 pages
Floating Point Number Representation
No ratings yet
Floating Point Number Representation
11 pages
Computer Arithmetic and Error Analysis
No ratings yet
Computer Arithmetic and Error Analysis
18 pages
Floating-Point Number Representation
No ratings yet
Floating-Point Number Representation
43 pages
CIT 335: Computational Science Overview
No ratings yet
CIT 335: Computational Science Overview
10 pages
Approximation and Errors in Computation
No ratings yet
Approximation and Errors in Computation
24 pages
Lecture 3
No ratings yet
Lecture 3
12 pages
Computer Arithmetic Techniques Explained
No ratings yet
Computer Arithmetic Techniques Explained
67 pages
Understanding Floating Point Arithmetic
No ratings yet
Understanding Floating Point Arithmetic
36 pages
Understanding Numerical Errors and Precision
No ratings yet
Understanding Numerical Errors and Precision
15 pages
Central Processing Unit Overview
No ratings yet
Central Processing Unit Overview
78 pages
Introduction to Computer Number Systems
No ratings yet
Introduction to Computer Number Systems
41 pages
Splines
No ratings yet
Splines
16 pages
Numerical Analysis Course Overview
No ratings yet
Numerical Analysis Course Overview
43 pages
Floating Point Number Systems Overview
No ratings yet
Floating Point Number Systems Overview
46 pages
ESA/390 Floating Point Enhancements Overview
No ratings yet
ESA/390 Floating Point Enhancements Overview
135 pages
Design and Implementation of Fast Floating Point Multiplier Unit
No ratings yet
Design and Implementation of Fast Floating Point Multiplier Unit
5 pages
Numerical Analysis Lecture 1
No ratings yet
Numerical Analysis Lecture 1
9 pages
Classical and Modern Numerical Analysis Theory Methods and Practice 1St Edition Azmy S. Ackleh
No ratings yet
Classical and Modern Numerical Analysis Theory Methods and Practice 1St Edition Azmy S. Ackleh
110 pages
Understanding Number Representation Errors
No ratings yet
Understanding Number Representation Errors
65 pages
Numerical Analysis and Binary Systems
No ratings yet
Numerical Analysis and Binary Systems
6 pages
Decimal Representation of 1/750
No ratings yet
Decimal Representation of 1/750
30 pages
CIT 335: Computational Science Guide
No ratings yet
CIT 335: Computational Science Guide
102 pages
Notes On Computational and Numerical Analysis
No ratings yet
Notes On Computational and Numerical Analysis
10 pages
Cit335 Nounmedia Summary
100% (1)
Cit335 Nounmedia Summary
46 pages
Floating Point Arithmetic and Errors
No ratings yet
Floating Point Arithmetic and Errors
15 pages
Integer Division and Number Representation
No ratings yet
Integer Division and Number Representation
16 pages
Subtraction Roundoff Errors Explained
No ratings yet
Subtraction Roundoff Errors Explained
22 pages
Understanding Floating Point Numbers
No ratings yet
Understanding Floating Point Numbers
4 pages
Floating-Point Representation in Computing
No ratings yet
Floating-Point Representation in Computing
6 pages
Errors in Floating-Point Arithmetic
No ratings yet
Errors in Floating-Point Arithmetic
22 pages
Computer Arithmetic Overview
No ratings yet
Computer Arithmetic Overview
14 pages
Machine Arithmetic Fundamentals
No ratings yet
Machine Arithmetic Fundamentals
20 pages
Music Assessment for Elementary Students
No ratings yet
Music Assessment for Elementary Students
3 pages
Red Hat System Administration I Overview
No ratings yet
Red Hat System Administration I Overview
6 pages
Flexural Design Examples for Beams
No ratings yet
Flexural Design Examples for Beams
53 pages
Data Models: Structured vs Unstructured
No ratings yet
Data Models: Structured vs Unstructured
43 pages
Understanding Monochrome vs. B&W Photos
No ratings yet
Understanding Monochrome vs. B&W Photos
2 pages
Fortrea Q3 2024 Earnings Overview
No ratings yet
Fortrea Q3 2024 Earnings Overview
11 pages
Laval-Montreal French Test Guide
No ratings yet
Laval-Montreal French Test Guide
2 pages
Fiber Laser Control Board Manual
No ratings yet
Fiber Laser Control Board Manual
19 pages
JAFRA Fall/Winter 2017 Product Catalog
No ratings yet
JAFRA Fall/Winter 2017 Product Catalog
49 pages
Lear 35 Specifications and Limits Guide
No ratings yet
Lear 35 Specifications and Limits Guide
240 pages
Sort Disk Usage in Linux by Size
No ratings yet
Sort Disk Usage in Linux by Size
4 pages
Cost-Benefit Analysis for Decision Making
No ratings yet
Cost-Benefit Analysis for Decision Making
38 pages
IF5 Molecular Geometry and Polarity
No ratings yet
IF5 Molecular Geometry and Polarity
46 pages
Tax 2 - Midterm Reviewer
No ratings yet
Tax 2 - Midterm Reviewer
13 pages
Electrical Substation Protection Equipment
No ratings yet
Electrical Substation Protection Equipment
50 pages
Understanding Transitive & Intransitive Verbs
No ratings yet
Understanding Transitive & Intransitive Verbs
3 pages
Case Summary: Gavin Bosarge
No ratings yet
Case Summary: Gavin Bosarge
38 pages
One-Dimensional Motion Practice Problems
100% (2)
One-Dimensional Motion Practice Problems
4 pages
Mikey and Takemichi: Omegaverse Love
No ratings yet
Mikey and Takemichi: Omegaverse Love
8 pages
Berg Balance Scale Assessment Guide
No ratings yet
Berg Balance Scale Assessment Guide
6 pages
Mustan 05-09 Automatic
No ratings yet
Mustan 05-09 Automatic
6 pages
MAPEH 8 Lesson Plan: Health Fitness
No ratings yet
MAPEH 8 Lesson Plan: Health Fitness
4 pages
Particulate Nature of Matter for Grade 8
0% (1)
Particulate Nature of Matter for Grade 8
18 pages
Immediate Core Buildup in Endodontics
No ratings yet
Immediate Core Buildup in Endodontics
9 pages
Medical Microbiology Overview and Concepts
No ratings yet
Medical Microbiology Overview and Concepts
19 pages
High Cycle Fatigue Analysis Techniques
No ratings yet
High Cycle Fatigue Analysis Techniques
4 pages
Financial Management MCQs and Answers
100% (1)
Financial Management MCQs and Answers
22 pages
Personal Accident Insurance Policy Details
No ratings yet
Personal Accident Insurance Policy Details
4 pages
Vasantdada Patil College Roll Numbers
No ratings yet
Vasantdada Patil College Roll Numbers
128 pages
Nokia Small Cell Solutions Overview
100% (1)
Nokia Small Cell Solutions Overview
32 pages

Floating Point Numbers and Errors

Uploaded by

Floating Point Numbers and Errors

Uploaded by

MATH3511

Assoc Prof Linda Stals

0.234567 × 0.9876543 = 0.2316711061881

1.1 Computer Arithmetic and Roundoff Errors

Floating Point System

0.011 11.000 1100.000.

• β, the base or radix of the system

• t the precision of the system (number of significant figures)

• [L, U ] the exponent range

A number in the floating point system has the form

where 0 ≤ bi < β for i = 0, ..., t − 1 and L ≤ e ≤ U .

The fractional part or the mantissa is the number

±1 b0 β t−1 + b1 β t−2 + · · · + bt−1 β e−t+1 ,

Definition 1.2 (Normalised Numbers). To ensure uniqueness of representation we can specify

Floating Point Number Spacing

One double precision number takes 64 bits/binary digits (8 bytes) of memory:

• 52 bits for the mantissa (t = 53 by not explicitly storing the normalised b0 = 1)

As an example, consider the double precision number (sign-exponent-mantissa) given in

• Exponent 10000000011 is e + U = 210 + 21 + 20 = 1027, so e = 4 (since U = 1023).

• Mantissa 1011100100010000000000000000000000000000000000000000. Consequently,

Hence it represents the (decimal) number

+ 1 + 2−1 + 2−3 + 2−4 + 2−5 + 2−8 + 2−12 × 24 = 27.56640625 (exactly).

• one bit for the sign of the fractional part.

0 00000000000 0000000000000000000000000000000000000000000000000000 = +0.0

In our floating point model, 1 = 1 × β 0 . So the next biggest number is 1 + β 1−t β 0 .

Rounding and Truncation

231.6711061881 = 2.316711061881 × 102

Definition 1.4 (Absolute and relative errors). If x̃ is an approximation of x, the error in

1.3 Accumulation of Errors

Floating Point Model

where β t−1 ≤ µ. We have assumed a normalised system where 0 < b0 .

fl(x) = x(1 + δ), |δ| ≤ ε

for numbers in the allowable range of absolute values.

Figure 2: Absolute error |x − fl(x)| when using single precision

fl(fl(x) × fl(y)) = fl(x(1 + δ1 ) × y(1 + δ2 ))

Figure 3: Relative error |x − fl(x)|/|x| when using single precision

So considering the very small value of ε we have roughly that

(more carefully, |δ4 | ≤ 3ε + 3ε2 + ε3 ).

The relative error is

The relative error is

fl(x) − fl(y) = x(1 + δ1 ) − y(1 + δ2 )

0 500 1000 1500 2000 2500 3000

2(r2 + s2 ) − (r + s)2 = (r − s)2 ≥ 0. (3)

Consider x = a + ib and y = c + id ∈ C. Suppose that the multiplication of two complex

According to the model,

fl(xy) = (ac(1 + δ1 ) − bd(1 + δ2 )) (1 + δ3 ) + i ((ad(1 + δ4 ) + bc(1 + δ5 )) (1 + δ6 )

where |δi | ≤ ε for 1 ≤ i ≤ 6 and

e = ac (δ1 + δ3 + δ1 δ3 ) − bd (δ2 + δ3 + δ2 δ3 ) + i (ad (δ4 + δ6 + δ4 δ6 ) + bc (δ5 + δ6 + δ5 δ6 )) .

|e|2 = (ac (δ1 + δ3 + δ1 δ3 ) − bd (δ2 + δ3 + δ2 δ3 ))2 + (ad (δ4 + δ6 + δ4 δ6 ) + bc (δ5 + δ6 + δ5 δ6 ))2

γ = max{|δ1 + δ3 + δ1 δ3 | , |δ2 + δ3 + δ2 δ3 | , |δ4 + δ6 + δ4 δ6 | , |δ5 + δ6 + δ5 δ6 |} ≤ 2ϵ + O(ϵ2 ).

From Equation (3),

|e|2 ≤ 2γ 2 |ac|2 + |bd|2 + |ad|2 + |bc|2

1.4 Catastrophic Cancellation

Table 3: Accuracy of the Matlab ex function

Example 1.5 (Catastrophic cancellation). Calculate ex using the formula

Consider the following Matlab program which can be used to approximate ex .

function ex_approx = ex_Taylor(x)

% Program to calculate exp(x) using naive Taylor series method

Table 4: Terms of the series approximation of e−25 .

Listing 1: Modification of code to avoid cancellation

% Program to calculate exp(x) avoiding cancellation errors

Table 5: Accuracy of new Matlab new_ex

You might also like