Floating Point Numbers and Errors
Floating Point Numbers and Errors
Semester 1, 2025
1
Contents
1 Floating Point Numbers 3
1.1 Computer Arithmetic and Roundoff Errors . . . . . . . . . . . . . . . . . . . 3
1.2 Measuring Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3 Accumulation of Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4 Catastrophic Cancellation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Definitions
1.1 Definition (Floating Point System) . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Definition (Normalised Numbers) . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Definition (Machine Epsilon) . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Definition (Absolute and relative errors) . . . . . . . . . . . . . . . . . . . . . 8
1.5 Definition (Floating point value) . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.6 Definition (Floating point model) . . . . . . . . . . . . . . . . . . . . . . . . . 9
Examples
1.1 Example (Floating point model - multiplication) . . . . . . . . . . . . . . . . 10
1.2 Example (Floating point model - subtraction) . . . . . . . . . . . . . . . . . . 12
1.3 Example (Floating point model - addition) . . . . . . . . . . . . . . . . . . . 13
1.4 Example (Floating point model - complex multiplication) . . . . . . . . . . . 14
1.5 Example (Catastrophic cancellation) . . . . . . . . . . . . . . . . . . . . . . . 16
Algorithms
Matlab Code
1 Modification of code to avoid cancellation . . . . . . . . . . . . . . . . . . . . 17
2
1 Floating Point Numbers
Computer Numbers
A knowledge of Computer Hardware is important for efficient computation. The speed
of a computer is often measured by the number of floating point operations per second. The
usual notation is 1,000,000 flops = 1 Megaflop; 1000 Megaflops = 1 Gigaflop, 1000 Gigaflops
= 1 Teraflop, 1000 Teraflop = 1 Exaflop.
Computer hardware is of primary importance when it comes to the efficiency of the com-
putations. Some (small PC’s) may carry out thousands of operations per second, while others
(super-computers) carry out billions of operations per second. To get the best performance
out of a machine, we need to use techniques that are at the forefront of both the mathematical
and computer sciences.
Computers have a limited number of digits to store a number. If a number contains too
many digits, we must shorten it to fit into the available space. We usually accomplish this
by rounding (or truncating) the number. From a numerical computing perspective, this is
crucial because it indicates that errors can be present from the moment data is entered. As
the calculation proceeds, further errors are introduced and can propagate, sometimes with
disastrous consequences. For example, on a hand calculator that stores ten decimal places,
the result of the multiplication
would either be rounded off to the nearest number with only ten decimal places, giving
0.2316711062
or chopped (rounded down, truncated), discarding the extra digits after the tenth, giving
0.2316711061.
Rounding changes the number by at most 0.5 × 10−10 while chopping can produce an error
(change) of up to 10−10 .
In this section, we will consider ways in which these errors propagate during elementary
operations, as well as the phenomenon of a catastrophic cancellation.
There are models designed to predict the behaviour of the error in a general setting. We
shall briefly discuss a popular model in this course. Our goal is to use the model to explain
the type of behaviour we see in a set of example model problems.
Be aware this is still a research area, and many other models are in use.
3
In a floating point system, numbers are represented in a scientific notation where there
are a fixed number of significant digits. The floating point representations of the numbers
given above are
1.10 × 10−2 1.10 × 101 1.10 × 103
if the number of significant digits is 3 here.
Machine Numbers
While we work in the decimal system, the computer operates with a binary system. The
1
storage of even some simple numbers, such as 10 , in the binary representation requires round-
ing or chopping to be represented by a fixed number of digits. A computer can store only a
finite set of numbers exactly. They are called machine numbers.
Floating Point Arithmetic
It is most common on computers to use floating point arithmetic, which means that
numbers are essentially stored in scientific notation.
Definition 1.1 (Floating Point System). The floating point system is characterised by:
with the advantage being that the term inside the bracket is now an integer.
Normalised Numbers
4
Figure 1: The floating point numbers are not evenly spaced
For example, 0.03 = 0.03 × 100 = 0.30 × 10−1 = 3.00 × 10−2 = 30.0 × 10−3 = · · · . In a
normalised system 0.03 will be stored as 3.00 × 10−2 .
For systems using base 2 the leading digit for a normalised number will always be 1 (and
therefore does not need to be explicitly stored).
In a normalised system, the smallest positive floating point number is β L . The largest
floating point number is
t−1 t−1 t−1
!
X X X
(β − 1) β −i β U = β U −i+1 − β U −i
i=0 i=0 i=0
t−1
X Xt
= β U +1−i − β U +1−i
i=0 i=1
U +1 U +1−t
=β −β .
5
Table 1: An example double precision number
0 10000000011 1011100100010000000000000000000000000000000000000000
IEEE format β t L U
Half precision 2 11 −14 15
Single precision 2 24 −126 127
Double precision 2 53 −1022 1023
Quadruple precision 2 113 −16382 16383
• 11 bits for the exponent e (save the binary representation of e + U ∈ {1, . . . , 2046})
• Sign 0 is +.
Note that the exponent bits can never be all zeros or all ones (i.e. e + U ̸= 0, 2047).
The double numeric type values in Matlab are stored using IEEE-754 format for double
precision. This consists of
• 52+1 bits describing a fractional part of the form 1.b1 b2 ...b52 where each of the b1 , b2
etc. is a bit [0 or 1]. Since the first bit is always 1, it does not need to be stored.
• 11 bits describing an exponent e giving a factor 2e . Of the 2048 possible values, the
ones actually used are -1022 to 1023, with the base of the powers being 2.
With β = 2, the largest possible number is 21024 − 2(1024−53) giving approximately 1.80 ×
10308 .See [Link] for more detail.
The IEEE standard also defines some special ‘numbers’, such as ±0 and ±∞. See Table
2.
6
Table 2: Some special cases represented in IEEE format
NaN means not a number, and is a result of performing undefined operations such
as 0/0, or arithmetic involving existing NaN values, such as NaN + 3 = NaN. Note the
convention that NaN = NaN and NaN > NaN and NaN < NaN are all false (this can cause
problems).
The value −0 acts essentially the same way as 0 for arithmetic, and has the convention
that +0 = −0 is true. It comes about from calculations like (−1) × (+0) or from rounding
very small negative numbers to zero.
The values ±∞ act as you might expect: ∞ − 1 = ∞ and 3 > −∞, but 0 × ∞ = NaN.
Overflow and underflow
• If the result of a computation is larger than that allowed by the computer, you have an
overflow.
• When a result is smaller than the smallest positive number, then it either becomes zero
or is considered inaccurate because it has less than the normal 52 significant bits. This
loss of accuracy is called underflow.
• Underflow is a more subtle problem than overflow, but still important, and some ma-
chines will stop with an error message (or at least print a warning message) when it
occurs.
Matlab interprets overflow as ‘infinity’ which is printed as inf or −inf according to its
sign, and continues computing. Other operations with no well-defined answer, like 0.0/0.0,
give the result NaN. Note that evaluating, say, sqrt(−1.0) will result in a complex number.
However in other languages such as Python, sqrt(−1.0) will return NaN.
Machine Epsilon
Definition 1.3 (Machine Epsilon). The smallest positive machine number εM such that
1 + εM ̸= 1 is called machine epsilon.
Hence εM = β 1−t .
For normalised IEEE 64 bit floating point numbers εM = 2−52 ≃ 2.2204×10−16 . So this is
roughly the maximum relative error introduced when you round or chop a number. Another
way of describing this is that numbers have 52 significant bits, or slightly better than fifteen
significant decimal digits.
For traditional a 32-bit machines εM = 2−23 ≃ 1.1921 × 10−7 . That is, numbers have 23
significant bits or close to seven significant decimal digits.
7
1.2 Measuring Errors
is rounded to
2.316711062 × 102 = 231.6711062
or chopped to
2.316711061 × 102 = 231.6711061.
This is rounding to ten significant figures.
The actual error in the numerical example above is 1 × 10−7 . Is that a large or small
error? It depends on the size of the numbers being calculated, which leads us to consider the
difference between absolute and relative errors.
The relative error provides a better understanding of the significance of the error. As seen
above, rounding or chopping to ten significant decimal places introduces a relative error of
about 10−10 regardless of the size of the number.
Definition 1.5 (Floating point value). The floating point value fl(x) is the result of the
rounding (or chopping) done on x. fl(x) is the floating point number representation of x as
given in Definition 1.1.
Recall the number of digits in the mantissa is the number of significant digits. The
rounding or chopping is done on the fractional part.
8
Write x ∈ R as
∞
!
X bi
x= βe
βi
i=0
∞
!
X bi
= β e−t+1
β i−t+1
i=0
∞
!
X
= bi β (t−1)−i β e−t+1
i=0
e−t+1
= µβ ,
Compare the above equation to the floating point representation given in Equation (1).
Then y1 = ⌊µ⌋β e−t+1 and y2 = ⌈µ⌉β e−t+1 are two floating point numbers such that
y1 ≤ x ≤ y2 . (⌊µ⌋ rounds a number down to the nearest integer, ⌈µ⌉ rounds a number up to
the nearest integer.)
So
|x − fl(x)| ≤ β e−t+1 ,
for chopping and
|x − fl(x)| ≤ β e−t+1 /2,
for rounding.
Furthermore
|x − fl(x)| β e−t+1
≤ ≤ β 1−t = εM ,
|x| µβ e−t+1
for chopping and
|x − fl(x)|
≤ β 1−t /2 = εM /2,
|x|
for rounding.
Definition 1.6 (Floating point model). Any system of floating point numbers will have an
upper limit on rounding ε, so that
When using rounding, which is the case of IEEE, ε = β 1−t /2. Note that |δ| ≤ 2−24 in single
precision and |δ| ≤ 2−53 in double precision (if rounding).
Machines may have slight differences, but we will assume that whenever two machine
numbers are combined through some arithmetical procedure, they are first combined, then
normalised, rounded off, and finally stored in memory.
9
Absolute Error - Single
0.0040
0.0035
0.0030
0.0025
0.0020
0.0015
0.0010
0.0005
0.0000
0 500 1000 1500 2000 2500 3000
We will define the floating point rounding function “fl” whose value is the result of the
rounding done on its argument by the machine in use, and then consider the accumulative error
through the examination of the effects of addition, subtraction, multiplication and division.
Thus, the result that you would expect for multiplication of two numbers x and y that can
be represented exactly in the machine’s floating point system is fl(x × y). If, alternatively,
the numbers themselves have to be rounded when input, the result is fl(fl(x) × fl(y)).
The floating point model in Definition 1.6 gives a uniform bound on the fractional change
in a number when it is rounded to a normalised value.
Numerical Experiment - Single Precision
As an experiment, I randomly generated 3000 real numbers x such that x ∈ [−100000, 100000].
I then rounded them to single precision and plotted |fl(x) − x|. The result is in Figure 2.
(Actually, I cheated a bit. Since we can’t store real numbers on a computer, I randomly
generated 3000 quadruple precision numbers, and used those as the ‘real’ numbers.)
Let us now consider the relative |fl(x) − x|/|x| as shown in Figure 3. Notice that the
relative error is ≤ 2−24 ≈ 5.96 × 10−8 .
Multiplication
Example 1.1 (Floating point model - multiplication). Use the floating point model given in
Definition 1.6 to bound the error when multiplying two floating point numbers.
Using the bound ε, consider the total rounding error in the multiplication:
where |δi | ≤ ε.
Thus
fl(fl(x) × fl(y)) = (xy)(1 + δ4 )
10
1e 8 Relative Error - Single
6
0
0 500 1000 1500 2000 2500 3000
where
(1 + δ4 ) = (1 + δ1 )(1 + δ2 )(1 + δ3 ).
The extreme values of (1 + δ4 ) give
(1 − ε)3 = 1 − 3ε + O(ε2 ) ≤ 1 + δ4
≤ (1 + ε)3 = 1 + 3ε + O(ε2 ).
|δ4 | ≤ 3ε
x = 0.8888888888888888 − 0.8888888888444444
= 0.000000000444444,
x̃ = fl(0.8888888888888888) − fl(0.8888888888444444)
= 0.8888888889 − 0.8888888888
= 0.0000000001.
11
1e 7 Relative Error - Single
1.4
1.2
1.0
0.8
0.6
0.4
0.2
0.0
0 500 1000 1500 2000 2500 3000
Figure 4: Relative error |fl(fl(x) × fl(y)) − x × y|/|x × y| when using single precision
x = 0.8888888888888888 + 0.8888888888444444
= 1.7777777777333332,
x̃ = fl(0.8888888888888888) + fl(0.8888888888444444)
= 0.8888888889 + 0.8888888888
= 1.7777777777.
Example 1.2 (Floating point model - subtraction). Use the floating point model given in
Definition 1.6 to get a more precise understanding of why subtraction may be problematic with
floating point numbers.
Let’s calculate the roundoff error when exactly subtracting two, different, numbers:
12
Relative Error - Single
10 5
10 6
10 7
10 8
10 9
10 10
10 11
Figure 5: Relative error |fl(fl(x) − fl(y)) − (x − y)|/|x − y| when using single precision
where
max(|x|, |y|)
|δ3 | ≈ 2ε .
|x − y|
This can be very large if x−y is very small relative to x and y. Again, I randomly generated
3000 real numbers x, y ∈ [0, 100000] and evaluated |fl(fl(x) + fl(y)) − (x − y)|/|x − y| using
single precision. The results shown in Figure 5 have relative error is ≫ 2−24 ≈ 5.96 × 10−8 .
Beware of subtracting two similar numbers.
For example, recall the finite different formula
f (x + h) − f (x)
f ′ (x) ≈ . (2)
h
In exact arithmetic, we expect the error to decrease as h → 0. However, f (x + h) ≈ f (x)
for h small. We saw in the chapter on numerical differentiation that the large roundoff errors
have a major impact on the accuracy of the solution. See Figure 6 for a reminder.
Example 1.3 (Floating point model - addition). Use the floating point model given in Def-
inition 1.6 to show that if we add two numbers with the same sign, the relative errors are
well-behaved.
If x, y > 0
fl(fl(x) + fl(y)) = (x(1 + δ1 ) + y(1 + δ2 ))(1 + δ3 )
= x + y + x (δ1 + δ3 + δ1 δ3 ) + y(δ2 + δ3 + δ2 δ3 )
= x + y + δ4 ,
where
|δ4 | = |x (δ1 + δ3 + δ1 δ3 ) + y(δ2 + δ3 + δ2 δ3 )|
≤ max{|(δ1 + δ3 + δ1 δ3 )| , |(δ2 + δ3 + δ2 δ3 )|}|x + y|
≤ (2ε + ε2 )|x + y|.
13
100
Truncation Error
O(h)
10−2 O(h^2)
10−4
Error
10−6
10−8
10−10
100 10−2 10−4 10−6 10−8
h
Figure 6: Error in approximating f ′ (1) for f (x) = x sin x using forward differences.
So
fl(fl(x) + fl(y)) = (x + y)(1 + δ4 ), |δ4 | ≤ 2ε + ε2 ≃ 2ε.
This is consistent with the results of the numerical experiment shown in Figure 7.
Multiplication of Complex Numbers
Before looking at the next example, note that if r, s ∈ R
Example 1.4 (Floating point model - complex multiplication). Use the floating point model
given in Definition 1.6 to bound the error when multiplying two complex numbers.
14
1e 7 Relative Error - Single
1.0
0.8
0.6
0.4
0.2
0.0
0 500 1000 1500 2000 2500 3000
Figure 7: Relative error |fl(fl(x) + fl(y)) − (x + y)|/|x + y| when using single precision
Now
where
= 2γ 2 a2 + b2 c2 + d2
= 2γ 2 |x|2 |y|2 .
Finally
fl(xy) = xy(1 + e)
√ √
where |e| ≤ 2γ ≈ 2 2ϵ.
Cancellation Error
In any calculation there is the danger of a catastrophic calculation. Suppose two numbers
agree (are equal) in all but the last digit, then their difference x − y can have only one
significant digit of accuracy, and this without any roundoff error. Future calculations will
then only have at most one significant digit. Such routines in extreme cases can become
15
x ex(x) ex
1 2.718281828459046e+00 2.7182818284590455e+00
10 2.202646579480671e+04 2.2026465794806718e+04
20 4.851651954097905e+08 4.8516519540979028e+08
40 2.353852668370201e+17 2.3538526683701997e+17
-1 3.678794411714424e-01 3.6787944117144233e-01
-10 4.539992967040021e-05 4.5399929762484847e-05
-20 6.147561828914626e-09 2.0611536224385579e-09
-40 3.116951588217358e-01 4.2483542552915889e-18
numerically unstable and lead to completely incorrect solutions. Such calculations are known
as catastrophic cancellation.
This potential problem is often difficult to foresee, although it is sometimes possible to
avoid. In general, if x and y are represented with n significant digits, and agree to one or
more digits, then their difference will not be accurate to the full n digits.
Catastrophic Cancellation Example
Let us consider another situation in which this problem arises.
ex_approx = 1;
term = 1;
n=0;
while ( ex_approx+term ˜= ex_approx )
n = n + 1;
term = term *(x/n);
ex_approx = ex_approx + term;
end
Note that we stop the calculation when the addition of a new term does not make a difference
to the sum.
Table 3 compares the results from this program with the exact value of ex . When x > 0
the results are fine. However, when x < 0 the results become progressively worse as x → −∞.
What is happening? For positive x we have a sum of positive terms. For negative x we
have a sum of positive and negative terms.
Table 4 displays the terms in the series e−25 . It is easy to see that in a situation like this,
catastrophic cancellation can and does occur.
16
n n-th term in Taylor’s series
0 1
1 -24
2 2.8850 × 102
..
.
12 8.3320 × 107
13 −1.5598 × 108
14 2.7134 × 108
15 −4.4086 × 108
..
.
85 −8.1807 × 10−7
86 8.1814 × 10−7
87 −8.1812 × 10−7
88 8.1813 × 10−7
In this case we can avoid the situation in which catastrophic cancellation will occur by
using the formula
1
ex = −x
e
for x < 0.
Specifically we the code given in Listing 1
if x > 0
ex_approx = ex_Taylor(x);
else
ex_approx = 1.0/ex_Taylor(−x);
end
As can be seen in Table 5, the results obtained with the new algorithm are quite satisfac-
tory.
A particularly interesting example given by Rump [2] of catastrophic cancellation is given
by the expression
a
y = 333.75b6 + a2 (11a2 b2 − b6 − 121b4 − 2) + 5.5b8 +
2b
with a = 77617 and b = 33096. The exact value of this expression is
−54767
≈ −.8273960599468.
66192
17
x new ex(x) ex
1 2.718281828459045e+00 2.7182818284590455e+00
10 2.202646579480671e+04 2.2026465794806718e+04
20 4.851651954097905e+08 4.8516519540979028e+08
40 2.353852668370201e+17 2.3538526683701997e+17
-1 3.678794411714423e-01 3.6787944117144233e-01
-10 4.539992976248485e-05 4.5399929762484847e-05
-20 2.061153622438557e-09 2.0611536224385579e-09
-40 4.248354255291587e-18 4.2483542552915889e-18
Now what do we get if we use double precision floating point arithmetic? All the constants
in the expression can be represented exactly in double precision, so we might expect an
accurate result. But using Matlab we obtain −1.1805916207174D+21!
If we were to do this calculation in extended precision we would still obtain essentially
the same result. It turns out that you need at least 37 decimal digits of accuracy to obtain a
reasonable result.
An explanation of the behaviour of this function can be found in the paper by Cuyt et. al. [1].
Error in Continuous Processes
Numerical schemes to solve differential equations will be dealt with later in the course,
but Eulers Method is one which already may have heard. Errors which occur in most of these
computational problems are a combination of roundoff errors due to the finite nature of the
floating point system, errors of discretisation, convergence errors and errors due to the nature
of the numerical scheme chosen. Of course, we could use various tools in Matlab to do our
calculation to any desired precision, but then the speed of our calculations will be slow. So
we usually make the compromise of using floating point systems which allow fast calculations,
but which maintain only a limited number of significant figures.
As we have seen in the previous discussions, calculations with many operations may be
swamped by roundoff errors; one with too few, may lack sufficient accuracy. For instance,
when we use the fourth order Runge-Kutta method (see the ODE Chapter) with a time-step
of size h to approximate the solution of an ODE at time T , we will use O(1/h) operations,
which will typically produce a roundoff error of size O(ε/h). As shown in Figure 8, the total
error is a combination of discretisation error and roundoff error.
For high efficiency there needs to be a balance / tradeoff / compromise between the errors
and required accuracy in context and thus for each problem, the most efficient method may
be different.
18
−6
−8
−10
−12
log(error)
−14
−16
−18
−20
−22
0 2 4 6 8 10 12 14 16
n
Figure 8: Typical errors using 4th Order Runge Kutta method, showing discretisation error
dominating for large h and roundoff error dominating for smaller h.
19
References
[1] A. Cuyt, B. Verdonk, S. Becuwe, and P. Kuterna. A remarkable example of catastrophic
cancellation unraveled. Computing, 66:309–320, 2001. [Link]
ticle/10.1007/s006070170028.
[2] S. M. Rump. Algorithms for verifying inclusions - theory and practice. In R. E. Moore,
editor, Reliability in Computing, pages 109–126. Academic Press, New York, 1988.
20