Machine Arithmetic: Fixed vs Floating Point
Machine Arithmetic: Fixed vs Floating Point
Contents
1 Overview 2
1
18.330 Lecture Notes 2
1 Overview
Consider an irrational real number like π = 3.1415926535..., represented by an
infinite non-repeating sequence of decimal digits. Clearly an exact specification
of this number requires an infinite amount of information.1 In contrast, comput-
ers must represent numbers using only a finite quantity of information, which
clearly means we won’t be able to represent numbers like π without some error.
In principle there are many different ways in which numbers could be repre-
sented on machines, each of which entails different tradeoffs in convenience and
precision. In practice, there are two types of representations that have proven
most useful: fixed-point and floating-point numbers. Modern computers use
both types of representation. Each method has advantages and drawbacks, and
a key skill in numerical analysis is to understand where and how the computer’s
representation of your calculation can go catastrophically wrong.
The easiest way to think about computer representation of numbers is to
imagine that the computer represents numbers as finite collections of decimal
digits. Of course, in real life computers store numbers as finite collections of
binary digits. However, for our purposes this fact will be an unimportant im-
plementation detail; all the concepts and phenomena we need to understand
can be pictured most easily by thinking of numbers inside computers as finite
strings of decimal digits. At the end of our discussion we will discuss the minor
points that need to be amended to reflect the base-2 reality of actual computer
numbers.
+
Figure 1: In a 7-digit fixed-point system, each number consists of a string of 7
digits, each of which may run from 0 to 9.
are
-999.9999
-999.9998
-999.9997
..
.
-000.0001
S rep = +000.0000
+000.0001
+000.0002
..
.
+999.9998
+999.9999
Notice something about this list of numbers: They are all separated by the
same absolute distance, in this case 0.0001. Another way to say this is that the
density of the representable set is uniform over the real line (at least between
max
the endpoints, Rmin = ±999.9999): Between any two real numbers r1 and r2
lie the same number of exactly representable fixed-point numbers. For exam-
ple, between 1 and 2 there are 104 exactly-representable fixed-point numbers,
and between 101 and 102 there are also 104 exactly-representable fixed-point
numbers.
Rounding error
Another way to characterize the uniform density of the set of exactly rep-
resentable fixed-point numbers is to ask this question: Given an arbitrary
real number r in the interval [Rmax , Rmin ], how near is the nearest exactly-
representable fixed-point number? If we denote this number by fi(r), then the
statement that holds for fixed-point arithmetic is:
for all r ∈ R, Rmin < r < Rmax , ∃ with || ≤ EPSABS such that
(1)
fi(r) = r + .
Error-free calculations
There are many calculations that can be performed in a fixed-point system with
no error. For example, suppose we want to add the two numbers 12.34 and
742.55. Both of these numbers are exactly representable in our fixed-point
18.330 Lecture Notes 5
=
Figure 3: Arithmetic operations in which both the inputs and the outputs are
exactly representable incur no error.
Non-error-free calculations
On the other hand, here’s a calculation that is not error-free.
=
Figure 4: A calculation that is not error-free. The exact answer here is
24/7=3.42857142857143..., but with finite precision we must round the an-
swer to nearest representable number.
18.330 Lecture Notes 7
Overflow
The error in (4) is not particularly serious. However, there is one type of cal-
culation that can go seriously wrong in a fixed-point system. Suppose, in the
calculation of Figure 3, that the first summand were 412.34 instead of 12.34.
The correct sum is
412.24 + 742.55 = 1154.89.
However, in fixed-point arithmetic, our calculation looks like this:
=
Figure 5: Overflow in fixed-point arithmetic.
The leftmost digit of the result has fallen off the end of our computer! This
is the problem of overflow: the number we are trying to represent does not fit
in our fixed-point system, and our fixed-point representation of this number is
not even close to being correct (154.89 instead of (1154.89). If you are lucky,
your computer will detect when overflow occurs and give you some notification,
but in some unhappy situations the (completely, totally wrong) result of this
calculation may propagate all the way through to the end of your calculation,
yielding highly befuddling results.
The problem of overflow is greatly mitigated by the introduction of floating-
point arithmetic, as we will discuss next.
18.330 Lecture Notes 8
+
+
Figure 6: A floating-point scheme with a 5-decimal-digit mantissa and a two-
decimal-digit exponent.
12.34 =
+
754.89 =
+
Rounding error
For a real number r, let fl(r) be the real number closest to r that is exactly
representable in a floating-point scheme. Then the statement analogous to (1)
is
for all r ∈ R, |r| < Rmax , ∃ with || ≤ EPSREL such that
(2)
fl(r) = r(1 + )
18.330 Lecture Notes 10
julia> eps()
2.220446049250313e-16
18.330 Lecture Notes 11
The phenomenon that arises when you subtract two nearly equal floating-point
numbers is called catastrophic loss of numerical precision; to emphasize that
it is the main pitfall you need to worry about we will refer to it as the big
floating-point kahuna.
Table 1: Monthly U.S. population data for February and March 2011.
These data have enough precision to allow us to compute the actual change
in population (in thousands) to three-digit precision:
But now suppose we try to do this calculation using the floating-point system
discussed in the previous section, in which the mantissa has 5-digit precision.
The floating representations of the numbers in Table 1 are
Subtracting, we find
3.1136 × 105
−3.1119 × 105
=1.7000 × 102 (4)
Comparing (3) and (4), we see that the floating-point version of our answer is
170, to be compared with the exact answer of 167. Thus our floating-point
calculation has incurred a relative error of about 2 · 10−2 . But, as noted above,
the value of EPSREL for our 5-significant-digit floating-point scheme is approxi-
mately 10−5 ! Why is the error in our calculation 2000 times larger than machine
precision?
What has happened here is that almost all of our precious digits of precision
are wasted because the numbers we are subtracting are much bigger than their
difference. When we use floating-point registers to store the numbers 311,356
and 311,189, almost all of our precision is used to represent the digits 311,
which are the ones that give zero information for our calculation because they
cancel in the subtraction.
More generally, if we have N digits of precision and the first M digits of
x and y agree, then we can only compute their difference to around N − M
digits of precision. We have thrown away M digits of precision! When M is
large (close to N ), we say we have experienced catastrophic loss of numerical
precision. Much of your work in practice as a numerical analyst will be in
developing schemes to avoid catastrophic loss of numerical precision.
In 18.330 we will refer to catastrophic loss of precision as the big floating-
point kahuna. It is the one potential pitfall of floating-point arithmetic that you
must always have in the back of your mind.
2
Stepsize h = 3
First suppose we start with a stepsize of h = 23 . This number is not exactly
representable; in our 5-decimal-digit floating-point scheme, it is rounded to
and thus
1 2
Stepsize h = 10
· 30
Now let’s shrink the stepsize by 10 and try again. Like the old stepsize h = 2/3,
2
the new stepsize h = 30 is not exactly representable. In our 5-decimal-digit
floating-point scheme, it is rounded to
Note that our floating-point scheme allows us to specify this h with just as much
precision as we were able to specify the previous value of h [equation (6a)] –
namely, 5-digit precision. So we certainly don’t suffer any loss of precision at
this step.
The sequence of floating-point numbers that our computation generates is
now
and thus
Analysis
The key equations to look at are (6b) and (7b). As we noted above, our floating-
point scheme represents 32 and 30 2
with the same precision – namely, 5 digits.
Although the second number is 10 times smaller, the floating-point uses the
same mantissa for both numbers and just adjusts the exponent appropriately.
The problem arises when we attempt to cram these numbers inside a floating-
point register that must also store the quantity 1, as in (6b) and (7b). Because
the overall scale of the number is set by the 1, we can’t simply adjust the
2
exponent to accommodate all the digits of 30 . Instead, we lose digits off the
18.330 Lecture Notes 15
right end – more specifically, we lose one more digit off the right end in (7b) then
we did in (7b). However, when we go to perform the division in (6c) and (7c),
the numerator is the same 5-digit-accurate h value we started with [eqs. (6a)
and (7a)]. This means that each digit we lost by cramming our number together
with 1 now amounts to an extra lost digit of precision in our final answer.
When ∆ x,the two terms on the RHS are nearly equal, and subtracting them
gives rise to catastrophic loss of precision. For example, if x = 900, ∆ = 4e-3,
the calculation on the RHS becomes
30.00006667 − 30.00000000
and we waste the first 6 decimal digits of our floating-point precision; in the
5-decimal-digit scheme discussed above, this calculation would yield precisely
zero useful information about the number we are seeking.
However, there is a simple workaround. Consider the identity
√ √ √ √
x+∆− x x + ∆ + x = (x + ∆) − x = ∆
The RHS of this equation is a safe way to compute a value for the LHS; for
example, with the numbers considered above, we have
4e-3
≈ 6.667e-5.
30.0000667 + 30.0000000
Even if we can’t store all the digits of the numbers in the denominator, it doesn’t
matter; in this way of doing the calculation those digits aren’t particularly
relevant anyway.
18.330 Lecture Notes 16
function DirectSum(X, N)
Sum=0.0;
for n=1:N
Sum += X;
end
Sum
end
Suppose we divide some number Y into N equal parts and add them all up.
How accurately to we recover the original value of Y ? The following figure plots
the quantity
Y
|DirectSum N ,N − Y |
Y
for the case Y = π and various values of N . Evidently we incur significant errors
for large N .
1e-09 -9
1e-10 -10
Direct
1e-11 -11
Relative Error
1e-12 -12
1e-13 -13
1e-14 -14
1e-15 -15
1e-16 -16
100 1000 10000 100000 1e+06 1e+07 1e+08
N
Y
Figure 7: Relative error in the quantity DirectSum N,N .
18.330 Lecture Notes 17
function RecursiveSum(X, N)
if N < BaseCaseThreshold
Sum = DirectSum(X,N)
else
Sum = RecursiveSum(X,N/2) + RecursiveSum(X,N/2);
end
Sum
end
What this function does is the following: If N is less than some threshold value
BaseCaseThreshold (which may be 100 or 1000 or so), we perform the sum
directly. However, for larger values of N we perform the sum recursively: We
evaluate the sum by adding together two return values of RecursiveSum. The
following figure shows that this slight modification completely eliminates the
error incurred in the direct-summation process:
Relative error in direct and recursive summation
1e-08 -8
1e-09 -9
Direct
1e-11 -11
Relative Error
1e-12 -12
1e-13 -13
1e-14 -14
1e-15 -15
1e-16 -16
100 1000 10000 100000 1e+06 1e+07 1e+08
N
Y
Figure 8: Relative error in the quantity RecursiveSum N,N .
3 Caution: The function RecursiveSum as implemented here actually only works for even
values of N . Can you see why? For the full, correctly-implemented version of the function,
see the code [Link] available from the “Lecture Notes” section of the website.
18.330 Lecture Notes 18
Analysis
Why does such a simple prescription so thoroughly cure the disease? The basic
intuition is that, in the case of DirectSum with large values of N , by the time
we are on the 10,000th loop iteration we are adding X to a number that is 104
times bigger than X. That means we instantly lose 4 digits of precision of the
right end of X, giving rise to a random rounding error. As we go to higher and
higher loop iterations, we are adding the small number X to larger and larger
numbers, thus losing more and more digits off the right end of our floating-point
register.
In contrast, in the RecursiveSum approach we never add X to any number
that is more than BaseCaseThreshold times greater than X. This limits the
number of digits we can ever lose off the right end of X. Higher-level additions
are computing the sum of numbers that are roughly equal to each other, in
which case the rounding error is on the order of machine precision (i.e. tiny).
For a more rigorous analysis of the error in direct and pairwise summation,
see the Wikipedia page on the topic4 , which was written by MIT’s own Professor
Steven Johnson.
4 [Link] summation
18.330 Lecture Notes 19
julia> 19%7
5
julia> exp(1000)
Inf
This fact can actually be used to test whether a given number is NaN.
Arbitrary-precision arithmetic
In the examples above we discussed the kinds of errors that can arise when
you do floating-point arithmetic with a finite-length mantissa. Of course it
is possible to chain together multiple floating-point registers to create a longer
mantissa and achieve any desired level of floating-point precision. (For example,
by combining two 64-bit registers we obtain a 128-bit register, of which we might
set aside 104 bits for the mantissa, roughly doubling the number of significant
digits we can store.) Software packages that do this are called arbitrary-precision
arithmetic packages; an example is the gnu mp library5 .
Be forewarned, however, that arbitrary-precision arithmetic packages are not
a panacea for numerical woes. The basic issue is that, whereas single-precision
5 [Link]
18.330 Lecture Notes 21
Since xnearest deviates from x0 by something like 10−15 , we find that f (xnearest )
deviates from f (x0 ) by something like 10−30 , i.e. the digits begin to disagree in
the 30th decimal place. But our floating-point registers can only store 15 decimal
digits, so the difference between f (x0 ) and f (xnearest ) is completely lost; the two
function values are utterly indistinguishable to our computer.
Moreover, as we consider points x lying further and further away from x0 ,
we find that f (x) remains floating-point indistinguishable from f (x0 ) over a
6 This is where the assumption that |x | ∼ 1 comes in; the more general statement would be
0
that the nearest floating-point numbers not equal to x0 would be something like x0 ±10−15 |x0 |.
18.330 Lecture Notes 23
wide interval near x0 . Indeed, the condition that f (x) be floating-point distinct
from f (x0 ) requires that (x − x0 )2 fit into a floating-point register that is also
storing f0 ≈ 1. This means that we need7
or
√
(x − x0 ) & machine (10)
This explains why, in general, we can only pin down minima to within the
square root of machine precision, i.e. to roughly 8 decimal digits on a modern
computer.
On other hand, suppose the function g(x) has a root at x0 . In the vicinity
of x0 we have the Taylor expansion
1
g(x) = (x − x0 )g 0 (x0 ) + (x − x0 )2 g 00 (x0 ) + · · · (11)
2
which differs from (8) by the presence of a linear term. Now there is generally
no problem distinguishing g(x0 ) from g(xnearest ) or g at other floating-point
numbers lying within a few machine epsilons of x0 , and hence in general we will
be able to pin down the value of x0 to close to machine precision. (Note that
this assumes that g has only a single root at x0 ; if g has a double root there,
i.e. g 0 (x0 ) = 0, then this analysis falls apart. Compare this to the observation
we made earlier that the convergence of Newton’s method is worse for double
roots than for single roots.)
Figures 7 illustrates these points. The upper panel in this figure plots,
for the function f (x) = f0 + (x − x0 )2 [corresponding to equation (8) with
x0 = f0 = 12 f 00 (x0 ) = 1], the deviation of f (x) from its value at f (x0 ) versus the
deviation of x from x0 as computed in standard 64-bit floating-point arithmetic.
Notice that f (x) remains indistinguishable from f (x0 ) until x deviates from x0
by at least 10−8 ; thus a computer minimization algorithm cannot hope to pin
down the location of x0 to better than this accuracy.
In contrast, the lower panel of Figure 7 plots, for the function g(x) = (x−x0 )
[corresponding to equation (11) with x0 = g 0 (x0 ) = 1], the deviation of g(x) from
g(x0 ) versus the deviation of x from x0 , again as computed in standard 64-bit
floating-point arithmetic. In this case our computer is easily able to distinguish
points x that deviate from x0 by as little as 2 · 10−16 . This is why numerical
root-finding can, in general, be performed with many orders of magnitude better
precision than minimization.
7 This is where the assumptions that |f | ∼ 1 and |f 00 (x )| ∼ 1 come in; the more general
0 0
statement would be that we need (x − x0 )2 |f 00 (x0 )| & machine · |f0 |.
18.330 Lecture Notes 24
1.6e-15
1.4e-15
1.2e-15
1e-15
8e-16
6e-16
4e-16
2e-16
0
-2e-16
4.0e-08 2.0e-08 0.0e+00 -2.0e-08 -4.0e-08
5e-16
4e-16
3e-16
2e-16
1e-16
0
-1e-16
-2e-16
-3e-16
-4e-16
-5e-16
-6e-16
-4e-16 -2e-16 0 2e-16 4e-16
pi; however, the situation here is quite different from that of mathematica. In matlab/julia,
the symbol pi refers to the best floating-point approximation to π that is available on the
hardware you are using; this is a rational number that approximates π to (typically) 15
or so digits, but is not equal to π. In contrast, in mathematica the symbol Pi specifies
abstractly the exact number π, in the sense of the labeling/indexing scheme described in the
text. If you ask mathematica to print out a certain number of digits of Pi, you will get a
rational approximation similar to that in julia; however, in mathematica you can ask for any
number of√these digits, and moreover a calculation such as Exp[I*Pi/4] will yield the exact
number 22 (1 + i), represented abstractly by a symbol. In contrast, in julia the calculation
√
exp(im*pi/4.0) will yield a rational number number that approximates 22 (1 + i) to roughly
15 digits). Be aware of this distinction between symbolic math software and numerical math
software!
18.330 Lecture Notes 26
√
• Irrational algebraic numbers. Because irrational numbers like 2 cannot
be represented as ratios of integers, one might think their specification re-
quires an infinite quantity of information. However, within the irrationals
lives a subset of numbers that maybe specified with finite information:
the algebraic numbers, defined as the√roots of polynomials with rational
coefficients. Thus, the number x = 2, though irrational, satisfies the
equation x2 + 0x − 2 = 0 and is thus an algebraic number; similarly, the
roots of the polynomial 7x3 + 43 x2 + 9 are algebraic numbers. If we tried to
specify these numbers to our counterpart by communicating their digits
(in any base), we would have to send an infinite amount of information;
but we just send the coefficients of the polynomials that define them we
can specify the exact numbers using only finitely much information.
• Transcendental numbers. Finally, we come to the transcendental numbers.
These are real numbers like π or e which are not the roots of any polyno-
mial with rational coefficients9 and which thus cannot be communicated
to our counterpart using only a finitely amount of information.
9 If we allow non-rational coefficients, then the numbers π and e are, of course, the roots
of polynomials (for example, π is a root of x − π = 0), but that’s cheating. Also, π and e are
the the roots of non-polynomial equations which may be written in the form of infinite power
series: for example, π is a root of
x3 x5
sin x = x − + + ··· = 0
6 120
and e − 1 is a root of
h i x2 x3
−1 + ln 1 + x = −1 + x − + + · · · = 0.
2 3
However, this is also cheating, because polynomials by definition must have only finitely many
terms.