0% found this document useful (0 votes)
89 views21 pages

Basic Computer Arithmetic

Basic computer arithmetic
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
89 views21 pages

Basic Computer Arithmetic

Basic computer arithmetic
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

BASIC COMPUTER ARITHMETIC

TSOGTGEREL GANTUMUR

Abstract. First, we consider how integers and fractional numbers are represented and
manipulated internally on a computer. Then we develop a basic theoretical framework for
analyzing algorithms involving inexact arithmetic.

Contents
1. Introduction 1
2. Integers 2
3. Euclidean division 4
4. Floating point numbers 8
5. Floating point arithmetic 10
6. Propagation of error 13
7. Summation and product 19

1. Introduction
There is no way to encode all real numbers by using finite length words, even if we use an
alphabet with countably many characters, because the set of finite sequences of integers is
countable. Fortunately, the real numbers support many countable dense subsets, and hence
the encoding problem of real numbers may be replaced by the question of choosing a suitable
countable dense subset. Let us look at some practical examples of real number encodings.
● Decimal notation. Examples: 36000, 2.35(75), −0.000072.
● Scientific notation. Examples: 3.6⋅104 , −72⋅10−6 . The general form is m⋅10e . In order
to have a unique (or near unique) representation of each number, one can impose a
normalization, such as requiring 1 ≤ ∣m∣ < 10.
● System with base/radix β. Example: m2 m1 m0 .m−1 = m2 β 2 + m1 β + m0 + m−1 β −1 . The
dot separating the integer and fractional parts is called the radix point.
● Binary (β = 2), octal (β = 8), and hexadecimal (β = 16) numbers.
● Babylonian hexagesimal (β = 60) numbers. This format is “true floating point,” in
the sense that no radix point is used, but the position of the radix point is inferred
from context. For example, 46 ∶ 2 ∶ 16 = (46 ⋅ 602 + 2 ⋅ 60 + 16) ⋅ 60k where k is arbitrary.
● Mixed radix. Example: 13(360) 10(20) 3 = 13 ⋅ 360 + 10 ⋅ 20 + 3 (Mayan calendar format).
● Rationals: pq , p ∈ Z, q ∈ N, q ≠ 0. This builds on a way to represent integers.
● Continued fraction: [4; 2, 1, 3] = 4 + 2+ 1 1 .
1+ 1
3
● Logarithmic number systems:
√ bx , where x ∈ R is represented in some way.
2
● More general functions: 2, e , sin 2, arctan 1.
As history has shown, simple base-β representations (i.e., place-value systems) seem to be
best suited for general purpose computations, and they are accordingly used in practically all
Date: January 28, 2018.
1
2 TSOGTGEREL GANTUMUR

modern digital computing devices. Inevitably, not only real numbers must be approximated
by fractional numbers that are admissible on the particular device, but also the operations
on the admissible numbers themselves should be approximate. Although the details of how
fractional numbers are implemented may vary from device to device, it is possible to formulate
a general set of assumptions encompassing practically all implementations, which are still
specific enough so as not to miss any crucial details of any particular implementation.
In what follows, we first describe how integers and fractional numbers are handled on ma-
jority of modern computing devices, and will then be naturally led to the aforementioned
set of assumptions, formulated here in the form of two axioms. Once the axioms have been
formulated, all the precursory discussions become just one (albeit the most important) exam-
ple satisfying the axioms. Finally, we illustrate by examples how the axioms can be used to
analyze algorithms that deal with real numbers, which gives us an opportunity to introduce
several fundamental concepts of numerical analysis.

2. Integers
Given a base (or radix) β ∈ N, β ≥ 2, any integer a ∈ Z can be expressed as

a = ± ∑ ak β k , (1)
k=0

where 0 ≤ ak ≤ β − 1 is called the k-th digit of a in base β. The digits can also be thought of
as elements of Z/βZ, the integers modulo β. Obviously, each integer has only finitely many
nonzero digits, and the digits are uniquely determined by a. More precisely, we have
∣a∣
ak = ⌊ ⌋ mod β, (2)
βk
where ⌊x⌋ = max{n ∈ Z ∶ n ≤ x} is the floor function.
In modern computers, we have β = 2, i.e., binary numbers are used, mainly because it
simplifies hardware implementation. Recall that one binary digit is called a bit. At the
hardware level, modern CPU’s handle 64 bit sized integers, which, in the nonnegative (or
unsigned) case, would range from 0 to M −1, where M = 264 ≈ 18⋅1018 . Note that in embedded
systems, 16 bit (or even 8 bit) microprocessors are more common, so that M = 216 = 65536.
There are several methods to encode signed integers, with the most popular one being the
so called two’s complement. This makes use of the fact that M − a ≡ −a mod M and hence
M − a behaves exactly like −a in the modulo M arithmetic:
(M − a) + b = M + b − a ≡ b − a mod M,
(M − a) ⋅ b = M b − ab ≡ −ab mod M,
(3)
(M − a) ⋅ (M − b) = M 2 − M (a + b) + ab ≡ ab mod M,
(M − a)b + r = c ⇐⇒ −ab + r ≡ c mod M, etc.

Thus for the 64 bit architecture, the signed integers would be Z̃ = {−263 , . . . , 263 −1}, where the
numbers 263 , . . . , 264 − 1 are internally used to represent −263 , . . . , −1. Note that 263 ≈ 9 ⋅ 1018 .
Within either of the aforementioned signed and unsigned ranges, each arithmetic operation
takes a few clock cycles, with the multiplication and division being the most expensive. Hence
it makes sense to measure the complexity of an algorithm by how many multiplication and
division operations it needs to execute (unless of course the algorithm has disproportionately
many addition and/or subtraction, in which case the relevant operations would be those). If
the result of an operation goes out of the admissible range, a flag (or exception) called integer
overflow would be raised. The other type of error one should watch out for is division by zero.
COMPUTER ARITHMETIC 3

cost

64 bits

Figure 1. The fixed length of standard integers can be modelled by assign-


ing infinite cost to any operation associated to integers longer than the stan-
dard size (red curve). For platforms that can handle arbitrarily large integers
(bignums), the cost of each arithmetic operation grows depending on the length
of the input variables (blue curve).

Integers outside of the usual ranges are called bignums, and handled at the software level
by representing them as arrays of “digits,” where a single “digit” could now hold numbers up
to ∼264 or ∼106 etc., since the digit-by-digit operations would be performed by the built-in
arithmetic. In other words, we use (1) with a large β. There is no such thing as “bignum
overflow,” in the sense that the allowed size of a bignum would only be limited by the com-
puter memory. Programming languages such as Python, Haskell, and Ruby have built-in
implementations of bignum arithmetic. Obviously, the complexity of an algorithm involving
bignum variables will depend on the lengths of those bignum variables. For example, addition
or subtraction of two bignums of respective lengths n and m has the complexity of O(n + m)
in general. This type of complexity measure is usually called bit complexity.

46 5 3 43 8 17
× −
1 1 1 2 0 7 34 25
7725 1 12 8 6 325 7 1 9 8
+ − −
5308 1 4 7 3 9306 ⋅ 8 5
13033 813 9631 7 1 1 3

Figure 2. Illustration of the grade school algorithms.

When two integers are given by their digits, elementary arithmetic operations can be per-
formed by employing the usual algorithms that we learn in grade school. This is relevant
to both built-in and bignum arithmetics. First, we can reduce the general case into a case
where the two integers are nonnegative. Then the digits of the sum s = a + b are given by the
recurrent relation
ck β + sk = ak + bk + ck−1 , k = 0, 1, . . . , (4)
where 0 ≤ sk ≤ β − 1, and c−1 = 0. Since a0 + b0 ≤ β + (β − 2), we have c0 ≤ 1, and by induction,
all the “carry digits” satisfy ck ≤ 1. The digit-wise operations such as (4) are performed
4 TSOGTGEREL GANTUMUR

either by elementary logic circuits in case our goal is the built-in arithmetic, or by the built-in
arithmetic in case we are talking about implementing a bignum arithmetic.
To compute the difference d = a − b, we can always assume a ≥ b, and the digits of d are
dk = ck β + ak − bk − ck−1 , k = 0, 1, . . . , (5)
where the “borrowed digits” ck are uniquely determined by the condition 0 ≤ dk ≤ β − 1, and
c−1 = 0. It is straightforward to show that 0 ≤ ck ≤ 1.
Addition and subtraction can also be treated simultaneously as follows. First, we introduce
the intermediate representation, which we call the Cauchy sum or difference:

a ± b = ∑ x∗k β k , (6)
k=0
with x∗k = ak ± bk ∈ {1 − β, . . . , 2β − 2}. This is not a standard representation because we may
have x∗k < 0 or x∗k > β − 1. The true digits of x = a ± b can be found from
ck β + xk = x∗k + ck−1 k = 0, 1, . . . , (7)
where the integers ck ∈ {−1, 0, 1} are uniquely determined by the condition 0 ≤ xk ≤ β − 1, and
c−1 = 0. As long as we ensure that both numbers are nonnegative, and that a ≥ b in case of
subtraction, the result will have an expansion with finitely many nonzero digits.
To multiply two positive integers, we first define the Cauchy product
∞ ∞ ∞ k ∞
ab = ( ∑ aj β j ) ⋅ ( ∑ bi β i ) = ∑ ( ∑ aj bk−j )β k = ∑ p∗k β k , (8)
j=0 i=0 k=0 j=0 k=0
where
k
p∗k = ∑ aj bk−j , (9)
j=0
is the k-th generalized digit of ab. In general, p∗k can be larger than β − 1, and so (8) is not
the base-β expansion of the product ab. However, the proper digits 0 ≤ pk ≤ β − 1 of ab can
be found by
ck β + pk = p∗k + ck−1 k = 0, 1, . . . , (10)

where c−1 = 0. One way to find the base-β expansion of pk would be to do the summation in
(9) from the beginning in base-β arithmetic.
Let the lengths of the input variables a and b be
n = max{k ∶ ak ≠ 0}, m = max{k ∶ bk ≠ 0}. (11)
Then as previously discussed, the bit complexity of addition and subtraction is O(n + m + 1).
Note that “+1” is introduced into O(. . .) because according to the way n and m are defined,
the lengths of a and b are n+1 and m+1, respectively. On the other hand, it is easy to see that
the same for our multiplication algorithm is O(nm + 1). In fact, there exist asymptotically
much faster multiplication algorithms such as the Karatsuba algorithm and the Schönhage-
Strassen algorithm, with the latter having the bit complexity of O(n + m + 1) up to some
logarithmic factors.

3. Euclidean division
Now we consider a division algorithm in the context of the preceding section. Again, what
follows is relevant to both built-in and bignum arithmetics. We assume that a and b are
positive integers. The goal is to compute the digits of q ≥ 0 and 0 ≤ r < b, satisfying
a = qb + r.
Here and in what follows, unless otherwise specified, all variables are integer variables. The
algorithm we are going to build is an adaptation of the usual long division algorithm we study
COMPUTER ARITHMETIC 5

in grade school. Without loss of generality, we can assume that n ≥ m and a > b ≥ 2, where n
and m are as defined in (11).
As a warm up, let us treat the special case m = 0 first. In this case, b has only one digit,
i.e., 2 ≤ b ≤ β − 1, so division can be performed in a straightforward digit-by-digit fashion.
This case is sometimes called “short division.” Thus the first step of the division algorithm
would be to divide an by b, as
an = qn b + rn ,
where 0 ≤ rn < b is the remainder, and qn ≥ 0 is the quotient. Obviously, qn ≤ β − 1 because
an ≤ β − 1. Computation of qn and rn should be performed in computer’s built-in arithmetic.
To proceed further, we combine rn with the (n − 1)-st digit of a, and divide it by b, that is,
rn β + an−1 = qn−1 b + rn−1 ,
where 0 ≤ rn−1 < b. Since rn < b, we are guaranteed that qn−1 ≤ β − 1. For bignum arithmetic,
computation of qn−1 and rn−1 should be performed in computer’s built-in arithmetic, and
since this involves division of the 2-digit number rn β + an−1 by the 1-digit number b, 2-digit
numbers must be within the reach of the built-in arithmetic. More precisely, we need β 2 ≤ M .
For built-in arithmetic, division of 2-bit numbers can be implemented either through logic
gates, or by a lookup table. The aforementioned procedure is repeated until we retrieve the
last digit a0 , and we finally get
a = an β n + . . . + a0 = (qn b + rn )β n + an−1 β n−1 + . . . + a0
= qn bβ n + (rn β + an−1 )β n−1 + . . . + a0
= qn bβ n + (qn−1 b + rn−1 )β n−1 + . . . + a0 = . . .
(12)
= qn bβ n + qn−1 bβ n−1 + . . . + q0 b + r0
n
= b ∑ qk β k + r0 ,
k=0
which shows that qk is the k-th digit of q, and that r = r0 .

925 7 925 9 925 5


− − −
7 132 9 102 5 185
22 02 42
− − −
21 0 40
15 25 25
− − −
14 18 25
1 7 0

Figure 3. “Short division”: To divide any number by a one-digit number, it


is enough to be able to divide any two-digit number by the one-digit number.

In the general case m > 0, the overall structure of the algorithm does not change, but
there will be one crucial new ingredient in the details. Before describing the algorithm, let us
introduce a convenient new notation. For 0 ≤ k ≤ ` let
a[k,`] = ak + ak+1 β + . . . + a` β `−k ,
which is simply the number consisting of those digits of a that are numbered by k, . . . , `. For
example, when β = 10 and a = 1532, we have a[2,4] = 15. The first step of our algorithm is to
compute qn−m and 0 ≤ rn−m < b satisfying
a[n−m,n] = qn−m b + rn−m . (13)
6 TSOGTGEREL GANTUMUR

Since the number of digits of a[n−m,n] is the same as that of b, we have qn−m ≤ β − 1. Next,
we compute qn−m−1 and 0 ≤ rn−m−1 < b satisfying
rn−m β + an−m−1 = qn−m−1 b + rn−m−1 . (14)
Since rn−m < b, we are guaranteed that qn−m−1 ≤ β −1. We repeat this process until we retrieve
the last digit a0 , and as before, we have
a = a[n−m,n] β n−m + an−m−1 β n−m−1 + . . . + a0
= (qn−m b + rn−m )β n−m + an−m−1 β n−m−1 + . . . + a0
= qn−m bβ n−m + (rn−m β + an−m−1 )β n−m−1 + . . . + a0
= qn−m bβ n−m + (qn−m−1 b + rn−m−1 )β n−m−1 + . . . + a0 = . . . (15)
= qn−m bβ n−m + qn−m−1 bβ n−m−1 + . . . + q0 b + r0
n−m
= b ∑ qk β k + r0 ,
k=0
which shows that qk is the k-th digit of q, and that r = r0 .

9 25 77 8 5 3 5 5 1236 12 3 0 0 4 7777
− − −
7 7 12 7 4 1 6 69 7 7 7 7 15
1 55 1 1 1 9 5 4 5 2 3 4
− − −
1 54 1 1 1 2 4 3 8 8 8 5
1 7 1 6 3 4 9

Figure 4. Examples of division by multiple-digit numbers. Note that in each


stage of each case, the current digit of the result can be accurately estimated
by dividing the number formed by the first 2 or 3 digits of the intermediate
dividend, by the number formed by the first 2 digits of the divisor. For instance,
in the 1st step of the 2nd case, we have 85/12=7, while the correct digit is 6.
In the 2nd stage, we have 111/12=9, which is the correct digit.

This seems all well and good, except that there is a catch: In (13) and (14), we divide by
b, which has m + 1 digits, and we cannot rely on the built-in arithmetic since m can be large.
We encounter the divisions (13) and (14) in each step of the paper-and-pencil long division
method. There, what helps is intuition and the fact that in practice we usually have m not
too large. Here, we need to replace intuition by a well defined algorithm. We shall consider
here an approach that is based on a few crucial observations. The first observation is that
since rn−m < b and an−m−1 < β, we have
rn−m β + an−m−1 ≤ (b − 1)β + β − 1 = bβ − 1,
so that the left hand side of (14) has at most m + 2 digits. Noting that the left hand side of
(13) has m + 1 digits, we now see that (13) and (14) only require divisions of a number not
exceeding bβ − 1 by b. In other words, the original division problem a/b has been reduced to
the case a ≤ bβ − 1 (and hence with m ≤ n ≤ m + 1). This indeed helps, because if two numbers
have roughly the same number of digits, then the first few digits of both numbers can be used
to compute a very good approximation of the quotient. For instance, as we shall prove below
in Lemma 1, it turns out that under the assumption a ≤ bβ − 1, if
a[m−1,n] = q ∗ b[m−1,m] + r∗ , (16)

with 0 ≤ r < b[m−1,m] , then
q ≤ q ∗ ≤ q + 1. (17)
COMPUTER ARITHMETIC 7

This means that the quotient of the number formed by the first 2 or 3 digits of a, divided
by the number formed by the first 2 digits of b, is either equal to the quotient q of a divided
by b, or off by 1. The cases q ∗ = q + 1 can easily be detected (and immediately corrected) by
comparing the product q ∗ b with a. For bignum arithmetic, the division (16) can be performed
in the built-in arithmetic, because the number of digits of any of the operands therein does
not exceed 3. As this requires that 3-digit numbers (i.e., numbers of the form c0 + c1 β + c2 β 2 )
should be within the reach of the built-in arithmetic, we get the bit more stringent condition
β 3 ≤ M , compared to the previous β 2 ≤ M . It is our final requirement on β, and hence, for
instance, we can take β = 221 or β = 106 . On the other hand, for built-in arithmetic, as before,
the options are logic gates and a lookup table.
We prove now the claim (17).
Lemma 1. Assume that 2 ≤ b < a ≤ bβ − 1 and hence m ≤ n ≤ m + 1, where n and m are as in
(11). Let q and 0 ≤ r < b be defined by
a = qb + r, (18)
and with some (small) integer p ≥ 0, let q ∗ and 0 ≤ r∗ < b[m−p,m] be defined by
a[m−p,n] = q ∗ b[m−p,m] + r∗ . (19)
Then we have
q ≤ q ∗ < q + 1 + β 1−p , (20)

and so q ≤ q ≤ q + 1 holds as long as p ≥ 1.
Proof. Let a∗ = a[m−p,n] β m−p , let b∗ = b[m−p,n] β m−p , and redefine r∗ to be r∗ β m−p . Then
rescaling of the equation (19) by the factor β m−p gives
a∗ = q ∗ b∗ + r∗ , with 0 ≤ r∗ ≤ b∗ − β m−p . (21)
We also have
β m ≤ b∗ ≤ b < b∗ + β m−p and β n ≤ a∗ ≤ a < a∗ + β m−p . (22)
We start by writing
a − r a∗ − r ∗ a a∗ r ∗ r
q − q∗ = − = − ∗ + ∗− . (23)
b b∗ b b b b

Then invoking b ≥ b and r ≥ 0, we get
a − a∗ r∗ β m−p + r∗
q − q∗ ≤ + ∗ < ≤ 1, (24)
b∗ b b∗
where we have used a − a∗ < β m−p in the second step, and r∗ ≤ b∗ − β m−p in the third step.
This shows that q − q ∗ < 1, and hence q ≤ q ∗ .
To get an upper bound on q ∗ , we proceed as
a∗ a r r∗ a a r a b − b∗ r
q∗ − q = − + − ≤ − + = ⋅ ∗ + , (25)
b∗ b b b∗ b∗ b b b b b
where we have simply used a∗ ≤ a and r∗ ≥ 0. Now, the upper bounds a < bβ, b − b∗ < β m−p ,
b∗ ≥ β m , and r ≤ b − 1 give
bβ β m−p b − 1 1
q∗ − q < ⋅ m + = β 1−p + 1 − < β 1−p + 1, (26)
b β b b
which yields the desired estimate q ∗ < q + 1 + β 1−p . The proof is complete. 
8 TSOGTGEREL GANTUMUR

Finally, let us put the algorithm together in a coherent form.


Algorithm 1: Pseudocode for long division
Data: Integers a = ∑nk=0 ak β k and b = ∑m k
k=0 bk β in radix β, satisfying a > b ≥ 2.
Result: Digits of the quotient q and the remainder r satisfying a = qb + r and 0 ≤ r < β.
Without loss of generality, assume an ≠ 0 and bm ≠ 0.
Compute q ∗ and 0 ≤ r∗ < b[m−1,m] satisfying a[n−1,n] = q ∗ b[m−1,m] + r∗ , cf. (13).
if q ∗ b ≤ a[n−m,n] then
qn−m ∶= q ∗
rn−m ∶= a[n−m,n] − qn−m b
else
qn−m ∶= q ∗ + 1
rn−m ∶= a[n−m,n] − qn−m b
end
for k ∶= n − m to 1 do
Let c ∶= rk β + ak−1
if c = 0 then Set qk−1 ∶= 0 and rk−1 ∶= 0, and go to the next iteration.
Let the expansion of c be c = ∑`i=0 ci β i with c` ≠ 0.
Compute q ∗ and 0 ≤ r∗ < b[m−1,m] satisfying c[m−1,`] = q ∗ b[m−1,m] + r∗ , cf. (14).
if q ∗ b ≤ c then
qk−1 ∶= q ∗
rk−1 ∶= c − qk−1 b
else
qk−1 ∶= q ∗ + 1
rk−1 ∶= c − qk−1 b
end
end

Exercise 1. Estimate the bit complexity of the long division algorithm.

4. Floating point numbers


Since the real numbers are uncountable, in general, they must be approximately repre-
sented. Perhaps the simplest proposal would be to use the integers internally, but to interpret
them in such a way that m ∈ Z represents κm, where κ is some fixed small constant, such as
κ = 0.001. This gives us access to the subset κZ̃ ⊂ R, where Z̃ ⊂ Z is the set of all admissible
integers in the given setting. With reference to how we represent integers, it is convenient to
have κ = β e , where e is a fixed integer, so that the accessible numbers are of the form
a = mβ e , m ∈ Z̃. (27)
For example, with β = 10 and e = −2, we get the numbers such as 1.00, 1.01, . . . , 1.99, and with
β = 10 and e = −3, we get 1.000, 1.001, . . . , 1.999, etc. These are called fixed point numbers,
which can be imagined as a uniformly spaced net placed on the real number line. In practice,
there is a trade-off between precision and range: For example, supposing that we can store
only 4 decimal digits per (unsigned) number, taking e = −1 gives the numbers 000.0, . . . , 999.9,
while the choice e = −3 would result in 0.000, . . . , 9.999. With 64 bit integers, if we take e = −30
(and β = 2), which corresponds to 30 log10 2 ≈ 9 digits after the radix point in decimals, we can
cover an interval of width 234 ≈ 17 ⋅ 109 . Thus, fixed point numbers are only good for working
with moderately sized quantities, although the implementation is straightforward and they
offer a uniform precision everywhere within the range.
COMPUTER ARITHMETIC 9

A simple modification of the above scheme yields a system that can handle extremely wide
ranges: We let the exponent e to be a variable in (27), leading to floating point numbers
a = mβ e , m ∈ M, e ∈ E, (28)
where M ⊂ Z and E ⊂ Z. In this context, m and e are called the mantissa (or significand)
and the exponent of a, respectively.
● Note that the role of e is simply to tell the location of the radix point. In practice,
e is moderate, so it does not require much storage. For example, the volume of the
observable universe is roughly 10185 cubic Planck length.
● Hence the main information content of a number is in the mantissa. Thus, 1000002
and 2890.032 contain basically the same amount of information.
● The number of digits in the mantissa (excluding the zeroes in the beginning and end)
is called the number of significant digits.
Modern computers handle floating point numbers at the hardware level, following the pre-
dominant IEEE 754 standard. Perhaps the most popular among the formats provided by this
standard is double precision format, which uses 64 bits per number as follows.
● 1 bit for the sign of m.
● 52 bits for the magnitude of m. The first bit of m is not stored here, and is taken to
be 1 in the so called normalized regime. So the smallest positive value of m in this
regime is 252 , the next number is 252 + 1, and so on, the largest value is 253 − 1. In
decimals, 52 bits can be rephrased as roughly 16 significant digits.
● 11 bits for e. It gives 2048 possible exponents, but 2 of these values are used as special
flags, leaving 2046 possibilities, which we use as: E = {−1074, −1073, . . . , 971}.
● One of the special values of e is to signal an underflow, and activate the denormalized
regime. In this regime, the first bit of m is implied to be 0, and e = −1074.
● Depending on the value of m, the other special value of e is used to signal signed
infinities (overflow) or NaN (not a number, such as 0/0).
Example 2. To have a better understanding, let us consider a simplified model, where β = 10,
two digits are allowed for m, and E = {−1, 0, 1}. In the normalized regime, the mantissa
must satisfy 10 ≤ m ≤ 99, and hence the smallest positive number is 10 ⋅ 10−1 = 1, and the
largest number is 99 ⋅ 101 = 990. In the denormalized regime, the nonnegative numbers are
0, 0.1, . . . , 0.9. This situation is illustrated in Figure 5.
As for the double precision format, the smallest normalized positive number is
a∗ = 252 ⋅ 2−1074 = 2−1022 ≈ 10−308 , (29)
and the largest possible number is
a∗ = (253 − 1) ⋅ 2971 ≈ 21024 ≈ 10308 . (30)

If the result of a computation goes beyond a in absolute value, we get an overflow. On the
other hand, an underflow occurs when a computation produces a number that is smaller than
a∗ in absolute value. In contrast to older formats where 0 was the only admissible number
in the so-called underflow gap (−a∗ , a∗ ), the current standard supports gradual underflow,
meaning that it has denormalized numbers sprinkled throughout this gap. However, gradual
or not, an underflow means a contaminated outcome, and should be avoided at all cost.
Furthermore, the distance between two consecutive double precision numbers behaves like
δx ∼ ε∣x∣, where ε ≈ 10−16 , (31)
in the normalized regime, and
δx ∼ 2−1022 ε, (32)
10 TSOGTGEREL GANTUMUR

e=1 t

overflow
e=0
e = −1

Figure 5. The red dots at the left signify the denormalized regime (the un-
derflow gap). The graph of a logarithm function is shown in the background.

in the denormalized regime. This can be thought of as the “resolution” of the double precision
floating point numbers.
Remark 3. It is clear that the normalized regime is where we want to be in the course of
any computation, and one needs a theoretical guarantee that the algorithm stays within this
regime. What is the same, we need to analyze the potential situations where underflow or
overflow (or NaN for that matter) could be produced, and should modify the algorithm to
account for those. Once this is taken care of, the upper and lower limits of e become irrelevant,
and we can set E = Z in (28), leading to what can be called generalized floating point numbers
N
R̃ = R̃(β, N ) = { ± ∑ ak β k+e ∶ 0 ≤ ak ≤ β − 1, e ∈ Z}. (33)
k=0

Note that in order to ensure uniqueness of a representation a = ± ∑N


k=0 ak β
k+e
for any given
a ∈ R̃, we can assume aN ≠ 0 unless a = 0.
Thus in the double precision format, or in any other reasonable setting, the floating point
numbers can be modelled by a set R̃ ⊂ R, satisfying the following assumption.
Axiom 0. There exist ε ≥ 0 and a map fl ∶ R → R̃ such that
∣fl(x) − x∣ ≤ ε∣x∣ for all x ∈ R. (34)
The parameter ε is called the machine epsilon or the machine precision.
Note that in the context of (33), we may take fl(x) = max{y ∈ R̃ ∶ y ≤ x} and ε = β −N , or
more precisely, ε = (β − 1)β −N −1 . By taking fl ∶ R → R̃ as rounding to the closest floating point
number, we can even get ε = 21 β −N , but this does not give any improvement for β = 2.

5. Floating point arithmetic


Let us now discuss the basic arithmetic operations on R̃. An immediate observation is
that R̃ is in general not closed under arithmetic operations. For instance, thinking of β = 10
COMPUTER ARITHMETIC 11

and N = 2 in (33), we have 1.01 ∈ R̃, but 1.01 × 1.01 = 1.0201 ∈/ R̃. Therefore, we need to
approximate these operations. As a benchmark, we may consider fl(x + y), fl(x − y), fl(x × y),
etc., as approximations to x + y, x − y, x × y, etc., respectively, for which we have the estimates
∣fl(x ± y) − (x ± y)∣ ≤ ε∣x ± y∣, ∣fl(x × y) − (x × y)∣ ≤ ε∣x × y∣, etc. (35)
In practice, however, it would be inefficient to compute x ± y exactly, in order to produce
fl(x ± y), if x and y have very different magnitudes, as in, e.g., 1050 + 10−50 . Hence we need a
more direct way to approximate x ± y.
Example 4. It turns out that a simple truncation is enough when the signs of the two
summands are the same. Thus thinking of N = 2 and β = 10 in (33), in order to compute
101 + 2.6, we truncate 2.6 to 2, and use s̃ = 101 + 2 as an approximate sum.
Lemma 5 (Truncated sum). Let R̃ be as in (33), and suppose that a, b ∈ R̃ are given by
N N
a = ∑ ak β k+e , and b = ∑ bk β k+e−m , (36)
k=0 k=0
with aN ≠ 0 and m ≥ 0. Then the “truncated sum”
N
s̃ = a + ∑ bk β k+e−m , (37)
k=m
satisfies the error bound
∣s̃ − (a + b)∣ ≤ β −N (a + b). (38)
Proof. First of all, note that we can set e = 0 since we are only interested in relative errors.
Then we proceed as
m−1 m−1
0 ≤ a + b − s̃ = ∑ bk β k−m ≤ ∑ (β − 1)β k−m
k=0 k=0
(39)
≤ (β − 1)(1 + β + . . . + β m−1 )β −m
= (β m − 1)β −m ≤ 1,
where we have used the fact that bk ≤ β − 1. On the other hand, we have
a + b ≥ a ≥ aN β N ≥ β N , or 1 ≤ β −N (a + b), (40)
which completes the proof. 
Example 6. The situation with subtraction (in the sense that the summands have differing
signs) is slightly more complicated. To illustrate, consider the subtraction 10.1 − 9.95 = 0.15,
and again thinking of N = 2 and β = 10 in (33), note that the truncated subtraction gives
0.05
10.1 − 9.9 = 0.2. Then the absolute error is 0.2 − 0.15 = 0.05, and the relative error is 0.15 = 13 .
This is much larger than the desired level 10−2 .
Nevertheless, the solution is very simple: We use a few extra “guard digits” to perform
subtraction, and then round the result to a nearest floating point number. In fact, a single
guard digit is always sufficient. So for the preceding example, with 1 guard digit, we get
10.1 − 9.95 = 0.15 as the intermediate result, and since 0.15 ∈ R̃, subtraction is exact. If
we were computing, say, 15.1 − 0.666, with 1 guard digit, the intermediate result would be
15.10 − 0.66 = 14.44, and after rounding, our final result would be 14.4.
In the following lemma, we take R̃ as in (33), and let rnd ∶ R → R̃ be an operation satisfying
∣x − rnd(x)∣ ≤ δβ −N , x ∈ R, (41)
1 β−1
with some constant 2 ≤δ≤ β . Note that if this operation is rounding to a nearest floating
1 β−1
point number, then δ = 2, whereas for a simple truncation we can set δ = β .
12 TSOGTGEREL GANTUMUR

Lemma 7 (Subtraction with guard digits). Let


N N
a = ∑ ak β k+e , and b = ∑ bk β k+e−m , (42)
k=0 k=0
with aN ≠ 0 and m ≥ 0. Then the truncated subtration with j ≥ 1 guard digits
N
d˜ = rnd(a − ∑ bk β k+e−m ), (43)
k=m−j

satisfies the error bound


∣d˜ − (a − b)∣ ≤ (δ + (1 + δ)β 1−j )β −N (a − b). (44)
In particular, taking j = 1 and δ = 21 , we can ensure ∣d˜ − (a − b)∣ ≤ 2β −N (a − b).
Proof. Without loss of generality, assume e = 0. Let
N
d∗ = a − ∑ bk β k+e−m , (45)
k=m−j

be the intermediate result, and note that the final result d˜ = rnd(d∗ ) satisfies
∣d˜ − d∗ ∣ ≤ δβ −N d∗ . (46)
Since the intermediate result is exact (meaning d∗ = a − b) if m ≤ j, we can further assume
that m > j ≥ 1. Then we proceed similarly to the proof of Lemma 5, and get
m−j−1 m−j−1
0 ≤ d∗ − (a − b) = ∑ bk β k−m ≤ ∑ (β − 1)β k−m
k=0 k=0
(47)
≤ (β − 1)(1 + β + . . . + β m−j−1 )β −m
= (β m−j − 1)β −m ≤ β −j .
A lower bound on a − b can be obtained as follows.
N N
a − b ≥ aN β N − ∑ bk β k−m ≥ β N − ∑ (β − 1)β k−m
k=0 k=0
N N −m
= β − (β − 1)(1 + β + . . . + β )β (48)
N N +1 −m N N −1 −m
= β − (β − 1)β ≥β −β +β
≥ (β − 1)β N −1 .
Finally, an application of the triangle inequality, in combination with (46) and (47), gives
∣d˜ − (a − b)∣ ≤ ∣d˜ − d∗ ∣ + ∣d∗ − (a − b)∣ ≤ δβ −N d∗ + β −j
≤ δβ −N (β −j + a − b) + β −j
(49)
(1 + δβ −N )β 1−j −N
≤ δβ −N (a − b) + β (a − b),
(β − 1)
β 1−N
where we have used d∗ ≤ β −j + a − b in the penultimate step, and 1 ≤ β−1 (a − b) from (48) in
the last step. The proof is complete, as β ≥ 2 and N ≥ 0. 
Turning to multiplication and division, recall that
(m1 β e1 ) × (m2 β e2 ) = (m1 m2 )β e1 +e2 ,
(50)
(m1 β e1 )/(m2 β e2 ) = (m1 /m2 )β e1 −e2 .
COMPUTER ARITHMETIC 13

To compute m1 m2 exactly, one should be able to work with ∼ 2N significant digits. The
exact result can then be rounded to N digits. A more efficient choice would be a truncated
multiplication (also called a “short product”), cf. Exercise 2. For division, we may simply
apply the long division algorithm until the first N digits are obtained. Hence multiplication
and division of floating point numbers are completely straightforward to implement, provided
that we have good algorithms for integer arithmetic.
These discussions justify the following general assumption.
Axiom 1. For each ⋆ ∈ {+, −, ×, /}, there exists a binary operation ⍟ ∶ R̃ × R̃ → R̃ such that
∣x ⋆ y − x ⍟ y∣ ≤ ε∣x ⋆ y∣, x, y ∈ R̃, (51)
where of course, division by zero is excluded.
Remark 8. Once we have formulated the axioms, the idea is obviously to use them as a
foundation, so that analysis (as well as design) of algorithms do not depend on the specific
details of how floating point numbers were implemented. For instance, we do not need to be
concerned with the parameters β and N , or even with what exactly the set R̃ is. All we need
is the knowledge that there is a set R̃ ⊂ R satisfying the axioms with some parameter ε ≥ 0.
In this regard, any 5-tuple (R̃, ⊕, ⊖, ⊗, ⊘) satisfying Axiom 0 and Axiom 1 may be called a
floating point system with machine precision ε. Then the special case with R̃ = R and ε = 0
would be called exact arithmetic.
Remark 9. In case one needs more precision than allowed by the default floating point
numbers, a robust option is arbitrary precision formats, which are usually implemented at the
software level. Arbitrary precision simply means that the mantissa of a number is now bignum,
and the arithmetic operations can be performed to stay within any given error tolerance. The
cost of operations must then depend on the number of significant digits, as in Figure 1.
Exercise 2 (Short product). We have reduced multiplication of floating point numbers to
multiplication of two positive integers, cf (50). Recall from Section 2 the multiplication algo-
rithm based on the Cauchy product
N N ∞ k
ab = ( ∑ aj β j ) ⋅ ( ∑ bi β i ) = ∑ ( ∑ aj bk−j )β k , (52)
j=0 i=0 k=0 j=0

cf. (8). We assume the normalization aN ≠ 0 and bN ≠ 0. With the intent of saving resources,
let us ignore the terms with k < m in the latter sum, with the truncation parameter m, that
is, we replace the product ab by
∞ k
p̃ = ∑ ( ∑ aj bk−j )β k .
k=m j=0

Show that
0 ≤ ab − p̃ ≤ ab ⋅ β m+3−2N .
What would be a good choice for the value of m, in the context of floating point multiplication?
Exercise 3 (Sterbenz lemma). With R̃ = R̃(β, N ) as in (33), let a, b ∈ R̃ be positive numbers
satisfying 21 a ≤ b ≤ 2a. Then show that a ⊖ b = a − b.

6. Propagation of error
A numerical algorithm is an algorithm that takes a finite sequence of floating point num-
bers (and possibly integers) as input, and produces a finite sequence of floating point numbers
(and possibly integers) as output. Here a floating point number means an element of R̃ as in
the axioms. The algorithm itself can contain the usual logical constructs such as conditional
14 TSOGTGEREL GANTUMUR

statements and loops, and a set of predefined operations on integers and floating point num-
bers, including the arithmetic operations, comparisons, and evaluation of some elementary
functions. If we fix any particular input value, then after unrolling the loops, and replacing
the conditional statements by the taken branches, the algorithm becomes a simple linear se-
quence of operations. This sequence in general depends on details of the floating point system
(R̃, ⊕, ⊖, ⊗, ⊘) that are more fine-grained than the axioms (i.e., on how the system is really
implemented), but the idea is that we should avoid algorithms whose performances critically
depend on those details, so that the axioms provide a solid foundation for all analysis. Hence
in the end, we are led to the analysis of sequences such as
(1,1,exp) (×,1) +
R3 ÐÐÐÐÐ→ R3 ÐÐÐ→ R2 Ð
→ R,
(53)
exp)
3 (1,1,̃ 3 (⊗,1) 2 ⊕
R̃ ÐÐÐÐÐ→ R ÐÐÐ→ R̃ Ð
→ R̃,
where the upper row corresponds to exact arithmetic, and the lower row to inexact arithmetic.
To clarify, the preceding sequence can be rewritten as
(x, y, z) ↦ (x, y, exp z) ↦ (xy, exp z) ↦ xy + exp z, (54)
and so it approximates the function f (x, y, z) = xy + exp z by f˜(x, y, z) = (x ⊗ y) ⊕ ẽ
xp(z),
with ẽxp ∶ R̃ → R̃ being some approximation of exp. In the context of numerical algorithms,
theoretical analysis of perturbations in the output due to the inexactness of floating point
arithmetic is known as roundoff error analysis. We illustrate it by the example in (54).
● If all operations except the last step were exact, then we would be computing a ⊕ b,
which is an approximation of a + b, where a = xy and b = exp(z).
● However, those operations are inexact, so the input to the last step is not the “true”
(or “intended”) values a and b, but their approximations ã = x ⊗ y and b̃ = ẽ
xp(z).
● Hence the computed value ã ⊕ b̃ will be an approximation of ã + b̃.
We can put it in the form of a diagram.

(a, b) (ã, b̃)


+ +
⊕ (55)

a+b ã + b̃ ã ⊕ b̃
Here the squiggly arrows indicate perturbations from the “true values,” due to, e.g., inexact
arithmetic. The error committed in the lower right squiggly arrow can be accounted for with
the help of Axiom 1:
∣ã ⊕ b̃ − (ã + b̃)∣ ≤ ε∣ã + b̃∣, (56)
or equivalently,
ã ⊕ b̃ = (1 + η)(ã + b̃) for some ∣η∣ ≤ ε. (57)
On the other hand, the behaviour of the error committed in the lower left squiggly arrow is
something intrinsic to the operation of summation itself, since this simply reflects how the
sum behaves with respect to inexact summands. Thus, putting
ã = a + ∆a, b̃ = b + ∆b, (58)
we have
ã + b̃ − (a + b) = ∆a + ∆b, (59)
where, e.g., ∆a = ã − a is called the absolute error in ã. Recall that we have access to the
approximation ã, but do not have access to the “true value” a. We may read (59) as: Absolute
errors are simply combined during summation.
COMPUTER ARITHMETIC 15

Next, dividing (59) through by a + b, we get


ã + b̃ − (a + b) aεa + bεb
εa+b ∶= = , (60)
a+b a+b
where, e.g., εa = ∆a
a is called the relative error in ã. We see that relative errors get combined,
a b
with weights a+b and a+b , respectively. In particular, if a + b ≈ 0, then the relative error εa+b
can be large, potentially catastrophic.
Remark 10. In the floating point context, the aforementioned phenomenon of potentially
catastrophic growth in relative error is called cancellation of digits. For example, consider
1 2 6.1

1 2 5.8
0.3
If we suppose that 126.1 and 125.8 had errors of size ≈ 0.1 in them, meaning that all their digits
were significant, the error in the result 0.3 can be as large as ≈ 0.2, which is barely a 1 significant
digit accuracy. Since the true result could be as small as ≈ 126.0−125.9 = 0.1, the relative error
of the result can only be bounded by ≈ 200%. The origin of the term “cancellation of digits”
is also apparent: The first 3 digits of the two numbers in the input cancelled each other. We
should stress here that the root cause of this phenomenon is the intrinsic sensitivity of the sum
(or rather, subtraction) with respect to perturbations in the summands. It has nothing to do
with the floating point arithmetic per se (In fact the subtraction in the preceding example was
exact). However, cancellation of digits is a constant enemy of numerical algorithms, precisely
because all inputs are inexact as they are in most cases the result of an inexact operation.
Turning back to the (60), assuming that ∣εa ∣ ≤ ε and ∣εb ∣ ≤ ε, we get
∣a∣ + ∣b∣
∣εa+b ∣ ≤ ε. (61)
∣a + b∣
Here, we can think of the quantity
∣a∣ + ∣b∣
κ+ (a, b) = , (62)
∣a + b∣
as expressing the sensitivity of a + b with respect to perturbations in a and b. This is called
the condition number of addition.
We proceed further, by using the triangle inequality and the estimate (56), as
∣ã ⊕ b̃ − (a + b)∣ ≤ ∣ã ⊕ b̃ − (ã + b̃)∣ + ∣ã + b̃ − (a + b)∣ ≤ ε∣ã + b̃∣ + εa+b ∣a + b∣, (63)
where we are now thinking of εa+b as a manifestly nonnegative quantity (i.e., we denoted ∣εa+b ∣
by εa+b ). Then invoking
∣ã + b̃∣ ≤ (1 + εa+b )∣a + b∣, (64)
and (61), we end up with
∣ã ⊕ b̃ − (a + b)∣
≤ ε(1 + εa+b ) + εa+b ≤ ((1 + ε)κ+ (a, b) + 1)ε, (65)
∣a + b∣
which takes into account both inexactness of the input, and inexactness of the summation
operation.
Let us do the same analysis for multiplication. We start with
ãb̃ − ab = b∆a + a∆b + ∆a∆b, (66)
16 TSOGTGEREL GANTUMUR

and division by ab yields


ãb̃ − ab
εab ∶= = ε a + ε b + εa εb ≈ εa + εb . (67)
ab
Thus, relative errors are simply combined during multiplication. If we assume that ∣εa ∣ ≤ ε
and ∣εb ∣ ≤ ε, we get
∣εab ∣ ≤ 2ε + ε2 ≤ κ× (a, b) ⋅ ε, (68)
where the condition number of multiplication is
κ× (a, b) ≈ 2. (69)
The full analysis involving inexact multiplication is exactly the same as in the case of addition,
and the final result we obtain is
∣ã ⊗ b̃ − ab∣
≤ ε(1 + εab ) + εab ≤ ((1 + ε)κ× (a, b) + 1)ε. (70)
∣ab∣
Quantitatively, of course κ× (a, b) remains bounded independently of a and b, while κ+ (a, b)
can become unbounded, exhibiting cancellation of digits.
Remark 11 (Univariate functions). Let f ∶ I → R be a differentiable function, with I ⊂ R
being an open interval. Suppose that x̃ = x + ∆x is a perturbation of the “true value” x ∈ I,
and let z = f (x). Then we have
z̃ ∶= f (x̃) = f (x + ∆x) ≈ f (x) + f ′ (x)∆x, (71)
and so
∆z ∶= z̃ − z ≈ f ′ (x)∆x, (72)
for ∆x small. From this, we can estimate the relative error, as
∆z f ′ (x)∆x xf ′ (x) ∆x
≈ = ⋅ . (73)
z f (x) f (x) x
The quantity
xf ′ (x) (log f (x))′
κf (x) = = , (74)
f (x) (log x)′
is called the (asymptotic) condition number of f at x, which represents the relative error
amplification factor, in the asymptotic regime where the error is small. The sign of κf has a
little importance, so we are really thinking of having the absolute value in the right hand side
of (74). Note that this can be thought of as the “derivative measured against relative error,”
or as the derivative of log f taken with respect to the variable log x. Since the argument of
a function always involves perturbation, either due to measurement error, initial rounding,
or inexact operations in the preparation steps, the condition number reflects the intrinsic
difficulty of computing the function in floating point arithmetic.
Example 12. Let us compute the condition numbers for some common functions.
● For f (x) = x1 , we have κ(x) = 1.
● For f (x) = xα , we have κ(x) = α.
● For f (x) = ex , we have κ(x) = x.
x 1
● For f (x) = x − 1, we have κ(x) = x−1 = 1 + x−1 . Cancellation of digits at x = 1.
x
● For f (x) = x + 1, we have κ(x) = x+1 . Cancellation of digits at x = −1.
● For f (x) = cos x, we have κ(x) = x tan x. Cancellation of digits at x = π2 + πn, n ∈ Z.
COMPUTER ARITHMETIC 17

√ 2
Example 13. Consider the root √ x = 1 − 1 − q of the quadratic x − 2x + q = 0, where we
assume q ≈ 0. Suppose √ that 1 − q was computed with relative error ε, i.e., the computed
root is x̃ = 1 − (1 + ε) 1 − q. Then we have

x − x̃ ε 1 − q 2ε
= ≈ , (75)
x x q
where we have used the fact that x ≈ 2q for q ≈ 0. Since 2ε q → ∞ as q → 0, our algorithm
exhibits cancellation of digits.
√ This occurs even if the input argument q is its true value,
because the computation √ of 1 − q is inexact. We may think of the algorithm as decomposing
the function f (q) = 1 − 1 − q into two factors, as
f = g ○ h, (76)

where g(y) = 1 − y and h(q) = 1 − q. Since h(q) is computed inexactly, and g(y) is poorly
conditioned near y = 1, we get cancellation of digits.

−1 + q

Figure 6. The quadratic y = x2 − 2x + q.

Can this be fixed? To answer this question, we compute the condition number of f :
qf ′ (q) q
κf (q) = = √ √ ≈ 1, for q ≈ 0, (77)
f (q) 2 1 − q(1 − 1 − q)
which indicates that there should be no intrinsic difficulty of computing f (q) for q ≈ 0, and
hence the preceding algorithm, i.e., the decomposition (76), was a poorly designed one. In
fact, a way out suggests itself, if we keep in mind that subtraction of nearly equal quantities
should be avoided. Namely, the following transformation gives a well behaved algorithm.

√ √ 1+ 1−q q
f (q) = 1 − 1 − q = (1 − 1 − q) √ = √ . (78)
1+ 1−q 1+ 1−q
Let us go over the involved operations one by one. First, since q ≈ 0, the subtraction 1 − q
is well conditioned. Second,
√ the square root is always well
√ conditioned for positive numbers,
cf, Example 12. Then as 1 − q ≈ 1, the summation 1 + 1 − q is well conditioned. Finally,
division is always well conditioned. We conclude that the algorithm implicitly defined by (78)
does not have cancellation of digits near q = 0.
Exercise 4. Perform a detailed roundoff error analysis of (78).
18 TSOGTGEREL GANTUMUR

Remark 14 (Bivariate functions). Let f ∶ U → R be a continuously differentiable bivariate


function, with U ⊂ R2 being an open set. Suppose that x̃ = x + ∆x and ỹ = y + ∆y are
perturbations of the “true values” (x, y) ∈ U , and let z = f (x, y). Then we have
∂f
z̃ ∶= f (x̃, ỹ) ≈ f (x̃, y) + (x̃, y)∆y ≈ f (x, y) + ∂x f (x, y)∆x + ∂y f (x̃, y)∆y
∂y (79)
≈ f (x, y) + ∂x f (x, y)∆x + ∂y f (x, y)∆y,

where in the last step we have used the continuity of ∂y f . This gives

z̃ − z ∂x f (x, y) ∂y f (x, y) x∂x f (x, y) ∆x y∂y f (x, y) ∆y


≈ ∆x + ∆y = ⋅ + ⋅ , (80)
z f (x, y) f (x, y) f (x, y) x f (x, y) y
and assuming that the relative errors of x̃ and ỹ are of the same magnitude, we are led to the
definition that the asymptotic condition number of f is
x∂x f y∂y f
κf (x, y) = ∣ ∣+∣ ∣ = κx,f + κy,f , (81)
f f
which can be thought of as the sum of two condition numbers, one in x direction and the
other in y direction.

Example 15. Let us compute the conditions numbers for some common functions.
∣x∣+∣y∣
● For f (x, y) = x + y, we have κ(x, y) = ∣x+y∣ , cf. (62).
α β
● For f (x, y) = x y , we have κ(x, y) = ∣α∣ + ∣β∣. Putting α = 1 and β = −1, we get the
(asymptotic) condition number of division.

Exercise 5. Generalize the notion of condition number to functions of n variables. Then


compute the condition numbers of the sum and the product of n numbers.

Exercise 6. Let I ⊂ R and J ⊂ R be open intervals, thought of as the domain and codomain
of some collection of functions f ∶ I → J. We associate to I and J the “error metrics”

e(x, ∆x) = g(x)∆x, e(z, ∆z) = h(z)∆z, (82)

where, e.g., e(x, ∆x) is to be understood as the error measure of the perturbation ∆x near
the point x, and g and h are positive functions. For a differentiable function f ∶ I → J, we
have ∆z ≈ f ′ (x)∆x, and so

xh(f (x))f ′ (x)


h(z)∆z ≈ h(z)f ′ (x)∆x = ⋅ g(x)∆x, (83)
g(x)
leading us to define the generalized asymptotic condition number
xh(f (x))f ′ (x)
κf (x) = , (84)
g(x)
associated to the error metrics (82). Note that the usual condition number (74) is obtained
by setting g(x) = x1 and h(z) = z1 .
(a) Take I = J = R, and find all error metrics for which the generalized condition number of
any translation f (x) = x + a (a ∈ R) is equal to 1.
(b) Take I = J = (0, ∞), and find all error metrics for which the generalized condition number
of any scaling f (x) = λx (λ > 0) is equal to 1.
COMPUTER ARITHMETIC 19

7. Summation and product


In this section, we consider computation of the sum and product
s n = x1 + x2 + . . . + xn , p n = x1 × x2 × . . . × xn , (85)
of a given collection x = (x1 , . . . , xn ) ∈ Rn . First of all, let us look at the condition numbers.
Thus, introduce perturbations
x̃k = xk + ∆xk , k = 1, . . . , n, (86)
and let
s̃n = x̃1 + x̃2 + . . . + x̃n , p̃n = x̃1 × x̃2 × . . . × x̃n . (87)
Then we have
s̃n − sn = ∆x1 + ∆x2 + . . . + ∆xn , (88)
and assuming that ∣∆xk ∣ ≤ ε∣xk ∣ for k = 1, . . . , n, we get
∣s̃n − sn ∣ ∣x1 ∣ + ∣x2 ∣ + . . . + ∣xn ∣
≤ ε. (89)
∣sn ∣ ∣sn ∣
From this, we read off the condition number of summation
∣x1 ∣ + ∣x2 ∣ + . . . + ∣xn ∣
κ+ (x) = , (90)
∣x1 + x2 + . . . + xn ∣
which is a generalization of (62). Naturally, we have cancellation of digits near sn = 0.
Turning to product, we start with
p̃n − pn = x̃1 x̃2 ⋯x̃n − x1 x2 ⋯xn = (x̃1 − x1 )x̃2 ⋯x̃n + x1 x̃2 ⋯x̃n − x1 x2 ⋯xn
= (x̃1 − x1 )x̃2 ⋯x̃n + x1 (x̃2 − x2 )x̃3 ⋯x̃n + x1 x2 x̃3 ⋯x̃n − x1 x2 ⋯xn = . . . (91)
= (x̃1 − x1 )x̃2 ⋯x̃n + x1 (x̃2 − x2 )x̃3 ⋯x̃n + . . . + x1 x2 ⋯xn−1 (x̃n − xn ).
Then invoking the estimates ∣x̃k − xk ∣ ≤ ε∣xk ∣ and ∣x̃k ∣ ≤ (1 + ε)∣xk ∣, we infer
∣p̃n − pn ∣ ≤ (ε(1 + ε)n−1 + ε(1 + ε)n−2 + . . . + ε(1 + ε) + ε)∣x1 x2 ⋯xn ∣
n
= ((1 + ε)n − 1)∣pn ∣ = (nε + ( )ε2 + . . . + nεn−1 + εn )∣pn ∣
2 (92)
≤ (nε + n2 ε2 + . . . + nn−1 εn−1 + nn εn )∣pn ∣

≤ ∣pn ∣,
1 − nε
where we have assumed that nε < 1. This implies that for perturbations satisfying, say, nε ≤ 12 ,
the condition number of product satisfies
κ× (x) ≤ 2n, (93)
and asymptotically, we have κ× (x) ≤ n as ε → 0.
Next, we look at the effect of inexact arithmetic on products with unperturbed input.
Introduce the notation
p̄k = x1 ⊗ x2 ⊗ . . . ⊗ xk , k = 1, 2, . . . , n, (94)
and invoking Axiom 1, we have
p̄2 = x1 ⊗ x2 = (1 + η1 )x1 x2 ,
p̄3 = p̄2 ⊗ x3 = (1 + η2 )p2 x3 = (1 + η1 )(1 + η2 )x1 x2 x3 ,
(95)
...
p̄n = p̄n−1 ⊗ xn = (1 + ηn−1 )pn−1 xn = (1 + η1 )(1 + η2 )⋯(1 + ηn−1 )x1 x2 ⋯xn ,
20 TSOGTGEREL GANTUMUR

for some η1 , . . . , ηn−1 satisfying ∣ηk ∣ ≤ ε, k = 1, . . . , n − 1. This yields


∣p̄n − pn ∣ = ∣(1 + η1 )(1 + η2 )⋯(1 + ηn−1 ) − 1∣∣pn ∣ ≤ ((1 + ε)n−1 − 1)∣pn ∣ (96)
and so
∣p̄n − pn ∣ (n − 1)ε
≤ ≤ 2nε, (97)
∣pn ∣ 1 − (n − 1)ε
where in the last step we have assumed that nε ≤ 12 .
Exercise 7. Combine the effects of input perturbation and inexact arithmetic for products.
That is, estimate x̃1 ⊗ . . . ⊗ x̃n − x1 × . . . × xn , where the notations are as above.
Finally, we deal with inexact summation.
Example 16 (Swamping). Thinking of N = 2 and β = 10 in (33), we have 10.0 ⊕ 0.01 = 10.0,
and by repeating this operation, we get, for instance
10.0 ⊕ 0.01 ⊕ 0.01 ⊕ . . . ⊕ 0.01 = 10.0, (98)
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¶
100 times

giving the relative error ≈ 10%. On the other hand, we have


0.01 ⊕ 0.01 ⊕ . . . ⊕ 0.01 ⊕10.0 = 11.0, (99)
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¶
100 times
which is the exact result. This suggests that one should sum the small numbers first. In
particular, floating point addition is not associative.

x1 x2 x3 x4 x5 x6 ...

+ + + + + ...

Figure 7. The “naive” summation algorithm (100). Errors made in the early
steps get amplified more, as they must go through all the subsequent steps.

As with the product, introduce the notation


s̄k = x1 ⊕ x2 ⊕ . . . ⊕ xk , k = 1, 2, . . . , n, (100)
and invoking Axiom 1, we have
s̄2 = x1 ⊕ x2 = (1 + η1 )(x1 + x2 ),
s̄3 = s̄2 ⊕ x3 = (1 + η2 )(p2 + x3 ) = (1 + η1 )(1 + η2 )(x1 + x2 ) + (1 + η2 )x3 ,
... (101)
s̄n = s̄n−1 ⊕ xn = (1 + ηn−1 )(sn−1 + xn )
= (1 + η1 )⋯(1 + ηn−1 )(x1 + x2 ) + (1 + η2 )⋯(1 + ηn−1 )x3 + . . . + (1 + ηn−1 )xn ,
for some η1 , . . . , ηn−1 satisfying ∣ηk ∣ ≤ ε, k = 1, . . . , n − 1. This yields
∣s̄n − sn ∣ ≤ ∣(1 + η1 )⋯(1 + ηn−1 ) − 1∣∣x1 + x2 ∣
+ ∣(1 + η2 )⋯(1 + ηn−1 ) − 1∣∣x3 ∣ + . . . + ∣ηn−1 ∣∣xn ∣ (102)
≤ ((1 + ε)n−1 − 1)∣x1 + x2 ∣ + ((1 + ε)n−2 − 1)∣x3 ∣ + . . . + ε∣xn ∣.
COMPUTER ARITHMETIC 21

Since (1 + ε)k − 1 = kε + O(k 2 ε2 ), terms such as x1 and x2 carry more weight in the final error
than terms such as xn , explaining the swamping phenomenon we have seen in Example 16.

By using the simple estimate (1 + ε)k − 1 ≤ 1−nε =∶ ρ(ε, n) on all pre-factors, we arrive at
∣s̄n − sn ∣ nε ∣x1 ∣ + . . . + ∣xn ∣
≤ ⋅ = ρ(ε, n)κ+ (x) ≤ 2nκ+ (x)ε, (103)
∣sn ∣ 1 − nε ∣x1 + . . . + xn ∣
where in the last step we have assumed that nε ≤ 12 .
Remark 17. Recall that the condition number κ+ (x) reflects how error propagates through
the summation map x ↦ s. Then ρ(n, ε) has to do with the particular way this map is
implemented, i.e., how x ↦ s is approximated by a sequence of floating point operations.
Exercise 8. Combine the effects of input perturbation and inexact arithmetic for summation.
That is, estimate x̃1 ⊕ . . . ⊕ x̃n − (x1 + . . . + xn ), where the notations are as above.
Exercise 9. To reduce the parameter ρ(n, ε) = O(nε), as well as to alleviate the phenomenon
of swamping, one may consider the pairwise summation algorithm, depicted in Figure 8.

x1 x2 x3 x4 x5 x6 ...

+ + +

+ ...

...

Figure 8. Pairwise summation algorithm.

More precisely, we set


σ(x1 , x2 ) = x1 ⊕ x2 , (104)
and
σ(x1 , x2 , . . . , x2k ) = σ(x1 , x2 , . . . , xk ) ⊕ σ(xk+1 , xk+2 , . . . , x2k ), (105)
for k ≥ 2. This defines an algorithm for summing x1 , x2 , . . . , xn , when n is a power of 2.
(a) Extend the algorithm to arbitrary integer n, not necessarily a power of 2.
(b) Show that ρ(n, ε) = O(ε log n) for this algorithm.
Exercise 10. Let x1 , x2 , . . . be a sequence of floating point numbers, and let sn = x1 + . . . + xn .
Consider Kahan’s compensated summation algorithm
yn = xn + en−1
s̃n = s̃n−1 + yn
en = (s̃n−1 − s̃n ) + yn , n = 1, 2, . . .
where each operation is performed in floating point arithmetic, and s̃0 = e0 = 0.
(a) Explain why you would expect the roundoff accuracy of this method to be better than
that of the naive summation method.
(b) Show that
n
∣s̃n − sn ∣ ≤ [Cε + O(ε2 )] ∑ ∣xk ∣,
k=1
where C is some constant, and ε is the machine epsilon.

You might also like