Module-2-Roundoff-and-Truncation-Errors

Module 2 – ROUNDOFF and TRUNCATION ERROR
Objectives:
1. Understanding the distinction between accuracy and precision.
2. Learning how to quantify error.
3. Learning how error estimates can be used to decide when to terminate an iterative calculation.
4. Understanding how roundoff errors occur because digital computers have a limited ability to
represent numbers.
5. Understanding why floating-point numbers have limits on their range and precision.
6. Recognizing that truncation errors occur when exact mathematical formulations are represented
by approximations.
7. Knowing how to use the Taylor series to estimate truncation errors.
8. Understanding how to write forward, backward, and centered finite-difference approximations
of first and second derivatives.
9. Recognizing that efforts to minimize truncation errors can sometimes increase roundoff errors
ROUNDOFF ERRORS
Roundoff errors arise because digital computers cannot represent some quantities exactly.
They are important to engineering and scientific problem solving because they can lead to erroneous
results.
In certain cases, they can actually lead to a calculation going unstable and yielding obviously erroneous
results. Such calculations are said to be ill-conditioned. Worse still, they can lead to subtler discrepancies
that are difficult to detect.
Two major facets of roundoff errors involved in numerical calculations:

1. Digital computers have magnitude and precision limits on their ability to represent numbers.
2. Certain numerical manipulations are highly sensitive to roundoff errors. This can result from
both mathematical considerations as well as from the way in which computers perform
arithmetic operation
4.2.1 Computer Number Representation

Numerical roundoff errors are directly related to the manner in which numbers are stored in a computer.
The fundamental unit whereby information is represented is called a word.
This is an entity that consists of a string of binary digits, or bits. Numbers are typically stored in one or
more words.
A number system is merely a convention for representing quantities.
Because we have 10 fingers and 10 toes, the number system that we are most familiar with is the
decimal, or base-10, number system.
A base is the number used as the reference for constructing the system. The base-10 system uses the 10
digits—0, 1, 2, 3, 4, 5, 6, 7, 8, and 9—to represent numbers. By themselves, these digits are satisfactory
for counting from 0 to 9.
For larger quantities, combinations of these basic digits are used, with the position or place value
specifying the magnitude.
The rightmost digit in a whole number represents a number from 0 to 9. The second digit from the right
represents a multiple of 10.
The third digit from the right represents a multiple of 100 and so on.
For example, if we have the number 8642.9, then we have eight groups of 1000, six groups of 100, four
groups of 10, two groups of 1, and nine groups of 0.1, or
(8 × 103 ) + (6 × 102 ) + (4 × 101 ) + (2 × 100 ) + (9 × 10-1 ) = 8642.9
This type of representation is called positional notation
Now, because the decimal system is so familiar, it is not commonly realized that there are alternatives.
For example, if human beings happened to have eight fingers and toes, we would undoubtedly have
developed an octal, or base-8, representation.
In the same sense, our friend the computer is like a two-fingered animal who is limited to two states—
either 0 or 1. This relates to the fact that the primary logic units of digital computers are on/off
electronic components.
Hence, numbers on the computer are represented with a binary, or base-2, system. Just as with the
decimal system, quantities can be represented using positional notation.
For example, the binary number 101.1 is equivalent to (1 × 22) + (0 × 21) + (1 × 20) + (1 × 2-1) = 4 + 0 + 1 +
0.5 = 5.5 in the decimal system.
Integer Representation
Signed magnitude method, employs the first bit of a word to indicate the sign, with a 0 for positive and a
1 for negative.
The remaining bits are used to store the number. For example, the integer value of 173 is represented in
binary as 10101101:
(10101101)2 = 27 + 25 + 23 + 22 + 20 = 128 + 32 + 8 + 4 + 1 = (173)10
Therefore, the binary equivalent of −173 would be stored on a 16-bit computer, as

1000000010101101
If such a scheme is employed, there clearly is a limited range of integers that can be represented.
Again assuming a 16-bit word size, if one bit is used for the sign, the 15 remaining bits can represent
binary integers from 0 to 111111111111111. The upper limit can be converted to a decimal integer, as in
(1 × 214) + (1 × 213) + . . . + (1 × 21 ) + (1 × 20 ) = 32,767.
Note that this value can be simply evaluated as 215 − 1. Thus, a 16-bit computer word can store decimal
integers ranging from −32,767 to 32,767
In addition, because zero is already defined as 0000000000000000, it is redundant to use the number
1000000000000000 to define a “minus zero.” Therefore, it is conventionally employed to represent an
additional negative number: −32,768, and the range is from −32,768 to 32,767
For an n-bit word, the range would be from ̶ 2n-1 to 2n-1 − 1. Thus, 32-bit integers would range from
−2,147,483,648 to +2,147,483,647
The foregoing serves to illustrate how all digital computers are limited in their capability to represent
integers. That is, numbers above or below the range cannot be represented.
A more serious limitation is encountered in the storage and manipulation of fractional quantities
Floating-Point Representation
Fractional quantities are typically represented in computers using floating-point format. In this approach,
which is very much like scientific notation, the number is expressed as
±s × be
where s = the significand (or mantissa), b = the base of the number system being used, and e = the
exponent
Normalization
Prior to being expressed in this form, the number is normalized by moving the decimal place over so that
only one significant digit is to the left of the decimal point.
This is done so computer memory is not wasted on storing useless nonsignificant zeros. For example, a
value like 0.005678 could be represented in a wasteful manner as 0.005678 × 100 . However,
normalization would yield 5.678 × 10-3 which eliminates the useless zeroes
EXAMPLE 4.2 Implications of Floating-Point Representation

Problem Statement. Suppose that we had a hypothetical base-10 computer with a 5-digit word size.
Assume that one digit is used for the sign, two for the exponent, and two for the mantissa. For simplicity,
assume that one of the exponent digits is used for its sign, leaving a single digit for its magnitude
so . do
Solution. A general representation of the number following normalization would be s1 d 1.d 2 x 1 0
where so and s1 = the signs, d o = the magnitude of the exponent, and d 1 and d 2 = the magnitude of the
significand digits
What is the largest possible positive quantity that can be represented? Clearly, it would correspond to
both signs being positive, and all magnitude digits set to the largest possible value in base-10, that is, 9:
Largest value = +9.9 x 10+9

So, the largest possible number would be a little less than 10 billion. Although this might seem like a big
number, it’s really not that big. For example, this computer would be incapable of representing a
commonly used constant like Avogadro’s number (6.022 x 1 023).
In the same sense, the smallest possible positive number would be Smallest value = 1.0 x 10−9
Again, although this value might seem pretty small, you could not use it to represent a quantity like
Planck’s constant (6.626 x 1 0−34 J ∙ s ). Similar negative values could also be developed.
Large positive and negative numbers that fall outside the range would cause an overflow error. In a
similar sense, for very small quantities there is a “hole” at zero, and very small quantities would usually
be converted to zero. Recognize that the exponent overwhelmingly determines these range limitations.
For example, if we increase the mantissa by one digit, the maximum value increases slightly to 9.99 1 09.
In contrast, a one-digit increase in the exponent raises the maximum by 90 orders of magnitude to
99
9.9 x 1 0 .
When it comes to precision, however, the situation is reversed. Whereas the significand plays a minor
role in defining the range, it has a profound effect on specifying the precision. This is dramatically
illustrated for this example where we have limited the significand to only 2 digits.
As in Fig. 4.5, just as there is a “hole” at zero, there are also “holes” between values. For example, a
simple rational number with a finite number of digits like 2-5 = 0.03125 would have to be stored as 3.1 ×
10-2 or 0.031. Thus, a roundoff error is introduced.
For this case, it represents a relative error of

While we could store a number like 0.03125 exactly by expanding the digits of the significand, quantities
with infinite digits must always be approximated. For example, a commonly used constant such as π (=
3.14159…) would have to be represented as 3.1 × 100 or 3.1. For this case, the relative error is
Although adding significand digits can improve the approximation, such quantities will always have some
roundoff error when stored in a computer
How floating-point quantities are actually represented in a real computer using base-2 or binary
numbers.
Binary numbers consist exclusively of 0s and 1s, a bonus occurs when they are normalized. That is, the
bit to the left of the binary point will always be one! This means that this leading bit does not have to be
stored. Hence, nonzero binary floating-point numbers can be expressed as
±(1 + f ) × 2e
where f = the mantissa (i.e., the fractional part of the significand). For example, if we normalized the
binary number 1101.1, the result would be 1.1011 × (2)-3 or (1 + 0.1011) × 2-3.
Thus, although the original number has five significant bits, we only have to store the four fractional bits:
0.1011
Arithmetic Manipulations of Computer Numbers

Aside from the limitations of a computer’s number system, the actual arithmetic manipulations involving
these numbers can also result in roundoff error.
We will employ a hypothetical decimal computer with a 4-digit mantissa and a 1-digit exponent
When two floating-point numbers are added, the numbers are first expressed so that they have the
same exponents.
For example, if we want to add 1.557 + 0.04341,
the computer would express the numbers as 0.1557 × 101 + 0.004341 × 101 . Then the mantissas are
added to give 0.160041 × 101
Now, because this hypothetical computer only carries a 4-digit mantissa, the excess number of digits get
chopped off and the result is 0.1600 × 101. Notice how the last two digits of the second number (41)
that were shifted to the right have essentially been lost from the computation.
TRUNCATION ERRORS
Truncation errors are those that result from using an approximation in place of an exact mathematical
procedure.
A truncation error was introduced into the numerical solution because the difference equation only
approximates the true value of the derivate.
To gain insight into the properties of such errors, we now turn to a mathematical formulation that is used
widely in numerical methods to express functions in an approximate fashion—the Taylor series.
4.3.1 The Taylor Series
Taylor’s theorem and its associated formula, the Taylor series, is of great value in the study of numerical
methods. In essence, the Taylor theorem states that any smooth function can be approximated as a
polynomial. The Taylor series then provides a means to express this idea mathematically in a form that
can be used to generate practical results
The Taylor series provides a means to express this idea mathematically in a form that can be used to
generate practical results
A useful way to gain insight into the Taylor series is to build it term by term. A good problem context for
this exercise is to predict a function value at one point in terms of the function value and its derivatives
at another point
Suppose that you are blindfolded and taken to a location on the side of a hill facing downslope (Fig. 4.7).
We’ll call your horizontal location xi and your vertical distance with respect to the base of the hill f (xi).
You are given the task of predicting the height at a position xi+1, which is a distance h away from you
At first, you are placed on a platform that is completely horizontal so that you have no idea that the hill is
sloping down away from you. At this point, what would be your best guess at the height at xi+1? If you
think about it (remember you have no idea whatsoever what’s in front of you), the best guess would be
the same height as where you’re standing now! You could express this prediction mathematically as
f (x i+1 )≅ f (x i) (4.9)
This relationship, which is called the zero-order approximation, indicates that the value of f at the new
point is the same as the value at the old point. This result makes intuitive sense because if x i and x i+1 are
close to each other, it is likely that the new value is probably similar to the old value.
Equation (4.9) provides a perfect estimate if the function being approximated is, in fact, a constant. For
our problem, you would be right only if you happened to be standing on a perfectly flat plateau.
However, if the function changes at all over the interval, additional terms of the Taylor series are
required to provide a better estimate.
So now you are allowed to get off the platform and stand on the hill surface with one leg positioned in
front of you and the other behind. You immediately sense that the front foot is lower than the back foot.
In fact, you’re allowed to obtain a quantitative estimate of the slope by measuring the difference in
elevation and dividing it by the distance between your feet
With this additional information, you are clearly in a better position to predict the height at f (x i+1 ). In
essence, you use the slope estimate to project a straight line out to x i+1. You can express this prediction
mathematically
'
f ( x i +1 ) ≅ f ( x i ) +f ( x i ) h (4.10)
This is called a first-order approximation because the additional first-order term consists of a slope
f ' (x i)) multiplied by h , the distance between x i and x i+1. Thus, the expression is now in the form of a
straight line that is capable of predicting an increase or decrease of the function between x i and x 1+1.
Although Eq. (4.10) can predict a change, it is only exact for a straight-line, or linear, trend. To get a
better prediction, we need to add more terms to our equation.
So now you are allowed to stand on the hill surface and take two measurements. First, you measure the
slope behind you by keeping one foot planted at xi and moving the other one back one back a distance ∆
'
x. Let us call this slope f b (xi ). Then you measure the slope in front of you by keeping one foot planted
'
at xi and moving the other one forward ∆ x. Let us call this slope f f ( x i) . You immediately recognize that
the slope behind is milder than the one in front. Clearly the drop in height is “accelerating” downward in
front of you. Thus, the odds are that f ¿ is even lower than your previous linear prediction.
As you might expect, you are now going to add a second-order term to your equation and make it into a
parabola. The Taylor series provides the correct way to do this as in
( x i+1 ) ≅ f ( x i) + f ' ( xi ) h+ f ( {x} rsub {i} )} over {2!} {h} ^ {2 ¿
(4.11)
To make use of this formula, you need an estimate of the second derivative. You can use the last two
slopes you determined to estimate it as
f ( {x} rsub {i+1} )≅ {{{f} ^ {'}} rsub {f} left ({x} rsub {i} right ) - {{f} ^ {'}} rsub {b} ( {x} rsub {i} )} over {
(4.12)
Thus, the second derivative is merely a derivative of a derivative; in this case, the rate of change of the
slope
Before proceeding, let us look carefully at Eq. (4.11). Recognize that all the values subscripted i represent
values that you have estimated. That is, they are numbers. Consequently, the only unknowns are the
values at the prediction position x i+1. Thus, it is a quadratic equation of the form
2
f ( h ) ≅ a2 h +a1 h+a o
Thus, we can see that the second-order Taylor series approximates the function with a second order
polynomial.
Clearly, we could keep adding more derivatives to capture more of the function’s curvature. Thus, we
arrive at the complete Taylor series expansion
'
f ( x i +1 )=f ( x 1 ) + f ( x i ) h+ f ( {x} rsub {i} )} over {2!} {h} ^ {2} +…+ {{f} ^ {left (n right )} ( {x} rsub {i} )} over
(4.13)
Note that because Eq. (4.13) is an infinite series, an equal sign replaces the approximate sign that was
used in Equations (4.9) through (4.11). A remainder term is also included to account for all terms from
n+1 to infinity
( n +1)
f (ξ ) n +1
Rn = h (4.14)
( n+1 ) !
where the subscript n connotes that this is the remainder for the nth order approximation and ξ is a value
of x that lies somewhere between x i and x i+1
We can now see why the Taylor theorem states that any smooth function can be approximated as a
polynomial and that the Taylor series provides a means to express this idea mathematically
In general, the nth -order Taylor series expansion will be exact for an nth -order polynomial.
For other differentiable and continuous functions, such as exponentials and sinusoids, a finite number of
terms will not yield an exact estimate.
Each additional term will contribute some improvement, however slight, to the approximation. Only if an
infinite number of terms are added will the series yield an exact result
Although the foregoing is true, the practical value of Taylor series expansions is that, in most cases, the
inclusion of only a few terms will result in an approximation that is close enough to the true value for
practical purposes.
The assessment of how many terms are required to get “close enough” is based on the remainder term
of the expansion (Eq. 4.14).
This relationship has two major drawbacks.
First, ξ is not known exactly but merely lies somewhere between x i and x i+1.
Second, to evaluate Eq. (4.14), we need to determine the (n+1)th derivative of f (x). To do this, we need
to know f (x). However, if we knew f (x), there would be no need to perform the Taylor series
expansion in the present context!
Despite this dilemma, Eq. (4.14) is still useful for gaining insight into truncation errors. This is because we
do have control over the term h in the equation. In other words, we can choose how far away from x we
want to evaluate f (x) , and we can control the number of terms we include in the expansion.
Consequently, Eq. (4.14) is often expressed as
Rn =O(h¿¿ n+1)¿
where the nomenclature O ¿ ¿ means that the truncation error is of the order of h n+1. That is, the error is
proportional to the step size h raised to the ¿ power.
Although this approximation implies nothing regarding the magnitude of the derivatives that multiply
n+1
h , it is extremely useful in judging the comparative error of numerical methods based on Taylor series
expansions. For example, if the error is O(h), halving the step size will halve the error. On the other
hand, if the error is O(h 2), halving the step size will quarter the error.
In general, we can usually assume that the truncation error is decreased by the addition of terms to the
Taylor series. In many cases, if h is sufficiently small, the first- and other lower-order terms usually
account for a disproportionately high percent of the error. Thus, only a few terms are required to obtain
an adequate approximation.
EXAMPLE. Approximation of a Function with a Taylor Series

Expansion Problem Statement. Use Taylor series expansions with n=0 ¿ 6 to approximate f ( x )=cos x at
π π
x i+1=
on the basis of the value of f (x) and its derivatives at x i= . Note that this means that
3 4
π π π
h= − =
3 4 12
Solution.
Our knowledge of the true function allows us to determine the correct value f (π/3) = 0.5. The zero-order
approximation is
f ( π3 )≅ cos( π4 =0.707106781)
which represents a percent relative error
ε t= |0.5−0.707106781
0.5 |100 %=41.4 %
For the first-order approximation, we add the first derivative term where f ' ( x )=−sin x
f ( π3 )≅ cos( π4 )−sin ( π4 )( 12π )=0.521986659

which has |ε t|=4.40 % . For the second-order approximation, we add the second derivative term where
f (x)=- cos⁡
π
cos
( π3 ) ≅ cos ( π4 )−sin ( π4 )( 12π )+ 2
4
¿¿
with |ε t|=0.449 %. Thus, the inclusion of additional terms results in an improved estimate. The process
can be continued, and the results listed as in
|ε t|
Order n
( π3 )
(n )
f ( x) f
0 cos x 0.707106781 41.4
1 −sin x 0.521986659 4.40
2 −cos x 0.497754491 0.449
3 sin x 0.499869147 2.62 x 10-2
4 cos x 0.500007551 1.51 x 10-3
5 −sin x 0.500000304 6.08 x 10-5
6 −cos x 0.499999988 2.44 x 10-6
Notice that the derivatives never go to zero as would be the case for a polynomial. Therefore, each
additional term results in some improvement in the estimate.
However, also notice how most of the improvement comes with the initial terms. For this case, by the
time we have added the third-order term, the error is reduced to 0.026%, which means that we have
attained 99.974% of the true value.
Consequently, although the addition of more terms will reduce the error further, the improvement
becomes negligible.
TOTAL NUMERICAL ERROR

The total numerical error is the summation of the truncation and roundoff errors.
In general, the only way to minimize roundoff errors is to increase the number of significant figures of
the computer. Further, we have noted that roundoff error may increase due to subtractive cancellation
or due to an increase in the number of computations in an analysis.
In contrast, Example 4.4 demonstrated that the truncation error can be reduced by decreasing the step
size. Because a decrease in step size can lead to subtractive cancellation or to an increase in
computations, the truncation errors are decreased as the roundoff errors are increased
Therefore, we are faced by the following dilemma:
The strategy for decreasing one component of the total error leads to an increase of the other
component. In a computation, we could conceivably decrease the step size to minimize truncation errors
only to discover that in doing so, the roundoff error begins to dominate the solution and the total error
grows!
Thus, our remedy becomes our problem (Fig. 4.11). One challenge that we face is to determine an
appropriate step size for a particular computation.
We would like to choose a large step size to decrease the amount of calculations and roundoff errors
without incurring the penalty of a large truncation error.
If the total error is as shown in Fig. 4.11, the challenge is to identify the point of diminishing returns
where roundoff error begins to negate the benefits of step-size reduction

Module-2-Roundoff-and-Truncation-Errors

Uploaded by

Module-2-Roundoff-and-Truncation-Errors

Uploaded by

Module 2 – ROUNDOFF and TRUNCATION ERROR

Two major facets of roundoff errors involved in numerical calculations:

4.2.1 Computer Number Representation

The fundamental unit whereby information is represented is called a word.

A number system is merely a convention for representing quantities.

(8 × 103 ) + (6 × 102 ) + (4 × 101 ) + (2 × 100 ) + (9 × 10-1 ) = 8642.9

This type of representation is called positional notation

(10101101)2 = 27 + 25 + 23 + 22 + 20 = 128 + 32 + 8 + 4 + 1 = (173)10

Therefore, the binary equivalent of −173 would be stored on a 16-bit computer, as

EXAMPLE 4.2 Implications of Floating-Point Representation

Largest value = +9.9 x 10+9

For this case, it represents a relative error of

Arithmetic Manipulations of Computer Numbers

For example, if we want to add 1.557 + 0.04341,

4.3.1 The Taylor Series

EXAMPLE. Approximation of a Function with a Taylor Series

f ( π3 )≅ cos( π4 )−sin ( π4 )( 12π )=0.521986659

TOTAL NUMERICAL ERROR

Therefore, we are faced by the following dilemma:

You might also like