Gauss Elimination for Linear Equations
Gauss Elimination for Linear Equations
The general approach to solving simultaneous linear equations is known as Gauss elimination.
There are several variants of the basic technique; however, all start with the basic idea of
reducing a set of equations to an upper triangular form that can be easily solved. Two important
parts of this process are (1) the detection of a system of equations that does not have a unique
solution and (2) reducing round-off error in the calculations. These notes do not cover iterative
methods for the solution of equations. These methods will be discussed as part of the
consideration of sparse matrices that arise in the numerical solution of partial differential
equations.
Although we are used to writing equations as the form 3x + 2y + 6z = 3, etc., the computer
solutions are based on a more general notation. Indeed, we will drop all reference to writing the
equations as such and simply use the data in the equations. In particular, we can write the
system of equations in matrix notation as Ax = b. We will develop a set of algorithms that uses
the data in the A matrix and the b vector to obtain the solutions in the x vector, without reference
to actual algebraic equations.
A key feature of algorithms for solving simultaneous equations is called a “pivoting strategy” that
is used to reduce round-off error that occurs in the solutions of these equations. These notes
begin with background material on computer data representation to give background on the topic
of round-off error.
Because of the binary representation of data in computers, numbers are not represented exactly.
This is no different than the custom that we practice when we represent as 3.14, 3.14159,
3.14159265358979, or some other number depending on the accuracy we require in a given
calculation. Although the binary representation of numbers in a computer uses base two instead
of our base ten, the computer designer’s decision on the number of significant figures used to
represent an inexact number limits the accuracy in our calculations.
Numbers in computers are typically represented as integer data types, used for counting, and
floating point data types used for computations. Although the names for the data types differ in
different languages, the floating-point data types can be thought of as single precision, double
precision, and extended precision. Each of these types uses more bits to represent the number.
In a typical computer 32 bits (4 bytes) are used for single precision, 64 bits (8 bytes) for double
precision, and 80 bits (10 bytes) for extended precision. Standards for computing languages give
wide options to compiler vendors and the word size for the different types may be different for
different machines and even for different compilers on the same machine.
The structure of a floating-point data type represents the number in a binary exponential form,
m•2c, where c is the characteristic (or exponent) and m is the mantissa. This is like our usual
representation of numbers as 6.023x1023, but it uses 2 instead of 10 for the exponent base. As
more bits are used for the word, both the range of the powers available in the characteristic and
Numerical solutions of simultaneous equations L. S. Caretto, September 14, 2017 Page 2
the number of significant digits in the mantissa increase. For the typical single-precision floating-
point data type, the magnitude of the numbers can range between 10-38 and 1038, and the
mantissa can represent about 7 significant figures. A typical double-precision type can range
between 10-308 and 10308, and the mantissa can represent more than 15 significant figures.
The “machine epsilon” is the name given to the largest number such that 1 + (machine epsilon) =
1 in the computer. For a 4-byte single precision this value is 1.19x10-7, and for an 8-byte double
precision, the machine epsilon is 2.22x10-16. This shows us that the single-precision cannot quite
represent seven significant figures (otherwise the machine epsilon would be less than 10 -7) while
the double precision can easily represent 15 significant figures. Although single precision may
seem to have significant accuracy for most engineering calculations, it is possible that rounding
that occurs in a series of calculations can increase the difference between the true solution and
the one obtained in a computer. This is particularly true in the solution of simultaneous algebraic
equations, and algorithms have been devised to reduce this round-off error. See the additional
information on round-off error in Appendix A and binary numbers in Appendix C.
In order to solve a set of simultaneous linear equations we use a process that replaces equations
in the set by equivalent equations. We can replace any equation in the set by a linear
combination of other equations without changing the solution of the system of equations. For
example, consider the simple set of two equations with two unknowns. Here we use the notation
x1 and x2 in place of the more conventional x and y for our two unknowns. This allows us to
generalize our processes to equation sets of any size and to introduce matrix notation.
3x1 5 x2 13
[1]
7 x1 4 x2 1
You can confirm that x1 = 1 and x2 = 2 is a solution to this set of equations. To solve this set of
equations we can replace the second equation by a new equation, which is a linear combination
of the two equations without changing the solution. The particular combination we seek is one
that will eliminate x1. We can do this by subtracting the first equation, multiplied by 7/3, from the
second equation to obtain the following pair of equations, which is equivalent to the original set in
equation [1].
3x1 5 x2 13
47 94 [2]
x2
3 3
We can readily solve the second equation to find x2 = 2, and substitute this value of x2 into the first
equation to find that x1 = [13 – 5(2)]/3 = 1. The same process can be applied to the solution of
three equations in three unknowns. Consider the following set of equations.
2 x1 4 x2 26 x3 34 ( A)
3x1 2 x2 9 x3 13 ( B) [3]
7 x1 3x2 8 x3 14 (C )
The general procedure for combining two equations, A and B, to eliminate the x k term from
equation B is described below. The description uses equation set [3] above as an example
Numerical solutions of simultaneous equations L. S. Caretto, September 14, 2017 Page 3
showing how to combine the first equation (A) and the second equation (B) to eliminate the x1
coefficient from the second equation.
• Multiply equation A by the ratio of the xk coefficient in equation B to the xk
coefficient in equation A. For example, multiply the first equation in [3] by –3/2 to obtain
the intermediate result –3x1 + 6 x2 +39x3 = 51, as the first step in eliminating the x1
coefficient in the second equation.
• Subtract the equation found in the previous step from equation B. The result will
be a new equation in which the xk coefficient is zero. In the example we subtract –3x1
+ 6 x2 + 39x3 = 51 from -3x1 + 2x2 + 9x3 = 13 to give the result –4x2 – 30x3 = –38.
• Replace equation B by the equation found in the previous step. In the example, the
new second equation, –4x2 – 30x3 = –38, has a zero coefficient for x1.
In practice, all three steps outlined above are combined into a single set of operations. We can
summarize all the work in the example above (multiplying the first (A) equation by –3/2 and
subtracting the result from the second (B) equation) by the following set of operations. This gives
us our new second equation whose x1 coefficient is zero.
2 x1 4 x2 26 x3 34
0 x1 4 x2 30 x3 38 [6]
0 x1 17 x2 99 x3 133
We see that we have reduced the last two equations to two equations in two unknowns and we
can make the x2 coefficient in the third equation zero by multiplying the second equation by
17/(-4) and subtracting the result from the third equation. (Since the x 1 coefficients in both
equations are zero, the result of the subtraction will have a zero for both the x 1 and x2 coefficient.)
17 17 17 17
0 0 x 17 ( 4) x 99 ( 30) x 133 ( 38) [7]
4 4 4 4
1 2 3
We can use equation [7] to replace the third equation in [6], giving the following set of equations.
Numerical solutions of simultaneous equations L. S. Caretto, September 14, 2017 Page 4
2 x1 4 x2 26 x3 34
0 x1 4 x2 30 x3 38
[8]
57 57
0 x1 0 x2 x3
2 2
This set of equations should have the same solution as the original set of equations in [3], since
none of our operations have changed the values of the unknowns. We say that this set of
equations is equivalent to the original set of equations in that it has the same set of solutions.
We see that we can easily solve the third equation for x3 = (-57/2) / (-57/2) = 1. Once we know
that x3 = 1, we can substitute this result into the second equation and solve for x 2.
38 (30) x3 38 (30)(1) 8
x2 2 [9]
(4) (4) (4)
We now know both x2 and x3 and we can substitute these values into the first equation to obtain
x1.
We have seen how to solve two equations in two unknowns and three equations in three
unknowns. We want to generalize the process to solve n equations in n unknowns. To do this,
we introduce the following notation for this general case: the coefficient of x j in equation i is written
as aij. The right-hand side of equation i is written as bi. With this notation we write our general
equation as follows
m
aij x j bi i 1,, n [11]
j 1
In matrix notation, we would define an A matrix whose coefficients are [aij], a right-hand side
vector, b, and a solution vector, x. These are written out in full below.
The basic process to solve this system of equations is known as Gauss elimination. The goal of
this process is to reduce the original set of equations into an equivalent set in which there is only
one unknown in the last equations, two unknowns in the next to last equation, three unknowns in
the last equation but two, and so forth, until only the first equation contains all the unknowns. The
operations that we use to do this do not change the solutions. However the final result, which is
called an upper-triangular form, can be readily solved for all the unknowns. This final, upper-
triangular form, in matrix notation, is shown below.
In order to have the same solutions, all the operations are done on the complete equation; this
means that we change not only the coefficients in the A matrix, but also the coefficients in the
right-hand-side b vector. (The same operations that are used to obtain the revised coefficient
matrix are used to obtain the revised right-hand-side vector.)
The revised A and b matrices, shown in equation [13] are obtained in a series of steps. This is a
generalization of the process used above for two and three equations. (In the following
discussion, we talk about operations on rows and operations on equations. These mean the
same thing. Operations on rows of the A matrix and similar operations on rows of the b vector
are equivalent to operations on the equations represented by the matrix coefficients.) In the first
step, the x1 coefficients are eliminated from all equations except the first one. Equations [4] and
[5] show numerical examples of this process. To generalize the process, we say that the first row
is multiplied by a factor and subtracted from row i in such a way that a i1, the coefficient of x1 in row
i is set to zero. The resulting equation is then used in place of equation i. This is repeated for
each row below the first row. In order for this subtraction to set ai1 to zero, we have to multiply the
first equation by (ai1/a11). We must apply this same subtraction process to all terms in row i,
including the bi term. This operation, which can be represented by the following replacement
operations,1 is done on the coefficients in rows (equations) 2 to n.
a ai1
aij aij i1 a1 j j 1,n and bi bi b1 i 2,n
a11
[14]
a11
After equation [14] is applied, the only nonzero x1 coefficient is in the first equation (represented
by the first row of the A matrix and the b vector.) You can confirm that this will set ai1 = 0 for i > 1.
You can also apply the formulae in [14] to the system of equations in [3] to obtain equations [4]
and [5]. The elimination process is next applied to make the x2 coefficients on all equations below
the second equation zero. Here we use the same process. We multiply the second row
1 Here we use the computer pseudocode notation a1j <expression> to show that the old value of a1j is
replaced by the value computed in the <expression>, where the <expression> may contain the old value of
a1j. This avoids the requirement in mathematical notation to distinguish between the old and new values of
aij in an equation.
Numerical solutions of simultaneous equations L. S. Caretto, September 14, 2017 Page 6
(equation) by the factor ai2/a22 and subtract it from row i, where i ranges from 3 to n, and use the
result to replace row i. The replacement operations to accomplish this are shown below.
a ai 2
aij aij i 2 a2 j j 2,n and bi bi b2 i 3,n
a22
[15]
a22
Equation [15] has the same form as equation [14]; only the fixed subscripts use row 2 instead of
row 1 and the starting point for the row and column operations are different. Again, you can apply
equation [15] to the system of equations in [6] to obtain equation [7].
The process described by equations [14] and [15] can be continued until the upper-triangular form
shown in equation [13] is obtained. We can develop a general form for this operation that can be
used in a computer code. The equation that is subtracted from all equations below it, at any step
in the process, is called the pivot equation (or the pivot row in the matrix). The goal of the
subtraction process is to set arow,pivot, the coefficient of xpivot, in each row below the pivot row to
zero. To do this we (1) multiply the pivot row by arow,pivot/apivot,pivot, (2) subtract it from the given
row, and (3) use the result to replace the given row. This gives the following generalization of
equations [14] and [15], where we use the index k to denote the pivot row.
a aik
aij aij ik akj j k ,n and bi bi bk i k 1,n [16]
akk akk
The replacement operations in [16] are applied using all rows except the last row as pivot rows.
This process will produce the upper-triangular form shown in equation [13]. From this form, we
can easily find all the values of xi. The process used to find these unknowns from the upper
triangular matrix is called back substitution, because the unknowns are found in reverse order.
From equation [13] we see that we can simply find xn as βn/αnn. Once xn is known, we find xn-1 =
[n-1 – n-1,nxn]/n-1,n-1. The general equation for this back-substitution process, inferred from
equation [13]. is shown below.
n
i ij x j
j i 1
xi i n 1, n 2,,1 [17]
ii
When we are solving for xi, all previous values of xj required in the summation are known.
The C++ code below shows a simplified version2 of how this method is applied to the solution of
equations. (We assume that all data values are declared and initialized.) The number of
equations is equal to the number of unknowns, n. The first of three loops used to get the upper
triangular array, is the loop that uses all rows (except the last row) as the pivot row. Here the
program variable, pivot, is used as the subscript for this row. The code execution is simplified by
augmenting the a matrix so that ai,n+1 = bi. This allows the code to proceed without separate
consideration of similar operations on the A and b matrix components.
// augment column n+1 of a matrix with b values
for ( row = 1; row <=n; row++) a[row][n+1] = b[row];
// get upper triangular array
2
As discussed below, actual code would have to account for the possibility that the system of equations
might not have a solution. It would also use different operations to reduce roundoff error.
Numerical solutions of simultaneous equations L. S. Caretto, September 14, 2017 Page 7
The process outlined above for the solution of a set of simultaneous equations is known as the
Gaussian elimination procedure. Alternative procedures such as the Gauss-Jordan method and
LU decomposition, work in a similar manner. They produce an upper triangular matrix or diagonal
matrix that is then used to solve for the values of xi in reverse order.
If the solution process outlined above is used on certain matrices, it may not be possible to obtain
a unique solution. Consider the two sets of equations shown below.
3x1 5 x2 13 3x1 5 x2 13
and [18]
6 x1 10 x2 26 6 x1 10 x2 27
In the set of equations on the left, the second equation is simply twice the first equation. If we
multiply the first equation by two and subtract it from the second equation, we get the result that 0
= 0. Thus, the second equation gives us no new information on the relationship between x 1 and
x2. We say that this system of equations has an infinite number of solutions. Any value of x 2 =
2.6 – 0.6x1 will satisfy both equations. The second set of equations has no solutions. If we
multiply the first equation by two and subtract it from the second equation, we have the result that
0 = 1! Thus, this second set of equations is incompatible and does not have a solution. 3
This simple example shows a general result. When the Gaussian elimination procedure produces
a row of the A matrix that is all zeros, we will not have a unique solution to our problem. If the
corresponding row of the b vector is zero, we will have an infinite number of solutions; if the b
vector row is not zero, the system of equation has no solution. In each case, the A matrix has no
inverse and is called a singular matrix. We cannot find a unique solution to Ax = b when we have
a singular matrix. In a singular matrix, there is a linear dependence among the rows. Since the
Gauss elimination process is a set of linear operations on the rows of the matrix, it can be used to
detect linear dependence in the rows of a matrix.
We detect a singular matrix by looking for zeros in the pivot column. This is the column which
contains the diagonal term, apivot,pivot on the pivot row. Just finding one zero on the pivot column
3This result has a geometric interpretation. When we have two simultaneous linear algebraic equations, we
can plot each equation in x1–x2 space. The solution to the pair of equations is located at the point where
both equations intersect. If we did this for the left set of equations in [ 18], we would only have a single line.
There is no second equation to give us an intersection point. Any point along the single line satisfies the pair
of equations. If we plot the pair of equations on the right, we would get parallel lines that never intersect.
Thus, there is no intersection to provide a solution.
Numerical solutions of simultaneous equations L. S. Caretto, September 14, 2017 Page 8
does not mean the matrix is singular. Consider the following set of equations, which is similar to
equation [3] except that a22 and b2 have been changed.
2 x1 4 x2 26 x3 34
3x1 6 x2 9 x3 21 [19]
7 x1 3x2 8 x3 14
Applying the same steps we used in equations [4] and [5] gives the follow result in place of
equation [6].
2 x1 4 x2 26 x3 34
0 x1 0 x2 30 x3 30 [20]
0 x1 17 x2 99 x3 133
Here we see that that a22 = 0, but Gauss elimination wants to divide by this element. Although we
cannot divide by zero, we can solve the problem by swapping the second and third equations.
This process will have not affect the results since we can place our equations in any order. Once
we swap the two equations in this simple example, we have our upper triangular form that we can
solve for the solutions.
Consider the following set of equations. This is a further modification of the set of equations in
[18] in which the third equation has been changed.
2 x1 4 x2 26 x3 34
3x1 6 x2 9 x3 21 [21]
1x1 2 x2 17 x3 13
When we apply Gauss elimination, using the first row as the pivot row, we get the following result.
2 x1 4 x2 26 x3 34
0 x1 0 x2 30 x3 30 [22]
0 x1 0 x2 30 x3 30
In this case, we cannot swap rows two and the three to get a nonzero value of a22. We see that
both of these equations give us x3 = 1 and we are left with one equation (the first equation) in two
unknowns. Thus, the set of equations in [21] has an infinite number of solutions. For any
arbitrary constant, , x3 = 1, x2 = , and , x1 = 2 – 4, will provide a solution to the set of equations
in [21]. We see that this occurs because the third equation is a linear combination of the first two.
(In this example, equation three simply the sum of the first two equations.) If the right-hand side
of the third equation in [21] had been an arbitrary number K, in place of –13, we would have the
following result in place of [22].
2 x1 4 x2 26 x3 34
0 x1 0 x2 30 x3 30 [23]
1
0 x1 0 x2 30 x3 K (34)
2
Numerical solutions of simultaneous equations L. S. Caretto, September 14, 2017 Page 9
We can subtract the second equation from the third equation to the result that 0 = (K + 13)/(-30).
This can only be true if K = -13. A set of equations with K having any other value cannot have a
solution. The two cases shown in equations [19] and [21] are ones in which the A matrix is
singular. It does not possess an inverse because the rows of the matrix are linearly dependent.
In such cases, we cannot find a unique solution to the equations.
The detection of a singular matrix arises naturally in the Gauss elimination process. We have to
swap equations when we find a zero as the diagonal element in the pivot row. If we cannot find a
nonzero element in the column that contains this element, called the pivot column, in equations
below the pivot row, we know that we have a singular matrix. In practical applications, we test for
a number that is small compared to round-off error rather than testing for zero. Having all zeros in
the pivot column means that we can use the other columns to obtain zeros in all columns in the
last row of the matrix. Thus, we produce a row of all zeros, which is an indication of a singular
matrix.
Gauss elimination involves many calculations, particularly for large system of equations. In
additions, some sets of equations may be ill conditioned. “Ill-conditioned” is the name given to
systems in which the matrix is nearly linearly independent, but not exactly so. Conventional
computer codes for Gauss elimination and its variants use some form of row or column exchange
to minimize round-off error. Exchange of rows is straightforward, but exchange of columns
changes the identity of the variables and is more difficult to implement. The extra effort for
computations with column pivoting is usually not worth the effort and the general approach used,
called scaled partial pivoting, is described below.
Initially a scale factor is found for each equation. This factor is the maximum element in absolute
value on the left-hand side of the matrix. If each equation is divided by its scale factor, then the
maximum coefficient in each equation is 1. Before using any row of the matrix as the pivot row,
all rows from the pivot row to the last row in the matrix are examined to find the row with the
maximum (scaled absolute value) element in the pivot column. This row is then swapped with the
pivot row. The process of using one row as the pivot row then proceeds in the usual fashion.
Condition of a matrix
In numerical analysis, a problem is called ill conditioned if a small relative change in input
produces a large relative change in the results. In general, then, the condition number for any
problem can be defined as follows.
Change in result
Condition result [24]
Change in input
input
For calculations with vectors and matrices, we use the norms to obtain relationships about the
condition of a problem. We have to first discuss how we compute the norm of a matrix in such a
way that it is compatible with the norm of a vector.
Numerical solutions of simultaneous equations L. S. Caretto, September 14, 2017 Page 10
Computing matrix norms – The norm of a matrix may be computed in various ways. The simplest
way, which is an extension of the concept of the length of a vector, is called the Frobenius norm.
This is defined as follows.
n m
A Frobenius
aij2 [25]
i 1 j 1
The norm of a square matrix, A, with n rows and n columns, can also be defined in terms of the
norm of a vector. This is based on the idea that if x is a column vector with n rows then Ax is also
a vector with n rows. Thus, we can define a vector norm for both ||Ax|| and ||x||. For any x which
is not 0, so that ||x|| 0, the ratio of these can be interpreted as a vector norm for the matrix, A.
In particular, we define this norm as follows:
Ax
A max [26]
x x
Equation [26] tells us that we seek the vector, x, which produces the maximum value of the ratio
in equation [26], provided that x is not the null vector, 0. The actual value of the matrix norm
defined according to equation [26] depends on the definition that we choose for the vector norm.
As noted in Kreyszig, it is possible to show that equation [26] leads to the following definition of a
matrix norm if we choose the “one” norm (sum of absolute values of the vector components).
n
A max aij [27]
j
i 1
This is called the column sum norm because we take the sum of the absolute values of the
elements in each column. We then pick the maximum of these column sums as the norm. We
can also show that choosing the “infinity” norm as our definition of a norm gives the following, row
sum, definition of the matrix norm.
n
A max aij [28]
i
j 1
Regardless of the definition of the matrix norm, the basic form of the definition in equation [26]
tells us that ||A|| ||x|| = max||x||{ ||Ax|| }. Since any value of ||Ax|| must be less than the maximum
possible value of ||Ax||, we can say that ||Ax|| max||x||{ ||Ax|| } = ||A|| ||x||. This means that the
following inequality must always hold.
Ax A x [29]
r b A~ x Ax ~
x Ax A~ x [30]
We want to see the effect that this residual has on the relative error in the solution.
Numerical solutions of simultaneous equations L. S. Caretto, September 14, 2017 Page 11
If we premultiply this equation by A-1, we obtain an equation for the difference between the correct
and incorrect solution vectors. We then take the norm of this difference and use the inequality in
[29] ( ||Ab|| ||a|| ||b|| ) to obtain the following result.
x ~x A 1r x A 1r A 1 r
x~ [31]
We can also apply the result that ||Ab|| ||a|| ||b|| to Ax = b as follows.
1 A
b Ax A x [32]
x b
x~x 1
A 1 r [33]
x x
We now use a basic rule for inequalities: if u < cv, where c is positive, and v < w, then u < cw. We
use this rule to combine the inequalities in [32] and [33] so as to eliminate the 1/||x|| term on the
right-hand side of [33].
x~x 1 A
A 1 r A 1 r [34]
x x b
We solve for the relative error in x in terms of the ratio the norm of the residual to the norm of the
right-hand-side vector, b.
x~x
x
A 1 A ( A) [35]
r
b
In this equation we have defined the condition number, (A), of the matrix, A, as the product of
the norms, ||A|| ||A-1||. Equation [35] tells us that the relative error in the solution vector, x, is
proportional to the condition number times the relative residual. A large condition number
indicates that there may be difficulties in obtaining an accurate solution. There is no magic value
for a condition number that indicates a potential problem with solving linear equations. Generally
a condition number greater than 100 indicates a possible problem.
1.00001 0.99999
A
1
[36]
1
Numerical solutions of simultaneous equations L. S. Caretto, September 14, 2017 Page 12
We see that the two rows of this matrix are almost, but not quite, the same. When we are close to
having a linear dependence in the rows of a matrix, we have an ill conditioned matrix. To
determine the condition number, we first find the inverse using the general formula for the inverse
of a (2 x 2) matrix shown below.
1
a11 a12 1 a22 a12
a
a11a22 a21a12 a21 a11
[37]
21 a22
Applying this formula to the matrix of equation [36] gives the inverse of that matrix.
50000 49999.5
A 1 [38]
50000 50000.5
The norms for the A and A-1 matrices are computed in the table below. Both the row-sum
(infinity) norm and the column sum (one) norms are computed.
In this example both choices of a norm give a condition number of 200,001, indicating that the
matrix is very ill conditioned. We expected this result since we could see that the two equations
implied by the original A matrix were very nearly identical. In a larger matrix, we can have linear
dependencies that are not so obvious. In such cases, the condition number will indicate possible
problems with the solution of equations based on those matrices.
The condition number also enters into the measure of the effects of changes in the input data in A
or b on changes in the solution. Appendix B shows how the condition number enters into these
results.
This is a variant of Gauss elimination in which the pivot row is subtracted from not only the rows
below the pivot row, but also from the rows above it. (Of course, the pivot row is multiplied by a
suitable factor to obtain zeros in the pivot column in all those rows.) With this change, the A
matrix is converted to a diagonal matrix, so the solutions for x are simply the modified values in
the b vector, divided by the diagonal elements on the row. The Gauss-Jordan method commonly
divides each pivot row by its diagonal element so that the solutions are simply the elements of the
modified b vector. This method is not commonly used because it involves more computational
work than Gauss elimination. It is sometimes used in the computation of matrix inverses.
If an inverse matrix is required, which is an unusual case, it can be found by the Gauss
elimination approach or one of the variants of that approach. If we define B = A-1, we have AB =
AA-1 = I. The matrices in the formula for AB = I are shown below.
Numerical solutions of simultaneous equations L. S. Caretto, September 14, 2017 Page 13
This form as well as the usual equation for matrix multiplication, written for equation [39] below,
show that an individual column of the matrix B = A-1 is found from the same process used to solve
Ax = b. To find column j of the inverse, the right-hand-side vector, b is all zeros except for row j,
which is a one. The column vector with the solution, x, is the jth column of the inverse.
n
AB I aik bkj ij [40]
k 1
Because the Gauss-Jordan method is often applied to the computation of matrix inverses, the
code for this method is usually written to solve AX = B. In this notation, B is a matrix with one or
more columns corresponding to one or more right hand side (b) vectors and X is a matrix that
contains the solutions for each of those b vectors.
A simplified C++ code to illustrate the Gauss-Jordan method is shown below. This is a
modification of the code for Gauss elimination shown on page 5. Here the variable nrhs denotes
the number of right-hand-side vectors that are used in the solution. The array b[row,rhs]
contains all the elements of the different right-hand-side vectors. This array is also used to
contain the solution. This is a common approach in linear equation solvers; the solution vector(s)
are contained in the space originally used for the right-hand-side vector(s).
b[row,rhs] = a[row,rhs+n]
The LU method
The LU method is a variation on the standard Gaussian elimination method. The LU method does
the preliminary steps that provide a simple approach to solving a set of equations even when the
right-hand-side vector, b, will be specified at some future time. This approach is based on
converting the A matrix into a product of a lower triangular matrix, L, and an upper triangular
matrix, U. Once these definitions are made, we can take advantage of the simple properties of
triangular matrices for the easy solution of a set of equations. The LU method has the advantage
that we can obtain the necessary triangular form for back-substitution even if we do not know the
right-hand-side b vector. This is useful for doing real-time data analysis in which we receive new
b vectors for which we want to solve the same set of simultaneous equations.
A LU [41]
With the specification that L is lower triangular and U is upper triangular, we can write the matrix
product in equation [41] as shown in equation [42], below.
We can see that the U matrix in equation [42] is just the usual upper-triangular form that results
from the forward substitution part of the Gaussian elimination algorithm. This shows the similarity
to Gaussian elimination.4 The L matrix contains the various factors used to produce the U matrix
and will be used to perform the same operations on the b vector.
4Note that both L and U have terms on the principal diagonal. The LU factorization defined here is not
unique. It is possible to have the unit diagonal on the U matrix instead of the L matrix. In general, it is
possible for the diagonal to have an arbitrary definition so long as it is consistent with the overall result of
Numerical solutions of simultaneous equations L. S. Caretto, September 14, 2017 Page 15
The usual formula for matrix multiplication can be applied to the A = LU product.
n
aij lik ukj [43]
k 1
We will use this formula to derive expressions for the components of the L and U arrays. We can
do this because of the zeros in the structure of the L and U arrays. For example, the first row of
the A matrix is the same as the first row of the U matrix because the only nonzero element in the
first row of the L matrix is l11, which equals one. To see this, we apply equation [43] to compute
a1j.
n
a1 j l1k u kj l11u1 j (0)u2 j (0)u3 j (0) unj u1 j [44]
k 1
Although we wrote this equation to compute a1j, we already know a1j from the specification of the
simultaneous linear equation problem. What equation [44] gives us is the equation that we can
use to compute u1j.
u1 j a1 j [45]
Similarly, we can apply the general matrix multiplication equation in [43] to the first column of the
A matrix, taking advantage of the zeros below the first element in the U matrix. This gives the
following result.
n
ai1 lik uk1 li1u11 li 2 (0) li 3 (0) lin (0) li1u11 [46]
k 1
We can use this equation to solve for the first column of the L matrix, using the value of u11 found
from equation [45].
ai1
li1 [47]
u11
We now have the first row of U and the first column of L. We can then proceed to compute the
second row of U. To do this we apply equation [43] to the computation of the a2j coefficients in
the original matrix. Because l22 = 1, and all other l2j with j > 2 are zero, this gives the following
result.
n
a2 j l2k ukj l21u1 j l22 u2 j (0)u3 j (0) unj l21u1 j u2 j [48]
k 1
equation [39] and the restrictions that (1) L has no terms above the principal diagonal and (2) U has no terms
below the principal diagonal.
Numerical solutions of simultaneous equations L. S. Caretto, September 14, 2017 Page 16
We can solve this equation for u2j, using the value of l21, found from equation [47] (for the first
column of the L array), and the values of u1j = a1j, found from equation [45] (for the first row of the
U array.)
Having found the second row of the U array, we now have sufficient information to solve for the
second column of the L array. To do this we apply our general equation in [43] to the second
column of the original A array.
n
ai 2 lik uk 2 li1u12 li 2 u22 li3 (0) lin (0) li1u12 li 2u22 [50]
k 1
If we solve this equation for li2, we see that we can find all the elements from the second column
of L from the original matrix coefficients ai2, and previously computed values from the second
column of the U array (u12 and u22) and the first column of the L array (li1).
ai 2 li1u12
li 2 i 3,, n [51]
u22
We can generalize this approach shown here of finding one row of U followed by finding a column
of the L array with the same index. The general equation for the elements of row k for the U
matrix can be induced from equation [48]. That equation recognized that the value for l22, was
one and values for lmk, with k > 2 were zero. To generalize equation [48] we replace the index 2
by the general row index, m, so that this becomes an equation for amj. We recognize that the
diagonal term, lmm, is one and values of lmk, with k > m are zero. We can thus rewrite equation
[48] for the general A matrix element, amj, as shown below.
m1
amj lmk u kj lmm umj from k m 1 to n is zero [52]
k 1
We wrote the write the lmmumj term, separately, because we will later set lmm = 1. All terms in the
sum for k > m are zero because values of lmk, with k > m are zero. Setting lmm, = 1 and solving
equation [52] for umj gives the following result.
m1
u mj a mj l mk u kj j m, , n [53]
k 1
In a similar manner, we can generalize equation [50] to, we solve for aim, instead of ai2. The
summation in the general equation will only run from k = 1 to k = m, because the u km terms in the
sum are zero for k > m. We can separate the final limumm term out from the rest of the sum and
solve for lim, to obtain the following result for the generalization of equation [50].
m 1
aim lik ukm
k 1
lim i m 1,, n [54]
u mm
Numerical solutions of simultaneous equations L. S. Caretto, September 14, 2017 Page 17
Equations [53] and [54] provide the general approach for calculating the L and U matrix
coefficients.
In computer calculations, the L and U matrices may be stored in the same locations used for the
original A matrix. We can see how this is possible by examining the structure of the L and U
matrices shown in equation [52]. The only place that the L and U matrices overlap is on the
principal diagonal. However, on this diagonal, all the diagonal elements of the L matrix are equal
to one. If we remember that lii = 1 in all our calculations, we do not have to store this unit diagonal
of the L matrix. The matrix below shows how the components of the L and U matrices (except for
the lii terms on the diagonal that are all one) can be stored in the original A matrix.
The storage of L and U components in the space allocated for A makes the code simpler to write
but more difficult to understand. We use the notation for the original aij matrix to refer to the
components of both the L and U matrices. For example, the statements below show how we can
implement equation [53] to compute umj in Fortran 95 or C++ code.
do j = m, n
do k = 1, m-1
for ( j = m; j <= n; j++)
a(m,j) = a(m,j) & for ( k = 1; k <= m-1; k++)
- a(m,k) * a(k,j)
end do a[m][j] -= (a[m][k] * a[k][j])
end do
Anyone reviewing the code (including the original programmer) must recognize that the a(m,j)
(or a[m][j]) array components in the code, sometimes refer to the components of the original A
array, sometimes to the components of the U array and sometimes to the components of the L
array. The m subscript is set in the code by an outer loop (not shown here) that does the code
shown for all rows and columns. Before the loops shown above start, a(m,j) contains amj. At
the end of the loops, umj is stored in this array component. The array components a(m,k) and
a(k,j) contain lmk and ukj, respectively.
Numerical solutions of simultaneous equations L. S. Caretto, September 14, 2017 Page 18
In the derivation of the process for finding the L/U decomposition of the A matrix, we outlined the
calculation as proceeding by finding one row of U followed by finding one row of L. In practice we
can compute the L and U arrays by looping over all columns of the A array. In the first column of
the A array, we compute u11 and all values of lk1 from k = 2 to k = n. We then proceed to column
2 where we compute u11, u22, and all values of lk2, from k = 3 to n.
The matrix below shows the formulas used for these column-by-column calculations.
The following C++ code will rearrange an original A matrix into the L/U components, stored in the
same location as the original A matrix.
a[i][j] /= a[j][j];
}
}
Numerical solutions of simultaneous equations L. S. Caretto, September 14, 2017 Page 19
Actual code would have to have a strategy for ensuring that diagonal elements like u ii are not
zero. Additional computing strategies would look at a pivoting strategy to round-off error.
Once the L/U matrix is formed, the solution of the equation Ax = b is done by defining an
intermediate vector, y, so that the solution proceeds in two steps.
y Ux so that Ax LUx Ly b
[56]
1
yL x and x U 1y
Solution of the equations y = L-1b and x = U-1y is straightforward because both L and U are
triangular matrices that can be readily solved by forward and backward substitution, respectively.
The equation Ly = b can be represented as follows.
1 0 0 0 y1 b1
l 1 0 0 y b
21 2 2
l31 l32 1 0 y3 b3
[57]
ln1 ln 2 ln 3 1 yn bn
From this equation, we can see that the various values of yi are given by the following sequence
of equations.
y1 b1
y2 b2 l21 y1 [58]
y3 b3 l31 y1 l32 y2
The general equation for this forward solution is given by equation [59].
i 1
yi bi lik yk i 1,2, n [59]
k 1
Once the y vector is known, the remaining matrix equation, Ux = y, has the following form.
Numerical solutions of simultaneous equations L. S. Caretto, September 14, 2017 Page 20
yn
xn
ann
y n1 an1 n xn
xn1 [61]
an1 n1
y n2 an2 n xn an2 n1 xn1
xn 2
an1 n1
The general equation for this backward solution is given by equation [62].
n
yi uik xk
k i 1
xi i n, n 1, n 2, ,2,1 [62]
aii
The code for the two steps in the final solution process, once the L/U components have been
stored in the original A array, is shown below.
x[i] /= a[i][i];
}
Numerical solutions of simultaneous equations L. S. Caretto, September 14, 2017 Page 21
In a computer program, the intermediate vector, y, and the final solution vector, x, are usually
stored in the same location as the original right-hand-side vector, b. With this storage, the code
above can be replaced by the more compact code shown below.
b[i] /= a[i][i];
}
We see that all the operations on the unknowns, bi, can be done using the information from the
L/U array. It is not necessary to know the bi array in advance, as it is with Gaussian elimination.
In fact, the elements of the L array may are actually the multipliers that are applied to the bi array
during normal Gaussian elimination.
The reduction of a matrix to its L/U form uses equations that are similar to Gaussian elimination.
This reduction process is susceptible to the same problems that plague Gaussian elimination.
The diagonal element, umm, that is used as a divisor in equation [54] may be zero. In addition, the
arrangement of the system of equations may affect the round-off error. Both of these problems
may be addressed by a strategy known as partial pivoting. In this strategy rows, but not columns,
are rearranged to eliminate zeros on the diagonal and reduce round-off error.
In the partial pivoting strategy, we proceed row by row to apply equations [53] and [54]. The
current row on which we are operating is called the pivot row. Before applying equation [53], we
search the column with the same index as the pivot row, called the pivot column, for the maximum
element (in absolute value). The row with this maximum value is then exchanged with the current
pivot row. In one practice, known as scaled partial pivoting, each row is first divided by the
maximum element (in absolute value) in the row. This makes the maximum element in each row
equal to one.
When this exchange is done in standard Gaussian elimination both the left-hand side of the
equation with the aij coefficients and the right-hand side solution vector, with components bi, can
be exchanged at the same time. In the LU method, we work only with the left-hand side
coefficients. Because of this we have to create a way of remembering the row swaps that we
make. This is done by creating a one dimensional integer array, swap[row]. Initially this array
is initialized by code like the following.
That is, the initial value of swap[row] is simply the row number. When two rows, say row
pivot and row maxPivot are exchanged, the following code is used to keep track of the
exchanges.
temp = swap[pivot];
Numerical solutions of simultaneous equations L. S. Caretto, September 14, 2017 Page 22
swap[pivot] = swap[maxPivot];
swap[maxPivot] = temp;
When this is done consistently, the swap[row] array will contain the original row numbers of the
equations as they are arranged at the end of the LU process. To see this, consider the example
below.
Assume that the first swap occurs when row 4 is the pivot row; assume further that row 4 is to be
exchanged with row 7. (pivot = 4, maxPivot = 7) Since this is the first exchange, swap(4) = 4 and
swap(7) = 7 from the initialization code. The swap code then operates as follows:
We thus know that the equation whose coefficients are currently in row 7 was originally equation 4
and vice versa.
If there is a subsequent exchange when row 7 is the pivot row (pivot = 7), where row 11
(maxPivot = 11) is exchanged with row 7, the swap code produces the following result.
The values of the first thirteen elements in the swap(row) array, after these two exchanges, are
shown in the table below.
row 1 2 3 4 5 6 7 8 9 10 11 12 13
swap[row] 1 2 3 7 5 6 11 8 9 10 4 12 13
This table shows how the swap[row] array tracks the changes due to row swaps. The value of
the swap[row] array, for a particular row, is the original row number of the equation currently
located in that row. For example, the value of swap[11] = 4 shows that the equation originally in
row 4 has been moved to row 11.
The final swap[row] array, at the end of the formation of the LU arrays, is used to get the correct
b[i] values in the back-substitution process. The structure of the LU array does not change; only
the identity of the original rows has changed. Thus, when the forward-substitution process in the
example here reaches the fourth row it has to use the right-hand side value from the 7th equation
in the original set. The modifications to the code on pages 8 and 9 that handle the final steps with
pivoting are shown below.
In this code, the y[i] array, is used to hold intermediate values. For the i th row, this array is
initialized by the statement, y[i] = b[swap[i]]. This places the correct right-hand-side b
value into the solution process. When this code reaches the 4th row, in the example above, this
statement will set y[4] = b[swap[4]] = b[7]. This statement correctly uses the right-hand side value
from the 7th equation whose left-hand-side coefficients have been substituted into the 4th row.
The solution process continues, however, with the correct calculations for the 4 th (and
subsequent) rows.
The intermediate y array is also used in the first part of the loop that obtains the final solutions.
This preserves algorithm required for the back substitution process. There is no need to use the
swap[row]array for the intermediate y[i] values or the final x[i] values. The ordering of these
values depends on the column order. So long as we do not swap columns, the order of the
unknowns (and the intermediate yi variables) does not change.
The array used for the input b array may be used for storing the y values and the final x values to
save storage and make the code slightly more compact (although less understandable). This
assumes that there is no further need for the b array after the solution has been computed.
APPENDICES
There are many sources of error in computer analyses. All of these have analogs to ordinary
human computation. Numerical analysis schemes that represent continuous operations in
calculus by approximating functions or finite difference expressions have truncation error that is
due to the mathematical approximation. When we represent a repeating decimal or an irrational
number with a fixed number of decimal places, we have a round-off error. Such errors are more
important in computers where we can do an enormous number of sequential calculations that will
expand an initial round-off error. In creating mathematical models of real engineering systems,
we can have modeling errors. For example, the assumption of a linear stress-strain relationship
may be extended into a nonlinear region. Finally, there are blunders. These may be errors in
experimental data, incorrect data entry into a program, incorrect coding of a program, and the like.
If we perform calculations with erroneous data, we will introduce errors in our results. If two
quantities, x and y, are represented by incorrect values ~x and ~y , we define the errors in each
term, x, and y as follows.
x~
x x y~
y y [A-1]
In considering error propagation, we assume that the errors in the two terms in an operation can
add. Thus, for addition or subtraction we have the following result.
x x ~
x y ~ y y ~ y x y
x~ [A-2]
The relative error, rel, in this addition or subtraction operation is defined as follows.
Numerical solutions of simultaneous equations L. S. Caretto, September 14, 2017 Page 24
rel
x y ~ y x y
x~
x y
~ ~ [A-3]
x y x y xy
The final term in this equation is the estimate of the error that we can calculate. (We do not know
x and y exactly, only their approximations, ~x and ~y .) If we add two numbers with opposite signs
or subtract two numbers with the same sign, we can have a very large relative error if the two
numbers are close in magnitude. In some cases, such as finding the roots of a quadratic
equation, it is possible to develop algorithms to avoid such large relative errors.
The relative error for multiplication is found as shown below. In this process, the product of two
error terms xy is assumed to be small compared to terms which are first order in the error.
x x ~
xy ~ y y ~y ~
x~ x y x y ~
y x ~ y ~
x~ x y
y x ~ [A-4]
~ ~ ~ ~
xy ~ y y x x y y x x y x y
x~
rel ~ ~ ~ [A-5]
xy xy x~y x y
The derivation of the relative error for division, shown below, assumes that the relative error in the
divisor, measured as y ~ y , is small to allow the use of a series expansion for 1/(1+x) which only
converges if |x| is less than one.
~
x x
x x y y
2
x ~
x x ~ y ~ y ~ ~
x x ~ x y
~ ~ ~ 1 ~ ~ [A-6]
y y y y y y y y
~
y ~ y ~y ~y
1 ~
y
As usual, we assume that the errors add so that we have the following result for error propagation
in division.
x ~x x ~ x y ~
x x ~x y x x y
~
~ ~
y y ~
y ~ y ~y ~
x ~
y ~ y ~y y ~
~ x y x y
rel ~ ~ [A-7]
x x x x x y
y y y y
Thus, both multiplication and have the same result for the combined error.
The derivations above apply to individual operations and look at the worst-case condition where
the errors are assumed to add. Advanced analyses of error propagation take a statistical
approach to the analysis of error propagation and produce less severe errors. However, the size
of the errors in a set of calculations increases as the number of required calculations increases.
Ill-conditioned problems – Certain problems led to difficulties in solution because small changes in
the inputs can make a large change in the results. As an example of this, consider the following
algorithm known as Newton’s method for the solution of one equation in one unknown. The
equation is expressed as f(x) = 0 and we seek the value of x that makes this true. We denote the
Numerical solutions of simultaneous equations L. S. Caretto, September 14, 2017 Page 25
successive guesses to the solution as x0, x1, …xn, … . The general rule for going from one
iteration on x, xn to a new iteration value, xn+1, is given by the following equation.
f ( xn )
xn1 xn [A-8]
f ' ( xn )
We assume that f(x) is a function whose first derivative, f`(x) = df/dx, can be computed.
Newton’s method can be expanded to a system of nonlinear equations. We assume that we have
N equations in N unknowns. Each individual equation can be written in the form f i(x1, x2,…xN) = 0
where we have N such equations from f1 = 0 to fN = 0. We can represent the set of equations in
the vector notation f(x) = 0. Our iterations in this multidimensional system take us from one
multidimensional point at iteration n, x(n), to the new iteration point, x(n+1). To get this iteration
equation for this process we expand each equation in a Taylor series about fixed point in our N-
dimensional space, x(n). If we drop terms that are second order and higher this series becomes.
f x ( n1) f x ( n) J ( n) x ( n1) x ( n) [A-9]
f k xm
( n) ( n)
In this series, we have defined J(n) as the matrix of partial derivatives: J , and the
(n) superscript in this definition indicates that the partial derivatives are evaluated at the point x =
x(n). Writing equation [A-9] in full matrix form, with higher-order terms neglected, gives the
following result.
We can also write equation [A-10] in a summation form for the general function, f k.
( n)
f f
x
N N
f k (x ( n 1)
) f k (x ( n)
) J km xm( n1) xm( n ) k (x
( n)
) k ( n 1)
xm( n )
m1 xm
m
m1
[A-11]
Since the solution we seek is for f(x) = 0, we set fk(x(n+1)) = 0 for k = 1, …, N. Since the functions
are, in general, nonlinear, we do not expect this to give us the correct solution. However, we
hope that it will provide an iterative procedure that will eventually lead to a solution. Setting the
left-hand side of equation [A-11] equal to zero gives us the following system of linear equations to
solve for the increment in x.
Numerical solutions of simultaneous equations L. S. Caretto, September 14, 2017 Page 26
(n)
f
x
N n
J km xm( n 1) xm( n ) k ( n 1)
xm( n ) f k (x ( n ) ) k 1, n [A-12]
m 1 xm
m
m 1
At each iteration step in we have to solve the system of linear equations in [A-12] to obtain the
new values for our unknowns, xm(n+1).
An example of ill conditioning is this process occurs in solving thermodynamic properties for
liquids. Thermodynamic property equations for real substances are usually written in terms of
temperature and specific volume (or temperature and density) as independent variables. If we
are trying to find the temperature and density that give a specified entropy and pressure, we are
trying to solve equations of the form s – s0 = 0 and P – P0 = 0. In this case equation [A-12]
represents the two following equations.
T
T
s ( n1)
T (n)
s ( n1)
v
v
v ( n ) s0 s T ( n ) , v ( n )
[A-13]
T
T
P ( n1)
T (n)
P ( n1)
v
v
v ( n ) P0 P T ( n ) , v ( n )
We can apply Cramer’s rule to write the solution for the changes in temperature and specific
volume (at each iteration) as follows.
P
s0 s T ( n ) , v ( n )
s
P0 P T ( n ) , v ( n )
T ( n1) T ( n ) v s P s P
v
T v v T
[A-14]
s
P0 P T ( n ) , v ( n )
P
s0 s T ( n ) , v ( n )
v ( n1) v ( n ) T
s P s P
T
T v v T
The partial derivative, P v , is infinity for an incompressible fluid. For liquids, which are
virtually incompressible, this is a very large number. In the volume correction equation, this
derivative appears in both the numerator and denominator. However, in the temperature
correction equation it appears in the denominator only. In both equations the terms in the
numerator, s0 – s(T,v) and P0 – P(T,v), represent the difference between the desired entropy and
pressure values the current guesses for those quantities. These differences in the temperature
correction equation are divided by the large value of P v , which dominates the denominator.
Because of this the temperature correction can easily become zero to within the machine epsilon
even where there are still significant differences between the guesses and desired values for
pressure and entropy.
Numerical solutions of simultaneous equations L. S. Caretto, September 14, 2017 Page 27
If the data in the A matrix are changed by an amount A (or if there are errors of this amount), the
solution will change by an amount x, such that
Premultiplying the second equation in [B-2] by A–1 gives (after noting that A–1Ax = Ix = x)
Taking norms of equation [B-3] and applying the inequality that ||Ax|| ||A|| ||x|| two times gives
the result in [B-4]. (Notice that taking the norm removes the minus sign, since the norm, as a
measure of size, is a positive quantity. This is also true in mechanics. A vector velocity
of -10 m/s has a magnitude of 10 m/s.)
We can divide equation [B-4] by ||x + dx|| use the condition number (A), defined in equation [35]
as the product of the norms, ||A|| ||A-1|| to replace ||A-1||.
δx ( A ) δA
A 1 δA δA ( A) [B-5]
x δx A A
Equation [B-5] tells us that the relative change in the norm of the solution is less than the relative
change in the norm of the A matrix times the condition number, (A). A large condition number
will amplify the effect of changes (or data errors) in the A matrix on the solution.
Similarly, if the data in the b vector are changed by an amount b (or if there are errors of this
amount), the solution will change by an amount x, such that
A δx δb [B-7]
Premultiplying equation [B-7] by A–1 gives (after noting that A–1Ax = Ix = x)
δx A 1δb [B-8]
Taking norms of equation [B-8], applying the inequality that ||Ax|| ||A|| ||x||, and replacing the
norm ||A-1|| by the ratio (A)/||A|| gives the result in [B-9].
Numerical solutions of simultaneous equations L. S. Caretto, September 14, 2017 Page 28
δb
δx A 1δb A 1 δb ( A) [B-9]
A
δx δb
( A ) [B-10]
x A x
Since Ax = b, we have the following relationship for norms
b Ax A x [B-11]
If we replace ||A|| ||x|| in the denominator of the larger term in the inequality [B-10] by ||b||, we will
be dividing by a smaller term. Thus, the inequality will still hold. This leads to our final conclusion
shown below.
δx δb
( A) [B-12]
x b
Again, we see that the condition number of the matrix determines the relationship between the
relative changes in data (or errors) in the b vector and the relative changes in the solution, x.
Introduction – Most people have heard that a computer is a binary device, but many students are
not familiar with the details of binary numbers and their representation on a computer. This
appendix provides background on binary numbers and their use in computer representation of
numerical data. Typical representations in C++ are used as examples, but the representation and
some of the quantitative limits are similar for other languages.
Binary numbers – The interpretation of binary numbers is a specific application of the general
approach to representing numbers in any base. In our usual base ten, we recognize that the
number 132 represents 1 times 102 plus 3 times 101 + 2 times 100. We can create a formula for
representing this number in base ten and then generalize if to other bases. To start with we write
the three digits in our number (the 1, the 3, and the 2 in the number 132) as the symbol, d i, for the
digit in position i. We number the digits from right to left, starting with the symbol d 0 for the
rightmost digit, which is 2 in this example of the number 132. We then say that the first digit to the
left of d0 is d1, which is 3 in this example. The next digit to the left is d2, which is 1 in our example.
We can represent a general three-digit number by the symbols d2d1d0. Since this is a base-ten
number, we can calculate the value of the number as d2(102) + d1(101) + d0(100). Note that the
range of digits for a base ten number is zero to nine.
How can we make this more general? We want to represent a number, in any base, with an
arbitrary amount of digits in the number. For a base-ten number with N digits, we can write the
following equation for the value of the number.
Here, we have written the number on the left as a sequence of digits and on the right as the
calculation, which gives its value. Note that we number the digits from zero to N-1 for our total of
N digits. We can rewrite the equation above using the usual sigma notation for sums.
N 1
d N 1d N 2 d1d 0 d k (10 k )
k 0
In this equation, we use ten because that is the base of our usual number system. To consider
other number systems such a binary (base 2), octal (base 8), or hexadecimal (base 16), we use
the same formula, but we replace the factor to ten by the appropriate base, b.
N 1
d N 1d N 2 d1d 0 d k (b k )
k 0
When want to formally represent the base that we are using for the number, we include it as a
subscript following the number. Thus, the base-ten number shown above as 132 could be
formally written as 13210 to emphasize that this is a base-ten number.
Note that the value of any “digit”, dk in a given numbering system must be less than the base b.
We know that decimal (base ten) digits range from zero to 9; similarly, octal digits range from 0 to
7 and binary digits range from 0 to 1. Hexadecimal digits range from 0 to f, where we have the
following correspondence between hexadecimal digits and base ten numbers: a 16 = 1010; b16 =
1110; c16 = 1210; d16 = 1310; e16 = 1410; f16 = 1510. We can write our example number 13210 in the
following bases: 13210 = 100001002 = 2048 = 8416. The table below shows the numbers in four
bases (decimal, binary, octal and hexadecimal) for numbers from 0 10 to 1910.
For binary numbers the only digits we can have are zero and one. Thus, our general-base
equation is written as follows for binary (b = 2) numbers.
N 1
d N 1d N 2 d1d 0 d k (2 k )
k 0
How large a number can we represent on a computer if we use 16 binary digits? (A binary digit is
usually called a bit and eight binary digits or bits are called a byte.) The minimum value of our 16-
bit binary number is zero and the maximum occurs when all the bits are one. With N = 16, our
sum ranges from 0 to 15 and we get the following result when all sixteen bits are one.
Numerical solutions of simultaneous equations L. S. Caretto, September 14, 2017 Page 30
15
d15d14 d1d 0 d k (2 k ) so that
k 0
15
1111111111111111 1(2 k ) 215 214 21 2 0 65 535
k 0
In any case like this one, where all the binary bits are one, we can add one to the number to
obtain the following result.
N 1 N 1
1N 11N 2 1110 1 2 k
1 1N 0 N 10 N 2 0100 2 0(2 k ) 2 N
N
k 0 k 0
so that 1N 11N 2 1110 2 1 N
We can express this result as follows: the maximum value of a binary number with N digits is
2N – 1. If N = 16, we obtain the result shown previously: the maximum number that we can
represent is 216 – 1 = 65 53510. If we wanted to use these 16 bits to represent both positive and
negative numbers, we could interpret the electronic bits in the computer so that the electronic
binary 0 was the lowest number and the computer value of 10000000000000002 = 32 76810 would
represent zero. In this case our computer circuit would have to subtract 32 76810 from the binary
bits stored in memory to give us our actual number. Since our 16-bit number ranges from 0 to
65 53510 in the computer memory, subtracting 32 76810 from this range gives us a range
of -32 76810 to 32 76710 for our 16-bit number when we want to have negative numbers.
Of course, choosing more or less than 16 bits would affect the range of our number. In the next
section, we discuss the various data types and show how their allowed ranges are governed by
the way in which the numbers are represented as binary information.
Visual C++ as an example of typical data types – C++ has two basic kinds of numerical data
types: integer numbers without decimal points and “floating point” numbers which have a decimal
point. (Some accounting systems have “fixed point” systems that have a fixed range for decimal
numbers. Some specialized programs like MATLAB have variable precision where a user can
specify the number of significant digits.) The C++ standard allows the size of integer data types to
be set by the compiler vendors, but there are a set of rules that govern the relationship among the
sizes of various data types. The names of the integer data types shown below are common to all
C++ implementations. The ranges shown for each type are those used in Visual C++. However,
these ranges are common for many other compilers. (Note that the int data type, according to the
C++ standard, can have a range anywhere between the short and the long):
short a 16-bit integer data type with range –32 768 to 32 767
int a 32-bit integer data type with range –2 147 483 648 to 2 147 483 647
long a 32-bit integer data type with range –2 147 483 648 to 2 147 483 647
unsigned short a 16-bit integer data type with range 0 to 65,535
unsigned int a 32-bit integer data type with range 0 to 4 294 967 295
unsigned long a 32-bit integer data type with range 0 to 4 294 967 295
The range for the short and the unsigned short are just the numbers we that we found above
when analyzing the 16-bit binary number. For the 32-bit number, the maximum range is from 0 to
232 -1 = 4 294 967 295. This is the range for the unsigned long and unsigned int. When we want
negative integers, we use half this range for negative numbers and half for positive numbers and
zero.
Numerical solutions of simultaneous equations L. S. Caretto, September 14, 2017 Page 31
Floating point data types are represented in the binary equivalent of a power notation. That is we
can represent numbers like 6.02x1023 and 1.38x10-16 in terms of their power of ten (called the
characteristic) and their prefactor (called the mantissa). We also require a bit to tell us if the
number is positive or negative. The formats of floating point numbers used in computer are
governed by an international standard that started as IEEE standard 754 in 1985. In this standard
a double precision floating point number is stored in eight bytes or 64 bits. These 64 bits are used
as follows: 1 sign bit, 11 bits for the (binary) mantissa, and 52 bits for the characteristic.
With 11 bits for the mantissa, we can handle a range from 0 to 211 – 1 = 2047. The maximum
value in this range is 22047 10616. This range is usually divided into a range between 10-308 and
10308. The characteristic will always start with a 1. This 1 is not stored so the effective number of
bits for the mantissa is really 52 +1 = 53. Fifty-three binary bits can store a number up to 253-1 =
9.007x1015. Thus the characteristic can store almost sixteen significant decimal digits.
C++ has three floating point types float, double, and long double. Each of these data types
represents a limited range of positive and negative numbers and zero. The range for float is
shown below.
2.2250738585072014x10-308 to 1.7976931348623158x10308
In Microsoft Visual C++, there is no difference between a double and a long double. Other
compiler vendors, who do offer a difference, usually have has a range of approximately
3.4x10-4932 to 3.4x10-4932.
The float data type occupies 4 bytes or 32 bits; the double type occupies 8 bytes or 64 bits; the
long double described in the previous paragraph occupies 10 bytes or 80 bits. The number of bits
for the mantissa defines the exponent range and the number of bits for the exponent defines the
number of significant figures. The float type represents about seven significant figures exactly.
Double and long double (in systems where long double uses more than 8 bytes for storage)
represent about 15 and 19 significant figures, respectively.
Integer data types are usually handled on the computer so that there is no check on the maximum
or minimum value. Because of this you can do something like the following.
Adding one to the maximum integer gives the minimum integer. For floating point variables
exceeding the maximum or going below the minimum gives you an overflow or underflow error,
respectively. These numbers are represented on the computer as 1.#INF and -1.#INF,
respectively. Another number you may see displayed on your computer screen is -1,#IND; the
IND is an abbreviation for INDefinite; this results from attempted mathematical operations such as
dividing zero by zero, raising zero to the zeroth power, etc.