algorithms-14-00198
algorithms-14-00198
Review
Decimal Multiplication in FPGA with a Novel Decimal
Adder/Subtractor
Mário P. Véstias 1, * and Horácio C. Neto 2
Abstract: Financial and commercial data are mostly represented in decimal format. To avoid errors
introduced when converting some decimal fractions to binary, these data are processed with decimal
arithmetic. Most processors only have hardwired binary arithmetic units. So, decimal operations
are executed with slow software-based decimal arithmetic functions. For the fast execution of
decimal operations, dedicated hardware units have been proposed and designed in FPGA. Decimal
multiplication is found in most decimal-based applications and so its optimized design is very
important for fast execution. In this paper two new parallel decimal multipliers in FPGA are
proposed. These are based on a new decimal adder/subtractor also proposed in this paper. The new
decimal multipliers improve state-of-the-art parallel decimal multipliers. Compared to previous
architectures, implementation results show that the proposed multipliers achieve 26% better area
and 12% better performance. Also, the new decimal multipliers reduce the area and performance gap
to binary multipliers and are smaller for 32 digit operands.
Keywords: decimal multiplication; decimal adder parallel multiplication; excess-3 coding; FPGA
quirement. For example, commercial applications usually do not require high decimal
arithmetic performance.
However, the fast increase of commercial and financial transactions requires fast
decimal arithmetic computing to meet real-time requirements and exact computations.
Some approaches to binary/decimal computing [5,10] were adopted for the design of
processors with special units for decimal floating-point arithmetic, like the IBM eServer
z900 [11], the IBM POWER6 [12] and the IBM z10 [13].
Since the set of applications taking advantage of these specialized units is somehow
limited, most processors only include some kind of specific instructions to help in the
execution of decimal operations performed in software. In this scenario, FPGAs (Field Pro-
grammable Gate Array) may be a good alternative for the execution of decimal arithmetic
with dedicated hardware modules, like in many other applications [14,15]. Many financial
applications already use FPGAs to speed-up the execution of their algorithms and so an
hardware reprogrammable platform is already available. Besides, since logic in FPGAs
is implemented with look-up tables, the gap between binary and decimal arithmetic is
smaller than when implemented with ASIC (Application Specific Integrated Circuit).
Decimal multiplication is a fundamental arithmetic operation used in many appli-
cations and the design of other arithmetic functions. Therefore, fast decimal multipliers
are important to obtain fast decimal-based applications. Two new methods for parallel
decimal multiplication on FPGA with different tradeoffs between area and performance
are proposed. The methods are based on a new decimal adder/subtractor.
The results obtained with the new decimal multipliers improve both the area and
the performance of the best state-of-the-art decimal multipliers. Additionally, it reduces
considerably the implementation gap between decimal and binary multipliers in FPGA.
This paper is organized as follows. Section 2 describes state-of-the-art of decimal
multiplication. Section 3 introduces the decimal adder/subtractor. Section 4 describes the
proposed decimal multipliers. Section 5 presents the results of the new decimal multipliers
and compares the results with previous parallel decimal multipliers. Section 6 concludes
the paper.
2. Related Work
Processors with dedicated decimal hardware multipliers implement them with it-
erative algorithms [16,17] to reduce the size of the arithmetic unit. However, iterative
algorithms are slow compared to parallel implementations due to its iterative nature. for
fast execution, parallel decimal multiplication consists of partial product generation for
each multiplier digit followed by partial product addition. Partial product generation of a
N × N multiplication can be implemented with N × N small digit by digit multipliers or N
digit by multiplicand multipliers. A digit by digit multiplier can be implemented with logic
or with look-up tables [18–20]), for fast and compact design. However, given the quadratic
number of digit by digit multipliers necessary to implement a multiplication, these solu-
tions are viable only for small operand sizes. The proposal in [21] considered recoding of
operands to simplify digit by digit multiplication for partial product generation. However,
the performance and area of the decimal multiplier based on digit by digit multiplication is
still worst than a multiplier with a partial product for each multiplier digit.
The approach followed to implement a 1 × N multiplier is to determine the decimal
multiples of the multiplier. A direct approach to a design a decimal multiplier based on
multiples generates all multiples of the multiplicand. Then, selects the required multiples
according to the multiplier digits. The generated multiples are then shifted and added to
generate the final product. While simple, the method requires a large multiplexer with
all multiplies for each multiplier digit and the generation of all multiples from A to 9A.
Knowing that the generation of some multiples are not carry-free, this solution degrades
the performance of the multiplier.
Therefore, authors started to consider only a limited set of multiples. In [22] only
multiples A, 2A, 4A, 5A are used, since they can be generated without carry propagation
Algorithms 2021, 14, 198 3 of 21
(multiple 4A is generated from 2A in sequence as 2× 2A). The other multiples are obtained
by adding two of these multiples. Since multiple 4A cannot be generated in a single carry-
free step, it has been removed from the set of base multiples in [23]. The other multiples
are obtained by adding a multiple from the set {0, 5X, 10X} and a multiple from the set
{−2X, −X, 0, X, 2X}. For fast selection of multiples, digits of the multiplier are first recoded,
but the solution requires a large multiplexer for each multiplier digit.
Since then, other sets of multiples and encodings were considered. In [24] two different
decimal encodings (4221 and 5211) are used to generate and reduce the partial products
with two different architectures. In one the architectures the multiplier is recoded into a
signed-digit (SD) set [−5, 5], while in the other the multiplier is encoded as A = YU 5 + Y L
like in [23], where YU ∈ {0, 1, 2} and Y L ∈ {−2, −1, 0, 1, 2}. Signed-digit (SD) recoding of
the multiplier in the set [−5, 5] was adopted by several authors for the implementation
of a decimal multiplier [25,26]. The base architecture generates multiples {0, X, 2X, 3X,
4X, 5X}. These are selected for each partial product and the output is complemented to
obtain the negative of the multiple. The partial products are then reduced with a partial
product reduction module. Different representations are used to improve the generation
of complements and the decimal addition. The radix-5 algorithm proposed in [24] was
followed by [27] but using an hybrid 8421–5421 representation.
In [28] a decimal multiplier is proposed using a redundant decimal addition algorithm
based on a weighted bit-set encoding. The method generates double BCD (Binary-Coded
Decimal) numbers using decimal multiples 2X, 4X, and 5X. The redundant decimal adder
is used to reduce the generated 2n BCD partial products to a redundant number in the
range of [0, 15]. The final redundant product is then converted to BCD encoding.
The special case of constant decimal multiplication was considered in [29]. Constant
decimal multiplication is widely used in economic and financial applications. The authors
address this problem to design a solution with smaller area, power and delay compared to
constant decimal multiplication implemented with a general decimal multiplier. The work
proposes a new redundant digit set in {0, 18} and a 3:1 compressor. The results show an
improvement in the area up to 89%.
Partial products are then added is a step known as partial product reduction using
decimal adders. Partial product reduction can be designed with an adder tree or with a
multioperand adder. An adder tree successively reduces pairs of partial products until
a final result. Multioperand addition takes into account that multiple partials have to be
reduced into a single value. In [30] three techniques were proposed for multioperand
decimal addition. Two of the approaches consider speculative addition that speculates
about BCD correction values which are corrected while adding the operands. The other
technique uses a binary adder that produces a binary sum which is then corrected. This
last technique achieved the best area-delay results. A mixed binary and BCD multioperand
addition was proposed in [31]. Digits in a column are all added in binary, converted to
decimal and finally added with decimal adders.
In [22,23] the adder tree is implemented with decimal carry look-ahead adders. In [32]
partial products are recoded to 4221. This codification simplifies addition since it avoids he
correction step. The method reduces three partial products to two equally weighted 4221
decimal digits. These two operands are then converted to BCD add added to generate the
final result.
A different approach for decimal multiplication considers binary multipliers as the
base arithmetic unit [33–36]. This permits using binary multipliers that are faster and
may already be available in the system. Also, it implements both binary and decimal
multiplication in a single module. The method first converts the BCD operands of the
multiplication to binary. The converted operands are then multiplied using the binary
multiplier. The binary product is then converted to BCD. The main drawback of the binary-
based method is the large overhead introduced by the converters [35,37]. A balanced
solution was proposed in [34] that subdivides the multiplier and the multiplicand into
Algorithms 2021, 14, 198 4 of 21
smaller blocks and applies the method to each of these sub-blocks. The partials are then
aligned and added using decimal adders to generate the final product.
Most works on decimal multiplication target ASICs, but several architectures have
been proposed for FPGA and coarse-grained reconfigurable computing [38]. Any of the
previous architectures can be directly mapped to FPGA. However, a careful adaptation of
the design leads to a more efficient architecture since logic functions in FPGAs are imple-
mented with look-up tables. In [39] a parallel implementation of a multiplier was mapped
in Virtex-4 FPGA from Xilinx. The architecture obtains the partial products using digit by
digit multiplication with a binary multiplier followed by binary to BCD conversion [40].
The work in [41] described previously was mapped on a 6-input LUT FPGA.
A new optimization of the multiplication algorithm was considered in [42] where the
application of the Karatsuba-Ofman algorithm reduces the area of the parallel decimal
multipliers on FPGA at the cost of an increase in delay. A BCD multiplier using the atomic
1 × 1 digit multiplier was proposed in [43]. The effort of the work is on the partial product
reduction unit. The two-digit partial products of all 1 × 1 digit multiplications are correctly
aligned to generate the complete partial products. The partial products are then reduced
with a mix of binary decimal compressors and decimal adders.
Recently, a new decimal multiplier [44] improved the area of the best previous decimal
multipliers on FPGA by about 20%. The solution considers a new decimal adder based on
a mixed BCD/excess-6 representation and a 5221 recoding of the multiplier digits. Partial
products are obtained from the addition of a multiple in the set {0, 2X, 5X, 2X + 5X} and a
multiple in the set {X, 2X}.
Two novel decimal multipliers on FPGA with different area/performance tradeoffs
with both multipliers improving the area and performance of state-of-the-art multipliers are
proposed. Both methods use a new adder/subtractor based on the excess-3 representation
of multiples. Two different sets of multiples are considered: {0, X, 2X, 5X, 10X} and {2X,
4X, 5X}. Partial products are obtained by the addition or subtraction of two multiples of
the sets. The method permits a very efficient generation of multiples, which considerably
reduces the required resources. The area of the largest decimal multiplier is smaller than
the area of an equivalent binary multiplier.
3. Decimal Adder/Subtractor
In [45] a decimal adder was proposed that considers an excess-6 representation to
avoid carry propagation of addition. This adder is used in the proposed multipliers to
implement the adder tree. It also serves as the base for a novel decimal adder/subtractor
necessary for the design of the partial product generators. To better understand the new
adder/subtractor, the decimal adder proposed in [45] is briefly described.
the operands must be converted to excess-6. If both are represented in excess6 then one
must be converted to BCD.
The adjustment of the operands and their addition were designed with a single level
of LUT-6 and the carry-chain of the FPGA. The expressions of the generate and propagate
signals of a single digit adder are as follows [45]:
z[3] ⊕ (z[2] ∨ z[1]) ⊕ w[3] if wbcd = zbcd = 1
p [3] = z[3] ⊕ (z[2] z[1]) ⊕ w[3] if wbcd = zbcd = 0
z [3] ⊕ w [3] if wbcd 6= zbcd
z [2] ⊕ z [1] ⊕ w [2] if wbcd = zbcd = 1
p [2] = z [2] ⊕ z [1] ⊕ w [2] if wbcd = zbcd = 0 (1)
z [2] ⊕ w [2] if wbcd 6= zbcd
z [1] ⊕ w [1] if wbcd = zbcd
p [1] =
z [1] ⊕ w [1] if wbcd 6= zbcd
p [0] = z [0] ⊕ w [0]
g [3] = w [3]
g [2] = w [2]
(2)
g [1] = w [1]
g [0] = w [0]
Signals wbcd and zbcd indicate whether the digits w and z are represented in BCD or
in excess-6.
The addition of two decimal numbers whose digits are represented in BCD or excess-6
is implemented with a chain of single digit BCD/excess-6 adders.
w − z = w + 90 z + 1 = w + (9 − z ) + 1 (3)
where 9’z is the nine’s complement of z. Using the BCD/exccess-6 adder to execute this
addition, we must add six to the equation:
w − z = w + (9 − z) + 6 + 1 = w + (15 − z) + 1 = w + z + 1 (4)
w z Action
BCD BCD none
BCD excess-6 z → z−6
excess-6 BCD z → z+6
excess-6 excess-6 none
Similar to the adder described in the previous Section, the subtraction of two digits is
implemented with the generate and propagate signals defined as follows
g [3] = w [3]
g [2] = w [2]
(6)
g [1] = w [1]
g [0] = w [0]
z[3] ⊕ (op z[2] z[1]) ⊕ w[3] if wbcd = 0, zbcd = 0
z[3] ⊕ (op z[2] z[1]) ⊕ w[3] if wbcd = 1, zbcd = 1
p [3] =
z[3] ⊕ (op z[2] z[1]) ⊕ w[3] if wbcd = 0, zbcd = 1
z[3] ⊕ (op z[2] z[1]) ⊕ w[3] if wbcd = 1, zbcd = 0
z[2] ⊕ (op z[1]) ⊕ w[2] if wbcd = 0, zbcd = 0
z[2] ⊕ (op z[1]) ⊕ w[2] if wbcd = 1, zbcd = 1 (7)
p [2] =
z[2] ⊕ (op z[1]) ⊕ w[2]
if wbcd = 0, zbcd = 1
z[2] ⊕ (op z[1]) ⊕ w[2] if wbcd = 1, zbcd = 0
z [1] ⊕ w [1] if wbcd 6= zbcd
p [1] =
z [1] ⊕ w [1] if wbcd = zbcd
p[0] = op ⊕ z[0] ⊕ w[0]
The generate signals are always
g [3] = w [3]
g [2] = w [2]
(8)
g [1] = w [1]
g [0] = w [0]
The propagate signals of the generic BCD/excess-6 adder are functions of only two
inputs when wbcd 6= zbcd, while in the generic BCD/excess-6 subtractor they are functions
of only two inputs when wbcd = zbcd.
In the case of the generic adder/subtractor circuit, not considering input op, signals
p[0] and p[1] are functions of two variables, while signals p[2] and p[3] are functions of 3
and 4 variables, independently of inputs wbcd and zbcd. Therefore, the complexity of the
generic adder/subtractor has increased.
It is possible to reduce this complexity to only two variables as in the generic adder
and generic subtractor using operands represented in excess-3. Considering two BCD
digits, w and z, converted to excess-3 (add three), we3 and ze3, respectively. The addition
of we3 and ze3 is given by
This is the same to say that one of the operands in represented in BCD and the other
in excess-6. In addition, when operands use different representations the propagate signals
are only functions of two variables.
Considering the same two digits in excess-3, the subtraction of we3 and ze3 is given by
This is the same to say that both operands are in BCD. In subtraction when operands
use the same representation the propagate signals are only functions of two variables.
Hence, when the operands are represented in excess-3 the propagate signals of both
operations are in the most simplified form as follows:
This property will allow the simplification of the multiplier to be described in the
next Section.
Algorithms 2021, 14, 198 8 of 21
4. Decimal Multiplier
Considering two operands, A and B, with n decimal digits (ai ) and (bi ), respectively,
given by
n −1
A = a n −1 a n −2 . . . a 0 = ∑ ai × 10i (12)
i =0
n −1
B = bn−1 bn−2 . . . b0 = ∑ bi × 10i (13)
i =0
−1
The product of A × B is a number with 2n decimal digits (pi ) given by ∑2n i
i =0 pi × 10 .
A decimal digit, xi , is coded with four bits according to expression
3
∑ x i [ j ] × wi [ j ] (14)
j =0
In this paper, the design of the multipliers follows the algorithm of most previous
approaches. Partial products are first generated and then reduced with a tree of decimal
adders. Each partial product results from the multiplication of a multiplier digit by the
multiplicand. The partial product is obtained by the addition or subtraction of two mul-
tiples of the multiplicand. This paper considers two different sets of multiples for the
partial product generation that leads to different tradeoffs between delay and area. In the
following, both methods for partial product generation are described.
Algorithms 2021, 14, 198 9 of 21
Table 3. Generation of multiples using subsets S1 = {0, 5A, 10A} and S2 = {0, A, 2A}.
Multiple ∈ S1 ∈ S2 Operation
0 0 0 0+0
A 0 A 0+A
2A 0 2A 0 + 2A
3A 5A 2A 5A − 2A
4A 5A A 5A − A
5A 5A 0 5A + 0
6A 5A A 5A + A
7A 5A 2A 5A + 2A
8A 10A 2A 10A − 2A
9A 10A A 10A − A
The hardware design of the partial product generator includes one multiplexer to
select the multiple from the subset S1 , another multiplexer to select the second multiple
from S2 and one decimal adder/subtractor (see Figure 1).
2 1 0 2 2 2 1 0
bi 5Ae3 2Ae3 Ae3
Sa Sb
4N+4 4N+4 4 4N+4 4N+4 4N
OP
Add/Sub Add/Sub
Selectors Sa and Sb of the multiplexers, and operand selector op are functions of the
4-bit multiplier digit, bi = bi [3]bi [2]bi [1]bi [0], as described in Table 4.
Algorithms 2021, 14, 198 10 of 21
2 1 0 2
bi 5Ae3 2Ae3 Ae3
Sa
2 4N+4 4N+4 4N
Ma 4N+4 OP Sb
Sel-Add/Sub Add/Sub
Block Sel − Add/Sub sums (op = 0) or subtracts (op = 1) the output of the multiplexer
with one multiple in the set {0, A, 2A}, determined by selector {Sb }. Considering the
propagate and generate expressions of the BCD/excess-6 adder/subtractor with operands
represented in excess-3, the expressions of the propagate and generate signals for a single
digit of block Sel − Add/Sub are as follows:
where m a is a digit from the output of the multiplexer, Ma . ae3 and 2ae3 are digits from
multiples Ae3 and 2Ae3, respectively.
Algorithms 2021, 14, 198 11 of 21
Connecting these propagate and generate signals with a carry chain provides the
circuit for a single digit of block Sel − Add/Sub with a carry-in and a carry-out (see the
implementation of a single digit in Figure 3).
2ae3[3] op Sb[1] Sb[0] 2ae3[2] op Sb[1] Sb[0] 2ae3[1] op Sb[1] Sb[0] 2ae3[0] op Sb[1] Sb[0]
ae3[3] ma[3] ae3[2] ma[2] ae3[1] ma[1] ae3[0] ma[0]
0 0 0 0
cout
1 1 1 1 cin
4 4
isbcd pp
The complete partial product is obtained with a chain of these single-digit blocks.
The generated partial products are in BCD/excess-6 format. Therefore, each partial prod-
uct output consists of N+1 BCD/excess-6 digits, { PP0 , PP1 , . . . , PPN −1 }, and N+1 bits,
{ ISbcd0 , ISbcd1 , . . . , ISbcd N −1 }, one for each digit, indicating if the digit is represented in
BCD or excess-6.
The partial product generator produces all N partial products in parallel using N
(single digit) partial product generators (see Figure 4).
Multiple ∈ S1 ∈ S2 Operation
0 4A 4A 4A − 4A
A 5A 4A 5A − 4A
2A 4A 2A 4A − 2A
3A 5A 2A 5A − 2A
4A 4A 0 4A + 0
5A 5A 0 5A + 0
6A 4A 2A 4A + 2A
7A 5A 2A 5A + 2A
8A 4A 4A 4A + 4A
9A 5A 4A 5A + 4A
The architecture of the partial product circuit is similar to that designed for method 1,
except that the multiplexer has only two inputs: 4A and 5A (see Figure 5).
1 0
bi 5Ae3 4Ae3 2Ae3
Sa
4N+4 2 4N+4 4N+4 4N+4
Ma OP Sb
Sel-Add/Sub Add/Sub
Selectors Sa and Sb , and operand selector op are functions of the multiplier digit,
bi = bi [3]bi [2]bi [1]bi [0], as described in Table 6.
Also, to simplify the adder/subtractor of the partial product generator all multiples
are represented in excess-3. Considering the equations of the propagate and generate
expressions of the adder/subtractor with operands represented in excess-3, the expressions
Algorithms 2021, 14, 198 13 of 21
of the propagate and generate signals for a single digit of block Sel − Add/Sub are as
follows:
p[3] = op ⊕ ma[3] ⊕ (Sb [0] Sb [1] 2ae3[3] + Sb [0] Sb [1] 4ae3[3])
p[2] = op ⊕ ma[2] ⊕ (Sb [0] Sb [1] 2ae3[2] + Sb [0] Sb [1] 4ae3[2])
p[1] = op ⊕ ma[1] ⊕ (Sb [0] Sb [1] 2ae3[1] + Sb [0] Sb [1] 4ae3[1])
p[0] = op ⊕ ma[0] ⊕ (Sb [0] Sb [1] 2ae3[0] + Sb [0] Sb [1] 4ae3[0]) (16)
g[3] = ma[3]
g[2] = ma[2]
g[1] = ma[1]
g[0] = ma[0]
Propagate and generate signals are interconnected with a carry chain to generate the
circuit for a single digit of block Sel − Add/Sub with a carry-in and a carry-out (see the
implementation of a single digit in Figure 6).
2ae3[3] OP Sb[1] Sb[0] 2ae3[2] OP Sb[1] Sb[0] 2ae3[1] OP Sb[1] Sb[0] 2ae3[0] OP Sb[1] Sb[0]
4ae3[3] ma[3] 4ae3[2] ma[2] 4ae3[1] ma[1] 4ae3[0] ma[0]
0 0 0 0
cout
1 1 1 1 cin
4 4
4
isbcd pp
Similar to method 1, the complete partial product is obtained with a chain of these
single-digit blocks. All N partial products are generated parallel with N partial product
generators. The carry-in of the first module receives the op signal.
a a i [3 − 0] ae3i [3 − 0] yi [3 − 0]
0 0000 0011 0011
1 0001 0100 0100
2 0010 0101 0101
3 0011 0110 0110
4 0100 0111 0111
5 0101 1000 1000
6 0110 1001 1001
7 0111 1010 1010
8 1000 1011 1011
9 1001 1100 1100
Considering now the excess-3 of 2A, 2Ae3 = Y = yn−1 . . . y0 , each digit yi is obtained
according to Table 8.
a 2ai [4 − 0] 2ae3i [4 − 0] yi [3 − 0]
a i −1 [4 ] = 1 a i −1 [4 ] = 0
0 0 0000 0 0011 0100 0011
1 0 0010 0 0101 0110 0101
2 0 0100 0 0111 1000 0111
3 0 0110 0 1001 1010 1001
4 0 1000 0 1011 1100 1011
5 1 0000 1 0011 0100 0011
6 1 0010 1 0101 0110 0101
7 1 0100 1 0111 1000 0111
8 1 0110 1 1001 1010 1001
9 1 1000 1 1011 1100 1011
Multiple 2ai of digit ai is obtained according to the second column of Table 8. In this
case, there is a carry out bit 2ai [4]. However, since the least significant digit of 2ai is always
zero then there is no carry propagation. However, multiple 2ae3i has always one at the
least significant bit. So, yi is the addition of 2ae3i plus the carry out from 2ae3i−1 , that is,
it depends on 2ae3i and 2ae3i−1 . The solution proposed in this paper, first determines all
carries 2ae3i [4] and then determines yi as a function of 2ae3i and 2ae3i−1 [4], as described in
Table 8.
The implementation of Y = 4Ae3 is based on multiple 2A. First, multiple B = 2A is
determined according to the second column of Table 8. Then, multiple Y = 4Ae3 = 2Be3 is
determined like multiple 2Ae3 described previously.
Considering the excess-3 of 5A, 5Ae3 = Y = yn−1 . . . y0 , each digit yi is obtained
according to Table 9.
Algorithms 2021, 14, 198 15 of 21
a 5ai−1 [6 − 0] yi [3 − 0]
a i [0] = 1 a i [0] = 0
0 000 0000 0110 0011
1 000 0101 0110 0011
2 001 0000 0111 0100
3 001 0101 0111 0100
4 010 0000 1000 0101
5 010 0101 1000 0101
6 011 0000 1001 0110
7 011 0101 1001 0110
8 100 0000 1100 0111
9 100 0101 1100 0111
The multiple 5ai of one digit results in two digits. The most significant digit is in
[0 to 4] and the least significant digit is in [0 or 5], depending if digit ai is even or odd,
respectively. So, each digit yi depends on input digit ai−1 and the least significant bit of
digit ai , where a−1 = 0000. So, multiple 5A is generated in a single step without any carry
propagation according to Table 9.
All multiple generators assume that the input number is in BCD. To be more generic,
all multiple generators were designed with an extra input (isbcd) to specify if the number
is in BCD or excess-6. With generic multiple generators, the multipliers are also generic
permitting to interconnect several adders/subtractors or multipliers without having to
convert the output to BCD.
The multioperand addition is designed using an adder tree, similar to [44]. The tree
has L = log2 N levels of adders. Each level i has 2iN+1 adders, where level 0 is the first set
of adders.
The complete partial product reduction tree for N operands of size N+1 uses N2 ×
log2 ( N ) + N 2 − N BCD/excess-6 single digit adders (digAdder). The critical path of the
adder tree with N partials is given by log2 N digit adders plus 4 × 2N carry chain bits.
A digit of the product must be converted to BCD if isbcd is 0. The logical ex-
pressions to convert a BCD/excess-6 digit, d = d[3]d[2]d[1]d[0] to a BCD digit, dbcd =
dbcd [3]dbcd [2]dbcd [1]dbcd [0] are as follows:
P=A×B P=A×B
(2N digits) (2N digits)
a) b)
Figure 7. Architecture fo the proposed decimal multipliers. (a) Decimal multiplier with method 1
and (b) decimal multiplier with method 2.
There are two lines in the Table for the multiples generator: the first line is when the
operands and the result are in BCD and the second line is when they are in BCD/excess-6
representation. In the last case, the final converter is not used.
The area of multiplier 1 with inputs and output represented in BCD is 2N 2 − 7 higher
than multiplier 2. When inputs and output are represented in BCD/excess-6, multiplier 1
has an area 2N 2 + N − 6 higher than multiplier 2.
5. Results
All designs were described in VHDL and implemented in a Virtex-7 FPGA (-3 speed
grade). The architecture was simulated, synthesized, placed and routed using Vivado
19.1 from Xilinx. The area of all implementations after place and route are presented in
Tables 11 and 12.
Table 11. Logic area (LUTs) and delay (ns) of both multipliers with BCD inputs and output for
different number of digits in a Virtex-7 FPGA, speed grade -3.
Multiplier 1 Multipler 2
Size Model Area Delay Model Area Delay
2×2 88 88 3.56 87 87 4.62
4×4 280 280 4.97 255 255 5.92
8×8 960 960 6.58 839 839 7.96
16 × 16 3488 3504 8.93 2983 3001 10.22
32 × 32 13,184 13,248 12.26 11,143 11,194 13.12
34 × 34 14,892 14,976 13.04 12,587 12,643 13.91
Table 12. Logic area (LUTs) and delay (ns) of both multipliers with BCD/excess-6 inputs and output
for different number of digits in a Virtex-7 FPGA, speed grade -3.
Multiplier 1 Multipler 2
Size Model Area Delay Model Area Delay
2×2 83 83 3.02 79 79 3.84
4×4 271 271 4.32 239 239 5.31
8×8 943 943 6.06 807 814 7.23
16 × 16 3455 3471 8.37 2919 2937 9.31
32 × 32 13,119 13,183 11.99 11,015 11,076 12.94
34 × 34 14,823 14,907 12.84 12,451 12,535 13.78
Multiplier 1 is the fastest while multiplier 2 is the smallest. These relations are
determined by the multiples generators and the partial product generators. Multiplier
2 needs multiple 4× that is harder to generate and has a critical path higher than the
other multiples. This determines the higher critical path of the second multiplier. On the
other side, the multiplexer of the partial product generator of multiplier 1 has three inputs.
while the multiplexer of multiplier 2 has only two. This determines the smaller area of
multiplier 2.
Algorithms 2021, 14, 198 18 of 21
As can be observed from the results, multiplier 2 is 18% smaller than multiplier 1 for
the 32 digit multiplier. This difference reduces for smaller multipliers. On the other side,
multiplier 1 is 30% faster than multiplier 2 for the 2 digit multiplier. This difference reduces
for larger multipliers (see Figures 8 and 9).
14,000
12,000
Multiplier 1 Multiplier 2
10,000
8,000
LUTs
6,000
4,000
2,000
0
2x2 4x4 8x8 16x16 32x32
Size
12
Multiplier 1 Multiplier 2
Delay (ns)
3
2x2 4x4 8x8 16x16 32x32
Size
The difference in delay between the proposed multipliers is mostly due to the multiples
generators. Therefore the absolute difference between both multipliers is almost constant.
Therefore, the relative delay difference decreases with the size of the multipliers.
The results of the proposed multipliers were compared with state-of-the-art decimal
multipliers implemented in FPGA [41–44], (see Table 13).
All designs consider registered inputs and outputs. The multiplier in [41] uses the
final flip-flops to implement the final conversion from the internal representation to BCD.
To register the outputs an extra level of LUTs is required. In our circuit, the extra level of
LUTs implements both the converter and the register. In this reference, the decimal digit
adder occupies five LUT, while our decimal adder occupies only four. For a fair comparison,
the multiplier from [41] was redesigned and implemented with our decimal adder.
The decimal multiplier from [42] uses the Karatsuba-Offman algorithm to reduce the
area at the cost of delay. The area reduction impact is higher for larger operands. However,
even for the 16 × 16 decimal multiplier, the proposed multiplier 1 achieves a similar area
with a reduction of 34% in the delay. Multiplier 2 reduces the area by 15% for the 16 × 16
Algorithms 2021, 14, 198 19 of 21
multiplier with with a reduction of 28% in the delay. The improvements obtained with the
proposed decimal multipliers are higher for smaller operands since the relative overhead
of the Karatsuba-Offman algorithm is higher for smaller operands.
Table 13. Comparison of the proposed decimal multipliers with state of the art works for different number of digits.
The proposed multiplier 1 is smaller (up to 14%) and faster than the previous best
decimal multiplier from [44]. This is due to the reduction in the area of the partial product
generator. The proposed multiplier 2 further improves the area of multiplier from [44] (up
to 35%) and also the performance for the multipliers with operands of 16 and 32 digits.
The area of multiplier 2 was compared with the area of a binary multiplier of equiv-
alent input operands generated for best speed with Xilinx Core Generator (see results in
Table 14)
Table 14. Comparison of the proposed decimal multipliers with state of the art works for different
number of digits.
Small decimal multipliers are relatively expensive compared to the binary multiplier,
but the proposed 16 × 16 decimal multiplier 2 has an area only 4% higher than the area
of the binary multiplier and the proposed 32 × 32 decimal multiplier 2 is smaller than the
binary multiplier. The area ratio reduction has to do with the overhead associated with the
generation of the multiples which are amortized as the multiplier size increases. The delay
of the decimal multiplier is always worse but the relative difference also decreases with the
operand size for the same reasons.
delay. Compared to a binary multiplier, the larger multipliers have a comparable area,
but the worst delay. Both area and delay ratios decrese with the operand size.
As future work, the proposed multiplier will be used to implement decimal floating-
point multiplication and fused multiplication-addition.
Author Contributions: Conceptualization, M.P.V. and H.C.N.; methodology, M.P.V. and H.C.N.;
software, M.P.V. and H.C.N.; validation, M.P.V. and H.C.N.; formal analysis, M.P.V. and H.C.N.;
investigation, M.P.V. and H.C.N.; resources, M.P.V. and H.C.N.; data curation, M.P.V. and H.C.N.;
writing—original draft preparation, M.P.V. and H.C.N.; writing—review and editing, M.P.V. and
H.C.N.; visualization, M.P.V. and H.C.N.; supervision, M.P.V.; project administration, M.P.V.; funding
acquisition, M.P.V. All authors have read and agreed to the published version of the manuscript.
Funding: This work was supported by national funds through Fundação para a Ciência e a Tecnologia
(FCT) with Reference UIDB/50021/2020.
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Tsang, A.; Olschanowsky, M. A Study of Database 2 Customer Queries; Technical report; IBM Santa Teresa Laboratory: San Jose, CA,
USA, 1991.
2. IEEE Standards Committee. 754-2008 IEEE Standard for Floating-Point Arithmetic; IEEE: New York, NY, USA, 2008; pp. 1–58.
3. Quinn, K. Ever had problems rounding offfigures? this stock exchange has. Wall Str. J. 1983, 202, 37.
4. IBM Corporation. The Telco Benchmark. 2017. Available online: http://speleotrove.com/decimal/telcoSpec.html (accessed on
20 May 2020).
5. Cowlishaw, M.F. Decimal floating-point: Algorism for Computers. In Proceedings of the 16th IEEE International Symposium on
Computer Arithmetic, Santiago de Compostela, Spain, 15–18 June 2003; pp. 104–111.
6. IBM Corporation. Decimal Arithmetic FAQ. 2007. Available online: http://speleotrove.com/decimal/decifaq1.html#needed
(accessed on 20 May 2020).
7. Cornea, M.; Anderson, C.; Harrison, J.; Tang, P.; Schneider, E.; Tsen, S. A software implementation of the IEEE 754R decimal
floating-point arithmetic using the binary encoding format. In Proceedings of the IEEE 18th Symposium on Computer Arithmetic,
Montpellier, France, 25–27 June 2007; pp. 29–37.
8. ANSI CdecNumber Library v3.68. Available online: http://speleotrove.com/decimal/decnumber.html (accessed on 27 June
2021).
9. GNU CCompiler Library. Available online: https://www.gnu.org/software/libc/ (accessed on 27 June 2021).
10. Cornea, M.; Crawford, J. IEEE 754R Decimal Floating-Point Arithmetic: Reliable and Efficient Implementation for Intel
Architecture Platforms. Intel Technol. J. 2007, 11, 91–94. [CrossRef]
11. Busaba, F.; Krygowski, C.A.; Li, W.H.; Schwarz, E.M.; Carlough, S.R. The IBM z900 Decimal Arithmetic Unit. In Proceedings of
the ASilomar Conference on Signals, Systems, Computers, Pacific Grove, CA, USA, 4–7 November 2001; pp. 1335–1339.
12. Le, H.Q.; Starke, W.J.; Fields, J.S.; O’Connell, F.P.; Nguyen, D.Q.; Ronchetti, B.J.; Sauer, W.M.; Schwarz, E.M.; Vaden, M.T. IBM
POWER6 microarchitecture. IBM J. Res. Dev. 2007, 51, 639–662. [CrossRef]
13. Webb, C.F. IBM z10: The Next- Generation Mainframe Microprocessor. IEEE Micro 2008, 28, 19–29. [CrossRef]
14. Zhao, Y.; Wang, D.; Wang, L. Convolution Accelerator Designs Using Fast Algorithms. Algorithms 2019, 12, 112. [CrossRef]
15. Deabes, W. FPGA Implementation of ECT Digital System for Imaging Conductive Materials. Algorithms 2019, 12, 28. [CrossRef]
16. Vestias, M.P.; Neto, H.C. Revisiting the Newton-Raphson Iterative Method for Decimal Division. In Proceedings of the 2011
21st International Conference on Field Programmable Logic and Applications, Chania, Greece, 5–7 September 2011; pp. 138–143.
[CrossRef]
17. Véstias, M.P.; Neto, H.C. Iterative decimal multiplication using binary arithmetic. In Proceedings of the 2011 VII Southern
Conference on Programmable Logic (SPL), Cordoba, Argentina, 13–15 April 2011; pp. 257–262. [CrossRef]
18. Larson, R.H. High-Speed Multiply Using Four Input Carry-Save Adder. IBM Tech. Discl. Bull. 1973, 16, 2053–2054.
19. Ueda, T. Decimal Multiplying Assembly and Multiply Module. U.S. Patent 5,379,245, 3 January 1995.
20. Castillo, E.; Lloris, A.; Morales, D.P.; Parrilla, L.; García, A.; Botella, G. A new area-efficient BCD-digit multiplier. Digit. Signal
Process. 2017, 62, 1 – 10. [CrossRef]
21. Erle, M.A.; Schwarz, E.M.; Schulte, M.J. Decimal Multiplication with Efficient Partial Product Generation. In Proceedings of the
17th IEEE Symposium on Computer Arithmetic, Cape Cod, MA, USA, 27–29 June 2005; pp. 21–28.
22. Erle, M.A.; Schulte, M.J. Decimal multiplication via carry-save addition. In Proceedings of the 14th IEEE International Conference
on Application Specific Systems, San Diego, CA, USA, 9–11 June 2003; pp. 348–358.
23. Lang, T.; Nannarelli, A. A radix-10 combinational multiplier. In Proceedings of the IEEE 40th International Asilomar Conference
on Signals, Systems, and Computers, Kos Island, Greece, 29 October–1 November 2006; pp. 313–317.
24. Vázquez, A.; Antelo, E.; Montuschi, P. Improved Design of High-Performance Parallel Decimal Multipliers. IEEE Trans. Comput.
2010, 59, 679–693. [CrossRef]
Algorithms 2021, 14, 198 21 of 21
25. Gorgin, S.; Jaberipur, G. Sign-Magnitude Encoding for Efficient VLSI Realization of Decimal Multiplication. IEEE Trans. Very
Large Scale Integr. VLSI Syst. 2017, 25, 75–86. [CrossRef]
26. Cui, X.; Dong, W.; Liu, W.; Swartzlander, E.E.; Lombardi, F. High Performance Parallel Decimal Multipliers Using Hybrid BCD
Codes. IEEE Trans. Comput. 2017, 66, 1994–2004. [CrossRef]
27. Zhu, M.; Jiang, Y.; Yang, M.; Chen, T. On High-Performance Parallel Decimal Fixed-Point Multiplier Designs. Comput. Electr. Eng.
2014, 40, 2126–2138. [CrossRef]
28. Gorgin, S.; Jaberipur, G. A fully redundant decimal adder and its application in parallel decimal multipliers. Microelectron. J.
2009, 40, 1471–1481. [CrossRef]
29. Hoseininasab, S.S.; Nikmehr, H. Architectures for multiple constant decimal multiplication. Comput. Electr. Eng. 2019, 75, 31–45.
[CrossRef]
30. Kenney, R.D.; Schulte, M.J. High Speed Multioperand Decimal Adders. IEEE Trans. Comput. 2005, 54, 953–963. [CrossRef]
31. Dadda, L. Multioperand Parallel Decimal Adder: A Mixed Binary and BCD Approach. IEEE Trans. Comput. 2007, 56, 1320–1328.
[CrossRef]
32. Vázquez, A.; Antelo, E.; Montushi, P. A New Family of High-Performance Parallel Decimal Multipliers. In Proceedings of the
IEEE 18th Symposium on Computer Arithmetic, Montpellier, France, 25–27 June 2007; pp. 195–204.
33. Neto, H.; Véstias, M. Decimal Multiplier on FPGA using Embedded Binary Multipliers. In Proceedings of the International
Conference on Field Programmable Logic and Applications, Dublin, Ireland, 27–31 August 2008; pp. 197–202.
34. Véstias, M.; Neto, H. Parallel Decimal Multipliers using Binary Multipliers. In Proceedings of the IEEE 6th Southern Pro-
grammable Logic Conference, Pernambuco, Brazil, 24–26 March 2010; pp. 73–78.
35. Fazlali, M.; Valikhani, H.; Timarchi, S.; Malazi, H.T. Fast Architecture for Decimal Digit Multiplication. Microprocess. Microsyst.
2015, 39, 296–301. [CrossRef]
36. Mukkamala, S.; Rathore, P.; Peesapati, R. Decimal multiplication using compressor based-BCD to binary converter. Eng. Sci.
Technol. Int. J. 2018, 21, 1–6. [CrossRef]
37. Al-Khaleel, O.; Al-Qudah, Z.; Al-Khaleel, M.; Papachristou, C. High performance FPGA-based decimal-to-binary conversion
schemes for decimal arithmetic. Microprocess. Microsystems 2013, 37, 287–298. [CrossRef]
38. Emami, S.; Sedighi, M. An Optimized Reconfigurable Architecture for Hardware Implementation of Decimal Arithmetic. Comput.
Electr. Eng. 2017, 63, 18–29. [CrossRef]
39. Sutter, G.; Todorovich, E.; Bioul, G.; Vázquez, M.; Deschamps, J.P. FPGA Implementations of BCD Multipliers. In Proceedings of
the IEEE International Conference on Reconfigurable Computing and FPGAs, Cancun, Mexico, 9–11 December 2009; pp. 36–41.
40. Jaberipur, G.; Kaivani, A. Binary-coded decimal digit multipliers. IET Comput. Digit. Tech. 2007, 1, 377–381. [CrossRef]
41. Vázquez, A.; de Dinechin, F. Efficient implementation of parallel BCD multiplication in LUT-6 FPGAs. In Proceedings of the
2010 International Conference on Field-Programmable Technology (FPT), Beijing, China, 8–10 December 2010; pp. 126–133.
42. Véstias, M.; Neto, H. Parallel Decimal Multipliers and Squarers Using Karatsuba-Ofman’s Algorithm. In Proceedings of the 15th
Euromicro Conference on Digital System Design, Cesme, Izmir, Turkey, 5–8 September 2012; pp. 782–788.
43. Gao, S.; Al-Khalili, D.; Langlois, J.; Chabini, N. Efficient Realization of BCD Multipliers Using FPGAs. Int. J. Reconfigurable
Comput. 2017, 2017, 2410408. [CrossRef]
44. Véstias, M.P.; Neto, H.C. Improving the area of fast parallel decimal multipliers. Microprocess. Microsyst. 2018, 61, 96–107.
[CrossRef]
45. Neto, H.C.; Véstias, M.P. Decimal addition on FPGA based on a mixed BCD/excess-6 representation. Microprocess. Microsyst.
2017, 55, 91–99. [CrossRef]