Efficient Realization of BCD Multipliers Using FPGAs

In this paper, a novel BCD multiplier approach is proposed. The main highlight of the proposed architecture is the generation of the partial products and parallel binary operations based on 2-digit columns. 1 × 1-digit multipliers used for the partial product generation are implemented directly by 4-bit binary multipliers without any code conversion. The binary results of the 1 × 1-digit multiplications are organized according to their two-digit positions to generate the 2-digit column-based partial products. A binarydecimal compressor structure is developed and used for partial product reduction. These reduced partial products are added in optimized 6-LUT BCD adders. The parallel binary operations and the improved BCD addition result in improved performance and reduced resource usage. The proposed approach was implemented on Xilinx Virtex-5 and Virtex-6 FPGAs with emphasis on the critical path delay reduction. Pipelined BCD multipliers were implemented for 4 × 4, 8 × 8, and 16 × 16-digit multipliers. Our realizations achieve an increase in speed by up to 22% and a reduction of LUT count by up to 14% over previously reported results.


Introduction
The traditional approach of using binary number system based operations in a decimal system requires frontend and backend conversion.These conversions can take a significant amount of processing time and consume large area.A more important problem with fractional decimal numbers expressed in a binary format may result in lack of accuracy.This can have major impact in finance and commercial applications.To solve these problems, interest in hardware design of decimal arithmetic is growing.This has led to the incorporation of specifications of decimal arithmetic in the IEEE-754 2008 standard for floating-point arithmetic [1].The development of decimal operations in hardwired designs with high performance and low resource usage is expected to facilitate the implementation of various applications [2].
Multiplication is a complex operation among decimal computations.To speed up this operation, early decimal multipliers were designed at the gate level targeting ASICs.The authors in [3] proposed an improved iterative decimal multiplier approach to reduce the number of iteration cycles.To avoid a large number of decimal to binary conversions, a twodigit stage was used as the basic block for the iterative Binary Coded Decimal (BCD) multiplier.To further speed up the multiplication, parallel decimal multipliers were proposed.Binary multiplier and binary to BCD conversion were utilized to implement 1 × 1-digit multipliers, and different binary compressors were employed for the result of the multiplier [4][5][6].To avoid the binary to decimal conversion, recoding methods were used to generate the partial products of the BCD multiplier [7,8].A Radix10 combinational multiplier was introduced in [7] and Radix4 and Radix5 recoding methods were presented in [8].In [9], Radix5 recoding was combined with BCD code converters using BCD4221 and BCD5211 codes to simplify the partial product generation and reduction.In the recent two years, some ASIC-based designs for the realization of decimal multiplication were proposed in [10][11][12][13][14].The recoding methods and BCD code conversions were used in these designs for efficient implementation in ASIC.
Although there are a number of approaches to implement decimal multipliers in ASICs, utilizing the same methods in FPGA devices is not necessarily efficient.With recent advancements in FPGA technology, enhanced architectures, and availability of various hardware resources, the FPGA International Journal of Reconfigurable Computing platform is recognized as a viable alternative to ASICs in many cases.To make efficient use of FPGA resources in the implementation of decimal multiplication, new algorithms and approaches have been developed.The authors in [15] implemented decimal multipliers using embedded binary multiplier blocks in FPGAs.The binary-BCD conversion was implemented using base-1000 as an intermediate base, and the result was converted to BCD using a shift-add-3 algorithm.In [16], the authors presented a double-digit decimal multiplier technique that performs 2-digit multiplications simultaneously in one clock cycle; then the overall multiplication was performed serially.In [17,18], a 1 × 1-digit multiplier was designed directly with BCD inputs/outputs and implemented using 6-input or 4-input LUTs.To sum the results of 1 × 1-digit multipliers, a fast carry-chain decimal adder was also proposed in [18].These decimal-operationbased approaches avoided the conversions but also impacted the speed.Vázquez and De Dinechin implemented a BCD multiplier using a recoding technique [19].Signed-Digit (SD) Radix5 was employed to recode one of the input operands of the multiplier for the generation of the partial products.6input LUTs and fast carry chains in Xilinx FPGAs were used to generate the building blocks and the decimal adders.To increase the performance, the authors in [20] implemented a parallel decimal multiplier based on Karatsuba-Ofman algorithm.The building blocks used in Karatsuba-Ofman algorithm were deigned based on the approach proposed in [19].Another SD-based decimal multiplier approach was proposed in [21].The recoding was based on SD Radix10.BCD4221, 5211, and 5421 converters were used for the partial product generation.BCD4221-based compressors and adders were utilized in this approach.Although the BCD4221-based operations are similar to binary operation, the recoding and the different code conversions still lead to delay and resource cost.
In this paper, we propose a new parallel binary-operationbased decimal multiplier approach.Binary operations are performed for the 1 × 1-digit multiplication and the partial product reduction based on the columns with two digits in each column.The operations for all columns are processed in parallel.After the column-based binary operations, binary to decimal conversions are required but the bit sizes of the operands to be converted are limited based on the columns.In this paper, an improved 6-LUT-based BCD adder and a 2digit column-based binary-decimal compressor are also presented.Our proposed approach was implemented in Xilinx Virtex-5 and Virtex-6 FPGAs.The results are compared with Radix-recoding-based approaches using a BCD4221 coding scheme.The proposed approach achieves improved FPGA performance in part because of the parallel binary operations and small size conversions.
The organization of this paper is as follows.Section 2 presents optimized building blocks required by the BCD multiplication.The proposed multiplier architecture and the schemes of the partial product generation and reduction are presented in Section 3. The implementation results of  × digit BCD multipliers are depicted in Section 4. Conclusions are given in Section 5.

Proposed Building Blocks for the Realization of BCD Multiplication
In this section, proposed schemes for an improved 6-input LUTs-based BCD adder and a mixed binary-decimal compressor are presented.These schemes will be utilized as the basic building blocks to construct our proposed BCD multipliers presented in Section 3.

6-Input LUTs
Then, the addition is presented as In (2),  1 or  1 has the binary set {000, 001, 010, 011, 100}, and the full adder [ 0 +  0 +  in ] has two outputs, the carry  0 and the sum  0 .The function  = [ 4  3  2  1 ] is a threebit adder with the add-3 correction merged, which can be expressed as In (3) To calculate the final result in BCD format, the carry  0 of the full adder must be added to .As a special case, an add-3 correction must be considered if  = 4 and  0 = 1 to achieve a correct final result.Table 1 is the truth table for the final correction.
Therefore, the proposed scheme requires the following steps: (i) Decompose the addition as two adders: one is a full adder for adding the two least significant bits of the input operands with the incoming carry, and another is a 3-bit adder with add-3 correction merged for the remaining bits.This function decomposition is presented in (2).
(ii) Implement the full adder and the 3-bit adder merged with an add-3 correction as presented in (3).(iii) Add the carry of the full adder with the output of the 3-bit adder using MUX-XOR networks.The multiplexers generate the propagated carries and the XOR gates output the sum bits.
(iv) Perform a final correction for the case of the carry of the full adder equal to "1" and the sum of the 3-bit adder equal to "4" to obtain the final result.
Figure 1 shows the architecture of this approach.In this design, if the carry of the full adder,  0 , is "0"; there is no change to the result of the 3-bit adder and no carry is propagated.The output of the BCD adder is the same as that of the 3-bit adder, which is However, if  0 is "1," the carry must be added to the result of the 3-bit adder.First, XOR 1 and MUX 1 add  0 to  1 and generate the sum  1 = ( 1 XOR  0 ) and the carry  1 = ( 1 AND  0 ).If  0 = 1 and  1 = 0, the sum  1 is equal to "1" and no carry ( 1 = 0) is propagated.However, if  0 = 1 and  1 = 1, the sum  1 is equal to "0," and the carry is propagated to  1 .The same procedure applies to XOR 2 and MUX 2 .For MUX 3 , it produces the output carry,  out .Based on the truth table listed in Table 1, the output carry  out is the same as  4 when  3 = 0 and the same as  0 when  3 = 1, which is realized by MUX 3 .In this case, propagating  0 from the output of the Figure 2: Two-group operands with the mixed binary-decimal format.full adder directly to the input of MUX 3 reduces this critical path delay.This has a significant performance impact on large size BCD ripple adders required by BCD multipliers.
To achieve a correct final result, a final correction in the cases of  0 = 1 and  = 4 must be performed to the sum.Since, before the final correction, the sum of the adder is equal to therefore under the condition of  0 = 1 and  = 4, the final add-3 correction is performed to ( 3  2  1 ), and the final result is equal to In this case, the outputs,  3 and  1 , have to be forced to "0." Otherwise,  3 and  1 are the same as  3 and  1 , respectively.Thus, the final correction performed to  3 and  1 is equal to The proposed 1-digit BCD adder was coded in VHDL and implemented in a Virtex-6 6vlx75tff784 Xilinx FPGA with a −3 speed grade using ISE13.1 [23].The results are compared with the carry-ripple BCD adder approach proposed in [19] using the same FPGA.The delays were extracted from Postplacement-and-Routing Static Timing Report and the LUTs usage was obtained from Place-and-Routing Report.Table 2 lists the implementation results.
Table 2 shows that the improved 6-LUT-based BCD adder approach achieves better performance compared with the reference BCD adder.Although the improvement in delay is approximately 2%, for large size adders the cumulative effect can be significant.

Binary-Decimal Compression.
The binary-decimal (BD) compression performs 2-digit column-based binary operations and binary to decimal conversions.The input operands of the BD compression are the results of 1 × 1-digit BCD multipliers presented in binary format, and the output of the BD compression is in BCD format.Since a 1 × 1-digit BCD multiplier results in a 2-digit decimal number, the binary inputs are based on 2-digit decimal positions.The input operands of the BD compression are or where  is the number of operands to be compressed, and n is the number of digits in each operand.The variable  , is expressed in a binary format but placed in a 2-digit decimal position.Since  , is the result of a 1 × 1-digit BCD multiplier, it has 7 binary bits.(iv) For each column, converting the binary sum to decimal with two digits as the sum and other digits as the carry (v) Saving the decimal sums and carries in carry-save format for all columns based on their decimal positions As an example, Figure 3 illustrates the BD compression with m input operands for the case presented in Figure 2(a).This procedure can also be used for the case in Figure 2(b).
In this case, the BD compression first compresses the m binary operands to one binary sum using binary compressors

Binary-sum
Binary-sum Binary-sum Sum(i + 1) Sum(i) Sum(i − 1) and binary adders.In this step, the binary compressors reduce m binary operands to  = (⌊log 2 ⌋ + 1) operands; then the binary adders add these  operands to produce a binary sum.Then, the binary sum is converted to a decimal number.The decimal number has a two-digit decimal sum, -sum(), and the decimal carries, -carry() t (for  = 1, 2, . ..).Each of the decimal sums or decimal carries takes two-digit position.The -sum() is located at the 10 2 column and the carries are located at the columns of 10 2+1 , 10 2+2 , and so on.Then, the decimal sum and carries for each column are saved as carry-save format based on their digit positions.Therefore, only ( + 1) decimal operands are generated after the BD compression.The value of  is dependent on the value of .If  is between 2 and 123, the maximal decimal result in each column is 81 × 123 = 9963, for which -sum = 63 and -carry = 99.In this case, only  (=1) decimal carry is generated.Thus, 123 such binary operands can be compressed to two decimal operands, one for the -sum and the other for the -carry.This arrangement results in a fast way to reduce the number of partial products for a BCD multiplier.

Proposed BCD Multiplier Approach
In this section, we present a binary-decimal compression (BDC) based BCD multiplier.The proposed approach consists of 1 × 1-digit binary multiplication, partial product generation, binary-decimal compression, and decimal addition.Figure 4 shows a block diagram which captures all the steps for this approach.

1 × 1-Digit Binary Multipliers
. The 1 × 1-digit binary multiplier receives two 1-digit BCD operands and outputs a binary result.The maximal output is 9 × 9 = 81 = [1010001] 2 , which is a 7-bit binary number.Since 1-digit 8421BCD number is the same as a 4-bit binary number, a 4 × 4-bit binary multiplier is used to perform the 1 × 1-digit binary multiplier.In our approach, the 4 × 4-bit binary multiplier is simply coded as  × , where  and  are 1-digit 8421BCD numbers.

Partial Product Generation (PPG).
The partial product generation is based on 1 × 1-digit binary multipliers.These binary outputs of the 1 × 1-digit binary multipliers are grouped according to their decimal positions.A triangular organization of the partial products is used for the BCD multiplier, which is similar to our previous work proposed in [24] for a binary multiplier.For the BCD multiplication, let us assume that the input operands of the multiplier are  and .They are in BCD format and can be expressed as . . .By multiplying  and  in (9), the product becomes where   ×  ,  + ×  , and  + ×  are the products from 1 × 1-digit binary multipliers.These 1 × 1-digit binary multipliers are organized based on their decimal positions, and the architecture of the BCD multiplier is shown in Figure 5.
Based on the decimal positions of the results of 1 × 1-digit binary multipliers, these partial products are separated into two groups.The first group is composed of  0 ,  3 ,  4 , . . .,  (2−3) and  (2−2) .The second group is composed of  1 and  2 , . . .,  (2−5) and  (2−4) .The number of operands in each of the columns is shown in Figure 6.The maximal number of operands in the first group is ( − 1) that is located at the column (/2) and column (/2 − 1).The maximal number of operands in the second group is n that is located at the column (/2 − 1).
As an example, Figure 7 shows the organization of a 4 × 4digit BCD multiplier.In this example, the operands in group 1 are located at the decimal positions 10 2 with  = 0, 1, 2, 3, and the number of operands in each column is 1, 3, 3, and 1, respectively.The operands in group 2 are located at the decimal positions 10 2+1 with  = 0, 1, 2, and the number of operands in each column is 2, 4, and 2.

Partial Product Reduction.
Based on the architecture of the BCD multiplier, the partial products are in mixed binarydecimal format.To reduce the number of partial products, two steps are performed: partial product compression and partial product conversion.

Partial Product Compression.
The partial product compression performs ( : 1) compression for the binary operands in each column using efficient binary compression and addition methods.The binary compressors first reduce  = (2  to 2 +1 − 1) binary operands to ( + 1) binary operands in each column.For example, for  = 3 the number of operands to be compressed is  = (8 to 15).After the compression, 4 binary operands are generated.
Then, these binary operands after the compression are added in binary to obtain a binary sum.Thus, the m operands are compressed to a single one for all columns.
Partial Product Conversion.The partial product conversion converts the binary sum to decimal operands.Double-Dabble (DD) converters [25] can be used in this step.Since the (b) Two groups of the partial products column-based operations produce limited size binary sums in each column, the conversions introduce only a small delay overhead.
After the binary to decimal conversion, normally only 3 or 4 decimal operands are generated.If there are 12 binary operands or less in one column, the maximal sum is 81 × 12 = 972, which is a 3-digit decimal number.Thus, the decimal sum has two digits and the decimal carry has only one digit.Moreover, the decimal carries in two groups are located at different digital positions.Therefore, the carries can be combined as one decimal operand.Figure 8 illustrates this situation for the 4 × 4-digit BCD multiplier example.Only three decimal operands are generated after the partial product reduction.
However, if there are more than 12 operands in one column, at least four digits are required in this column because 81 × 13 = 1053.Thus, the decimal carry has two digits.In this case, 4 decimal operands will be generated after the partial product reduction.For example, based on the number of operands in each column for a 16 × 16-digit BCD multiplier, as shown in Figure 9(a), the columns at 6, 7, 8, and 9 in the first group create 4-digit decimals for each column, and the columns at 6, 7, and 8 in the second group also generate 4-digit decimals for each column.The decimal operands after the conversion are shown in Figure 9(b), where  1 and  1 are the decimal sum and carry for group 1 and  2 and  2 are the decimal sum and carry for group 2. By combining the decimal carries in two groups, Figure 9(c) shows the decimal operand organization. 1 combines the first digit of the carries for all columns, and  2 combines the second digit of the carries for the related columns.After partial product reduction, four decimal operands are generated for the 16 × 16digit BCD multiplier as shown in Figure 9(c).

Final Decimal Addition (FDA).
To obtain the final result of the BCD multiplier, the decimal operands generated after the partial product reduction must be added to decimal adders.BCD ripple adders are used in our approach.These BCD ripple adders are built using our improved 6-LUTsbased BCD adders.Since only 3 or 4 decimal operands need to be added, two-level BCD ripple adders are required.Figure 10 shows the final addition of the BCD multiplication.If there are only three decimal operands to be added, the BCD adder2 in this figure is removed.

Pipelined Multipliers.
Based on the architecture of the BCD multiplier, a 4-stage pipelined BCD multiplier is illustrated in Figure 11.
In this pipelined multiplier, the 1 × 1-digit (4 × 4-bit) binary multiplication and binary compression and addition are combined in the first stage.In this stage, all operations in each column are in binary format.The second stage International Journal of Reconfigurable Computing  converts the binary numbers to decimal using the Double-Dabble (DD) converter [25].Since the input operand of the conversion is based on each column, the number of bits in the input operands is limited.Therefore, the delay for the conversions is relatively small.After the binary to decimal conversion, 3 or 4 decimal operands are generated and need to be added.To add these decimal operands, two levels of additions are performed.For a larger size multiplier, more pipeline stages may be required.Figure 12 shows an 8-stage pipeline strategy.

Implementation Results
The proposed BCD multiplier approach was implemented in Xilinx Virtex-   suite [23] was used for the synthesis and implementation.4 × 4-bit binary multipliers were used for the partial products generation.The mixed binary-decimal compressors were employed for partial product reduction.The improved 6-LUTs-based BCD adders were connected as ripple adders and used to sum the compressed partial products and generate the final result.Our multipliers were implemented targeting Xilinx xc5vlx330ff1760-2 and xc6vlx760ff1760-2 FPGA devices.The results of the total delay and number of LUTs usage were extracted after the synthesis and implementation and compared with those of the multipliers proposed in [21,22].Figures 13 and 14 illustrate timing information and LUTs utilized for 4 × 4, 8 × 8, and 16 × 16-digit pipelined BCD multipliers based on our proposed approach and on the architecture presented in [21].The implementation targeted Virtex-5 and Virtex-6 FPGAs, which are the exact same devices used in [21].The number of pipeline stages was selected based on the best implementation result for each of the multipliers.The total delay, clock cycle time, and LUT usage were depicted in these two figures labeled as (a), (b), and (c), respectively.
Compared with the results presented in [21], our proposed approach achieves improvements in all cases as shown in these figures.On average, the total delay reductions are 22.5% and 14.3% with 14.6% and 16.6% LUT savings when targeting Virtex-5 and Virtex-6 FPGAs, respectively.
The 16 × 16-digit multiplier with 5, 6, and 7 pipeline stages was implemented targeting Virtex-5 FPGA.The results were compared with the architecture in [22] and presented in Table 3.The total delay of all pipeline stages and the worstcase clock cycle for one pipelined stage were extracted and used for speed comparison.
Compared with the result proposed in [22], our approach achieves faster performance in terms of the total delay and worst-case minimum clock period.On average, the improvement in total delay reduction is 20.2% and in clock cycle reduction is 21.0%, with 8.7% LUTs penalty.

International Journal of Reconfigurable Computing
Ref [21]_v5 Proposed_v5   Thus, our approach compares favorably with the architectures in [21,22].The improvement comes in part from the use of parallel and binary operations, as well as our fast BCD additions.By using 1 × 1-digit binary multipliers and the 2-column-based binary-decimal compressors, fast parallel operations are performed with small size binary numbers.These binary-decimal compressors efficiently reduce the number of partial products to 3 or 4 decimal operands, which simplifies the decimal additions required by the multiplication.Moreover, in the decimal addition, our fast BCD adder decreases the propagation delay for BCD ripple adders.All these lead to superior multiplier architecture.

Conclusions
In this paper, a new  × -digit BCD multiplier approach was proposed.This approach uses 1 × 1-digit binary multipliers for the partial product generation.2-digit column-based binary operations are used for partial product reduction.This proposed binary-decimal compression scheme makes Ref [21]_v6 Proposed_v6  efficient use of a parallel strategy and of fast binary operation schemes to reduce the number of partial products of the multiplier.After the binary-decimal compression, only 3 or 4 operands in general need to be added in decimal to receive the final result of a BCD multiplier.To perform the decimal additions, a fast 6-LUTs-based BCD adder was proposed to realize BCD ripple adders required for the multiplication.The proposed BCD multipliers were pipelined and implemented on Xilinx Virtex-5 and Virtex-6 FPGAs.Compared with existing architectures, improved results have been achieved.

Figure 2
illustrates these twogroup operands, where (a) and (b) correspond to(7) and(8), respectively.The difference between Figures2(a) and 2(b) is the decimal positions of the columns.The binary-decimal compression performs the following steps: (i) Aligning the input operands based on 2-digit decimal position.All operands in the same column should have the same 2-digit decimal position (ii) Compressing all operands in each of the columns using binary compressors (iii) Adding the compressed binary operands in each column using binary adders

Figure 5 :
Figure 5: Triangular organization of the partial products of the BCD multiplier.

Figure 6 :
Figure 6: Number of operands in each of the columns.
Number of digits in each of input operands (c) Number of LUTs used
Number of digits in each of input operands (c) Number of LUTs used

Table 1 :
Final correction for the BCD adder.

Table 2 :
Comparison of the implementation results for the BCD adders.