In this paper, a novel BCD multiplier approach is proposed. The main highlight of the proposed architecture is the generation of the partial products and parallel binary operations based on 2digit columns. 1 × 1digit multipliers used for the partial product generation are implemented directly by 4bit binary multipliers without any code conversion. The binary results of the 1 × 1digit multiplications are organized according to their twodigit positions to generate the 2digit columnbased partial products. A binarydecimal compressor structure is developed and used for partial product reduction. These reduced partial products are added in optimized 6LUT BCD adders. The parallel binary operations and the improved BCD addition result in improved performance and reduced resource usage. The proposed approach was implemented on Xilinx Virtex5 and Virtex6 FPGAs with emphasis on the critical path delay reduction. Pipelined BCD multipliers were implemented for 4 × 4, 8 × 8, and 16 × 16digit multipliers. Our realizations achieve an increase in speed by up to 22% and a reduction of LUT count by up to 14% over previously reported results.
The traditional approach of using binary number system based operations in a decimal system requires frontend and backend conversion. These conversions can take a significant amount of processing time and consume large area. A more important problem with fractional decimal numbers expressed in a binary format may result in lack of accuracy. This can have major impact in finance and commercial applications. To solve these problems, interest in hardware design of decimal arithmetic is growing. This has led to the incorporation of specifications of decimal arithmetic in the IEEE754 2008 standard for floatingpoint arithmetic [
Multiplication is a complex operation among decimal computations. To speed up this operation, early decimal multipliers were designed at the gate level targeting ASICs. The authors in [
Although there are a number of approaches to implement decimal multipliers in ASICs, utilizing the same methods in FPGA devices is not necessarily efficient. With recent advancements in FPGA technology, enhanced architectures, and availability of various hardware resources, the FPGA platform is recognized as a viable alternative to ASICs in many cases. To make efficient use of FPGA resources in the implementation of decimal multiplication, new algorithms and approaches have been developed. The authors in [
In this paper, we propose a new parallel binaryoperationbased decimal multiplier approach. Binary operations are performed for the 1 × 1digit multiplication and the partial product reduction based on the columns with two digits in each column. The operations for all columns are processed in parallel. After the columnbased binary operations, binary to decimal conversions are required but the bit sizes of the operands to be converted are limited based on the columns. In this paper, an improved 6LUTbased BCD adder and a 2digit columnbased binarydecimal compressor are also presented. Our proposed approach was implemented in Xilinx Virtex5 and Virtex6 FPGAs. The results are compared with Radixrecodingbased approaches using a BCD4221 coding scheme. The proposed approach achieves improved FPGA performance in part because of the parallel binary operations and small size conversions.
The organization of this paper is as follows. Section
In this section, proposed schemes for an improved 6input LUTsbased BCD adder and a mixed binarydecimal compressor are presented. These schemes will be utilized as the basic building blocks to construct our proposed BCD multipliers presented in Section
The 6input LUTsbased 1digit BCD adder is based on the use of 6input LUTs and MUXXOR networks in FPGAs. It is an improved version of the architecture presented in [
Assume that the input operands of the adder are
To calculate the final result in BCD format, the carry
Final correction for the BCD adder.



Comments 




0 0 0 0  0 0 0 0  0 0 0 1  “+3” is not required 
0 0 0 1  0 0 0 1  0 0 1 0  “+3” is not required 
0 0 1 0  0 0 1 0  0 0 1 1  “+3” is not required 
0 0 1 1  0 0 1 1  0 1 0 0  “+3” is not required 
0 1 0 0  0 1 0 0  1 0 0 0 

0 1 0 1  x x x x  x x x x  
0 1 1 0  x x x x  x x x x  
0 1 1 1  x x x x  x x x x  
1 0 0 0  1 0 0 0  1 0 0 1  “+3” has been performed 
1 0 0 1  1 0 0 1  1 0 1 0  “+3” has been performed 
1 0 1 0  1 0 1 0  1 0 1 1  “+3” has been performed 
1 0 1 1  1 0 1 1  1 1 0 0  “+3” has been performed 
1 1 0 0  x x x x  x x x x  
1 1 0 1  x x x x  x x x x  
1 1 1 0  x x x x  x x x x  
1 1 1 1  x x x x  x x x x 
Therefore, the proposed scheme requires the following steps:
Decompose the addition as two adders: one is a full adder for adding the two least significant bits of the input operands with the incoming carry, and another is a 3bit adder with add3 correction merged for the remaining bits. This function decomposition is presented in (
Implement the full adder and the 3bit adder merged with an add3 correction as presented in (
Add the carry of the full adder with the output of the 3bit adder using MUXXOR networks. The multiplexers generate the propagated carries and the XOR gates output the sum bits.
Perform a final correction for the case of the carry of the full adder equal to “1” and the sum of the 3bit adder equal to “4” to obtain the final result.
Figure
Improved 1digit BCD adder using 6LUTs and MUXXOR network in FPGA.
In this design, if the carry of the full adder,
To achieve a correct final result, a final correction in the cases of
The proposed 1digit BCD adder was coded in VHDL and implemented in a Virtex6 6vlx75tff784 Xilinx FPGA with a −3 speed grade using ISE13.1 [
Comparison of the implementation results for the BCD adders.
Improved 6LUT  Reference [  

Delay (ns)  LUTs  Delay (ns)  LUTs 
1.372  10  1.397  10 
Table
The binarydecimal (BD) compression performs 2digit columnbased binary operations and binary to decimal conversions. The input operands of the BD compression are the results of 1 × 1digit BCD multipliers presented in binary format, and the output of the BD compression is in BCD format. Since a 1 × 1digit BCD multiplier results in a 2digit decimal number, the binary inputs are based on 2digit decimal positions. The input operands of the BD compression are
Twogroup operands with the mixed binarydecimal format.
The binarydecimal compression performs the following steps:
Aligning the input operands based on 2digit decimal position. All operands in the same column should have the same 2digit decimal position
Compressing all operands in each of the columns using binary compressors
Adding the compressed binary operands in each column using binary adders
For each column, converting the binary sum to decimal with two digits as the sum and other digits as the carry
Saving the decimal sums and carries in carrysave format for all columns based on their decimal positions
As an example, Figure
Binarydecimal compression.
In this case, the BD compression first compresses the m binary operands to one binary sum using binary compressors and binary adders. In this step, the binary compressors reduce m binary operands to
Then, the binary sum is converted to a decimal number. The decimal number has a twodigit decimal sum,
In this section, we present a binarydecimal compression (BDC) based BCD multiplier. The proposed approach consists of 1 × 1digit binary multiplication, partial product generation, binarydecimal compression, and decimal addition. Figure
Block diagram of the proposed BDCbased BCD multiplier.
The 1 × 1digit binary multiplier receives two 1digit BCD operands and outputs a binary result. The maximal output is 9 × 9 = 81 = [
The partial product generation is based on 1 × 1digit binary multipliers. These binary outputs of the 1 × 1digit binary multipliers are grouped according to their decimal positions. A triangular organization of the partial products is used for the BCD multiplier, which is similar to our previous work proposed in [
Triangular organization of the partial products of the BCD multiplier.
Based on the decimal positions of the results of 1 × 1digit binary multipliers, these partial products are separated into two groups. The first group is composed of
Number of operands in each of the columns.
As an example, Figure
A 4 × 4digit BCD multiplier.
Triangular organization
Two groups of the partial products
Based on the architecture of the BCD multiplier, the partial products are in mixed binarydecimal format. To reduce the number of partial products, two steps are performed: partial product compression and partial product conversion.
Then, these binary operands after the compression are added in binary to obtain a binary sum. Thus, the m operands are compressed to a single one for all columns.
After the binary to decimal conversion, normally only 3 or 4 decimal operands are generated. If there are 12 binary operands or less in one column, the maximal sum is 81 × 12 = 972, which is a 3digit decimal number. Thus, the decimal sum has two digits and the decimal carry has only one digit. Moreover, the decimal carries in two groups are located at different digital positions. Therefore, the carries can be combined as one decimal operand. Figure
Partial product reduction for a 4 × 4digit BCD multiplier.
However, if there are more than 12 operands in one column, at least four digits are required in this column because 81 × 13 = 1053. Thus, the decimal carry has two digits. In this case, 4 decimal operands will be generated after the partial product reduction. For example, based on the number of operands in each column for a 16 × 16digit BCD multiplier, as shown in Figure
Partial product reduction for a 16 × 16digit BCD multiplier.
To obtain the final result of the BCD multiplier, the decimal operands generated after the partial product reduction must be added to decimal adders. BCD ripple adders are used in our approach. These BCD ripple adders are built using our improved 6LUTsbased BCD adders. Since only 3 or 4 decimal operands need to be added, twolevel BCD ripple adders are required. Figure
The final addition for a BCD multiplier.
Based on the architecture of the BCD multiplier, a 4stage pipelined BCD multiplier is illustrated in Figure
4stage pipelined BCD multiplier.
In this pipelined multiplier, the 1 × 1digit (4 × 4bit) binary multiplication and binary compression and addition are combined in the first stage. In this stage, all operations in each column are in binary format. The second stage converts the binary numbers to decimal using the DoubleDabble (DD) converter [
An 8stage pipeline multiplier.
The proposed BCD multiplier approach was implemented in Xilinx Virtex5 and Virtex6 FPGAs for 4 × 4, 8 × 8, and 16 × 16digit pipelined BCD multipliers. The ISE 13.4 tool suite [
Figures
Implementation results using Virtex5 FPGA.
Total delay
Clock cycle
Number of LUTs used
Implementation results using Virtex6 FPGA.
Total delay
Clock cycle
Number of LUTs used
Compared with the results presented in [
The 16 × 16digit multiplier with 5, 6, and 7 pipeline stages was implemented targeting Virtex5 FPGA. The results were compared with the architecture in [
Results compared with [
# of pipeline 
[ 
Proposed  Comparison  

Total delay (ns)  Clock cycle (ns)  #LUTs  Total delay (ns)  Clock cycle (ns)  #LUTs  Delay reduction (%)  Clock cycle time reduction (ns)  # of LUT saving (%)  
5  27.400  5.480  6438  19.025  3.805  6843  30.57  30.57  −6.29 
6  28.740  4.830  6664  22.242  3.707  6918  22.61  23.25  −3.81 
7  30.660  4.460  5992  28.392  4.056  6953  7.40  9.06  −16.04 
Compared with the result proposed in [
Thus, our approach compares favorably with the architectures in [
In this paper, a new
The authors declare that they have no competing interests.