High-Speed FPGA 10 ’ s Complement Adders-Subtractors

This paper first presents a study on the classical BCD adders from which a carry-chain type adder is redesigned to fit within the Xilinx FPGA’s platforms. Some new concepts are presented to compute the P and G functions for carry-chain optimization purposes. Several alternative designs are presented. Then, attention is given to FPGA implementations of add/subtract algorithms for 10’s complement BCD numbers. Carry-chain type circuits have been designed on 4-input LUTs (Virtex-4, Spartan-3) and 6-input LUTs (Virtex-5) Xilinx FPGA platforms. All designs are presented with the corresponding time performance and area consumption figures. Results have been compared to straight implementations of a decimal ripple-carry adder and an FPGA 2’s complement binary adder-subtractor using the dedicated carry logic, both carried out on the same platform. Better time delays have been registered for decimal numbers within the same range of operands.


Introduction
In a number of computer arithmetic applications, decimal systems are preferred to the binary ones.The reasons come not only from the complexity of coding/decoding interfaces but mostly from the lack of precision and clarity in the results of the binary systems.
Decimal arithmetic plays a key role in data processing environments such as commercial, financial, and Internetbased applications [1][2][3].Performances required by applications with intensive decimal arithmetic are not met by most of the conventional software-based decimal arithmetic libraries [1].Hardware implementation embedded in recently commercialized general purpose processors [3,4] is gaining importance.
Furthermore, IEEE has recently published a new standard 754-2008 [5] that supports the floating point representation for decimal numbers.
At the moment, Binary Coded Decimal (BCD) is used for decimal arithmetic algorithm implementations.Although other coding systems may be of interest, BCD seems to be the best choice until now.Issues of hardware realization of decimal arithmetic units appear to be widely open: potential improvements are expected in what refers to algorithm concepts as well as to hardware design.This paper resumes some new concepts about carry-chain type algorithms for adding BCD numbers.Two key ideas have been introduced: (i) the Propagate P and generate G functions are computed from the input data instead of intermediate BCD sums, and (ii) the functions have been implemented in Xilinx Virtex-4 [6] and Virtex-5 FPGA platforms [7], taking advantage of the 6-input LUTs structure of Virtex-5 version.
Signed numbers addition is used as a primitive operation for computing most arithmetic functions, so that it deserves particular attention.It is well known that in classical algorithms the execution time of any program or circuit is proportional to the number N of digits of the operands.In order to minimize the computation time, several ideas have been proposed in the literature [8,9].Most of them consist in modifying the classical algorithm in such a way as to minimize the computation time of each carry; the time complexity may still be proportional to N, but the proportionality constant may be reduced.Moreover, it has to be pointed out that, within the same range, decimal addition involves shorter carry propagation process than for the straight binary code.It will be shown in the practical implementations that adding BCD digits can not only save coding interfaces but moreover provides time delay reductions.Hardware consumption for BCD will be greater, if coding and decoding processes are not considered; as of today, the dramatic decreasing of hardware cost stimulates work on time saving.
In this paper, decimal carry-chain and ripple-carry adders have been implemented on Virtex-4 Xilinx FPGA platforms, for a number of operand sizes; comparative performances are presented for binary and BCD digit operands.
Additionally, three implementations of adderssubtractors have been implemented on FPGA Xilinx Virtex-5 platforms for a number of operand sizes; comparative performances are presented for binary and BCD digit operands, respectively.Adder-subtractor inputs are 10's complement signed BCD numbers; sign-change algorithm is used whenever subtraction is at hand.

Base-B Ripple-Carry Adders
Consider the base-B representations of two n-digit numbers: Algorithm 1 (pencil and paper) computes the (n+1)-digit representation of the sum z = x + y + c in where c in is an initial carry equal to 0 or 1.

Algorithm 1. Classic addition (ripple carry):
c(0) := c in; As c(i + 1) is a function of c(i), the execution time of Algorithm 1 is proportional to n (Figure 1).In order to reduce the execution time of each iteration step, Algorithm 1 can be modified as shown in Section 3.

Base-B Carry-Chain Adders
First define two binary functions of two B-valued variables, namely, the propagate (P) and generate (G) functions: The next carry c i+1 can be calculated as follows: if The corresponding modified Algorithm 2 is the following one.

Algorithm 2. Carry-chain addition
-computation of the generation and propagation conditions: Comments.
(1) Instruction sentence (3) is equivalent to the following Boolean equation: Furthermore, if the preceding relation is used, then the definition of the generate function can be modified: (2) Another Boolean equation equivalent to (4) is If the preceding relation is used, then the definition of the propagate function can be modified: The structure of an n-digit adder with separate carry calculation is shown in Figure 2 c (1) x(0) y( 0) Cy.Ch.
x(0) y( 0) The Cy.Ch (carry-chain) cell computes the next carry, that is to say so that G(i) generates a carry, whatever happens upstream in the carry-chain, and P(i) propagates the carry from level i−1.The mod B sum cell calculates As regards the computation time T, the critical path is shaded in Figure 2. It has been assumed that T sum > T Cy.Ch .
Another interesting time is the delay T carry (n) from c(0) to c(n) assuming that all propagate and generate functions have already been calculated: Comments.The carry-chain cells are binary circuits, whereas the generate-propagate and the mod B sum cells are B-ary ones.Equation ( 4) can be implemented by a 2-to-1 binary multiplexer (Figure 3(a)) while ( 6) by a 2-gate circuit (Figure 3(b)).In the first case, the per-digit-delay of a carrychain adder is equal to the delay T mux2-1 of a 2-to-1 binary multiplexer, whatever the base B is.
If B = 2 and the carry-chain cell of Figure 3(a) is used, then P(i) = x(i) ⊕ y(i) and G(i) can be chosen equal to, for example, y(i).The corresponding cell for a n-bit binary adder is shown in Figure 4.

Base-10 Complement and Addition
Figure 4: Binary adder cell.[8,9].One restricts to 10's complement system to cope with the needs of this paper.A one-to-one function R(x), associating a natural number to x, is defined as follows.
Every integer x belonging to the range is represented by R(x) = x mod 10 n , so that the integer represented in the form The conditions ( 12) may be more simply expressed as Another way to express a 10's complement number is where while the sign definition rule is the following one: if x is negative, then x n−1 ≥ 5; otherwise x n−1 < 5.

Ten's Complement Sign Change.
Given an n-digit 10's complement integer x, the inverse z = −x of x is an (n + 1)digit 10's complement integer.Actually the only case that −x cannot be represented with n digits is when x = −10 n /2, so −x = 10 n /2, that is to say −x = 0.10 n + (5) . The computation of the representation of −x is based on the following property.Assuming x to be represented as an n-digit 10's complement number R(x), −x may be readily computed as A straightforward inversion algorithm then consists in representing x with n + 1 digits, complementing every digit to 9, then adding 1. Observe that sign extension is obtained by adding a digit 0 to the left of a positive number or 9 for a negative number, respectively.

Base-10 Adders
5.1.Base-10 Ripple-Carry Adders.For B = 10, the classic and naïve approach [8] of ripple-carry for a BCD decimal adder cell can be implemented as in Figure 5. Observe that the critical path involves the carry propagation through 7 binary adders plus a 4-bit Boolean circuit (checking if the sum s is greater than 9 or not).

Base-10 Carry-Chain Adders.
If B = 10, the carry-chain circuit remains unchanged but the P and G functions as well as the modulo-10 sums are somewhat more complex.In base 2, the mod B sum cell appears to be a single XOR function, while the mod 10 sum cell is more complex as suggested by Figure 5.
In base 2, the P and G cells are, respectively, synthesized by XOR and AND functions, while in base 10, P and G are now defined as follows: A straightforward way to synthesize P and G is shown at Figure 6.Nevertheless, functions P and G may be directly computed from x(i) and y(i) inputs.The following formulas (18) are Boolean expressions of conditions (17), where p j = x j ⊕ y j , g j = x j • y j , and k j = x j • y j are the binary propagator, generator, and carry-kill for the jth components of the BCD digits x(i) and y(i).
The BCD carry-chain adder ith cell is shown at Figure 7.It is made of a first mod 16 adder stage, a carry-chain cell driven by the G-P functions, and an output adder stage performing a correction (adding 6) whenever the carry-out is one.Actually, a zero carry-out c(i+1) identifies that the mod 16 sum does not exceed 9 if c(i) = 0, respectively, 8 if c(i) = 1; so no corrections are needed.Otherwise, the add-6 correction applies.
The G-P functions may be computed according to Figure 6, using the outputs of the mod 16 stage, including the carry-out s 4 .With more hardware consumption, but saving time delays, formulas (18) may be used.FA FA FA HA Figure 6: G-P cell for BCD adder.
FA FA FA HA Carry chain FA FA FA HA

FPGA Implementations of the Base-10 Adders on 4-Input LUTs Xilinx Platforms
The base-10 adders of Figures

FPGA Implementation of the Base-10 Carry-Chain Adder.
In order to make the best use of the resources, the design has been achieved using relative location techniques (RLOC) [12] with low-level component instantiations.This first architecture is called GP a.
The adding stages are implemented as shown at Figures 8(a) and 8(b) while the carry-chain structure with the G-P functions has been implemented as shown at Figure 9 where G is computed according to Figure 6, while P is computed as equivalent to the expression of Figure 6. Figure 9 emphasizes that G depends on s 1 , s 2 , s 3 , and s 4 while P is computed from s 0 , s 3 , and G.
x 3 (i) x 2 (i) x 1 (i) x 0 (i) The time delay corresponding to the 4-bit adder stage (Figure 8(a)) and the output adder stage (Figure 8(b)) is given as Both adder stages of Figures 8(a) and 8(b) need the same hardware requirement; computed in slices, the area consumption is given as The complexity figures of the carry-chain circuit for a 4digit unit, as shown at Figure 9, are given as where T con1 stands for the average connection delay between two neighboring slices of the same CLB.
The overall circuit is represented in Figure 10.The overall time delay is computed from formulas (21), ( 22) and (24): where T con2 stands for the average connection delays between two slices located in neighbor columns.T con2 has to be accounted twice to involve both the connection delay between the 4-bit adder and the carry-chain and the one between the carry chain and the output adder.

y(i) inputs using the Boolean expression (18). Using 4-input
LUTs (4-LUTs), a first implementation (Figure 11) computes This architecture called GP b is shown in Figure 11.The corresponding time and area of a carry-chain cell using this architecture is The complete cell includes a 4-bit adder and a conditional 3-bit output adder adding 6 whenever necessary (similar to Figure 5).The overall time delay and area consumption using this carry-computation cell is: International Journal of Reconfigurable Computing The results in area and speed are poor compared to the GP a implementation (obtaining G-P from the results of the 4-bit adder).
Another alternative is based on the use of dedicated multiplexers.Xilinx Spartan 3, Virtex-2, and Virtex-4 devices have Look-Up Table multiplexers (muxf5, muxf6, muxf7, muxf8) in order to construct functions of 5, 6, 7, and 8 variables without using the general purpose routing fabric.
Using this feature the circuit of Figure 12 (GP c) can be implemented using the following relations: The corresponding time and area of a carry-chain cell GP c is where T 6-LUT stands for the delay from an LUT input to a muxf6 output.The complete cell also includes 4-bit adder and a conditional 3-bit adder.The overall delay-area for GP c cell is , respectively, while the carry-chain structure with the G-P functions is computed according to Figure 6.Xilinx Virtex-5 6-input/2-output LUT is built as two 5input functions while the sixth input controls a 2-1 multiplexor allowing to implement either two 5-input functions or a single 6-input one; so G and P functions fit in a single LUT as shown at Figure 13.
In a second version, Ad-II, the carry-chain is speeded up thanks to a direct computation of the G-P, namely, using inputs x(i) and y(i), instead of the intermediate sum bits s k .
For this purpose one could use formulas (18); nevertheless, in order to minimize time and hardware consumption the implementation of P(i) and G(i) is revisited as follows.
Remembering that P(i) = 1 whenever the arithmetic sum x(i) + y(i) = 9, one defines a 6-input function pp(i) set to be 1 whenever the arithmetic sum of the first 3 bits of x(i) and y(i) is 4. Then P(i) may be computed as On the other hand, gg(i) is defined as a 6-input function set to be 1 whenever the arithmetic sum of the first 3 bits of x(i) and y(i) is 5 or more.So, remembering that G(i) = 1 whenever the arithmetic sum x(i) + y(i) > 9, G(i) may be computed as As Xilinx Virtex-5 LUTs may compute 6-variable functions, then gg(i) and pp(i) may be synthesized using 2 LUTs in parallel while G(i) and P(i) are computed through an additional single LUT as shown at Figure 14.

10's Complement BCD Carry-Chain Adder-Subtractor.
To compute X + Y similar algorithm as in Section 7.1 is used.In order to compute X-Y , 10's complement subtraction algorithm actually adds (−Y ) to X.

10's Complement (AS-I
). 10's complement sign change algorithm may be implemented through a digitwise 9's complement stage followed by an add-1 operation.It can be shown that the 9's complement binary components z 3 , z 2 , z 1 , and z 0 of a given BCD digit y 3 , y 2 , y 1 , and y 0 are expressed as To compute X-Y , 10's complement subtraction algorithm actually adds (−Y ) to X.So for a first implementation, AS-I, Figure 15 presents a 9's complement implementation using 6-input/2-output LUTs, available in the Virtex-5 Xilinx technology.A /S is the add/subtract control signal; if A /S = 1 (subtract), formulas in (38) apply; otherwise A /S = 0 and z j (i) = y j (i) for all i, j.
The AS-I circuit is similar to the Ad-I (Figures 8 and 13) using, instead of input Y , the input Z as produced by the circuit of Figure 15.

Carry-Chain Stage Computing G and P Directly from the Input Data (AS-II).
As far as addition is concerned, the P and G functions may be implemented according to formulas (36) and (37).The idea of the AS-II is computing the corresponding functions in the subtract mode and then multiplexing according to the add/subtract control signal A /S.For this reason, assuming that the operation at hand is X + (±Y ), one defines on one hand ppa(i) and gga(i) according to Section 7.1, that is, using the straight values of Y's BCD components.On the other hand, pps(i) and ggs(i) are defined according to the same Section 7.1 but using z k (i) as computed by the 9's complement circuit shown at Figure 15.As z k (i) are expressed from the y k (i) (38), both pps(i) and ggs(i) may be computed directly from x k (i) and y k (i) as shown in Figure 17.Nevertheless, for subtraction, the computation of z 0 (k) = y 0 (k) is carried out at the output LUT level.So formulas (36) and (37) are then expressed as Figure 13: FPGA carry-chain circuit for Ad-I.

Experimental Results
8.1.Xilinx Virtex-4 Adder Implementations.The base-10 adders have been implemented on Xilinx Virtex-4 FPGA family speed grade-11 [6].The Synthesis and implementation have been carried out on XST (Xilinx Synthesis Technology) [13] and Xilinx ISE (Integrated System environment) version 10.1 [14].Performances of different N-digit BCD adders have been compared to those of an M-bit binary carry chain adder (implemented by XST [13] using Xilinx fast carry logic) covering the same range, that is, as The time and hardware complexities of an M-bit ripplecarry adder implemented on the same 4-LUT based Xilinx FPGA are given by C M-bit adder = M LUTs = 3.322 × N LUTs (43) Formulas ( 26), (30), (34), and (42) show that, asymptotically, T N-digit adder should be somewhat inferior to T M-bit adder .Nevertheless, as shown by the experimental results, the additive values appearing in (26), (30), and (34) are not negligible for reasonable values of N; so the saving in time will mainly appear for applications where BCD-to-binary coding and decoding operations play a significant role in the overall delay.
A /S y 0 (i) x 0 (i) Post place-and-route time delays and area consumptions are quoted in Tables 1 and 2, respectively, where N stands for the number of BCD digits while M stands for the number of bits required to cover the decimal N-digit range.The results presented in the table are as follows: Figure 17: FPGA implementation of the carry-chain stage AS-II for BCD adder-subtractor.
Figure 18 shows the delays for the compared adders.Observe that, for the technology at hand, Table 1 and Figure 18 suggest that for N > 48 the carry-chain decimal implementation of adders is faster than the binary one for the equivalent range.Furthermore for small numbers of digits to add (N < 40) the PG c architecture is faster than other decimal implementations.

Virtex-5
Adder-Subtractor Implementations.The addersubtractor circuits have been implemented on Xilinx Virtex-5 family with speed grade-2 [7].The synthesis and implementation have been carried out on XST (Xilinx Synthesis Technology) [13] and Xilinx ISE (Integrated System environment) version 10.1 [14].The critical parts were designed using low-level components instantiation (lut6 2, muxcy, xorcy, etc.) in order to obtain the desired behavior.Performances of different N-digit BCD adders have been compared to those of an M-bit binary carry chain adder (implemented by XST) covering the same range, that is, such that M = N • log 2 (10) ∼ = 3.322 N.    Comments.Observe that, for large operands, the decimal operations are faster than the binary ones.
The overall area with respect to binary computation is not negligible.In Virtex-4 the area increases, with respect to an equal range binary adder, in a factor between 2.4 and 5.4.In the 6-input LUT family Virtex-5 an adder-subtractor is between 3.0 and 3.9 times bigger.

Conclusions
The present interest in BCD arithmetic systems stimulates further researches at both the algorithmic and design levels.
Considering that the hardware costs are everyday more affordable, full hardware BCD units are now very attractive, with moreover a growing potential in the near future.This paper has developed some implementations of BCD adders and subtractors in FPGA platforms.Experimental results emphasize time performances with reasonable costs in terms of area.Matched with the binary system, the decimal implementations are faster as operand sizes are growing (break even around 50 digits).
One of the key points about delays comes from the fact that the carry-propagation computation remains binary; then a faster carry-chain circuit can be designed because, for the same operand range, the number of digits (therefore of carries to propagate) is lower in decimal than in binary.In the carry-chain structures studied in this paper, the propagate P and generate G functions are more complex and therefore more time and area consuming than in the binary ones; therefore, the speed improvements only appear for large enough operands.The breakeven point is obviously technology dependent; so it could be expected to occur for a smaller number of digits in the near future.
The area overhead with respect to binary computation is not negligible; it is around five times in Virtex-4 and nearly four times in Virtex-5.That is mainly due to the more complex definition of the carry propagate and carry generate functions and to the final mod 10 reduction.The decreasing costs of technology make hardware consumption less central.
For BCD addition, the performance considerations on Xilinx Virtex-5 platform are similar to those of 4-input LUTsbased Virtex-4 technology.That is, the addition time of BCD digits remains faster than the binary counterpart in the same conditions.

C 7 . 1 .
N-digit adder-c = 16 • N LUTs.Base-10 BCD Carry-Chain Adder.In a first version, Ad-I, the adding stage and correction stage are implemented as shown at Figures 8(a) and 8(b)

Figure 10 :
Figure 10: FPGA implementation of an N-digit BCD Adder.

Figure 16 :
Figure 16: FPGA implementation of the adder stage for a 10's complement BCD adder-subtractor.

Figure 18 :
Figure 18: Delay in ns for different adders in Virtex-4.
. It is based on Algorithm 2. The G-P (Generate-Propagate) cell calculates the Generate and Propagate functions (2).

Table 3
exhibits the postplacement and routing delays in ns for the decimal adder implementations Ad-I and Ad-II of Section 7.1; Table4exhibits the delays in ns for

Table 3 :
Delays in ns for decimal and binary adders in Virtex-5 -2.II of Section 7.2.Table5lists the consumed areas expressed in terms of 6-input look-up tables (6-input LUTs).The estimated area presented in Table5was empirically confirmed.

Table 5 :
Area in 6-input LUTs for different adders and adderssubtractors.