FPGA-Based Synthesis of High-Speed Hybrid Carry Select Adders

Carry select adder is a square-root time high-speed adder. In this paper, FPGA-based synthesis of conventional and hybrid carry select adders are described with a focus on high speed. Conventionally, carry select adders are realized using the following: (i) full adders and 2 : 1 multiplexers, (ii) full adders, binary to excess 1 code converters, and 2 : 1 multiplexers, and (iii) sharing of common Boolean logic. On the other hand, hybrid carry select adders involve a combination of carry select and carry lookahead adders with/without the use of binary to excess 1 code converters. In this work, two new hybrid carry select adders are proposed involving the carry select and section-carry based carry lookahead subadders with/without binary to excess 1 converters. Seven different carry select adders were implemented in Verilog HDL and their performances were analyzed under two scenarios, dual-operand addition and multioperand addition, where individual operands are of sizes 32 and 64-bits. In the case of dual-operand additions, the hybrid carry select adder comprising the proposed carry select and section-carry based carry lookahead configurations is the fastest. With respect to multioperand additions, the hybrid carry select adder containing the carry select and conventional carry lookahead or section-carry based carry lookahead structures produce similar optimized performance.


Introduction
Carry select adder (CSLA) belongs to the family of highspeed square-root time adders [1,2] and provides a good compromise between the low area occupancy of ripple carry adders (RCAs) and the high-speed performance of carry lookahead adders (CLAs) [2,3]. In the existing literature, many flavors of carry select addition have been realized on both ASIC and FPGA platforms with ASIC implementations being predominant. CSLAs usually involve duplication of RCA structures with presumed carry inputs of binary 0 and binary 1 to enable parallel addition and thereby speed up the addition process [3,4]. To minimize the area metric of CSLAs owing to replication of the RCA structures, an addone circuit (also called, binary to excess-1 converter, viz. BEC) is introduced [5][6][7]. Carry select addition can also be performed by utilizing the common Boolean logic (CBL) [8] shared between the sum and carry outputs of a full adder [9].
Nevertheless, due to the serial cascading of full adder modules, the delay metric would not decrease although the area parameter would reduce. Further, optimizations at the device, gate levels [10][11][12][13][14][15], and realization styles [16,17] have been carried out to reduce area, improve speed, and minimize the power-delay product of CSLAs on the basis of semicustom and full-custom ASIC-style synthesis. Rather than realizing pure CSLAs, hybrid architectures incorporating carry select and carry lookahead structures have also been proposed [18][19][20][21] to improve the design efficiency of CSLAs. Moreover, some FPGA implementations of CSLAs have been attempted [21][22][23]. Overall, a survey of published literature reveals that CSLAs have been widely implemented using the following topologies and computational elements: (i) (Conventional) CSLA -full adders and 2 : 1 multiplexers (MUXes) (ii) CSLA with BEC -Full adders, BECs, and 2 : 1 MUXes 2 Advances in Electronics (iii) CSLA based on CBL sharing (iv) Hybrid CSLA and CLA structures (v) Hybrid CSLA and CLA including BECs.
In general, CSLAs are composed using a carry select architecture with/without BECs or may consist of a mix of carry select and carry lookahead configurations with/without BECs. CSLAs constructed using pure carry select structures are called "homogeneous CSLAs" and CSLAs realized using a combination of carry select and carry lookahead structures are labeled as "heterogeneous/hybrid CSLAs. " The interest behind hybrid CSLAs is supported by the fact that heterogeneous adders tend to better optimize the design metrics compared to homogeneous adders [24]. In a recent work [25], section-carry based CLAs (SCBCLAs) were proposed as an alternative to conventional CLAs; for a 32-bit addition operation, the SCBCLA was found to exhibit reduced propagation delay than the conventional CLA by 15.2%. Motivated by this result, two new hybrid CSLA architectures are proposed in this work, a hybrid CSLA incorporating CSLA and SCBCLA and another hybrid CSLA embedding CSLA, SCBCLA, and BECs. This paper builds upon our prior work [21] by analyzing the performance of different CSLA architectures with respect to diverse input partitions for different addition widths for the case of dual-operand addition and further evaluates the efficacy of the conventional and proposed CSLAs with respect to multioperand additions.
The remaining part of this paper is organized as follows. With 8-bit addition as a running example, Section 2 describes the conventional CSLA topologies with and without BEC logic and also the CSLA based on sharing of CBL. Section 3 presents the architectures of hybrid CSLAs incorporating CLAs and SCBCLAs with/without BEC logic. In Section 4, the performance of different CSLA topologies is evaluated for dual-operand and multioperand additions with operand sizes of 32 and 64-bits. Finally, the conclusions follow in Section 5.

Homogeneous CSLA Architectures
The RCA and homogeneous CSLA architectures are shown in Figure 1 for an example case of 8-bit addition. Figure 1(a) depicts an 8-bit RCA, which is formed by a cascade of full adder modules; the full adder [9] is an arithmetic building block that adds an augend and addend bit (say, and ) along with any carry input (cin) and produces two outputs, namely, sum (Sum) and carry overflow (Cout). Since there is a rippling of carry from one full adder stage to another, the propagation delay of the RCA varies linearly in proportion to the adder width. The CSLA basically partitions the input data into groups and addition within the groups is carried out in parallel; that is, the CSLA is composed of partitioned and duplicated RCAs. It can be seen from Figure 1 that the least significant 4-bit adder stages of RCA and CSLAs are identical. However, the carry produced by the least significant nibble is simply propagated through the more significant nibble in the case of the RCA bit-by-bit, while the carry corresponding to the least significant nibble serves as the selection input for MUXes present in the more significant position in the case of CSLAs. Figure 1(b) shows the 8-bit conventional CSLA comprising full adders and 2 : 1 MUXes, henceforth referred to as simply "CSLA. " In the case of CSLA shown in Figure 1(b), the full adders present in the most significant nibble position are duplicated with carry inputs (cin) of 0 and 1 assumed; that is, one 4-bit RCA with a carry input ("cin") of 0 and another 4bit RCA with a carry input ("cin") of 1 are used. Notice that both these RCAs have the same augend and addend inputs. While the least significant 4-bit RCA would be adding the augend inputs ( 3 to 0 ) with the addend inputs ( 3 to 0 ), the more significant 4-bit RCAs would be simultaneously adding up the augend inputs ( 7 to 4 ) with the addend inputs ( 7 to 4 ), with presumed carry inputs (cin) of 0 and 1. Due to two addition sets, two sets of sum and carry outputs are produced, one based on 0 as the carry input and another based on 1 as the carry input, which are in turn fed as inputs to the 2 : 1 MUXes. The number of MUXes used depends on the size of the RCA duplicated. To determine the true sum outputs and the real value of carry overflow pertaining to the most significant nibble position, the carry output ( 4 ) from the least significant 4-bit RCA is used as the common select input for all the MUXes; thereby the correct result corresponding to either the RCA with 0 as the carry input or the RCA with 1 as the carry input is displayed as output. Figure 1(c) portrays the 8-bit CSLA containing full adders, 2 : 1 MUXes, and BEC logic, henceforth identified as "CSLA BEC". Figure 1(c) also shows the internals of the 5-bit BEC, which is depicted by the circuit shown within the oval. The CSLA BEC is rather different from the CSLA in that instead of having an RCA with a presumed carry input of 1 in the more significant nibble position, the BEC circuit is introduced. The BEC logic adds binary 1 to the least significant bit of its binary inputs and produces the resultant sum and carry as output. As seen in Figure 1(c), the BEC accepts as inputs the sum and carry outputs of the RCA having a presumed carry input of 0, adds binary 1 to this input, and produces the resulting sum and carry overflow as output. Now the correct result exists between choosing the output of the RCA with a presumed carry input of 0 and the output of the BEC logic. The carry output 4 of the least significant RCA is used to determine the correct set of the most significant nibble position sum and carry outputs. The logic equations governing the 5-bit BEC are given below. In the equations, ∼ signifies logical inversion, ⊕ implies logical XOR, and • represents logical conjunction. Consider The CSLA constructed on the basis of sharing of CBL is depicted through Figure 2, which will be referred to as "CSLA CBL" henceforth. The CSLA CBL adder is founded    upon utilizing the full adder logic, whose underlying equations are given below with , , and cin being the primary inputs, and Sum and Cout being the primary outputs. In (3), "+" implies logical disjunction:

Advances in Electronics
Referring to (2) and (3), it may be understood that, for a carry input (cin) of 0, (2) and (3) reduce to Sum = ( ⊕ ) and Cout = ( • ). With cin = 1, (2) and (3) become Sum =∼ ( ⊕ ) and Cout = ( + ). Based on this principle, sum and carry outputs for both possible values of input carries are generated simultaneously and fed as inputs to two 2 : 1 MUXes. The correct sum and carry outputs are determined by the carry input, serving as the select input for the two MUXes. Though the exorbitant duplicated RCA and RCA with BEC logic structures are eliminated through this approach, leading to savings in terms of area, nevertheless, since the carry propagates from stage-to-stage, the critical data path delay tends to be proportional to the size of the full adders cascade. As a consequence, the delay of the CSLA CBL adder may be close to that of the RCA which is confirmed by the simulation results given in Section 4.

Heterogeneous/Hybrid CSLA Architectures
Apart from synthesizing basic CSLA topologies viz. CSLA, CSLA BEC, and CSLA CBL, hybrid CSLA architectures involving CSLA and CLA/SCBCLA were also implemented with the intention of minimizing the maximum propagation path delay. It is well known that a CLA is faster than a RCA, and hence it may be worthwhile to have a CLA as a replacement for the least significant RCA in the CSLA structure. Although the concept of carry lookahead is widely understood, the concept of section-carry based carry lookahead may not be that well known, and hence to explain the distinction between the two, sample 4-bit lookahead logic realized using these two approaches is portrayed in Figure 3 for an illustration. For details on different section-carry based carry lookahead structures and SCBCLA constructions using them, an avid reader is directed to references [25][26][27], which constitute prior works in the realm of synchronous and asynchronous designs. The section-carry based carry lookahead generator shown enclosed within the circle in Figure 3 produces a single lookahead carry signal corresponding to a "section" or "group" of the adder inputs (hence the term "section-carry"), while the conventional carry lookahead generator encapsulated within the rectangle produces multiple lookahead carry signals corresponding to each pair of augend and addend primary inputs. The section-carry based carry lookahead generator differs from the traditional carry lookahead generator in that bit-wise lookahead carry signals are not required to be computed for the former. The XOR and AND gates used for producing the necessary propagate and generate signals ( 3 to 0 and 3 to 0 ) are highlighted using dotted lines in Figure 3; these constitute the propagate-generate logic referred to in Figures 4 and 5.
8-bit hybrid CSLAs with/without BEC logic and comprising a CLA in the least significant stage viz. "CSLA-CLA" and "CSLA BEC-CLA" adder types are shown in Figure 4. On the other hand, 8-bit hybrid CSLAs with/without BEC logic and incorporating a SCBCLA in the least significant stage viz. "CSLA-SCBCLA" and "CSLA BEC-SCBCLA" adder varieties are portrayed in Figure 5. Both the conventional CLA and SCBCLA constitute three functional blocks: propagategenerate logic, lookahead carry generator, and the sum producing logic. Not only is the carry lookahead generator different for CLA and SCBCLA adders, but the sum producing logic is also different; in case of CLA, the sum producing logic comprises only XOR gates, whereas in the SCBCLA, the sum producing logic consists of full adders and an XOR gate, with the XOR gate providing the sum of the primary inputs 3 , 3 , and 3 . While rippling of carries occurs internally within the carry-propagate adder constituting the SCBCLA and producing the requisite sums, the lookahead carry signal corresponding to an adder section is generated independently (in parallel) and serves as the lookahead carry input for the successive CSLA stage.  viz. CSLA-CLA, CSLA BEC-CLA, CSLA-SCBCLA, and CSLA BEC-SCBCLA were described topologically in Verilog HDL similar to previous works [16,[21][22][23]25] to perform two kinds of addition operations viz. dual-operand addition and multioperand addition. For dual-operand addition, two binary operands having corresponding sizes of 32-bits and 64-bits were considered. For multioperand addition, addition of four binary operands, each of size 32-bits, and another multioperand addition involving four binary operands with each having size of 64-bits were considered. Moreover, two types of multioperand additions were performed based on (i) carry save adder (CSA) topology, and (ii) bit-partitioned addition scheme. All the adders were synthesized using a 90 nm FPGA (XC3S1600E) [28], with speed optimization specified as the design goal in the Xilinx 9.1i ISE design suite. The critical path delay and area values (in terms of number of basic logic elements viz. BELs) were ascertained after automatic place-and-route. The results of dual-operand additions shall be presented first, followed by the results obtained for multioperand additions.

Dual-Operand Addition.
CSLAs can be implemented on the basis of uniform or nonuniform primary input partitions; accordingly they are labeled as "uniform" or "non-uniform" CSLAs, in a structural sense. "Input partitioning" basically means splitting up of the primary inputs into groups of inputs so as to pave the way for addition to be done in parallel within the partitions; it should be noted that input partitioning is inherent to all CSLAs except the CSLA CBL type (shown in Figure 2) which has a regular carry select structure and hence is void of input partitions. Referring to Figure 1(b), it can be seen that 8 pairs of inputs have been split into two uniform or equal-sized groups of 4-input pairs; thus it can be said that the 8-bit CSLA is realized according to a 4-4 input partition.  [29] were considered for realizing the 64-bit CSLAs. Figure 7 depicts the propagation delay variations subject to different primary input partitions for the six CSLA architectures. The trend line highlighted in Figure 6 shows that the uniform 8-8-8-8 input partition consistently paves the way for least propagation delay (varying from 17 ns to 20 ns) with respect to various 32-bit homogeneous and heterogeneous CSLAs. Similarly the trend line indicated in black in Figure 7 conveys that the uniform 16-16-16-16 input partition results in the least data path delay (varying from 27 ns to 29 ns) for the different homogeneous and heterogeneous 64-bit CSLAs.  The maximum combinational path delay (also called, "critical path delay") encountered and the total number of BELs consumed by different homogeneous and heterogeneous CSLAs to perform the addition of two 32-bit operands and two 64-bit operands separately is shown in Tables 1 and 2, respectively. The optimum delay and area values are in found to have the longest path delay of 37.604 ns. Compared to the maximum delay of the hybrid CSLA-SCBCLA, the hybrid CSLA BEC-SCBCLA adder which is another proposed hybrid CSLA topology has a comparable speed performance of 18.052 ns. However with respect to area, the RCA and CSLA CBL structures require less number of BELs than all the CSLAs. Hence it is inferred from Figure 6 and Table 1 that for the addition of two input operands having sizes of 32-bits the hybrid CSLA-SCBCLA adder is preferable over all other homogeneous and heterogeneous CSLAs and the favorable input data partition is 8-8-8-8.
Based on a similar observation, by referring to Figure 7 and Table 2, it can be seen that the 16-16-16-16 input partition is found to be optimum from a delay (i.e., speed) perspective for 64-bit dual-operand addition. The proposed CSLA BEC-SCBCLA constructed using the 16-16-16-16 input data partition leads to the least latency amongst all other adder topologies; however, the other proposed CSLA viz. CSLA-SCBCLA based on a similar input partition features almost a similar delay metric. In terms of area occupancy though, the 64-bit RCA is optimized. Nevertheless, the RCA encounters considerably more data path delay by 1.6× in comparison with the proposed CSLA BEC-SCBCLA based on a 16-16-16-16 input partition.

Multioperand Addition.
The performance of different homogeneous and heterogeneous CSLAs is evaluated based on the case studies of multioperand addition involving 4 binary operands, with respective sizes of 32-bits and 64bits. Two multioperand addition schemes are considered, one involving the carry save adder (CSA) topology, and another involving the bit-partitioning method.

CSA Based Multioperand
Addition. The structure of an example CSA used to add four -bit binary numbers is shown in Figure 8. Here, −1 to 0 , −1 to 0 , −1 to 0 , and −1 to 0 represent the primary inputs and the sum bits and Sum +1 to Sum 0 represents the primary outputs. The subscript 0 denotes the LSB and the subscript ( − 1) denotes the MSB. As shown in Figure 8, there are three adders in three levels to perform the addition of four input operands. In each CSA, the carry output signal of the current bit at a level is not transferred to the next bit adder of the same level as the carry input; instead, the carry output is transferred to the next bit adder in the lower level as the carry input. In the top-level adder, three numbers ( , , and ) are added simultaneously; that is, the bits corresponding to any number could act as input carries for the full adders of the first level CSA. In the next lower level, an extra number ( ) is added. The adder in the bottom level, shown within the ellipse in Figure 8, is a simple RCA which is what portrayed here but it may be any dual-operand adder that can be used to compute the final sum. Experimentation was performed by having different dualoperand adders viz. RCA and various homogeneous and heterogeneous CSLAs in the final adder stage of the CSA, shown in Figure 8, to analyze their relative performance for two different addition scenarios: (i) addition of four binary operands, each of size 32-bits, and (ii) addition of four binary operands with each having size of 64-bits. The FPGA-based synthesis results viz. delay and area obtained for the addition of four binary operands, each having size of 32-bits, are given in Table 3 with the optimized values in bold font. Since the 8-8-8-8 primary input partition was found to yield the least data path delay, as evident from Figure 6 and Table 1, it was preferred for the various CSLA realizations. It can be seen from Table 3   the proposed CSLA BEC-SCBCLA adder closely following it with just a 1.7% delay difference. The conventional CLA, when used in the final adder stage of the CSA as a "homogeneous adder, " reports a critical path delay of 34.306 ns. On the contrary, when the conventional CLA is used along with the CSLA inclusive of the BEC as a "heterogeneous adder" (CSLA BEC-CLA), it enables considerable decrease in maximum data path delay by 37.8% vindicating the observation made in [24] that heterogeneous adders are preferable over homogeneous adders for delay optimization. Although the use of RCA and CSLA CBL adders in the final adder stage of the CSA helps to minimize the area occupancy compared to their counterparts, they suffer from an exacerbated increase in delay of about 87% over the CSLA BEC-CLA type.
The synthesis results obtained for the addition of four binary operands, each having sizes of 64-bits, is shown in  Table 4 and the optimized values are in bold font. Since the 16-16-16-16 uniform input partition was found to be delay optimal (refer to Figure 7 and  the number of input operands by an approximate linear order. To reduce the logic depth of the adder tree, a bitpartitioning strategy was presented in [30] in the context of self-timed multioperand addition, which involved splitting up of the entire group of data operands into a desired number of subgroups, and the intermediate addition results of the subgroups are finally added to produce the final sum. The bitpartitioning approach basically parallelizes the multioperand addition and is illustrated through Figure 9 for an example scenario where addition of " " binary operands with each operand having a size of " " bits is considered whilst assuming " " to be even. A "dot" represents a bit position in Figure 9. The entire set of input operands from bit position 0 to bit position ( − 1) is divided into two equal-sized groups (for an example) as field, which comprises inputs from bit positions 0 to ( /2−1) and the field consisting of inputs from bit positions ( /2) to ( −1). Addition within the individual fields (i.e., field and field) is performed simultaneously and the sum bits generated as intermediate outputs from these individual fields ( field and field) are then added together using a final dual-operand adder to produce the required sum. The bit-partitioning scheme might help to speed-up the addition, especially when several operands have to be added by way of performing parallel column-wise addition of rowwise partitions. For example, considering the addition of 32 data operands, each of size 32 bits, the CSA topology would encounter thirty full adder delays plus the delay associated with the final dual-operand adder. On the other hand, based on the bit-partitioning technique, considering eight partitions with each partition comprising four data operands, the bit-partitioned multioperand adder based upon the CSA topology could encounter a reduced propagation delay of about four full adder delays plus the delay of a dual-operand adder, depending upon the implementation. Also, a high regularity would be implicit within the overall architecture as the gate-level hardware is being duplicated.
In this work, the bit-partitioning scheme was employed to partition the set of four inputs into two input groups ( field and field, as shown in Figure 9) and the outputs of and fields were then added to produce the final sum. Several dual-operand adders were used to realize the bitpartitioned addition separately viz. RCA, CSLA CBL, CSLA, CSLA BEC, CSLA-CLA, CSLA BEC-CLA, CSLA-SCBCLA, and CSLA BEC-SCBCLA. The different bit-partitioned addition structures were individually synthesized using the same FPGA (XC3S1600E). It should be noted that the focus here is only on evaluating the performance of the RCA and different CSLAs as employed for multioperand addition and not to comment upon the efficacy of the bit-partitioning scheme as such (i.e., no comparison with the results of the previous subsection). This is because, as mentioned in the preceding discussions, the bit-partitioning technique is scalable, can be custom-defined, and could potentially benefit in terms of latency reduction primarily for additions involving typically higher dimensions as compared with conventional combinational tree structures. Table 5 presents the timing and area results obtained for the synthesis of bit-partitioned multi-input addition of 4 binary operands, each of size 32-bits, on the basis of RCA and various homogeneous and heterogeneous CSLAs. Since the 8-8-8-8 uniform input partition was found to be delayoptimum for realizing the 32-bit CSLAs (refer to Figure 6 and Table 1), only this uniform input partition has been considered for implementing the various homogeneous and hybrid CSLAs corresponding to -field and field of the bitpartitioned multioperand addition. To sum up the outputs of -field and field, a 33-bit dual-operand adder would be required in which case an extra bit has been added to the most significant position of various CSLA input partitions. The optimum synthesis metrics obtained for the example multi-input addition are in bold font in Table 5. It can be seen that the proposed CSLA BEC-SCBCLA paves the way for least computation time (27.056 ns) amongst all. In comparison, the undesirable increases in delay values for other bit-partitioned multioperand adders incorporating RCA, CSLA CBL, CSLA, CSLA BEC, CSLA-CLA, CSLA BEC-CLA, and CSLA-SCBCLA types are found to be 47.6%, 56.1%, 15.9%, 3%, 15.9%, 3%, and 2.1%, respectively. However, the RCA results in the lowest area occupancy (190 BELs) and the CSLA CBL adder occupies nearly the same area with just 5 more BELs. Nevertheless, the bit-partitioned multioperand adder based upon the RCA pays a 47.6% delay penalty in comparison with that utilizing the CSLA BEC-SCBCLA. Table 6 shows the delay and area values obtained for the synthesis of bit-partitioned addition of four input operands of sizes 64 bits, corresponding to different adder architectures, with the CSLAs utilizing the 16-16-16-16 uniform input partition since this partition was found to be delay optimal (refer to Figure 7 and Table 2). With respect to less area, the RCA is found to be the optimum architecture. However, in terms of less critical path delay, the proposed CSLA-SCBCLA benefits by achieving a good delay reduction of 38.2% compared to the maximum path delay of the RCA based bit-partitioned multioperand adder.

Conclusions
CSLA is an important member of the high-speed adder family. In this paper, existing CSLA architectures viz. homogeneous and heterogeneous have been described and two new hybrid CSLA topologies were put forward: (i) carry selectcum-section-carry based carry lookahead adder (CSLA-SCBCLA) and (ii) carry select-cum-section-carry based carry lookahead adder including BEC logic (CSLA BEC-SCBCLA). The speed performances of the various CSLA structures have been analyzed based on the case studies of 32-bit and 64-bit dual-operand and multioperand additions. Both uniform and nonuniform input data partitions were considered for the various CSLA implementations and FPGA-based synthesis was performed. It has been found for dual-operand additions; the proposed CSLA-SCBCLA/CSLA BEC-SCBCLA architecture is faster and outperforms all other homogeneous and heterogeneous CSLAs. For bit-partitioned multi-input additions, the proposed CSLA-SCBCLA/CSLA BEC-SCBCLA architecture promises high speed. Nevertheless, for multioperand addition based on the CSA topology, the conventional CSLA BEC-CLA and the proposed CSLA BEC-SCBCLA architectures were found to exhibit an optimized and comparable speed performance. From the inferences derived through this work, it is likely that the proposed hybrid CSLA architectures could achieve enhanced performance over conventional CSLAs for ASICbased synthesis as well.