Design of High-Speed Adders for Efficient Digital Design Blocks

The core of every microprocessor and digital signal processor is its data path. The heart of data-path and addressing units in turn are arithmetic units which include adders. Parallel-prefix adders offer a highly efficient solution to the binary addition problem and are well suited for VLSI implementations. This paper involves the design and comparison of high-speed, parallel-prefix adders such as Kogge-Stone, Brent-Kung, Sklansky, and Kogge-Stone Ling adders. It is found that Kogge-Stone Ling adder performs much efficiently when compared to the other adders. Here, Kogge-Stone Ling adders and ripple adders are incorporated as a part of a lattice filter in order to prove their functionalities. It is seen that the operating frequency of lattice filter increases if parallel prefix Kogge-Stone Ling adder is used instead of ripple adders since the combinational delay of Kogge-Stone Ling adder is less. Further, design and comparison of different tree adder structures are performed using both CMOS logic and transmission gate logic. Using these adders, unsigned and signed comparators are designed as an application example and compared with their performance parameters such as area, delay, and power consumed. The design and simulations are done using 65 nm CMOS design library.


Introduction
Binary addition is one of the most primitive and most commonly used applications in computer arithmetic.A large variety of algorithms and implementations have been proposed for binary addition [1][2][3].Parallel-prefix adder tree structures such as Kogge-Stone [4], Sklansky [5], Brent-Kung [6], Han-Carlson [7], and Kogge-Stone using Ling adders [8,9] can be used to obtain higher operating speeds.Parallelprefix adders are suitable for VLSI implementation since they rely on the use of simple cells and maintain regular connections between them.VLSI integer adders are critical elements in general purpose and digital-signal processors since they are employed in the design of Arithmetic-Logic Units, floating-point arithmetic data paths, and in address generation units.Moreover, digital signal processing makes extensive use of addition in the implementation of digital filters, either directly in hardware or in specialized digital signal processors (DSPs).In integer addition, any decrease in delay will directly relate to an increase in throughput.In nanometer range, it is very important to develop addition algorithms that provide high performance while reducing power consumption.The requirements of the adder are that it should be primarily fast and secondarily efficient in terms of power consumption and chip area.For wide adders (N > 16), the delay of carry look-ahead adders becomes dominated by the delay of passing the carry through the look-ahead stages.This delay can be reduced by looking ahead across the look-ahead blocks.In general, we can construct a multilevel tree of look-ahead structures to achieve delay that grows with log N.Such adders are variously referred to as tree adders or parallel prefix adders.Many parallel prefix networks have been described in the literature, especially in the context of addition.The classic networks include Brent-Kung, Sklansky, Kogge-Stone, and Han-Carlson adders.The basic components of adders can be designed in many ways.Initially, the combinational delay and functionality can be verified using HDLs, and optimization can be seen at architecture level.At second level, optimization can also be achieved by using specific logic families in the design.In this paper, adder components are designed, analyzed, and compared using CMOS gates and transmission gates using 130 nm technology file.This is a deep submicron technology file.Several variants of the carry look-ahead equations, like Ling carries [9], have been presented that simplify carry computation and can lead to faster structures.Most high speed adders depend on the previous carry to generate the present sum.Ling adders [8,9], on the other hand, make use of Ling carry and propagate bits, in order to calculate the sum bit.As a result, dependency on the previous bit addition is reduced; that is, ripple effect is lowered.This paper provides a comparative study on the implementation of the abovementioned high-speed adders.By designing and implementing high-speed adders, we found that the power consumption and area reduced drastically when the gates were implemented using transmission gates.This is found to happen without compromising on the speed.Later as an application example such as magnitude comparator is designed using Kogge-Stone Ling adder to verify the efficiency.

Adders
where g i is the bit generate and p i is the bit propagate.The schematic of g i and p i using CMOS and transmission gates design style is as shown in Figure 1.These are then utilized to compute the final sum and carry bits, in the last stage as follows: where There are many ways to develop these intermediate stages, the most common being parallel prefix.Many parallel prefix networks have been described in the literature, especially in the context of addition.In this paper, we have used the Kogge-Stone implementation, Hans-Carlson, Sklansky, Brent-Kung implementation of CLA, and Kogge-Stone implementation of Ling adder.PG logic in all adders is generally represented in the form of cells.These diagrams known as cell diagrams will be used to compare a variety of adder architectures in the following sections.Here two cells are used for implementation of all the adders: grey cell and the black cell.The basic block diagrams are as shown in Figure 2.

Analysis of Adders
In this paper, mathematical analysis is given for Ling adders.Similar analysis can be given for all other adders as well.
3.1.Brent-Kung Implementation.The Brent-Kung tree computes prefixes for 2-bit groups.These are used to find prefixes for 4-bit groups, which in turn are used to find prefixes for 8-bit groups, and so forth.The prefixes then fan back down to compute the carries-in to each bit.The tree requires 2log 2 N − 1 stages.The fanout is limited to 2 at each stage.
The diagram shows buffers used to minimize the fanout and loading on the gates, but, in practice, the buffers are generally omitted.The basic blocks used in this case are gray and black cells which are explained in Section 2. This adder is implemented for 8, 16, and 32 bits using CMOS logic and transmission gate logic.

Sklansky Implementation.
The Sklansky or divide-andconquer tree reduces the delay to log 2 N stages by computing intermediate prefixes along with the large group prefixes.This comes at the expense of fanouts that double at each level.The gates fanout to (8, 4, 2, 1), respectively.These high fanouts cause poor performance on wide adders unless the high fanout gates are appropriately sized, or the critical signals are buffered before being used for the intermediate prefixes.
Transistor sizing can cut into the regularity of the layout because multiple sizes of each cell are required although the larger gates can spread into adjacent columns.

Han-Carlson
Adder.The Han-Carlson trees are a family of networks between Kogge-Stone and Brent-Kung.The logic performs Kogge-Stone on the odd numbered bits and then uses one more stage to ripple into the even positions.

Kogge-Stone Adders.
The main difference between Kogge-Stone adders and other adders is its high performance.It calculates carries corresponding to every bit with the help of group generate and group propagate.In this adder the logic levels are given by log 2 N, and fanout is 2.
3.5.Ling Adders.Ling [8] proposed a simpler form of CLA equations which rely on adjacent pair bits (a i , b i ) and (a i−1 , b i−1 ).Along with bit generate and bit propagate, we introduce another prefix bit, the half sum bit given by

Gray cell
Figure 5: Ling generate and propagate in Ling CLA.
Figure 6: Block generate and propagate (Ling carry) using CMOS and transmission gate.Now, instead of utilizing traditional carries, a new type of carry, known as Ling carries, is produced where the ith Ling carry in [11] is defined to be where In this way, each H i can be in turn represented by We can see from (5) that Ling carries can be calculated much faster than Boolean carry.Consider the case of c 4 and H 4 If we assume that all input gates have only two inputs, we can see that calculation of c 4 requires 5 logic levels, whereas that for H 4 requires only four.Although the computation of carry is simplified, calculation of the sum bits using Ling carries is much more complicated.The sum bit, when calculated by using traditional carry, is given to be Substituting ( 5) into (9), we get that However, according to [12] the computation of the bits s i can be transformed as follows: Equation ( 11) can be implemented using a multiplexer with H i−1 as the select line, which selects either No extra delay is added by Ling carries to compute the sum since the delay generated by the XOR gate is almost equal to that generated by the multiplexer and that the time taken to compute the inputs to the multiplexer is lesser than that taken to compute the Ling carry.In [9], a methodology to develop parallel prefix Ling adders using Kogge-Stone [4] and Knowles [8] algorithm was developed.Here, for n-bit addition, Ling carry H i and H i+1 is given by   where To explain the above equations, consider the 3rd and 4th Ling carry, given by This can be further reduced by using (13) to This can be then further reduced by using the "•" operator to This allows the parallel prefix computation of Ling adders using a separate tree [9] for even and odd indexed positions.Using this methodology, we implemented a 16-bit adder using the Kogge-Stone tree and then utilized that block to develop 32 and 64-bit adders.The gates and blocks used for this implementation were then modified using transmission gates.Cells other than gray and black cell that are used as components in Ling adder, and they are as explained, in Figures 3 and 4. Figure 3 forms the first stage in the adder.It generates the bit generate, bit propagate, and half sum bits (for Ling adders) that is g i , p i , and d i , respectively, which are used extensively in the next stages to generate block generate and propagate.
Figure 5 is used to generate the Ling carry H which is nothing but the block generate.This is then used to find subsequent group generate and propagate with the block shown in Figure 6.
Finally the block generates are used to calculate the final sum along with the bit propagate half-sum bits to calculate the sum as in Figures 7 and 8.     Adders are extensively used as a part of filters.Lattice filter structures are used in various signal processing applications, and they are internally considered in the present work.The block diagram of third-order lattice filter is shown in Figure 9.The ripple adders in Lattice filter are replaced with Kogge-Stone Ling adder using component instantiation in VHDL.Here initially, Kogge-Stone Ling adder is implemented in VHDL to observe the functionality and combination delay.It is found that combination delay of 32 Kogge-Stone Ling adder is 12.492 ns which is much less when compared to the ripple adder of 15.504 ns.If components with lesser combinational delay are used in sequential circuits, the clock period will be reduced which internally increases the clock frequency.It is found that the implementation of Ling adder resulted in a 15% less delay when compared to the ripple adders after synthesis.
For cascaded lattice filter shown in Figure 9, with ripple adder, we get the below results after synthesis: (i) minimum period: Hence the clock frequency of any digital filter blocks is found to increase if Kogge-Stone Ling adder is used.This can be used for any digital blocks where operation speed needs to be high.

Simulations and Results
Schematic is constructed for 8 bit and 32 adders using CMOS and transmission gates as given in Figures 10,11,12,13,and 14.In each circuit, measurement of power, area, and delay is done.This can be done by designing the basic components such as black and grey cells using CMOS and transmission gates.The performance parameters are obtained for all these using 65 nm technology file, and the different performance parameters are compared for adders using CMOS gates and adders using transmission gates.The result summary of all the adders is given in Table 1.

Application Example
Here signed and unsigned magnitude comparator [13,14] is designed using Kogge-Stone Ling adder.A magnitude comparator determines the larger of two binary numbers.
To compare two unsigned numbers A and B, compute A zero detector indicates that the numbers are equal.Figure 15(a) shows a 8-bit unsigned comparator built from a carry-ripple adder and two complement units.The relative magnitude is determined from the carryout (C) and zero (Z) signals.For wider inputs, any of the faster adder architectures can be used.Figure 15(b) shows 8-bit signed comparator.
Comparing signed two's complement numbers is slightly more complicated because of the possibility of overflow when subtracting two numbers with different signs.Instead of simply examining the carry-out, we must determine if the result is negative (N, indicated by the most significant bit of the result) and if it overflows the range of possible signed numbers.The overflow signal V is true if the inputs had different signs (most significant bits), and the output sign is different from the sign of B. The actual sign of the difference B − A is S = N XOR V because overflow flips the sign.If this corrected sign is negative (S = 1), we know that A > B. Again, the other relations can be derived from the corrected sign and the Z signal.Carry signal is used here as well for comparison purpose.Kogge-Stone Ling adder as a basic block for comparator design performs much better since its combinational delay is less.

Conclusions
From the above work, it was seen that the clock frequency for the IIR filter using Ling adder was more than the clock frequency for the same IIR filter using simple ripple adder.The combinational path delay for the Ling adder was found to be 15% lesser than that for the ripple adder.Using transmission gates reduced the area of the adder and hence the comparator built using the adder, as compared to the area consumed when CMOS logic was used for implementation.
Using transmission gate logic reduced the delay and power consumption of the adder, and hence the comparator using these adders, as compared to the delay and power consumed when CMOS logic was used for implementation.The power consumed by the comparator using Ling adder is lesser than the power consumed by comparator designed using other normal tree adders.

2. 1 .
Carry Look Ahead Adders.Consider the n-bit addition of two numbers: A = a n−1 , a n−2 , . . ., a 0 and B = b n−1 , b n−2 , . . ., b 0 resulting in the sum, S = s n−1 , s n−2 , . . ., s 0 and a carry, C out .The first stage in CLA computes the bit generate and bit propagate as follows:

Figure 1 :
Figure 1: Schematic of bit generate circuit using CMOS and transmission design style.

Figure 2 :
Figure 2: Block diagram of grey cell and black cell.

Figure 8 :
Figure 8: Sum block in Ling adder using CMOS and transmission gates.

Table 1 :
Delay, power and area consumed for different adders: a comparision.Power in W Delay in sec Area (no of transistors) Power in W Delay in sec