AGeneral DesignMethodology for Synchronous Early-Completion-Prediction Adders in Nano-CMOSDSP Architectures

Synchronous early-completion-prediction adders (ECPAs) are used for high clock rate and high-precision DSP datapaths, as they allow a dominant amount of single-cycle operations even if the worst-case carry propagation delay is longer than the clock period. Previous works have also demonstrated ECPA advantages for average leakage reduction and NBTI effects reduction in nanoscale CMOS technologies. is paper illustrates a general systematic methodology to design ECPA units, targeting nanoscale CMOS technologies, which is not available in the current literature yet. e method is fully compatible with standard VLSI macrocell design tools and standard adder structures and includes automatic de�nition of critical test patterns for postlayout veri�cation. A design example is included, reporting speed and power data superior to previous works.


Introduction
Fast integer adders are an essential component of most DSP datapaths. Synchronous early-completion-prediction adders (ECPAs) [1], also known as variable-latency adders [2], have been introduced for high clock rate and high-precision datapaths, as they allow single-cycle operations even if the clock period is shorter than the worst-case carry propagation delay. anks to the data dependency of actual carry chain propagation, the occurrence of multicycle operations can be maintained statistically rare, thus allowing an overall speed improvement. e industrial effectiveness of the idea was �rst proven by the design of a full-custom ECPA unit for a DSP datapath at Toshiba Labs [1]. e logic foundation of that adder is shown in [3]. An extension to multiply unit design has been shown in [4]. e works in [2] and [5] have recently pointed out the potentials of variable-latency adder units in nano-CMOS addition units, for reducing average leakage power consumption and improving robustness to NTBI faults occurring in nano-scale technologies.
An ECPA consists of a conventional adder plus a completion-prediction logic unit ( Figure 1). e prediction unit estimates the actual critical path length in the adder depending on the operand values and hence the cycle count of the operation for the target cycle time. is approach differs from asynchronous completion detection units [6][7][8], as it is based on a totally synchronous scheme. From the design point of view, the logic speci�cation of the prediction function depends on the target cycle time and on the estimation of the variable completion time of the adder, in order to de�ne the cycle count output. Moreover, the speed of the prediction unit is critical, since the prediction must always be completed in a single cycle in order to be effective.
No general design methodology for ECPA VLSI cores has been proposed yet. In [3], Lee and Asada analyzed the design problem on the basis of 2-input-gate unit delay within a ripple carry adder structure. In [1], Kondo et al. address the full-custom design case of a fast carry-select structure. In [9], Nowick et al. deal with the design of speculativecompletion adders, similar in principle to ECPA but again addressing asynchronous design. In [2], a 64-bit carry-select design case is presented, where the prediction logic synthesis a priori assumes that single-cycle latency occurs when the carry propagation chain is shorter than 32 bits.
We present a detailed, general method for the design of completion-prediction logic in full-custom adder macrocells in nano-scale CMOS technologies, targeting carryselect and hybrid carry-select/carry-lookahead addition schemes. Notably, the prediction logic unit insertion does not modify the adder logic in any way. e proposed method supports the prediction of generally any cycle count latency value and not only single-cycle and two-cycle latency.
e paper illustrates the reference architecture template, the adder propagation delay model upon which the design method is built, the design procedure detailed description, the general validation of the approach, and a speci�c example with performance results referring to 32 ns CMOS technology and reporting speed and power data.

Architecture Template
We start from the well-known carry-select addition scheme [10][11][12], illustrated in Figure 2. e logic components indicated with "setup" produce the propagate and generate bits out of the operand bits. e components indicated with "adder block" produce potential carry and sum values to be selected by the chain of 2 : 1 multiplexers in the lower part of the picture. e proposed methods targets two standard addition schemes: (i) a conventional, low area carry-select with ripple adder circuitry in its adder blocks (in the following, CSA); (ii) a hybrid scheme of carry-select with carry-lookahead circuitry in its adder blocks (in the following, CLA/CSA).
In both schemes, we refer to the operand size as bits and the block size as bits.
A design option is the introduction of internal pipeline stages in the adder. As an ECPA unit has the goal of statistically exploit single-cycle operation latency, the introduction of pipelined operations may result in relatively low effectiveness at the expense of a pipeline register delay overhead. Its effectiveness was studied by instruction level simulations of SPECint benchmarks in [8]. For a 2-cycle worst-case latency ECPA sustaining 1 GHz clock rate, only 6% of the additions resulted to take bene�t from the internal pipeline. Here we illustrate the design of nonpipelined ECPA speci�cally addressing singlecycle latency maximization; speci�c applications may ben-e�t from the introduction of a pipeline in the architecture.
e clocking strategy underlying the proposed highspeed ECPA design is a two-phase symmetric clock with dynamic logic and transparent latches [11,12]. Figure 3 illustrates the architecture and timing of the operations. e set-up logic and the adder block logic are implemented in Domino style, precharged on the low clock half-cycle. e selection chain multiplexers are implemented in static logic, while the prediction logic is implemented in Domino style, precharged in the high clock half-cycle. e input register of the whole adder is split into two latches, one between the adder block and the selection chain and one aer the selection chain. e whole adder structure is developed adopting a standard VLSI macrocell design tool chain.
In case of single cycle operation, the adder operates as a normal single-cycle Domino circuit. In case of multicycle operation, the input of the adder blocks should not change for the time needed to complete the selection chain and the sum generation. is is normally accomplished by input registers of the arithmetic unit in the datapath architecture. In general, depending on the operand values and on the clock cycle time, an addition can take 1 + cycles, where ranges from 0 (single cycle operation) to some units (usually not more than 3 for practical interest design cases).

Timing Analysis and Model of the Variable Completion Time
In the analyzed adder schemes, the addition operation proceeds as follows: aer the propagate and generate bit vectors have been set up, any block having at least one propagate bit at "0" produces a valid anticipated output carry bit independent from its input carry bit. en, the output carry bits of the remaining blocks are evaluated by means of the carry selection chain. e longest chain of unknown carry bits determines the time duration of the latter operation [7,8,14,15]. When all the carries are ready, the adder performs the sum selection to produce the result. Referring to Figure 3, the delays of the components involved in the above operations are the se-tup logic delay, the adder block delay (including a latch delay), the multiplexer delay, and the latch delay, namely, SETUP , BLOCK , MUX , and LATCH . Finally, let us refer to the length of the longest chain of consecutive unknown carries as max , an integer ranging from 0 to , and to the prediction unit propagation delay as PRED . Given a clock cycle time , the timing conditions for correctly performing an addition in + clock cycles are as follows. Precharge phase conditions: (1) Evaluation phase conditions: In (4), the term BLOCK + MUX is the time needed to produce the anticipated carries [7,15], the term MUX ⋅ max is the delay of the longest carry selection chain, and the last MUX is the delay of the 2 : 1 �nal sum selection a�er all the carries are ready (minor conservative approximation contained in (4)-mainly in the time for producing the anticipated carries and in the individual role of the least/most signi�cant bloc�s-slightly overestimates the addition time).
Equation (1) must be veri�ed a�er the circuit implementation but do usually not represent a problem as the evaluation pull-down, performed sequentially, is normally slower than the pull-up precharge, performed in parallel [16]. Equations (2) and (3) are necessary conditions for the correct implementation of the ECPA and constitute a circuit design constraint. Equation (4) constitutes the functional speci�cation of the prediction logic, as the smallest integer satisfying it de�nes the predicted cycle count 1 + . e worst-case cycle count occurs if max 1. We characterized the propagation delay of the CMOS circuits in the adder structure by means of NGSPICE simulation [17], referring to a 32 nm CMOS technology. Transistor sizing was optimized for critical path speed according to the logical effort method [16]. Table 1 summarizes the resulting component delays for the two adder schemes. In order for the reported characterization to be used in different technologies, the delay values are normalized with respect to the reference time unit FO4, given by the delay of an inverter driving four identical inverters, which is a characteristic datum for a given technology. Note that LATCH results to be always 1.7 FO4 delay units, as we use a �xed latch cell with �xed load.

Synthesis Method of the Completion Prediction Unit
According to (4), the prediction circuit must evaluate max , convert this information into the binary cycle count of the addition, and/or activate a synchronous completion signal at the th cycle aer the addition has started. Figure 4 shows the generic architecture of the prediction logic, composed of three stages.
(1) �e �rst stage computes the dependency bit of each block (except for block 0); that is where and are the potential carry values produced in block by the two adders with input carry = 0 and input carry = 1, respectively. e output carry of a block depends on the preceding block if and only if is "1, " provided that and have reached a stable state [7]. (2) e second stage evaluates max and encode its value by signals named , such that = "1" means max . To synthesize , we have to �nd out any block having active; for 2 we have to �nd out any two consecutive blocks and having both and active; and so on. As a general expression, we have Referring to a target cycle time , from (4) in conjunction with the delay values in Table 1, we can evaluate the cycle count for all the possible values of max and hence for all the combinations of the logic variables . We can build up a prediction table for each adder type and size, de�ning the correct values of for a set of values of the target cycle time . Table 2 shows an example of two prediction tables, for the CSA and the CLA/CSA, respectively. e shaded part covers those values that do not match (3). Once we choose the column corresponding to the target cycle, we can look at the prediction table as a truth table with logic  signals as input variables and the cycle count (e.g., its binary expression) as output variable. e adjacent rows having the same output cycle count correspond to two logic minterms differing only in one input variable , occurring as direct and complemented. Such adjacent rows can, therefore, collapse into a single row by Boolean reduction, resulting in a �nal truth table dedicated to the target adder and target clock period, where only a subset of effective variables appears, drastically reducing the hardware overhead in both the second and third stage of the prediction logic. Section 6 presents an example of the truth table generation.
(3) e third stage converts the logic signals into a binary expression of . e binary digits of the number can be explicitly speci�ed in the truth table and synthesized as a (very simple) combinational function of the variables. If the datapath architecture does not require a binary expression of the cycle count but rather a synchronous wait signal to �ag the completion of the operation, the third stage can be a (very simple) synchronous state machine, directly driven by the logic signals , activating a synchronous completion signal cycles aer the current one. An example of both solutions is shown in Section 6.
e fast full-custom circuit implementation of the �rst two logic stages relies on dynamic Domino circuits precharged on the low clock half-cycle, according to the twophase clocking strategy sketched in Figure 2. e third stage, thanks to its inherently low complexity, can be implemented as static logic without compromising speed. e schematic of the critical path of the prediction logic is sketched in Figure 5: the �rst and second stages, implementing (5) and (6), consist of single-stage Domino circuits. All of the signals coming out of the Domino gates are precharged low during the low clock phase. e external logic is supposed to sample the output of the prediction unit on the clock falling edge.

Design Validation Procedure
e validation of the prediction logic design must address two issues: verifying the predicted cycle count correctness and verifying that the prediction unit evaluates the cycle count within the second half-cycle of the current clock cycle.
To verify the correct prediction in a postlayout SPICE simulation, we can automatically generate the critical test patterns corresponding to the boundaries between different cycle count predictions shown by the rows of the truth table.
In each row, only one variable occurs with explicit value "0"; the corresponding critical addition operands are all the input patterns that set a string of consecutive dependency bits high. In formulas, such critical operand values and can be de�ned as follows, for each truth table row, in which = : A special test case is the prediction of the worst-case cycle count, which corresponds to operands = 2 and = . To test the prediction unit critical path delay in a postlayout SPICE-level simulation, the adder input to be used is the same as for the worst-case cycle count prediction; that is, = 2 , and = .
6. Example of ECPA Unit Design synthesis and circuit design of the prediction logic three stages (from (5), (6), and Truth Table), (7) adder circuit design through standard full-custom VLSI design tools, (8) postlayout design validation through SPICE simulation of the prediction logic critical path and of the test additions set produced at step (5).
If postlayout simulation reports a prediction fault (i.e., predicted cycle count lower than real latency), one may permanently update the prediction table accordingly and repeat the synthesis process. All of the steps can be implemented in a soware ECPA macrocell generator.

Design Example.
We show the case of a prediction unit designed for a 32-bit operand, 4-bit block size CLA/CSA ECPA, considering a 32 nm metal-gate high-K CMOS process characterized by an FO4 propagation delay of 8.7 ps. We target a clock frequency of 6 GHz, that is, 166.4 ps cycle time, equivalent to 18.5 FO4 delay units. Such clock speed is extremely high for the reference CMOS process.
From Table 2, selecting the cycle time column labeled with number 25.5, we have 3 possible values for the cycle count, that is, 1, 2, and 3 . Hence the prediction table  will collapse into a 3 row truth table, shown by Table 3. Consequently, the logic functions to be implemented are = ∧ , = 1, , , e logic synthesis of the cycle count obtained from Table 3 is the 2-bit expression 1 = 6 , 0 = 6 ∧ 1 . As an alternative, the synthesis of the wait signal can be obtained by the state machine speci�ed in Figure 6. Figure 7 shows the transistor level design of the whole prediction circuitry resulting from applying the synthesis procedure. e transistor sizing in the prediction unit as well as in the adder is optimized according to the logical effort method [16].
e prediction logic transistor count is 181, including the static logic for encoding the binary digits of number and the state machine to produce the wait signal (in practice they are mutually exclusive solutions).
e critical test cases resulting from (7) in conjunction with Table 3 are additions with the following values of operands and . e total hardware complexity of this ECPA unit is 1705 transistors, resulting from the prediction circuitry (181 transistors) plus the hybrid CLA/CSA adder (1524 transistors). Figure 8 shows the layout of the adder macrocell, total size being 6.6 m × 5.8 m.

Speed Performance Results.
We tested the critical paths of the resulting circuits by NGSPICE simulation, verifying that all of the tests give correct prediction. An additional veri�cation was performed at the logic level on the full architecture of the adder, by means of a nano-CMOS dedicated delay model [18], con�rming the positive results of the test. Finally, NGSPICE BSIM4 simulation of the critical path of the prediction unit con�rms that the evaluation of the predicted cycle time is always within one clock cycle. Figure 9 shows the NGSPICE output of the latter simulated test, referring to the worst-case cycle count; that is, 2, at 6 GHz clock frequency. e 1 and 0 bits correctly take the "10" value on the falling clock edge aer the presentation of the addition operands, and the wait signal correctly remains high for two consecutive falling clock edges. e critical path of the prediction logic goes from the 1 signal (potential carry bit of carry-select block coming out of slave latch in Figure 3) to the wait signal and results to have a slack time of over 30% of the clock period.
e statistical speed performance of the proposed example is shown in Table 4, compared with previous earlycompletion-prediction designs for which performance data are available in the literature. e works in [1] refer to a variable-latency 32-bit carry-select ALU, in [13] refers to variable-latency 32-bit Brent-Kung adder implementation, while the work in [9] refers to an asynchronous early-completion 32-bit Brent-Kung adder. In Table 4 the speed-up value refers to the average improvement with respect to a �xed-latency synchronous implementation of the same adder design. In the proposed approach, a �xed-latency implementation is simply obtained by eliminating the completion prediction logic, with no modi�cation of the conventional adder structure. Performance results are reported referring to random uniformly distributed input operands and to real operands obtained by execution traces of SPECInt benchmark suite, except for [9] for which the published performance refers to Dhrystone and Espresso benchmarks. e reported performance values are directly obtained from the results claimed in [1,9,13]. In all cases, the proposed adder outperforms the speedup attained by previously published works.

Power Saving and NBTI Mitigation
Estimation. e average power consumption of the designed unit simulated at 1 V power supply with random inputs at 6 GHz is 0.148 mW, subdivided in 0.127 mW dynamic and 0.021 mW leakage power consumption. e statistical speed advantage over a �xed-latency implementation can be effectively used for reducing power consumption at the same operations/second throughput, by means of DD reduction and clock period relaxation [2]. In the proposed design example, the supply voltage to obtain the same average throughput as a �xed latency implementation is DD = 0.6 V with relaxed clock period of 330 ps. Table 5 shows the resulting energy saving with respect to the �xed latency design. e normalized results are compared with the energy saving attained by the variable-latency carry-select design in 70 nm CMOS described in [2], for which power saving data are available in the literature and the supply voltage compatible with the same throughput as the �xed latency implementation is reported 0.77 V.
According to the characterization of NBTI reported in [19], the shi in the PMOS threshold voltage th caused by NBTI is strongly dependent on the DD value. As a result, the same mechanism applied for power consumption reduction can be applied to NBTI mitigation. In the proposed design example, the statistical performance advantage allows a 40% reduction of DD (from 1.0 V to 0.6 V) which results in 35% reduction in th shi in one year of circuit operation, according to [19]. As a supplementary countermeasure, like any variable-latency unit, the proposed design can be equipped with the insertion of guard band violation sensors [2] in order to detect the NBTI effect on completion time and adjust predicted latency, as shown in the variable-latency adder presented in [2].

Conclusions
A general methodology has been presented, for synthesizing the prediction logic of early-completion-predicting adders (ECPA), also known as variable-latency (VL) adders, which have been proposed for reducing leakage and NBTI failures in high-speed DSP datapath to be realized in nano-CMOS technology.
While previous works present speci�c design cases, a general method for prediction logic synthesis is not available in the literature. e proposed method utilize the well-known high-speed carry-select and hybrid carryselect/carry-lookahead as reference addition schemes, and the prediction logic does not affect the adder logic design in any way. e design method is implemented through a standard VLSI custom macrocell design tool chain. Finally, the methodology includes an automatic way to generate critical test patterns for the ECPA postlayout validation.
e resulting ECPA circuit complexity is competitive with conventional high-speed adders, as the hardware overhead is only 10% of the adder logic. A design case in 32 nm CMOS technology, simulated at postlayout SPICE BSIM4 level, results in sustaining a 6 GHz clock frequency with correct cycle time predictions. Results on statistical speed performance advantage, power consumption reduction, and NBTI mitigation have been obtained with respect to a �xed latency implementation of the same adder architecture.