Synchronous early-completion-prediction adders (ECPAs) are used for high clock rate and high-precision DSP datapaths, as they allow a dominant amount of single-cycle operations even if the worst-case carry propagation delay is longer than the clock period. Previous works have also demonstrated ECPA advantages for average leakage reduction and NBTI effects reduction in nanoscale CMOS technologies. This paper illustrates a general systematic methodology to design ECPA units, targeting nanoscale CMOS technologies, which is not available in the current literature yet. The method is fully compatible with standard VLSI macrocell design tools and standard adder structures and includes automatic definition of critical test patterns for postlayout verification. A design example is included, reporting speed and power data superior to previous works.
Fast integer adders are an essential component of most DSP datapaths. Synchronous early-completion-prediction adders (ECPAs) [
An ECPA consists of a conventional adder plus a completion-prediction logic unit (Figure
General scheme of an ECPA unit.
No general design methodology for ECPA VLSI cores has been proposed yet. In [
We present a detailed, general method for the design of completion-prediction logic in full-custom adder macrocells in nano-scale CMOS technologies, targeting carry-select and hybrid carry-select/carry-lookahead addition schemes. Notably, the prediction logic unit insertion does not modify the adder logic in any way. The proposed method supports the prediction of generally any cycle count latency value and not only single-cycle and two-cycle latency.
The paper illustrates the reference architecture template, the adder propagation delay model upon which the design method is built, the design procedure detailed description, the general validation of the approach, and a specific example with performance results referring to 32 ns CMOS technology and reporting speed and power data.
We start from the well-known carry-select addition scheme [ a conventional, low area carry-select with ripple adder circuitry in its adder blocks (in the following, CSA); a hybrid scheme of carry-select with carry-lookahead circuitry in its adder blocks (in the following, CLA/CSA).
Generic scheme of carry-select addition.
In both schemes, we refer to the operand size as
A design option is the introduction of internal pipeline stages in the adder. As an ECPA unit has the goal of statistically exploit single-cycle operation latency, the introduction of pipelined operations may result in relatively low effectiveness at the expense of a pipeline register delay overhead. Its effectiveness was studied by instruction level simulations of SPECint benchmarks in [
The clocking strategy underlying the proposed high-speed ECPA design is a two-phase symmetric clock with dynamic logic and transparent latches [
ECPA architecture template and timing diagram of the adder, for a single-cyle and a two-cycle operations.
In case of single cycle operation, the adder operates as a normal single-cycle Domino circuit. In case of multicycle operation, the input of the adder blocks should not change for the time needed to complete the selection chain and the sum generation. This is normally accomplished by input registers of the arithmetic unit in the datapath architecture. In general, depending on the operand values and on the clock cycle time, an addition can take 1 +
In the analyzed adder schemes, the addition operation proceeds as follows: after the propagate and generate bit vectors have been set up, any block having at least one propagate bit at “0” produces a valid anticipated output carry bit independent from its input carry bit. Then, the output carry bits of the remaining blocks are evaluated by means of the carry selection chain. The longest chain of unknown carry bits determines the time duration of the latter operation [
Referring to Figure
In (
Equation (
We characterized the propagation delay of the CMOS circuits in the adder structure by means of NGSPICE simulation [
Delay estimates for different adder schemes, assuming
|
|
4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | |
---|---|---|---|---|---|---|---|---|---|---|---|
CSA | 32 | Total delay | 31.7 | 29.3 | 28.1 | 27.4 | 27.2 | 28.0 | 29.1 | 29.4 | 30.8 |
|
3.7 | 3.6 | 3.5 | 3.4 | 3.4 | 3.3 | 3.2 | 3.2 | 3.2 | ||
|
9.2 | 11.0 | 12.8 | 14.7 | 16.5 | 18.2 | 19.8 | 21.6 | 23.4 | ||
|
2.7 | 2.6 | 2.6 | 2.5 | 2.4 | 2.4 | 2.4 | 2.4 | 2.4 | ||
64 | Total delay | 57.8 | 50.6 | 46.1 | 43.6 | 41.2 | 41.2 | 40.2 | 39.8 | 40.5 | |
|
4.0 | 4.0 | 3.9 | 3.8 | 3.7 | 3.6 | 3.6 | 3.5 | 3.4 | ||
|
10.1 | 12.4 | 14.5 | 16.4 | 18.5 | 12.1 | 21.9 | 23.8 | 25.4 | ||
|
3.0 | 2.9 | 2.8 | 2.7 | 2.7 | 2.6 | 2.6 | 2.6 | 2.5 | ||
| |||||||||||
CLA/CSA | 32 | Total delay | 29.3 | 26.0 | 23.9 | 22.6 | 21.8 | 22.0 | 22.5 | 22.8 | 23.8 |
|
4.2 | 4.1 | 4.1 | 4.2 | 4.3 | 4.2 | 4.0 | 4.2 | 4.1 | ||
|
4.2 | 5.0 | 5.9 | 7.0 | 8.3 | 9.5 | 10.9 | 12.6 | 14.3 | ||
|
3.0 | 3.0 | 3.0 | 3.0 | 3.1 | 3.0 | 2.9 | 3.0 | 3.0 | ||
64 | Total delay | 56.7 | 49.0 | 44.1 | 41.0 | 38.2 | 37.5 | 36.7 | 36.4 | 36.7 | |
|
4.4 | 4.5 | 4.6 | 4.6 | 4.8 | 4.7 | 4.8 | 5.0 | 4.8 | ||
|
4.4 | 5.2 | 6.2 | 7.3 | 8.6 | 9.9 | 11.4 | 13.1 | 14.8 | ||
|
3.2 | 3.2 | 3.3 | 3.2 | 3.4 | 3.4 | 3.4 | 3.5 | 3.4 |
According to (
Generic structure and timing diagram of the prediction logic for a single-cycle and a two-cycle operation.
(1) The first stage computes the dependency bit
(2) The second stage evaluates
To synthesize
Prediction table for the cycle count in two different ECPA schemes with
|
Cycle time in FO4 units | |||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
16 | 16.5 | 17 | 17.5 | 18 | 18.5 | 19 | 19.5 | 20 | 20.5 | 21 | 21.5 | 22 | 22.5 | 23 | 23.5 | 24 | |
CSA | 0 | — | — | — | — | — | — |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
1 | 0 | — | — | — | — | — |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 | |
— | 1 | 0 | — | — | — | — |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2 | |
— | — | 1 | 0 | — | — | — |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2 | |
— | — | — | 1 | 0 | — | — |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2 | |
— | — | — | — | 1 | 0 | — |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2 | |
— | — | — | — | — | 1 | 0 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2 | |
— | — | — | — | — | — | 1 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2 | |
| ||||||||||||||||||||||||
CLA/CSA | 0 | — | — | — | — | — | — |
|
|
|
1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
1 | 0 | — | — | — | — | — |
|
|
|
2 | 2 | 2 | 2 | 2 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | |
— | 1 | 0 | — | — | — | — |
|
|
|
2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 1 | 1 | 1 | 1 | |
— | — | 1 | 0 | — | — | — |
|
|
|
2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | |
— | — | — | 1 | 0 | — | — |
|
|
|
2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | |
— | — | — | — | 1 | 0 | — |
|
|
|
2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | |
— | — | — | — | — | 1 | 0 |
|
|
|
3 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | |
— | — | — | — | — | — | 1 |
|
|
|
3 | 3 | 3 | 3 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 |
(3) The third stage converts the logic signals
The fast full-custom circuit implementation of the first two logic stages relies on dynamic Domino circuits precharged on the low clock half-cycle, according to the two-phase clocking strategy sketched in Figure
Critical circuit path in the prediction logic implementation.
The validation of the prediction logic design must address two issues: verifying the predicted cycle count correctness and verifying that the prediction unit evaluates the cycle count
To verify the correct prediction in a postlayout SPICE simulation, we can automatically generate the critical test patterns corresponding to the boundaries between different cycle count predictions shown by the rows of the truth table. In each row, only one
To test the prediction unit critical path delay in a postlayout SPICE-level simulation, the adder input to be used is the same as for the worst-case cycle count prediction; that is,
Designing an ECPA with the proposed approach is an eight-step process. The first two are performed off-line and their result can be reused in different ECPA designs.
delay table compilation through SPICE characterization, prediction table compilation through delay tables, (
adder architecture configuration and selection of the prediction table, column selection in the prediction table and truth table generation, generation of the test addition set through ( synthesis and circuit design of the prediction logic three stages (from ( adder circuit design through standard full-custom VLSI design tools, postlayout design validation through SPICE simulation of the prediction logic critical path and of the test additions set produced at step (5).
If postlayout simulation reports a prediction fault (i.e., predicted cycle count lower than real latency), one may permanently update the prediction table accordingly and repeat the synthesis process. All of the steps can be implemented in a software ECPA macrocell generator.
We show the case of a prediction unit designed for a 32-bit operand, 4-bit block size CLA/CSA ECPA, considering a 32 nm metal-gate high-K CMOS process characterized by an FO4 propagation delay of 8.7 ps. We target a clock frequency of 6 GHz, that is, 166.4 ps cycle time, equivalent to 18.5 FO4 delay units. Such clock speed is extremely high for the reference CMOS process.
From Table
Truth table for the prediction unit of a 32-bit, 4-bit clock, CLA/CSA ECPA for a cycle time equivalent to 25.5 FO4 delay units.
|
|
Predicted |
Binary output |
---|---|---|---|
0 | — | 0 | 0 0 |
1 | 0 | 1 | 0 1 |
— | 1 | 2 | 1 0 |
The logic synthesis of the cycle count
State diagram of a synchronous FSM generating the wait signal.
Figure
Circuit implementation of the prediction logic example.
The prediction logic transistor count is 181, including the static logic for encoding the binary digits of number
The critical test cases resulting from ( Test pattern for predicted latency of 1 cycle ( Test patterns for predicted latency of 2 cycle ( Test pattern for predicted latency of 3 cycle (
The total hardware complexity of this ECPA unit is 1705 transistors, resulting from the prediction circuitry (181 transistors) plus the hybrid CLA/CSA adder (1524 transistors). Figure
Layout of the adder macrocell design example.
We tested the critical paths of the resulting circuits by NGSPICE simulation, verifying that all of the tests give correct prediction. An additional verification was performed at the logic level on the full architecture of the adder, by means of a nano-CMOS dedicated delay model [
SPICE simulation of the prediction logic example.
The statistical speed performance of the proposed example is shown in Table
Statistical performance data (average speedup with respect to fixed latency implementation).
Design | Random operands | Real operands |
---|---|---|
[ |
n.a. | 1.79 |
[ |
1.19 | 1.01 |
[ |
n.a. | 1.78 |
This work | 2.13 | 2.43 |
The average power consumption of the designed unit simulated at 1 V power supply with random inputs at 6 GHz is 0.148 mW, subdivided in 0.127 mW dynamic and 0.021 mW leakage power consumption.
The statistical speed advantage over a fixed-latency implementation can be effectively used for reducing power consumption at the same operations/second throughput, by means of
Power saving over fixed-latency implementation having the same operations/second throughput.
Design |
|
Dynamic power saving | Leakage power saving | Total power saving |
---|---|---|---|---|
[ |
0.77 V | 36% | 77% | 44% |
This work | 0.60 V | 78% | 89% | 81% |
According to the characterization of NBTI reported in [
A general methodology has been presented, for synthesizing the prediction logic of early-completion-predicting adders (ECPA), also known as variable-latency (VL) adders, which have been proposed for reducing leakage and NBTI failures in high-speed DSP datapath to be realized in nano-CMOS technology.
While previous works present specific design cases, a general method for prediction logic synthesis is not available in the literature. The proposed method utilize the well-known high-speed carry-select and hybrid carry-select/carry-lookahead as reference addition schemes, and the prediction logic does not affect the adder logic design in any way. The design method is implemented through a standard VLSI custom macrocell design tool chain. Finally, the methodology includes an automatic way to generate critical test patterns for the ECPA postlayout validation.
The resulting ECPA circuit complexity is competitive with conventional high-speed adders, as the hardware overhead is only 10% of the adder logic. A design case in 32 nm CMOS technology, simulated at postlayout SPICE BSIM4 level, results in sustaining a 6 GHz clock frequency with correct cycle time predictions. Results on statistical speed performance advantage, power consumption reduction, and NBTI mitigation have been obtained with respect to a fixed latency implementation of the same adder architecture.