Redundant Logic Insertion and Latency Reduction in Self-Timed Adders

,


Introduction
The 2009 International Technology Roadmap on Semiconductor (ITRS) design predicts that adaptive digital circuits will be increasingly necessary for the future as a consequence of increase in variability [1]. This is owing to a blurring of the boundary between catastrophic faults in circuits caused due to manufacturing defects and parametric faults resulting from device and interconnects variability. The ITRS roadmap [1] projects a growing requirement for asynchronous global signaling and emphasizes the need for a continuous development of asynchronous logic/circuit design tools. This is significant in the context of a key challenge faced in modern IC design, namely, distribution of a centralized clock signal throughout the chip with acceptably low skew whilst having to keep the power, congestion, and area costs of traditional repeater insertion in long global clock lines to a minimum. Indeed as variability increases, circuits can exhibit faulty behavior similar to that caused by catastrophic defects. The major sources of failures include (i) process variations-statistical variations of device parameters such as channel length, threshold voltage, and mobility, (ii) lifetime variations-variations causing shift in physical parameters over the operating life of a circuit, and (iii) intrinsic noisenoise sources (shot noise, thermal noise, and random noise) which are inherent to normal device operation that becomes dominant at small feature sizes. At a time when the issue of variability has become prominent and the reliability aspect tending to assume greater significance than quality of results in nanometer scale digital circuits, the self-timed design paradigm offers an attractive alternative to conventional synchronous design. In fact, self-timed logic circuits are inherently tolerant of process, temperature, and parameter uncertainties [2][3][4][5][6]. A recent work [7] by Chelcea et al. demonstrated the superior resiliency of asynchronous circuits vis-à-vis their synchronous counterparts in the presence of parametric variations (probabilistic device delays) for the case of a 32-bit Brent-Kung adder and a 16-bit multiplier.

VLSI Design
Self-timed circuits also have better electromagnetic compatibility [8] and noise susceptibility attributes compared to synchronous designs [9], consume power only where and when active [10,11], and feature excellent design reusability [12]. Moreover, self-timed circuits are self-checking [13,14] and are latency insensitive thus being naturally elastic or adaptive.
Although the term "self-timed" has been used to refer to asynchronous circuits, it is important to note that selftimed circuits actually constitute a robust class of asynchronous circuits, namely, input/output mode circuits. In general, circuits corresponding to the input/output operating mode do not impose timing assumptions on when the environment should respond to the circuit. The robustness attribute in self-timed circuits usually results from employing a delay-insensitive (DI) code for data representation, communication, and processing, and a 4-phase (return-tozero) handshake signaling convention is commonly adopted. Among the family of DI codes [15], the dual-rail (1-of-2) code is widely preferred owing to its simplicity and ease of logic implementation.
According to dual-rail data encoding, each data wire d is represented using two encoded data wires d0 (d 0 ) and d1 (d 1 ) as shown in Figure 1. A transition on the d0 wire indicates that a zero has been transmitted, while a transition on the d1 wire indicates that a one has been transmitted. Since the request signal is embedded within the data wires, a transition on either d0 or d1 informs the receiver about the validity of data. The condition of both d0 and d1 being a zero at the same time is referred to as the spacer or empty data. Both d0 and d1 are not allowed to transition simultaneously as it is illegal and invalid since the coding scheme utilized is unordered [16], where no codeword forms a subset of another codeword.
With reference to Figure 1, the 4-phase handshake protocol is explained as follows (the explanation remains valid for data representation using any DI data encoding scheme).
(i) The dual-rail data bus is initially in the spacer state.
The sender transmits the codeword (valid data). This results in "low" to "high" transitions on the bus wires (i.e., any one of the rails of all the dual-rail signals is assigned logic "high" state), which correspond to nonzero bits of the codeword. (ii) After the receiver receives the codeword, it drives the ackout (ackin) wire "high" ("low"). (iii) The sender waits for the ackin to go "low" and then resets the data bus (i.e., the data bus is driven to the spacer state). (iv) After an unbounded but finite (positive) amount of time, the receiver drives the ackout (ackin) wire "low" ("high"). A single transaction is now said to be complete, and the system is ready to proceed with the next transaction.
The timing diagram for the 4-phase asynchronous signaling protocol is shown in Figure 2, with the request (req) signal, which is actually embedded within the data wires, explicitly shown to describe the handshaking. The dual-rail code is the simplest member of the general family of delayinsensitive m-of-n codes [15], where m lines are asserted "high" out of a total of n physical lines to represent a codeword. The size (i.e., number of unique symbols) of a generic m-of-n code is given by the binomial coefficient n choose m = n!/m!(n − m)!. The dual-rail code is ideally suited for representing a single bit of binary information.
To represent two bits of information, the dual-rail code can be concatenated as shown in Table 1 or can equivalently be represented through a 1-of-4 code.
The 1-of-4 encoded values of single-rail inputs given in Table 1 represent only one of many possible encodings, and an arbitrary choice is portrayed here. Two binary bits of information are represented by asserting only half of the physical lines as logic "high" in the 1-of-4 code in comparison with a dual-rail code, although both the coding schemes require the same number of physical lines. As a result, the 1-of-4 encoding scheme experiences only half the transitions of the dual-rail encoding convention. Thus the dynamic power dissipation of the former scheme is likely to be better than that of the latter due to reduced switching activity. This phenomenon was confirmed with the practical example of an ARM thumb instruction decoder [17]. However, considering the additional encoding and decoding circuitry required for realizing 1-of-4 encoded selftimed data paths in comparison with dual-rail encoded selftimed data paths [18], the power savings gained by the former might diminish.
Although higher order encoding schemes are available, apart from the dual-rail code that allows easier mapping between conventional binary functions, the other widely used DI code is the 1-of-4 code. This is owing to the reason that for self-timed data paths, encoding by sender and membership test and decoding by receiver are important aspects, and consequently the encoding and decoding complexity is dependent on the message space to be coded [19]. When the dual-rail code and 1-of-4 code are used to represent exactly one bit and two bits of binary information, respectively, they are said to be complete [14]. A code is said to be complete if and only if it contains all code words as implied by its definition. Even with one missing codeword, it would be labeled incomplete. A DI coding scheme, in general, is required to be unordered and complete.
Seitz classified a self-timed logic circuit into two robust categories on the basis of its indicating (acknowledging) genre as strongly indicating and weakly indicating [20]. It was also shown therein that a legal interconnection of strongly or weakly indicating logic circuits gives rise to a larger strong or weak-indication logic circuit.
(i) Strong Indication. In this case, the self-timed circuit waits for all of its inputs (valid/spacer) to arrive before it starts to produce all the outputs (valid/ spacer). The sequencing constraints are given below: (a) all the inputs become defined (valid)/undefined (spacer) before any output becomes defined/ undefined; that is, any or all of the output(s) become defined/undefined only after all the inputs have become defined/undefined, Single-rail inputs Dual-rail encoded data 1-of-4 encoded data  (b) all the outputs become defined/undefined before any input becomes undefined/defined.
(ii) Weak Indication. According to this, the self-timed circuit is allowed to produce any of the outputs (valid/spacer) even with a subset of the inputs (valid/ spacer). However, Seitz's weak timing specifications require that at least one output (valid/spacer) should not have been produced until after all the inputs (valid/spacer) have arrived. The sequencing constraints in this case are as follows.
(a) Some inputs become defined (undefined) before some outputs become defined (undefined); that is, some outputs could become defined (undefined) only after at least some inputs have become defined (undefined). (b) All the inputs become defined (undefined) before all the outputs become defined (undefined); that is, all the outputs could become defined (undefined) only after all the inputs have become defined (undefined). (c) All the outputs become defined (undefined) before any input becomes undefined (defined).
The signaling scheme for strong-and weak-indication timing regimes in terms of the input-output characteristics is illustrated graphically in Figure 3, which summarizes the sequencing constraints mentioned above. In general for iterative circuits, weakly indicating implementations are preferable compared to strongly indicating versions since the former's computation time is data dependent for valid data and may exhibit constant latency for spacer data, while the latter is always bound by worst-case latency for both valid data and spacers [21].

Redundant Logic Insertion
This section deals with an efficient method of reducing the critical path delay of self-timed adders by means of a novel concept called redundant logic insertion. In general, the concept can be extended to effect latency reduction in any iterative logic circuit that comprises a cascade of basic building blocks. Redundancy insertion, in general, implies inclusion of extra redundant logic into a non redundant implementation without modifying the original function that synthesizes the desired functionality to enable speeding up the propagation of certain signals, which are required to drive the subsequent stages of a circuit cascade.
Logic redundancy can be incorporated into a self-timed circuit implementation by careful duplication of similar logic, and this can lead to multiple acknowledgements, which might be useful in simplifying the timing assumptions. Additionally, this procedure could facilitate faster reset of logic during the return-to-zero phase with a constant latency. Logic redundancy achieved through input-incomplete gates basically introduces weak-indication property into the circuit as it relaxes the indication constraints of those outputs that are considered as candidates for optimization. (Inputincomplete gates need not have to wait for the arrival of all their inputs to produce the required output under all scenarios; examples include AND gates and OR gates. If any one of its inputs is assigned a 0(1), the output of the AND gate (OR gate) becomes a 0(1)). It can either be implicit or explicit in the circuit. The minor drawbacks of this approach are insignificant increases in area and power parameters. Since logic duplication is involved, switching activity would increase due to multiple acknowledgements, consequently pushing up the dynamic power and resulting in increased average power dissipation. However, the area and power overheads may be marginal depending upon the functionality and its initial nonredundant implementation,  and eventually the degree of logic redundancy introduced. We will now consider some case studies to demonstrate the benefits of redundancy insertion on the basis of the selftimed ripple carry adder (RCA) architecture, where logic redundancy is targeted towards the carry output function since the carry is required to propagate between successive stages of the adder architecture.

Implicit Logic Redundancy.
The basic equations corresponding to a dual-rail encoded full adder are given by (1)-(4). Here (a0, a1), (b0, b1), and (cin0, cin1) represent the dual-rail encoded augend, addend, and carry inputs of the adder, while (Sum0, Sum1) and (Cout0, Cout1) represent the dual-rail encoded sum and carry outputs of the adder, respectively: The circuit shown in Figure 4 corresponds to our synthesized dual-rail encoded full adder, henceforth referred to as the SSSC DRE adder (single sum, single carry dual-rail encoded adder). Three steps are involved in the synthesis process-(i) deriving the minimum orthogonal sum-ofproducts form of a given logic functionality [22], (ii) speedindependent decomposition of logic to facilitate realization using standard cells [23], and (iii) performing logic optimizations to pave the way for latency reduction. In the figures, the C-element is represented by the AND gate symbol with the marking C on its periphery. (The Muller C-element governs the rendezvous of input signals. It produces a 1(0) if all its inputs are 1(0); otherwise it retains its existing steady state. The C-element (also called C-gate) is classified as an input-complete gate as it waits for the arrival of all its input(s) to produce the desired output).
Firstly, it can be noticed that the responsibility of indication is confined to the sum outputs of the adder block, thereby freeing the carry signal from indication constraints which facilitates fast carry propagation. Even with the arrival of a subset of the inputs, the carry outputs could become defined/undefined, while the sum outputs would have to wait for the arrival of all the inputs to become defined/undefined. Thus the full adder satisfies Seitz's weak-indication timing constraints. This style of implementation is labeled as the biased approach [24], as there is no distribution of inputs indication between the primary outputs. In other words, the primary outputs are not collectively responsible for acknowledging the arrival of all the primary inputs and internal outputs. Our proposed synthesis solution corresponds to a direct synthesis strategy and differs from the method presented in [24] in that the latter generates a dual-rail asynchronous gate pair or a delay-insensitive minterm synthesis (DIMS) equivalent [25] of each synchronous logic gate. In fact, the process of generating a dual-rail asynchronous gate pair for a synchronous logic gate is based on the dualrail combinational logic style [26,27]. The asynchronous dual-rail gate pair equivalent or the DIMS equivalent of each synchronous logic gate is eventually realized using proprietary null convention logic (NCL) macros [28], which are constructed on the basis of threshold logic [29].
Secondly, the full adder block depicted in Figure 4 features implicit logic redundancy. The intermediate gate output functions "int1" and "int2" are found embedded within the logic producing the carry outputs Cout0 and Cout1, respectively, however, in their input-incomplete forms. The principal advantages of this full adder with Overflow carry Dual-rail encoded self-timed full adder Dual-rail encoded self-timed full adder Figure 5: n-bit dual-rail encoded self-timed carry-ripple adder structure.
respect to the n-bit self-timed RCA architecture shown in Figure 5 are (i) fast carry output production and propagation when carry-generate (a1 = b1 = 1) and carry-kill (a0 = b0 = 1) conditions occur and (ii) reset of the entire adder circuitry with an approximate propagation delay of only two full adders during the return-to-zero phase regardless of the adder size. The latter advantage results from the fact that the intermediate dual-rail output carries of all the full adder modules connected in a cascade could be reset in parallel as the dual-rail encoded augend and addend inputs of every adder stage are reset. Subsequently, the dualrail sum outputs of all the adder stages would be reset as their input carries assume a spacer state. This leads to a constant latency operation for application of spacer data, while data-dependent latency would manifest when valid data is applied. Indeed, this attribute becomes inherent in all the redundancy incorporated self-timed adders. The worst-case latency results when the carry-propagate mode is activated with respect to all the individual full adder stages; this happens for the case when a1 = b0 = 1 or a0 = b1 = 1.
The SSSC DRE adder has some similar properties as that of Martin's full adder [30], which is nevertheless a stand-alone full-custom transistor level realization.

Explicit Logic Redundancy.
We now consider a variety of scenarios where logic redundancy is explicit in a circuit design. To this end, we analyze some adder circuits which employ a uniform DI data encoding protocol (dual-rail encoding) for both primary inputs and outputs, or a combination of DI codes (dual-rail and 1-of-4 codes) for primary inputs, but a single DI code (dual-rail code) for the primary outputs.

Single-Bit Adder Based on Hybrid Input
Encoding. The term "hybrid input encoding" specifies a mix of at least two different DI data encoding schemes as adopted for the primary inputs. Considering the single-bit full adder block, the augend and addend input bits can be encoded using a  can adopt the dual-rail code; that is, hybrid encoding of primary inputs and uniform encoding of primary outputs are resorted to. The structure of the n-bit hybrid input encoded self-timed RCA is depicted by Figure 6, which is similar to the topology shown in Figure 5 with the exception that the augend and addend single-rail inputs are now encoded using the 1-of-4 code.
The general expressions governing a full adder block utilizing hybrid input encoding for inputs and dual-rail encoding for outputs are given below. In the equations that follow, (i0, i1, i2, i3) represents the 1-of-4 encoded equivalent of the single-rail adder inputs (a, b), with a single-rail to 1of-4 data representation scheme adopted as shown in Table 1: The full adder block that synthesizes equation (2) inclusive of carry output logic optimization is portrayed by Figure 7. Henceforth, this adder module shall be identified as the SSSC HIE NRL adder (single sum, single carry hybrid input encoded nonredundant logic adder). As the name implies, all the gates that constitute this adder are irredundant. It can be observed from Figure 7 that the sum outputs are entrusted with the responsibility of inputs indication, while the carry outputs could evaluate to the correct state whenever the carry-kill or carry-generate condition occurs without having to wait for the carry input. Thus the SSSC HIE NRL adder corresponds to the weak-indication timing model.
The synthesized hybrid input encoded full adder block that incorporates logic redundancy is shown in Figure 8.
Here, gates C 1 and C 2 denote 2-input C-elements, while gates g 1 and g 2 represent 2-input AND gates. It can be noticed in the figure that the functions realized by C 1 and C 2 are identical to that implemented by g 1 and g 2 , respectively, for the case of upgoing transitions. Hence, redundancy is explicit in the present design, henceforth referred to as the SSSC HIE RL adder (single sum, single carry hybrid input encoded redundant logic adder). With respect to this adder design, logic redundancy is found to be beneficial in two ways. During the spacer phase, all the sum outputs could be reset in a parallel fashion, as the dual-rail carry output of the kth stage of a n-bit adder could be reset based on its 1-of-4 encoded augend and addend inputs, and the dualrail sum output of the (k + 1)th adder stage would depend only on the dual-rail carry input of its preceding stage. There is also a benefit in terms of improving the computation speed during the valid data phase. This would be obvious by comparing the designs portrayed by Figures 7 and 8; it can be observed that the carry propagation delay is less in case of the SSSC HIE RL adder (AND2, OR2 gate delays) in a(n − 1)1 a(n − 1)0 a(n − 2)0 a(n − 2)1  Figure 9: Dual-rail encoded n-bit RCA architecture comprising dual-bit adder blocks.

Dual-Bit Adder Utilizing Homogeneous Data Encoding.
We now analyze the effect of introducing redundant logic in a self-timed dual-bit adder module that employs homogeneous data encoding for both its primary inputs and outputs. The homogeneous encoding procedure refers to a similar DI data encoding protocol as adopted for all the primary inputs and outputs of a function block-here dualrail data encoding. The dual-bit adder block consists of dualrail encoded versions of five single-rail inputs, namely, a1, a0, b1, b0, and cin, and three single-rail outputs Cout, Sum1 and Sum0, where (a1, a0) and (b1, b0) represent the addend and augend inputs and cin the carry input. The output Cout is the carry output or overflow bit of the addition process, and Sum1 and Sum0 signify the most significant and least significant sum output bits, respectively.
The reduced orthogonal sum-of-products forms corresponding to the encoded outputs of the dual-bit adder are given below, expressed in terms of their encoded inputs. In an orthogonal sum-of-products form, the logical conjunction of any pair of product terms yields a null: The architecture of the n-bit self-timed carry-ripple adder structure that features dual-bit adder modules is shown in Figure 9. The synthesized dual-bit adder module is portrayed by Figure 10. It shall be referred to as the DSSC DRE adder module (dual sum, single carry dual-rail encoded adder) in the subsequent discussions. Figure 10 depicts the redundant AND gates (shaded gates) inserted into the DSSC DRE adder block. The nonredundant adder block would not feature the AND gates rg1 and rg2, and so one of the inputs for the OR2 gates producing Cout1 and Cout0 would be the outputs of C-elements (C 1 and C 2 ), which are the nets labeled as gn2 and gn3, respectively. In fact, gn2 and gn3 would be isochronic forks in the nonredundant version. Isochronic forks are the weakest compromise to delay insensitivity [31], leading to quasidelayinsensitive (QDI) circuit implementations. According to the isochronicity assumption, if a transition on a wire fork is acknowledged, then the transition on the other wire forks is also said to be acknowledged. It was shown in [32] that QDI circuits which include isochronic fork assumptions can be realized in even nanometer scale dimensions. Indeed, QDI circuits are the practically implementable DI circuits, and they constitute the robust class of self-timed circuits. In the redundant dual-bit adder shown in Figure 10, the OR2 gates producing Cout1 and Cout0 have gn1 and gn4 feeding as inputs, respectively. For the case of low-to-high transitions, the AND gates rg1 and rg2 are functionally equivalent to Celements C 1 and C 2 . The gate output node labeled "isf " signifies an isochronic fork junction. Referring to Figure 10, it can be observed that an upgoing transition on the fork isf (isf ↑) would be followed by either gn2↑ or gn3↑ in case of the nonredundant DSSC DRE adder block and by (gn1↑, gn2↑) or (gn3↑, gn4↑) in case of the DSSC DRE adder module that incorporates logic redundancy; this explains the possible multiple acknowledgements. It can be observed that there is a possibility for fast or eager reset during the return-to-zero phase as a result of introducing logic redundancy into the adder. During the spacer phase, all the sum outputs could be reset in a parallel fashion, as the carry output of the previous dual-bit adder stage could be reset even by its corresponding augend and addend inputs without having to wait for an input carry from the preceding stage. The advantage of latency reduction gained by introduction of redundant logic is due to the lower data path delay encountered, as the critical path in every dual-bit adder stage contains inputincomplete gates instead of a mix of input-complete and input-incomplete gates as in the original nonredundant version.

Dual-Bit Adder
Sum (2) Sum (1) Sum (0) Carryin (1) Carryin (0) Carry (1)1 Carry (1)0 Overflow carry Figure 12: Heterogeneously encoded dual-bit adder based n-bit self-timed RCA.  Figure 13: (a) Self-timed system handling heterogeneously encoded inputs and outputs, (b) dual-rail to 1-of-4 encoder, (c) 1-of-4 to dualrail decoder. and 1-of-4 codes), used to encode the primary inputs and outputs of a self-timed logic circuit. A dual-bit adder block based on heterogeneous DI data encoding can represent the augend, addend inputs, and sum outputs by a 1-of-4 code, while the input and output carry signals can be represented using the dual-rail code. Adopting such an encoding scheme, the minimized expressions for the function block outputs are given below. It is to be noted that the 1-of-4 code assignments for the augend, addend inputs, and the sum outputs are the reverse of the assignments given in Table 1: VLSI Design 11 Cout0 = a0b3cin0 + a1b2cin0 + a2b1cin0 The dual-bit adder module that synthesizes (4)-(9) is shown in Figure 11. Henceforth, this adder shall be referred to as the DSSC HE adder (dual sum, single carry heterogeneously encoded adder). The DSSC HE adder block satisfies the weak-indication timing constraints. The 1-of-4 encoded sum outputs assume responsibility for indicating the arrival of all the adder inputs, while the dual-rail encoded carry output can be relaxed with respect to ensuring input completeness. Logic redundancy, as introduced into the DSSC HE adder module, is shown in the figure with the input-incomplete AND gates (shaded gates) marked as rg1 and rg2. Similar notations have been used as that of Figure 10 so that the discussions of the previous section would hold well for this scenario too. As in the earlier case, the sum output(s) of the (i+1)th dual-bit adder stage could be reset based on the carry input from the ith dual-bit adder stage, and there does not arise any need for resetting of the entire carry chain during the return-to-zero phase.
The n-bit self-timed carry-ripple adder architecture that encompasses heterogeneously encoded dual-bit adder modules is shown in Figure 12. The self-timed system configuration that supports the RCA topology is depicted in Figure 13. A subset of the dual-rail inputs (augends and addends) is 1-of-4 encoded before being fed to the function block for data processing, while the remaining inputs (dualrail encoded input carry) are fed as such. The non-dualrail outputs produced by the logic block (sum outputs) are decoded before being passed onto the next stage, while the dual-rail outputs (output carry) are driven to the next stage. The encoding and decoding costs equate to 28 and 12 transistors per bit, respectively.

Simulation Mechanism and Results
To demonstrate the usefulness of the proposed concept of logic redundancy insertion, simulations have been performed by considering a 32-bit self-timed RCA architecture. In this context, a subset of well-known self-timed design methods [25,33,34] is considered in this work. Various 32-bit self-timed RCAs were built by considering different adder building blocks-32 single-bit adder blocks or 16 dual-bit adder modules. Before discussing the simulation results, the mechanism of estimating the design metrics is elucidated. The delay parameter refers to the maximum propagation delay (critical path delay) encountered in the data path, which is a sum of the latencies of the input register and that of the combinatorial adder logic. The delay metric was estimated using PrimeTime. To avoid the notion of a clock source, a virtual clock was used as a remote reference to constrain the input and output ports of the design. The area and power metrics correspond to the input registers, completion detection logic, and the 32-bit combinatorial adder. The delay and power metrics consider estimated parasitics in addition to the parameters associated with actual components (gates). The area metric gives a combined account of the area of all the logic cells. The total/average power dissipation is the sum of dynamic and static power components, where dynamic power is in turn composed of switching and internal power consumption values. NC-Sim has been used for functional simulation and also to obtain the switching activity files corresponding to gate-level simulations of Verilog descriptions of various 32-bit self-timed adders. Input data were supplied to the adders at a time interval of 15 ns through a random test bench which models the environment. The switching activity files obtained were subsequently used for power estimation using PrimeTime PX. The simulations targeted a PVT corner of the 130 nm bulk CMOS standard cell library whose recommended supply voltage is 1.32 V and the ambient junction temperature is −40 • C. All the circuit inputs possess the driving strength of the minimum-sized inverter of the cell library, while the outputs are associated with a fanoutof-4 drive strength. Appropriate buffering for the input acknowledgement signal was provided where necessary to eliminate timing violations. Since identical registers and a similar completion detection circuit were used for all the 32bit adders, the area and power metrics can be correlated with that of the function block, thus paving the way for a straightforward comparison between adders synthesized on the basis of different self-timed design methods. Strong/weakindication adders corresponding to various self-timed design methods were constructed manually and were subsequently optimized for minimum latency by taking into account the physical constraints of the target cell library. (A 130 nm CMOS standard cell library was used. The maximum fanin of AND gate and OR gate in this library is 4 and 3, respectively. The granularity of the C-element ranges from 2 to 4 inputs, and the gate level C-element models are given in [35]). The delay, area, and power metrics corresponding to the simulations of various nonredundant 32-bit self-timed RCAs are given in Table 2.  The nature of indication of the different adders is mentioned within brackets in the 1st column of the Table. The values specified within brackets in the 3rd column of the table signify the area of the respective individual single-bit/dual-bit self-timed adder block. The delay, area and power parameters of the different redundant logic incorporated 32-bit self-timed RCAs are given in Table 3. Introduction of logic redundancy into the dual-bit adder module synthesized on the basis of Toms and Edwards, approach [33] was not considered, since it would change the indication property of the original synthesis solution. Therefore, redundant versions of other adders were alone considered for comparison in Table 3. By comparing the results given in Tables 2 and 3, it is found that logic redundancy insertion has enabled a mean delay reduction of 21.1%, with associated area and power penalties to the tune of 2.3% and 0.8%, respectively. On average, the increase in size of an individual self-timed single-bit/dual-bit adder module is found to be 2.8% after incorporating redundant logic.
With reference to the DSSC HE adder module shown in Figure 11, a further peephole optimization was carried out by merging the gate rg1 and the OR gate producing Cout1 and rg2 and the OR gate producing Cout0 and replacing these combinations using complex gates (AO12 cells). Simulations were repeated for this case study, and the delay, area, and power values corresponding to the 32-bit RCA, comprising optimized redundant DSSC HE adder blocks, are found to be 4 ns, 10953 µm 2 , and 696.8 µW respectively. The optimized redundant DSSC HE adder block occupies less area than the nonoptimized redundant DSSC HE adder block by 1.5%. Hence, the 32-bit self-timed RCA comprising a cascade of optimized redundant DSSC HE adder blocks exhibits reduced delay in comparison with the nonredundant DSSC HE adder module based 32-bit self-timed RCA by 31%. However, in terms of area and average power, the latter features reduced figures to the tune of 0.6% and 1.2%, respectively, compared to the former.

Conclusions
A new concept of redundant logic insertion was described in this paper that can be used to minimize the data path delay of self-timed arithmetic circuits. It was shown that introduction of logic redundancy is feasible with respect to many selftimed design methods, especially for synthesizing iterative logic specifications. The advantages of logic redundancy insertion have been propounded on the basis of a 32-bit self-timed carry-ripple addition. It has been inferred from the simulation results that significant reduction in latency could be achieved at the expense of only marginal increases in area and power metrics. It was also discussed how logic redundancy paves the way for constant latency operation by permitting fast reset when applying spacer data, while actual case latency is encountered for addition of valid data.