Break-before-Make CMOS Inverter for Power-Efficient Delay Implementation

A modified static CMOS inverter with two inputs and two outputs is proposed to reduce short-circuit current in order to increment delay and reduce power overhead where slow operation is required. The circuit is based on bidirectional delay element connected in series with the PMOS and NMOS switching transistors. It provides differences in the dynamic response so that the direct-path current in the next stage is reduced. The switching transistors are never ON at the same time. Characteristics of various delay element implementations are presented and verified by circuit simulations. Global optimization procedure is used to obtain the most power-efficient transistor sizing. The performance of the modified CMOS inverter chain is compared to standard implementation for various delays. The energy (charge) per delay is reduced up to 40%. The use of the proposed delay element is demonstrated by implementing a low-power delay line and a leading-edge detector cell.


Introduction
Serial connection of inverters is often used for implementing low-precision delay in digital systems. However, the delay of cascaded inverters is power efficient only for small delays, which leads to an excessive power loss when longer delays are required. Each inverter in the chain drains additional parasitic energy that is approximately equal to the dynamic energy required for changing its input. Often the number of delay stages in a chain is reduced at the expense of increased node capacitances as long as capacitive loads do not introduce excessive direct-path energy.
Direct-path current is a well-known source of internal dynamic power consumption in CMOS logic. In welldesigned circuits, it is estimated to be less than 20% of the dynamic dissipation [1] but may prohibitively increase in circuits with significant capacitive loads. The problem is efficiently solved if NMOS and PMOS gates of the CMOS inverter are driven by separate, time-skewed signals. This solution has been applied for large capacitive loads in [2] and later in [3][4][5]. All of these circuits have additional driving stages inserted in front of the split inverter inputs. The overhead of the additional components in terms of area and power consumption is justified only if it is outweighed by the savings obtained in the driving stages of large capacitive loads. On the gate-level logic, the overhead is hardly justified since the loads are small. Other gate-level techniques have also been proposed with the aim of reducing internal static power due to leakage currents in nanometer technologies [6,7]. These techniques come with an area overhead and do not improve internal dynamic power.
The solution proposed in Figure 1 addresses the internal dynamic power consumption problem by inserting a bidirectional delay element at the inverter output to provide time-skewed signals for the next split-input inverter stage. The proposed structure provides break-before-make (BBM) switching with very low component overhead. No additional stages are needed. The overhead is low enough for the circuit to be used with small loads that are common in gate-level circuit design.

Circuit Operation
Input signal is transformed into two time-skewed signals by CMOS to BBM converter in Figure 1 transition of the input signal di both transistors switch, the PMOS opens and the NMOS goes into high-impedance state. The PMOS pulls the output node to logical high. The delay is defined by the PMOS transistor. The output node follows with delay defined by the bidirectional delay element. Similar process is repeated in the opposite direction at low to high transition of . Output signal pair , of the CMOS to BBM converter is used as input time-skewed signals , for the proposed inverter in Figure 1 The proposed inverter is composed of a serially connected PMOS transistor, a bidirectional delay element, and a NMOS transistor. Circuit operation is best explained by an ideal transport delay timing diagram (Figure 1(b) right). The inverter input and output signals are applied as signal pairs , and , , respectively. From the point of view of static signals, the two signals in a signal pair represent the same logical level. Because of the built-in delay, they never change simultaneously. The first transition also referred to as isolation (nonbold slope in timing diagram in Figure 1(b)) is followed by the second transition also referred to as information (bold slope in timing diagram in Figure 1(b)). The isolation slope always precedes the information slope in any logical transition. Isolation time is the time interval between the isolation and the information. During the isolation time, the inverter output is in a high-impedance state (indicated by the grayed output signal areas in timing diagram in Figure 1(b)), preserving the old logical state on capacitive load and preventing the direct-path current from flowing between the power supply and the ground.
Let us first assume that the input nodes and change from logical low to logical high, as presented in timing diagram in Figure 1(b). The voltage on node rises first, representing the isolation slope. The information slope at node follows after the rising isolation time . The isolation slope switches the PMOS transistor into a high-impedance state. The old logical state is preserved on a capacitive load until the information slope opens the NMOS transistor and pulls the output node to logical low. Because of the bidirectional delay element, the output node transits to logical low later than node and the state switching transient is complete. A similar process takes place when and transits from logical high to logical low.
The input isolation times are reproduced at the output thus allowing cascaded operation. Time skewing between input and output nodes of serially connected inverters is therefore guaranteed throughout the whole cascade. Now, let us assume that one or both isolation times , are increased. This increases the time during which both transistors are in a high-impedance state but do not affect the skew between the output signals. The later depends only on the bidirectional delay element. Isolation times of serially connected inverters therefore do not accumulate from stage to stage. If the isolation time is reduced, the circuit functionality is maintained until or becomes zero, and the circuit operation becomes identical to that of a standard CMOS inverter. Generally, the isolation slopes are defined by the inverter transistors, while the information slopes are defined by the delay element.
The role of input signals differs depending on the transition. Isolation slope is the falling edge at the NMOS gate or the rising edge at the PMOS gate. Similarly, the rising edge at the NMOS gate and the falling edge at the PMOS gate represent information slope. Because of dual input and output signals, the standard timing parameters must be redefined. Delay times and are defined between input information slope and output isolation slope as presented in timing diagram in Figure 1(b). Standard rise and fall time definitions (10% to 90% and vice versa) apply to each signal separately. Isolation times and are defined as the delay between the isolation slope and the corresponding information slope.
Time-skewed input signals are merged back into a single output by BBM to CMOS converter in Figure 1(c). An isolation falling slope on is followed by the information falling slope on . The first switches the NMOS transistor into a high-impedance state. Low logical state on output is preserved on capacitive load until the information slope opens the PMOS transistor. During the isolation time , the converter output is in a high-impedance state. The PMOS pulls the output node to logical high with delay . Similar process is going on at rising isolation and information slopes.
The propagation delay is defined as the average of the low-to-high ( LH ) and the high-to-low ( HL ) transition. For BBM inverter (Figure 1(b)), LH = + and HL = + . For a symmetric circuit with = = and = = , the propagation delay is given by Assuming that is proportional to the equally sized standard CMOS inverter propagation delay ( ,CMOS ), the following linear relation can be obtained: By definition, ,CMOS is the average of the rise and the fall delay ( LH,CMOS and HL,CMOS ) of a CMOS inverter. Eliminated direct path in the proposed inverter provides more switching current for charging and discharging the capacitive load represented by the next stage. Therefore, < ,CMOS and 1 < 1.

Charge Delay Analysis
Circuit power efficiency is measured by the power-delay product (PDP) and energy-delay product (EDP = PDP ) [8,9]. PDP corresponds to the energy required for one gate switch. EDP, on the other hand, represents a trade-off between energy and performance. Usually, PDP and EDP should be as low as possible thus resulting in minimum delay at minimal possible energy consumption. When designing a low-precision delay, the situation is turned upside down. The goal is to implement the required delay at lowest possible energy consumption. EDP can be reduced by reducing the supply voltage or the charge. Because the supply voltage cannot be changed, this means that we are looking for minimal charge required for implementing delay .
A standard two-stage CMOS buffer is depicted in Figure 2. A simplified transistor model (3) [10] is assumed. The model merges transistor geometry and technology parameters into factor = ( / )( / ), where is surface mobility of the carriers, and are permittivity and thickness of the gate insulator, and and are transistor channel's width and length, respectively. Transistor capacitances are represented by and . Consider If input signal rise and fall times are zero, then no directpath current is present in the first stage. Dynamic charge (4) charging and discharging capacitive loads and represents the only consumption in the first stage. Consider By solving the Kirchhoff 's current law for the internal node, input rise and fall times for the second stage can be obtained. They can be approximated by (5) [11]. Consider Constants 2 and 3 are equal for = − . The second stage has no load. Internal voltage is directly transferred to the output. Therefore, no dynamic consumption is present in the second stage. Since rise and fall times (5) of the second stage input signal are not zero, direct-path current flows during the transition resulting in charge: With a linear approximation of the internal voltage transitions during rise and fall times (5), the static charge can be calculated as Because there is no capacitive load, no delay is added in the second stage. The low-to-high and the high-to-low output delays are proportional to and , respectively. Consider If the second stage is symmetric, 4 = 5 = 1/2. Assuming = − = and symmetric first stage 1 = 1 = 1 , the rise and the fall times , (5) are equal. Additional symmetry in the second stage 2 = 2 = 2 simplifies the ( ,CMOS ) relation into 4 The Scientific World Journal . (10) The propagation delay is generated by the first stage ( 1 ) and the second stage gate capacitances. The static and the dynamic consumption increases linearly with ,CMOS .

Bidirectional Delay Implementation
In the simplest case, the bidirectional delay can be implemented with a single resistor (Figure 3), which in combination with the capacitances of the next stage provides the isolation time.
The dynamic charge (4) required for charging and discharging capacitive loads and remains the only cause for the power consumption in the first stage.
Internal node voltage V , V transients are required for computing the isolation times. Equations (11) and (12) must be solved to obtain , and , , respectively. Consider The solution of the equations is complicated and is not appropriate for manual calculation.
Discharging load capacitances and can be dealt with separately in (11) assuming high .
→ ∞ causes V → 0 and is discharged first. Influence of current is negligible. Equation (11) simplifies into 1 + V = 0. The same deduction holds for charging load capacitances and in (12). → ∞ causes V → 0 and (12) simplifies into 1 + (V − ) = 0. Delay times and can be obtained by solving the simplified versions of (11) and (12) as in (5) The isolation time depends on the RC constant. Consider Constants 6 and 7 are equal for = − . Since node voltages V and V are time skewed, static consumption in the second stage is zero. The total consumption is therefore equal to the dynamic consumption in the first stage (4). Assuming = − = and symmetric first stage 1 = 1 = 1 , the propagation delay (1) can be expressed as which can be used to obtain Charge consumption again linearly increases with propagation delay.
To ensure rise and fall delay symmetry LH = + 9 = HL = + 10 , a balance among variables in (13) and (14) is required. If both stages are symmetric and = − , then the second stage gate capacitances must also be equal ( = ). Capacitances and depend only on the gate capacitance in the first approximation. The condition = can be met by increasing channel length of the second stage NMOS transistor, which degrades its driving performance. Yet another way is to assume that capacitive load is composed of gate capacitance and various stray capacitances ( = gate + stray ). Smaller NMOS gate can be partly compensated by adding more parasitic capacitance to the NMOS gate. Larger parasitic NMOS stray capacitance can be introduced by different gate connections in case the layout allows such modifications.
The analysis above holds if is high enough and static consumption is consequently negligible. The isolation slope must end before the information slope begins. In the first approximation conditions, must be fulfilled, leading to > max ( ) .
The dependency of on is depicted in Figure 4. Although the concept of introducing a bidirectional delay using can be expanded to logic gates, the overhead of increased delay combined with double wiring hardly justifies the energy savings. This approach shows its advantage in circuits with productive use of delay, such as edge-triggered storage elements and clock distribution networks.
The Scientific World Journal

Optimization Problem
Analysis of the circuit in Figure 3 shows that obtaining a specific delay requested with minimum charge consumption is an optimization problem. Minimum of the function ( , w, l) represent the optimal solution. Vectors w and l represent transistor channel widths and lengths, which define gain factors ( ) and capacitances. The implicit constraint LH = HL = requested is imposed on the solution. Delay implementation with causes long charging phases of and . Transient phenomena of the information slope may not be concluded before the transistors in the next stage switch state ( Figure 5). The input signal must remain constant during the transient. Otherwise the next delay is shortened. For this reason another implicit constraint defining the maximum length of the transient is introduced into the optimization problem 12 < max and 12 < max . The input signal must stay constant for at least max after every transition.
The manual calculation is derived from a simplified static transistor model (3). Dynamic behavior is modeled with constant gate capacitances and . These capacitances are voltage dependent. The optimization procedure of a real world BBM buffer must consider numerous higher order effects that were neglected in the first approximation, such as: (i) input signal is not an ideal rectangular shape voltage generator, (ii) output load is not zero, (iii) the MOSFET should include model with higher order static (channel-length modulation, short-channel effect, sub-threshold conductivity, etc.) and dynamic (nonlinear capacitances, etc.) effects, (iv) parasites (layout, wiring, etc.) have to be taken into account.

Transistor-Based Delay Implementation
implementation with high-resistance polysilicon is area consuming and poorly controlled. One or more MOS transistors can be used instead. If a single delay transistor is connected as a diode or a triode (Figure 6), then the output voltage swing is reduced. The voltage drop is defined by the threshold voltage of the delay transistor . Voltage swing reduction applies to one or both outputs depending on the configuration. This has several implications. The dynamic charge (4) is reduced from to swing consequently reducing the power consumption. This causes the delay to increase due to transistor's high resistance in the saturation region and results in long transient phenomena in the information slope. On the other hand, the lower voltage swing is required for reaching the threshold voltage of the next stage, which in turn decreases the delay. A fairly high supply voltage > 5 is required. Long information transients and high supply voltage make the reduced swing topologies inappropriate.
Full-swing can be achieved with additional level restoration transistors (Figure 7). The PMOS (NMOS) level restoration transistor restores the high (low) level. Level restoration transistor(s) can be combined in parallel with any delay element from Figure 6. Controlling signals , are delayed input signals , , which can for instance be obtained at the next stage output.
On gate level, every additional component introduces its own parasitic capacitances causing additional power overhead that must be justified. Therefore, the number of transistors must be kept as low as possible. At least two transistors are needed for full-swing delay implementation. Possible topologies are shown in Figure 8. There are four combinations of PMOS and NMOS delay transistors in triode mode (PtNt), level restoration transistors without delay transistors (Pf Nf), and PMOS or NMOS delay transistor with appropriate level restoration transistor (PtNf and Pf Nt). Controlling level restoration signals ( , ) is taken from the next stage output. Using feedback for level restoration is a logical choice, since level restoration is required immediately after the next stage switches state.

Noninverting Delay Cell
The dual-ramp (i.e., BBM) CMOS inverter is well suited for building low-power low-precision delay elements. It conveniently combines the delay with short-circuit current elimination. Generally, the delay circuit can be constructed as cascade of several BBM stages comprising elements depicted in Figure 1. In the simplest case the circuit can be reduced to the interfacing elements depicted in Figures 1(a) and 1(c). This results in a 6-transistor noninverting delay cell when fullswing delay topologies from Figure 8 are used.
To verify the proposed principle, all four variations of the simple dual-ramp delay cell were compared to a standard serial connection of two CMOS inverters. All five delay circuits were sized for smallest possible charge consumption at a required delay. Digital cell sizing, including delay, is highly dependent on a required fan-in and fan-out properties. To eliminate this dependence, standard input and output unit inverters were added, defining equal fan-in and fanout properties (Figure 9) for delay circuits. Both inverters contribute to the delay and are considered as part of the cell. The final sizing (i.e., optimization result) of a delay circuit is of course tailored to the selected pair of input and output standard unit inverters. Standard CMOS and dualramp (BBM) delay cells with input and output buffers are shown in Figure 9.

Results
Sizing cells from Figure 9 to a required delay is an eightor a twelve-dimensional optimization problem. Finding the global minimum is not a trivial task, especially if there are plenty of local minima. Therefore, every optimization run was repeated several times in various parts of the parameter space until the global minimum was confirmed.
A parallel version of SADE [12] global optimization method was used. The optimization procedure ran in parallel on a cluster of 100 computing nodes driven by the PyOPUS [13] library. 25 Intel Core i5 2.66 GHz processors (4 nodes per processor) were used. Circuit simulations were performed by the Synopsys HSPICE circuit simulator with the TSMC 0.18 m/1.8 V process parameters.
Beside transistor sizing, the delay and power dissipation also depend on the circuit layout. Automatic layout procedure and extracting parasitic node capacitances should be done in every iteration before the simulation. The authors could not include the layout and extraction steps into the optimization loop due to not small, but, nevertheless, limited computer power. Therefore, the node capacitances due to layout were not taken into account. To approximate the real conditions, transistor parasitics due to the connection geometry were included. Layout rules ( = = 0.8 m × , = = 1.5 m + ) were applied. But, in spite of described imperfection, the obtained results for standard and proposed delay cell still indicate the capabilities of the two topologies.
Straightforward sizing of the cells produces inappropriate results. Optimizer finds a sizing with small consumption and a perfect delay match. These are in fact degenerated circuits whose operation depends on poorly defined parasitics causing very long internal transients (  process parameters, such as gate capacitance and intrinsic transconductance, additional implicit constraints were required. The first set of safeguarding constraints avoids extreme over-and undershoots thus preventing circuit operation based on parasitic capacitances (e.g., Miller capacitance).
Miller capacitances of large transistors are the main source of the delay time. In that unwanted case, the delay results from charging and discharging the parasitic capacitances.
The second set of constraints ensures that the steady state is reached after max ( Figure 5). Very long internal transients with smaller charge consumption are otherwise superior from the optimization point of view.
The third set represents additional requirements needed to obtain noise resistant circuits. Noise margins are obtained by requiring stable steady state node voltages during -and -substrate potential disturbances. Figure 10 illustrates the results summarized in Table 1. Each topology was sized targeting delays from 100 ps to 5 ns. Charge consumption growth with delay becomes approximately linear for delays above 1 ns as (9) and (16) predict. The topology with PMOS transistor in triode mode and NMOS level restoration transistor (PtNf) turns out as the most efficient. In comparison with the standard topology, charge savings are slightly higher than 40%.

8
The Scientific World Journal Table 1: Charge consumption results for the standard (std.) and the BBM delay cells depicted in Figure 9 measured for one rising and one falling slope. The columns denoted by the percent sign represent the percentage of the standard delay cell's consumption. [ns] std. [  Elimination of static consumption can be observed in the third stage. It is the only stage actually driven by timeskewed signals. Drain currents for = 750 ps sizing are depicted in Figure 11. Similar current transients can be observed with other BBM topologies and delays. The large direct-path current in the standard topology (shadowed) is almost completely eliminated in the PtNf topology. The optimization procedure obtained the required delay with large channel lengths in the second stage. This means that the bulk of the delay is caused by the third stage gate capacitances.

Applications
Dual-ramp BBM delay cells can be used for constructing lowpower delay lines. All that needs to be done is to replace standard delay elements with proposed ones, as shown in Figure 12. The number of stages in one delay element can vary depending on the required delay.
The BBM inverter with level restoration can be used as a key element in a leading-edge pulse generator ( Figure 13). Pulse generators are frequently used for generating precharge or data-strobe pulses in dynamic logic and flip-flops. Since, in this case, the delay is needed only for the low-to-high transition, the circuit can be simplified. A single BBM stage provides enough delay for the AND-type edge detection.
The width of the generated pulse is defined by the AND gate delay and the parasitic capacitance of node combined with the resistance of the discharging NMOS feedback transistor. The inherent delay of the output pulse dictated by the AND gate provides enough time to reset node through the NMOS feedback transistor. The voltage level of node is restored through the precharging PMOS transistor. In this context, the BBM inverter acts as a feedback switch with limited impact on the delay.
The BBM topology in Figure 13 was compared to standard leading-edge pulse generator with various delay line lengths. Low-power cells (inverter and NAND gate) from industry standard library were used. Input and output buffers were added to equalize fan-in and fan-out properties. Manufacturing process layout rules defining transistor geometry were applied during the sizing procedure (i.e., optimization). And of course some previously described constraints were essential to obtain sensible results.
The results are summarized in Table 2. The pulse width, the total charge consumption, and the delay line charge consumption of the standard leading-edge pulse generator with 3, 5, 7, 9, and 11 cascaded inverters in the delay were measured first. Then, the BBM topology was sized to the individual pulse widths. The charge consumptions of the equivalent BBM based leading-edge pulse generators were obtained. The simplicity of the delay implementation saves up to 50% of the total switching energy compared to the standard realization. The advantage of the BBM based circuit is clearly presented when the charge consumption of the delay line is measured separately. For standard implementation, the consumption linearly increases with the number of inverters in the delay. On the other hand, the consumption of the BBM inverter is almost constant. This is due to the constant number of transistors. A slight increase can be observed for longer pulse widths, which is caused by higher parasitic capacitances of larger transistors required in that case. Note that the consumption of the inverter supplying the feedback is included only in the total charge consumption measurement.
Two potential topology modifications not requiring the feedback are given in Figures 14 and 15. The AND gate delay therefore does not affect the width of the generated pulse. The price for removing the feedback is an increased number Table 2: Charge consumption results for the standard and the BBM based leading-edge pulse generator depicted in Figure 13 measured for one input impulse. The columns denoted by the percent sign represent the percentage of the standard realization's consumption.   of transistors, which causes higher charge consumption compared to the feedback implementation from Figure 13. The delay is defined more precisely if the discharge current is controlled by a biased MOS transistor [14] (Figure 14). The leading-edge delay of the input signal is defined by the time needed for discharging the parasitic capacitance of node through the NMOS feedback transistor M1. Transistor output signal for switching the current through M2/M3 when the biasing voltage on the M1 gate is needed [15].
The DC power consumption of the biasing circuit can be reduced by dynamic biasing presented in Figure 15. When the input signal is low, the circuit prepares the initial conditions for the delay transient: the voltage of node is clamped by M2 to and node is precharged through M3 to . In the active phase, when input goes to high, node is charged by the parasitic capacitance of node through the transmission gate M4/M5, thus raising the M1 gate voltage to the desired level for the delay transient. The effective voltage at the M1 gate is given by = ( + )/( + ), where and are the parasitic capacitances of nodes and , respectively. Parasitic capacitances can be trimmed by adding diffusion areas to the relevant transistors or using gate oxide capacitors. The transistors in the switching network (M2⋅ ⋅ ⋅ M6) are minimum sized.

Conclusion
A modified static CMOS inverter has been presented which reduces direct-path current in circuits, where the delay is a required part of the circuit's functionality. The proposed BBM inverter is well-suited for building low-power lowprecision delay elements due to its ability to combine delay and direct-path current elimination in one single stage. The suppression is based on the serially connected delay element in the inverter output thus providing time-skewed output signals. Two output signals provide additional capabilities for compact functional solutions. The principle of operation has been verified by performing delay cell optimizations for various delay element implementations. With the exception of very short delays, the proposed BBM inverter structure improves the power budget compared to the standard cascaded inverter transport delay implementation. Besides the delay lines, variations of the proposed topology can be used in other slow transition circuits. Edge-detector circuit featuring BBM topology has been presented.