DPFFs : C 2 MOS Direct Path Flip-Flops for Process-Resilient Ultradynamic Voltage Scaling

We propose twomaster-slave flip-flops (FFs) that utilize the clocked CMOS (C2MOS) technique with an internal direct connection along the main signal propagation path between the master and slave latches and adopt an adaptive body bias technique to improve circuit robustness. C2MOS structure improves the setupmargin and robustness while providing full compatibility with the standard cell characterization flow. Further, the direct path shortens the logic depth and thus speeds up signal propagation, which can be optimized for less power and smaller area.Measurements from test circuits fabricated in 130 nm technology show that the proposed FF operates down to 60mV, consuming 24.7 pW while improving the propagation delay, dynamic power, and leakage by 22%, 9%, and 13%, respectively, compared with conventional FFs at the iso-output-load condition. The proposed FFs are integrated into an 8 × 8 FIR filter which successfully operates all the way down to 85mV.


Introduction
The rapidly growing volume of data processing and transferring in contemporary clocked electronics has constantly drawn considerable attention to the design of high speed lowpower yet robust sequential timing elements.With a clock network, these sequential components account for 30% to 60% of total power consumption in VLSI systems [1].Moreover, the propagation delay and latency of the timing units are responsible for a large portion of the cycle time as the operating frequency increases.As a result, FF selection and design have a great effect on both reducing power dissipation and providing more slack time for easier time-budgeting in highperformance applications.
FFs and latches are the major building blocks of digital circuits, and their primary function is to store binary data.Many commercial digital applications selectively use masterslave and pulse-triggered FFs.Examples of the master-slave flip-flop (MSFF) include the transmission gate-based FF [2], push-pull D-type FF [3], and true single phase clocked FF (TSPC) [4].Despite popularity of high-performance design, the TSPC can malfunction when the slope of the clock is insufficiently steep.Slow clocks can cause both the clocked pull-up and pull-down networks to be on simultaneously, resulting in undefined value of the state and race condition.Further, transistor sizing is critical to achieve correct functionality in the TSPC.With improper sizing, glitches may also occur at the output due to a race condition when the clock transits [5].Ishikawa et al. proposed a MSFF in which a hysteresis characteristic of the input circuit of the slave latch prevents the output which was once inverted from being inverted again in the metastable state [6].Another edgetriggered FF is the sense amplifier based FF [7].All these hard-edged FFs are characterized by positive setup time, causing a large cycle time.Alternatively, the pulse-based FFs (PBFFs) [8][9][10] have been widely used to decrease the datato-output (D-to-Q) delay.In scan logic of the PBFF, however, the scan control becomes too complex and incompatible with the conventional MSFF for more enhancement of the Dto-Q speed of the FF.The PBFF requires the addition of pulse generators internally or externally, which can cause an increase in area and power and routing congestion, even though external pulse generation scheme would provide several advantages such as shareability among the neighboring FFs and availability of the dual-edge triggering.Other various types of FFs and their analysis at normal or high voltages are found in [11][12][13][14][15][16][17][18][19][20].Considering circuit robustness against race condition and signal integrity in the presence of clock skew and power consumption and performance in aiming at commercial products, we will focus on the fully static MSFFs [21] for ultralow as well as normal voltage applications in this paper.
Based on the conventional MSFFs which are widely used especially in commercial products, we propose two C 2 MOS asymmetric direct path master-slave FFs (DPFFs) which use the C 2 MOS technique at the primary input stage and have direct signal path between master and slave latches.The proposed DPFFs adopt a low-power technique, called ABRM (adaptive -ratio modulation), to dynamically adjust the skewness in PMOS and NMOS transistors in different regions of operation.Highlights include that the proposed FFs provide (1) high-performance in speed, which can be traded off to reduce power and area; (2) full compatibility with the most widely used commercial tools for characterization while inheriting main advantages of the conventional FFs, such as high signal integrity and noise immunity against clock skew; and (3) compensation over the device mismatches and skewed P/N-ratios as a supply voltage changes from the normal voltage all the way down to the deep subthreshold region (say, 60 mV).
The remaining of this paper is organized as follows.Section 2 explores the FF design metrics of interest.Conventional MSFFs used as references in the paper are presented in Section 3 and DPFFs are proposed in Section 4. Section 5 briefly explains a body bias technique applied to improve circuit robustness especially at ultralow voltages.Section 6 discusses simulation and measurement results, and the conclusions are drawn in Section 7.

Flip-Flop Design Metrics
In general, gate-level static timing analysis requires that the sequential elements are characterized for three important design metrics: setup and hold times and propagation delay.These design metrics affect the system-level features such as performance in speed, signal integrity, and noise immunity under noisy and race conditions.The setup time  setup is defined as the minimum amount of time for which the input data D should remain steady before the active clock edge so that the data is reliably sampled by the clock.Any violation may cause incorrect data to be captured, which is known as a setup violation.The hold time  hold is the minimum amount of time for which the input D should remain stable after the active clock edge so that D is correctly sampled.Otherwise, violation may cause incorrect data to be latched, which is known as a hold violation.Finally, the clock-to-Q delay is the propagation delay that a FF takes to compute its correct output Q after the active clock edge.Considering the polarity of propagated data, the clock-to-Q delay is defined by  clk2q = max( clk2qh ,  clk2ql ), where  clk2qh and  clk2ql are the clock-to-Q low-to-high and high-to-low delays, respectively.
These design metrics are determined as time difference between data and clock signals and generally precharacterized and stored in a table indexed by the input slope, clock slope, and output load.Typically, each sequential gate is characterized using commercial library delivery methodologies.The characterization procedure is repeated for a rising or falling edge with various combinations of input slopes and output loads.Figure 1 illustrates a condition which FF environment in a digital circuit has to satisfy for correct operation.The clock period  must be greater than or equal to the sum of the clock-to-Q delay  clk2q , setup time  setup , maximum combinational logic delay  logic , and relative clock skew  skew .Then, the FF delay has to meet the maximum delay limit given by where 5% margin of the clock-to-Q delay is considered to avoid the metastable state where setup and hold violations occur and thus the output is unpredictable.The worst race condition happens when there is no logic between the two FFs in Figure 1.The internal race immunity of a FF is given by

Conventional Flip-Flops
The master-slave flip-flop (MSFF) typically consists of two gated latches connected in series and with an inverted enable input to one of them.Clocking causes the FF to either change or retain its output based upon the value of the input signals at transition.It is known that the transmission gate-based FF (TGFF) presents the best power-performance trade-off with the total delay ( clk2q +  setup ) among the fully static FFs. Figure 2(a) shows the TGFF which was originally designed for the IBM PowerPC microprocessor [2,11].The main advantages of the TGFF include a short signal path and a low-power feedback.The butterfly-structured low-power feedback in the C 2 MOS cross-coupled inverters is usually insensitive to overlap of the clocks.The TGFF covers a relatively wide range of total energy-delay space [11] and presents the least amount of total leakage as an average across all states compared with all other FFs even with limited performance and positive setup time [21].On the other hand, the use of the transmission gates not only degrades signal integrity at the presence of output noise but also increases the sensitivity to race condition when two phases overlap.The transmission gate T1 at the primary input stage is normally vulnerable to the output noise due to its generic characteristics of bidirectional signal transferring capability which can cause output Figure 2: Transmission gate flip-flops (TGFFs) [2,11].
noise to flow back to and disturb the input stage.Moreover, nonbuffered (or bare) input directly applied to the transmission gate can be limited by the standard cell library characterization flow since power consumed by a FF should partially be delivered through the input data D terminal.The authors believe that design constraints of the characterization flow even with leading commercialized EDA (Electronic Design Automation) tools [22][23][24][25] come with a restricted capability in characterizing library cells which, besides the power source  dd , require power delivery from the input D as well.From the performance characterization perspective, the current drive of the previous stage can also cause inaccuracy in FF delay measurements.
The modified TGFF (MTGFF) shown in Figure 2(b) addresses the noise immunity issue of the TGFF by adding an inverter buffer at the primary input stage which keeps the output noise from propagating back to and interfering with the input [11].In addition, the added inverter ensures compatibility with the characterization constraints by delivering power from a single voltage source  dd to the entire cell including the transmission gate.The MTGFF inherits the main advantages of the TGFF such as a low-power feedback to store the cell value.The C 2 MOS technique along with the transmission gate separates the hold mode from the transparent mode.In general, large transmission gates are used to speed up signal propagation in the transparent mode, resulting in increased area overhead.Unlike the TGFF, the addition of the inverter I1 at the primary input stage enables the MTGFF to achieve high noise immunity against the output noise and provides full compatibility with the primitive cell characterization methodology.The added inverter, however, requires earlier data arrival, which increases the setup time by the inverter delay  ,inv .Further, inverter insertion now needs another inverter I4 at the output stage in order to keep the same polarity with the input, which may increase the propagation delay by the inverter delay  ,4 .
Despite these unfavorable aspects, high robustness and low-power features and full compatibility with the underlying characterization process allow the MTGFF to be successfully embedded in numerous commercial applications such as Intel's mainstream microprocessors and Samsung's SSD (Solid State Drive) controllers.We will use the MTGFF as a reference to evaluate the proposed FFs.
Figure 3 shows the C 2 MOS FF (C2FF) as another approach to resolve the issues associated with the TGFF.Unlike the MTGFF, the C2FF utilizes the C 2 MOS inverter that combines the inverter and transmission gate in the MTGFF at the input stage for both master and slave latches.The use of the C 2 MOS inverter as an input buffer at the input stage shortens the logic depth along the main signal propagation path, reducing the setup time roughly by the inverter delay  ,1 .The reduced setup time can increase system performance and robustness by relaxing the timing constraint of the maximum FF delay in (1) and improving the internal race immunity in (2).Note that, for a speedup in charging or discharging the output of the FF, the input data D is applied to the outer PMOS and NMOS transistors, whereas the clock signals of ck and ckb are applied to the inner transistors, where we assume that input data arrives and becomes stable earlier than the clock signals.

Proposed Direct Path Flip-Flops
As basic and common building blocks of digital systems, the FFs are required to have high-performance and low-power consumption while providing high robustness under data and clock skews and compatibility with a characterization flow of primitive cells.In this paper, we propose two C 2 MOS direct path master-slave FFs with an internal direct connection between the asymmetric master and slave latches.
Figure 4 shows the first proposed FF, called C 2 MOS DPFF (C2DPFF), which utilizes the C 2 MOS scheme at the primary input stage and takes over the main advantages of the C2FF, addressing the output noise and noncompatibility issues of the TGFF while reducing the setup time.Unlike the C2FF, direct interconnection at node  between the master and slave latches enables prompt signal propagation along the main signal path.Performance improvement in time can be traded off with power so that the C2DPFF can achieve further power saving with area reduction.Due to the transistor stacking effect, however, the size of the C 2 MOS inverter at the input stage may need to be enlarged to offer the current  drive or strength comparable with that of the nonstacked counterpart.Larger input capacitance of a gate requires larger current drive of the previous driver, resulting in more power dissipation.
The second proposed FF, called transmission gate DPFF (TGDPFF), shown in Figure 5, uses the transmission gate with reduced gate input capacitance at the input stage while leveraging the key advantages of the MTGFF.With the employment of butterfly-structured C 2 MOS cross-coupled inverters to store the cell value, the TGDPFF also presents good low-power properties, assuring fully static operation.
The direct path structure may, however, have write-back glitches between storage nodes  and  in both Figures 4  and 5 due to charge sharing through or bidirectional signal transferring capability of the transmission gate [26].The write-back issue is a kind of contention which can happen when the clock transitions high the value stored in the slave node  which writes back into the nonprotected master node , resulting in incorrect bit flip because of reduced noise margins especially at lower voltages.At lower supply voltages, the issue is getting more serious since degradation in the transistor ON/OFF current ratio, random and systematic process variations, affects stability of the storage nodes.In order to address the issue, the keepers need to be upsized to improve the stage of state retention and made interruptible to avoid write contention.During retention phase, the oncurrent of the keepers can hence fully contend with the offcurrent of the transmission gates and thus avoid incorrect bit flipping.A clocked CMOS style flip-flop implementation of the proposed DPFFs replaces master and slave transmission gates in the conventional circuit topologies with pass-gate free clocked inverter, thereby eliminating the risk of data write-back through the transmission gate.
On the other hand, the proposed direct path scheme may cause an increase in load capacitance at node  due to the directly connected transmission gate of T1 or T2 in Figure 4 or Figure 5, respectively.Hence, the size of the C 2 MOS inverter needs to be enlarged to secure enough current drive of the inverter, which may increase area and power consumption accordingly.This side effect can be compensated by the shortened signal path and thus increased performance which in turn allows the use of smaller sized DPFFs during synthesis while meeting a given performance constraint.
Asymmetric structure may cause an unbalanced timing specification for positive or negative edge-triggered FFs, which desires careful sizing and optimization for target edgetriggered systems.
Figure 6 shows the power and delay profile of the conventional and proposed FFs with different sizes.Size optimization is made with an in-house tool which varies individual transistor size in the FFs.Boundaries of size variation and initial sizes of the transistors are set by using the theory of Logical Effort [27] and prelayout simulations without layout-extracted parasitics are performed with various transistor sizes.The optimal points with respect to both power consumption and delay are marked with black dots which have the minimum power and delay product (PDP) for the FFs.On the other hand, 3000 Monte-Carlo simulations at isoarea conditions show that the proposed DPFFs have similar variations in key design metrics compared with their corresponding conventional counterparts (not shown).
Both proposed DPFFs can be extended to include scan logic.For example, Figure 7 shows one possible implementation of the scanned C2DPFF which, same as the scanned TGDPFF, includes total of 36 transistors and functions as a scanned asynchronous reset D-type FF.

Adaptive Body Bias Technique for Ultralow Voltage Operation
Due to the impact of process variation and skewness between the PMOS and NMOS transistors, circuit robustness can severely degrade especially for subthreshold operation.This limits supply voltage scaling while providing proper logic functionality with limited voltage headroom under process variation.Hence, it is of primary essence to keep an equal device strength ratio between the transistors in the FFs as well as logic cells to minimize the impact of process variation [28].
We proposed a circuit technique, ABRM (adaptive -ratio modulation) [29,30], which dynamically adjusts the P/Nratio (or -ratio) in the current drive between the PMOS and NMOS transistors and thus maximizes noise margin and circuit robustness for ultradynamic voltage operation.For reader's convenience, we restate a brief explanation of ABRM as follows.
The body bias technique is used to equalize the strength of pull-up and pull-down networks when switching back and forth between different regions of operation (Figure 8(a)).Forward body bias (FBB) lowers   , whereas reverse body bias (RBB) increases   .Body biases are implemented with additional body-biasing rails for PMOS and NMOS transistors (namely,  pbody and  nbody ) and a body bias generating circuit.
Figure 8(b) shows the body bias circuit for ABRM.The proposed adaptive body bias generator (BBG) consists of two comparators, switch logic, body bias voltage sources, two reference voltage sources, and an inverter to monitor the logical threshold voltage   .The monitored   of the inverter is compared against the reference voltages of  ref1 and  ref2 .If   is below a predetermined reference potential ( ref1 ), indicating that the NMOS transistor is stronger than the PMOS transistor, we apply a FBB to the pull-up network (PUN) and/or a RBB to the pull-down network (PDN) to make them equally strong.Conversely, if the monitored   is higher than  ref2 , the -ratio is too large compared to the optimal value due to strong PMOS.We apply a FBB to the PDN (and/or a RBB to the PUN).If   is monitored to be between the two reference levels, zero body bias (ZBB) is applied to the target system.The generated BB voltages are fed to the inverter, and then the updated   is again compared against the reference voltages.With more voltage references (i.e., fine-grained levels), this loop repeats until the best BB voltages are found.

Results and Discussions
To compare FF features among the conventional FFs and proposed DPFFs in terms of design metrics, we implement  isolated FF test circuits with a direct probing capability.Figure 9(a) shows the photograph and GDS views of a test chip and isolated FF test circuits fabricated in 130 nm process.
The DPFFs are integrated into an 8-tap, 8-bit FIR filter as an example to demonstrate highly robust low-power operation at the circuit level.Figure 9(b) shows the diagram of the FF test circuits in the test chip and FF design metrics as defined in Section 2. The FF cells in the test circuits are drawn, respectively, according to the transistor sizes at the PDP optimal points marked in Figure 6.Note that the FF test circuits include two inverters connected in series which are used to shape the waveforms of both input and clock signals, whereas four inverters connected in parallel are used for an output load.The same structure of the test circuits is used for FF simulations where the sizes of the wave-shaping and output-load inverters are modulated to change the slew rate of input or clock waveforms and the value of an output capacitive load, respectively.Conventionally, the setup and hold times are independently characterized as a skew so that an increase in the clockto-Q delay remains within a certain amount of percentage (say, 10%).The basic concept behind setup and hold time characterization is to sample and propagate the data in the stable region of operation.Otherwise, if the data-to-clock  skew (or time difference) is too small then a FF fails to capture the data or fails to correctly transfer the data.The window of data-to-clock skew is termed as the failure region.During timing analysis the constraints ensure that the FF does not fall into the failure region.In the stable region, the nominal clock-to-Q delay is named  0 clk2q .Table 1 summarizes the average values of 97 sets of the referential and proposed FFs, measured from the isolated test circuits shown in Figure 9 at the slow-slow (SS) corner with a supply voltage of 1.15 V and 25 ∘ C for delay and power consumption and at the fast-fast (FF) corner with a supply voltage of 1.25 V and 125 ∘ C for leakage characterization.The area is calculated without scan logic.It can be seen that, compared with the MTGFF, both proposed DPFFs achieve a considerable improvement in the clock-to-Q delay by more than 20% due to the reduced logic depth, which in turn may allow the direct path applied FFs to achieve further reduction in area and power consumption.Note that the use of C 2 MOS inverter at the primary input stage of the C2FF and C2DPFF lowers the setup time by 25% and 19%, respectively, over the MTGFF.Improvement in the delay and setup time apparently relaxes the timing constraints and improves performance of target systems.Table 2 shows delay variation over supply voltage scaling of the conventional and proposed FFs, measured at the SS corner with a supply voltage range of 1.15 V to 0.65 V and −25 ∘ C, where the delay ratio or sensitivity is calculated as a ratio of the delay at 0.65 V over the delay at 1.15 V.It can be observed that the proposed DPFFs have comparable delay sensitivity over supply voltage scaling with the conventional FFs especially at low voltages.
Figure 10 shows the input  and output  waveforms measured from the C2DPFF in the deep subthreshold region.The measured minimum supply voltage of  dd,min = 60 mV results in the dynamic switching power of 24.7 pW, five orders of magnitude smaller than normal voltage operation, with the minimum-sized design at an operating frequency of 50 Hz during ten complete binary cycles (i.e., low-to-high and highto-low transitions).On the other hand,  dd,min of 80 mV and 85 mV is measured for the conventional C2FF and MTGFF, respectively.It is worthwhile to mention that the output of a 36 mV swing is observed at a supply voltage of  dd = 60 mV.This voltage diminution is due to the fact that, unlike in the normal  dd region, the OFF leakage current is not negligible anymore compared to the (operating) subthreshold current in the ultralow  dd region.That is,  sub / off ∼ 10 2 for  dd <   , whereas  on / off ∼ 10 5 for  dd >   , where  on is the (normal) ON current,  sub is the subthreshold current,  off is the leakage current, and   is the threshold voltage of the device.The main conduction current at high or normal  dd 's can be explained by the drift mechanism while the subthreshold current at ultralow  dd 's is mainly governed by the diffusion mechanism.For example, assume that the input  is set to "1" and a "0" value is driven to the internal node  in Figure 4.Then, the PMOS transistor of the output driver I3 must pull up the output to  dd by overriding the (idle) OFF current of the NMOS transistor of I3.If the PMOS operating current is not strong enough to overcome the NMOS OFF current (unlike at high  dd 's), a supply voltage is divided resistively across the transistors and, as a result, the output will not rise all the way to  dd in deep subthreshold operations.
Many studies have been reported regarding the theoretical and practical limit of CMOS logic operation [26,31,32].Presentation of astonishing circuit operation at aggressively scaled supply voltages (i.e., 36 mV swing at a supply voltage of 60 mV) does not necessarily mean that it is recommended to operate the system at the voltage level but to provide measurement results as an evidence of increased circuit robustness with the proposed technique and as a possible advantage of the body bias technique to lower  dd,min and salvage a silicon chip; otherwise, a circuit would fail to operate due to the presence of various process variations.
Comparison of the FF types in the clock-to-Q delay requires consideration of negative as well as positive influences.As discussed in Section 4, the reduced logic depth along the main signal propagation path usually decreases the propagation delay, whereas the use of the C 2 MOS inverter generally comes with the stacking effect which results in higher threshold voltage and less current drive and thus increases the propagation delay.In comparison of C2FF with MTGFF, delay decrease tendency thanks to the reduced logic depth compensates delay increase due to the stacking effect and hence the delay of the C2FF is comparable to or slightly better than that of the MTGFF.
In the proposed DPFFs, however, much reduced logic depth further lowers the clock-to-Q delay by roughly one and two inverter delays over C2FF and MTGFF, respectively, surpassing a feasible delay increase caused by the stacking effect.This is the best advantage of the DPFFs where the direct path connection considerably improves the clock-to-Q delay by 17% over the conventional FFs.From the power and area perspective, this gate delay improvement can provide an opportunity to save more power at the isoperformance condition even with less area.Moreover, direct path connection allows the use of minimum-sized transistors in the datastoring units (e.g., I1 and I2 in Figure 4) that are now on the noncritical path, providing a further reduction in area, dynamic, and leakage powers by more than 36% (0.3%), 19% (9%), and 37% (6%) over the MTGFF (C2FF), respectively, with the same output load.Note that, in the C2FF, the output driver I2 drives the feedback transistors, M1 and M2, as well as the underlying output load, which decreases the output slope especially in the presence of large fanout.The C2DPFF and TGDPFF address this fanout issue by using the additional driver I3 and I4, respectively, dedicated to drive output as in MTGFF, which improves the output slope and current drive of the cell, covering a wide range of fanout.
Figure 11 now provides the measurement results with various signal polarities and driving strengths.The normalized average values of the MTGFF, C2FF, TGDPFF, and C2DPFF are plotted in red, black, magenta, and blue, respectively.As the setup skew becomes smaller, the contamination delay, the amount of time needed for a change in a logic input to cause an initial change at an output, dramatically increases.Consequently, there is a radical push out in the clock-to-Q delay as shown in the figure.Note that, for a certain clock-to-Q delay, the hold time increases with a decrease in the setup time to keep the internal race immunity in (2).One may argue that the DPFFs cause an increase in the setup and hold times as side effects.This is due to the fact that the decreased nominal clock-to-Q delay lowers the 10% constraint as well, which increases the setup and hold times by their definition; even the absolute values almost remain  the same.The authors believe that the FFs should be designed to operate with margin in the nominal delay region (i.e., the flat region in Figure 11) for stable and prompt operation.
The increased setup and hold times are acceptable and fully redeemed by considerable advantages in propagation delay, power consumption, and area.The proposed DPFFs are fully integrated into an 8 × 8 FIR filter fabricated in 130 nm technology.Figure 11 shows the architecture of the 8-tap, 8-bit FIR filter.Detailed discussions of the filter are found in [29,30].With the application of ABRM by which the optimal -ratio value of the filter is automatically driven by the BBG (body bias generator) as shown in Figure 9, the filter successfully operates all the way down to 85 mV, consuming 40 nW of power at an operating frequency of 240 Hz.This ultralow voltage operation proves high circuit robustness of the DPFFs since relative variations are significantly higher with voltage scaling and the circuit becomes much more vulnerable to noise disturbance with limited voltage headroom.

Conclusions
Design metrics are of primary importance for the FFs to be used as primitive library cells.We proposed two direct path master-slave FFs which adopt a C 2 MOS style input buffer to improve performance while providing full compatibility with widely used EDA characterization tools.Internal direct path between the master and slave latches reduces the logic depth along the main signal path, achieving a further speedup in the propagation delay.Measurements from the ABRM-applied test circuits fabricated in 130 nm demonstrated potential advantages of the proposed FFs in design metrics for ultralow as well as normal voltage applications.

Figure 1 :
Figure 1: Flip-flop environment in a digital system.

Figure 6 :
Figure 6: Power and delay with different sizes (simulated).

Figure 8 :
Figure 8: Adaptive -ratio modulation and variation tolerant body bias circuit.

Figure 9 :
Figure 9: Test chip with isolated FF test circuits in 130 nm technology.