A Modified Implementation of Tristate Inverter Based Static Master-Slave Flip-Flop with Improved Power-Delay-Area Product

The paper introduces novel architectures for implementation of fully static master-slave flip-flops for low power, high performance, and high density. Based on the proposed structure, traditional C2MOS latch (tristate inverter/clocked inverter) based flip-flop is implemented with fewer transistors. The modified C2MOS based flip-flop designs mC2MOSff1 and mC2MOSff2 are realized using only sixteen transistors each while the number of clocked transistors is also reduced in case of mC2MOSff1. Postlayout simulations indicate that mC2MOSff1 flip-flop shows 12.4% improvement in PDAP (power-delay-area product) when compared with transmission gate flip-flop (TGFF) at 16X capacitive load which is considered to be the best design alternative among the conventional master-slave flip-flops. To validate the correct behaviour of the proposed design, an eight bit asynchronous counter is designed to layout level. LVS and parasitic extraction were carried out on Calibre, whereas layouts were implemented using IC station (Mentor Graphics). HSPICE simulations were used to characterize the transient response of the flip-flop designs in a 180 nm/1.8 V CMOS technology. Simulations were also performed at 130 nm, 90 nm, and 65 nm to reveal the scalability of both the designs at modern process nodes.


Introduction
Flip-flops are the key elements used in sequential digital systems. The appropriate selection of flip-flop topologies is instrumental in the design of VLSI integrated circuits such as microprocessors, microcontrollers, and other high complexity chips. However, factors such as high performance, low power, transistor count, clock load, design robustness, power-delay, and power-area tradeoffs are generally considered before choosing a particular flip-flop design. The highest operating frequency of clocked digital systems is determined by the flip-flops. Flip-flops and clock distribution network generally account for 30-70% of the total chip power consumption [1,2]. Clock load is another major concern for digital system designers and several contributions have been reported in the past to reduce clock load and the associated power dissipation in the clocking network [3][4][5]. A design with elevated transistor count occupies a larger area on chip and leads to an increase in the overall manufacturing cost. Hence, design and implementation of low power high performance flip-flops with the least possible chip area is the main target of the modern chip manufacturing industry.
Flip-flops are broadly classified into three main categories, namely, master-slave [6][7][8][9][10][11], pulse triggered [12][13][14][15][16][17], and differential flip-flops [18][19][20][21]. Among them, master-slave and pulse-triggered flip-flops are the most efficient in terms of power-delay product. Master-slave flip-flops exhibit positive (negative) set-up time (hold time) requirements and hence not suitable for high speed systems due to extended data to output delays. But they are power efficient and can be used in low power applications. However, their main limitation is less robustness to clock skew. Pulse-triggered flip-flops have negative set-up time and thus lead to smaller data to output delay. They exhibit inherent soft clock edge property which minimizes clock skew related cycle time loss.  A classification of master-slave flip-flops is further elaborated in Figure 1. Clock-gated topologies exhibit internal clock gating to suppress the power consumption at lower data switching activities based on a clock gating logic and a comparator circuit. However, clock gated flip-flops have extended latency due to enhanced clock to output delays along with increased chip area overhead. Clock gated structures generally consume lesser power at low switching activities [22]. TGFF represents the best choice in the nonclock gated flipflop category in terms of power-delay product [6], whereas existence of NMOS transistors in the critical path along with partially nongated keepers leads to less significant powerdelay tradeoff characteristics in case of write port masterslave flip-flop (WPMS) [7,8] and pass transistor logic based flip-flop (PTLFF) [9].
In this paper, we introduce an alternative design approach for designing C 2 MOS based master-slave flip-flop, based on a new architecture with reduced transistor count and improved power-delay-area product. The proposed configurations mC 2 MOSff1 and mC 2 MOSff2 fall under the nonclock gated flip-flop category as shown in Figure 1.
The rest of the paper is organized as follows. Section 2 compares the conventional master-slave flip-flop configurations with proposed designs. Section 3 highlights the simulation parameters and test bench along with techniques used for transistor sizing and methodology adopted for optimization of timing and power-delay product. Section 4 describes the simulation results. Section 5 concludes the paper. An appendix is added to show calibration of parameters for delay calculations using LE theory and to outline the strategy followed for designing the eight-bit ripple counter. Figure 2 shows the conventional master-slave flip-flop architecture, whereby two regenerative loops (L1 and L2) are present in the master and slave sections to account for a static functionality. Both loops operate independently of each other on complementary clock signals. Regenerative loops are composed of cross coupled inverters. It can be observed from Figure 2 that for each loop, regenerative action is achieved through one inversion in the forward (critical) path while the other (clocked) inversion takes place in the feedback path. Moreover, there is no common component between both loops. Since an inverter followed by transmission gate is equivalent to a clocked inverter, the combination is replaced by a clocked inverter to form a C 2 MOS based flip-flop architecture as shown in Figure 3 [23]. Two regenerative loops L3 and L4 are used in a similar manner as in the previous case to maintain the static nature of the flip-flop. However, in the proposed architecture as reported in Figure 4(a), both inversions take place in the forward (critical) path and the loop is completed by a clocked switch for loop L6 while loop L5 is completed by using an inverter in the feedback path. It is clearly noticed from Figure 4(a) that the output node is always driven and never floating thus ensuring a static flip-flop operation. The size of transistors in the feedback path marked by asterisks ( * ) is kept at 360 nm (minimum technology width) to eliminate race conditions at nodes U and V. Yet another implementation is shown in Figure 4(b) which uses inverter INVX in the critical path and a clocked switch to form a regenerative loop L7. It is to be noted that INVX is common to both the regenerative loops L7 and L8 which is contrary to the realization of previous architectures. Figure 5 represents the actual circuit design based on the proposed architectures in Figure 4, while TGFF is implemented using transmission gates as switches in the conventional architecture as demonstrated in Figure 6.

Overview of Previous Work and Proposed Designs
It can be clearly observed that mC 2 MOSff1 and mC 2 MOSff2 both are realized using sixteen transistors each. As a result, the area occupied by the proposed designs is significantly lesser than the conventional designs. Moreover, the number of clocked transistors in mC 2 MOSff1 is six as compared to eight in case of TGFF or conventional clocked inverter based flip-flop C 2 MOSff [23].
To illustrate the superior performance of the proposed flip-flop configurations, other flip-flop topologies, namely, TGFF, WPMS, PTLFF, gated master-slave latch (GMSL) [10], and data transition look ahead flip-flop (DTLA) [11] belonging to the master-slave class have been used for comparisons. Out of the above mentioned topologies GMSL, and DTLA represent flip-flops with internal clock gating. Schematic diagrams of WPMS, PTLFF, GMSL and DTLA are shown in Figures 7,8,9, and 10, respectively.    Figure 11 shows the simulation test bench for characterization and comparison of the FF designs [3]. The clock and data signals are fed to the flip-flop through a two stage buffer. Data-to-output delay ( DQ,min ) is used for performance comparisons. Logical effort theory is extensively used for designing fast CMOS circuits based on pencil and paper calculations and is widely adopted in the literature [24]. Hence, the delay sensitivity factor introduced by Alioto et al. [25] based on logical effort theory has been used for performance optimization.

Simulation Parameters, Test Bench, and Optimization Methodology
A 16-cycle long pseudorandom sequence with a switching factor = 0.5 is supplied at the data input for measurement of average power [26]. Since the delay and power characterization are strongly dependent on the capacitive load offered to FFs [27], varying capacitive loads {4, 16, 64} min , where min is the input capacitance of a symmetrical minimum inverter ( = 2 = 2 min ), have been used to test the FF behaviour. Transistor sizing methodology adopted is the same as that in [28,29], whereas power-delay product (PDP) and power-delay-area product (PDAP) are the chosen figures of merit (FOM).
The expression relating the absolute gate capacitance ( GATE ) in terms of fF (femtofarads) and absolute transistor width ( ) in terms of nanometers (nm) obtained at 180 nm process node by fitting simulation data [30] is given as (1) The Scientific World Journal  TN1  TP1   TN2   TN3   TP2   TP3   TN4   TP4   TN6   TP6   TN7   TP7   TN9   TP9   N   TN5   TP5  TN8 TP8  LE method states that the optimized delay of a path of cascaded stages is where , , (= / in ) are the logical effort, branching effort, and electrical effort while , (= ) and are parasitic delay, path effort, and final load capacitance, respectively. One has the following: From (2) and (4), where represents the relative delay increment with respect to parasitic delay. Equations (4) and (5) indicate that larger values of in lead to a saturation in the optimized delay and based on the above analysis, the delay sensitivity factor introduced by Alioto et al. [25] is utilized to obtain the upper bound on the transistor widths for exploration of the powerdelay design space with least computational effort. Consider the following:   where in is the delay sensitivity factor and is obtained from (3) to (5). The upper bounds on the normalized transistor widths (normalized with respect to min ) have been obtained such that the delay sensitivity remains under a minimum value min which is chosen as −5% for our analysis. The input capacitance in of the flip-flop is expressed in terms of normalized width 1 as follows: (7) Figure 12 shows the conventional TGFF design. The sizing is done by assuming the transistors in the critical path to be independent design variables (IDVs) and optimizing for maximum performance using LE theory. The inverter before transmission gate in the first stage protects the input terminal from noise variations [31]. Table 2 exhibits delay variation for increasing in values. It is noteworthy that the delay saturates at 153 ps for in = 24.8 fF. As a result, the upper bounds on transistor widths are exposed and the limits of power (energy)-delay design space are defined early in the design cycle [32]. The table also includes the corresponding power dissipation along with the power-delay product and it is observed that minimum power-delay product is obtained at in = 9.92 fF. The technology parameters used for capacitance calculations throughout this paper are listed in Table 3.

Results and Discussion
It is a well-established fact that the conventional C 2 MOS although slower, is skew tolerant and occupies lesser area than TGFF [23,33]. Moreover, mC 2 MOSff1 and mC 2 MOSff2 show nearly identical characteristics in terms of power, delay, and area and hence only mC 2 MOSff1 is considered for comparisons.
The waveforms in Figure 13 represent the transient analysis of mC 2 MOSFF1 carried out over a period of 8 clock cycles. The SPICE simulation results verify the correct flipflop operation at 1 GHz clock frequency (all the flip-flops reported in the paper are designed for negative edge triggered operation). The variation of absolute data-to-output delays DQ,min with FF input capacitance ( in ) for 16X (19.92 fF) capacitive load is illustrated in Figure 14.
TGFF utilizes transmission gates in the critical path and hence it is faster than the rival designs. There is exactly the same number of stages in the critical path of TGFF and mC 2 MOSff1, the only difference being that the latching circuit in case of TGFF is an inverter followed by a clocked transmission gate (inverting latch), whereas a clocked/tristate inverter is present in mC 2 MOSff1. Logical effort of both the latches is considered to be two; however, it is apparent that an inverter followed by a transmission gate is faster because the output node is driven by both the transistors of the transmission gate in parallel and this behaviour is reflected in Figure 14. From the above discussion, it is obvious that the value of logical effort for an inverting latch can be assumed to be two for most theoretical purposes, but for comparison with a C 2 MOS latch, it must be slightly less than two if delays are to be modelled precisely.
Equation (2) clearly indicates that lesser branching effort leads to a faster circuit operation. The branching effort for a path with internal fan-out is expressed as [24] = on-path + off-path where on-path represents the load capacitance along the path under analysis and off-path represents the capacitance of the connections that lead off the path. The branching effort along the critical path is given as There are two branches each in TGFF and mC 2 MOSff1 represented as 1, 2 and 3, 4 in Figures 6 and 5(a), respectively. The branching effort corresponding to branches 1, 2, 3, and 4 is calculated as follows.

Branching Effort in Case of TGFF.
One has the following.
Time (ns) 8 The Scientific World Journal It is clearly observed that the delay of mC 2 MOSff1 is marginally higher than the delay of TGFF. Now, keeping other parameters to be the same and assuming the logical effort of inverting latch to be 1.8, the updated value of TGFF is evaluated as = 12.35 (absolute delay 160.55 ps).
The value of process dependent parameter is determined as approximately 13 ps using the calibration technique as mentioned by Sutherland et al. [24]. The detailed procedure is discussed in the Appendix. The absolute delay measurements obtained through simulation are 162 ps for TGFF and 196 ps for mC 2 MOSff1 which is in close agreement with the theoretical values 160.55 ps and 166.27 ps, respectively (typically within 15% error).
WPMS and PTLFF topologies show degraded performance due to the presence of pass transistors in the critical path while the speed of clock-gated structures is worst mainly because gating circuit is inserted between the clock and the flip-flop terminals which deteriorates the timing characteristics. The characterizations are done assuming that in = 12.4 fF and = 19.92 fF (16X) where represents the flip-flop load capacitance.
The variation of average power with in for 16X loading condition is depicted in Figure 15. Due to threshold voltage drop at internal nodes, WPMS and PTLFF display worst power dissipation characteristics because of short circuit power dissipation. GMSL and DTLA exhibit greater power dissipation than nongated counterparts because pseudorandom sequence has an activity factor of 0.5. The reason being the presence of additional comparator and clock gating circuit which is beneficial only at sufficiently low switching activities or otherwise leads to both increased area and power overhead. Apart from the clock load, the capacitance value at internal nodes of mC 2 MOSff1 is reduced as compared to TGFF by eliminating transistors TN6 and TP6 from the feedback structure.

Capacitance Calculations at Internal Nodes of mC 2 MOSff1
Internal Capacitance at Nodes P' and K' Node P': (TN12) + (TP12) = 9.76 fF. It can be easily concluded from calculations above that a total of 19.34 fF capacitance has been reduced from the internal nodes in the critical path of mC 2 MOSff1 in comparison to TGFF. This leads to reduced internal power dissipation at these nodes as lesser capacitance has to be charged or discharged per clock cycle. However, reduction in the clock load of mC 2 MOSff1 due to transistors eliminated from the feedback structure is nullified due to PMOS transistors TP10 and TP11 whose size is twice that of transistors TP1 and TP5 in case of TGFF and as a result the total power dissipation of both the flip-flops is nearly the same as it can be clearly observed from Figure 16. Following a similar procedure, the clock load of various flip-flops is obtained and listed in Table 4 along with number of clocked transistors and power consumption values. It is seen that TGFF and mC 2 MOSff1 represent the most efficient designs in terms of reduced power consumption having power dissipation comparable to DTLA at in = 12.4 fF and = 19.92 fF. It can be observed that mC 2 MOSff1 has the least transistor count along with PTLFF while GMSL and DTLA consist of maximum number of transistors. Since only sixteen transistors are used for circuit realization of mC 2 MOSff1, power dissipation is comparable to TGFF. It is worth noting that GMSL and DTLA offer minimum clock load, as a result, these topologies exhibit least power dissipation at lower switching activities. The reason for extended clock-to-output delays of GMSL and DTLA is the insertion of clock gating circuitry while DTLA has a pulsed operation and hence shows negative set-up time requirements. Based on the power and delay measurements, power-delay product characteristics are derived for all the flip-flops as shown in Figure 16. The optimum power-delay product of gated structures GMSL and DTLA is, respectively, 3.30x and 3.34x times greater than optimum PDP of TGFF. Among the nonclock gated structures, pass transistors based designs WPMS and PTLFF exhibit 1.77x and 1.57x enhancement in the power-delay product with respect to the benchmark flip-flop TGFF. TGFF also shows 20% improvement over mC 2 MOSff1 in terms of minimum powerdelay product. However, despite the fact that TGFF represents a better alternative in terms of performance and optimum power-delay product, the area requirements also remain a major concern. It has been observed in the literature that conventional C 2 MOS based flip-flop is up to 20-25% more efficient in terms of occupied chip area. This stems mainly from the fact that at layout level (i) in comparison to TGFF, diffusion areas of most of the transistors can be shared in C 2 MOS flip-flop [33], (ii) the number of contact holes can be reduced in the layout pattern [23], and (iii) less complicated feedback structure leads to fewer interconnections. The layouts were implemented using in = 12.4 fF, indicating almost similar transistor sizes throughout the critical path with the exception of TP10 and TP11 belonging to mC 2 MOSff1 which are twice in size compared to TP1 and TP5 in accordance with the LE theory. The layouts for TGFF and mC 2 MOSff1 are shown in Figures 17 and 18, respectively.   the power consumption of the mC 2 MOSff1 based counter is comparable to the TGFF at varying frequencies. Again, LE theory has been adopted for sizing individual flip-flops in each counter for optimum performance which is expressed in detail in the Appendix.
The flip-flops were also designed and simulated to layout level with inclusion of parasitics at 130 nm, 90 nm, and 65 nm CMOS processes to address scalability issues at more advanced process nodes. The simulation test bench and optimization methodology are similar as mentioned in Section 3. PVT variations are emphasized to evaluate the performance of flip-flops at all process corners, namely, FF, SS, FS, and SF with voltages scaled from 0.9 to 1.1 V while the temperatures varied from 0 to 125 degrees as shown in Table 6. The simulation and technology parameters are also listed in Table 6 where represents the capacitance per unit gate oxide and was evaluated to be 1.3 fF/um by fitting simulation data. In addition, the capacitances per unit length of poly, metal 1 and metal 2 interconnects are also mentioned.
For illustration purposes, the delay and power variations with the flip-flop input capacitance with respect to different process corners at 65 nm CMOS technology for mC 2 MOSff1 are demonstrated in Figures 20 and 21, respectively, at 16X capacitive loading. Both mC 2 MOSff1 and mC 2 MOSff2 showed correct circuital behaviour at the aforementioned process nodes which indicates that no internal noise violations exist especially due to the fact that logic levels are retained even at FF process corner. However, it is to be pointed out that mC 2 MOSff1 in a manner similar to TGFF starts to fail at SS corner for lower values of in [34].

Conclusion
In this paper, an alternative architecture for designing C 2 MOS based flip-flops is presented with a modified feedback strategy while preserving the fully static operation. Using the new feedback approach, a modified topology mC 2 MOSff1 is proposed with decreased parasitic capacitances at internal nodes in comparison to the TGFF which is the finest design in terms of PDP. However, postlayout simulations and analyses indicate that the modified configuration mC 2 MOSff1 presents the best alternative in terms of PDAP among all the conventional designs. Therefore, for high performance applications, TGFF still remains the best choice but it can be replaced by mC 2 MOSff1 for high density applications. Comparisons were carried out with state-ofthe-art flip-flops in the master-slave class. The simulation results are well supported with mathematical analysis based on logical effort theory within acceptable error (typically less than 15%).

A. Delay Calibration Using LE Theory
For modelling delays using LE theory initially, all the delays are expressed in terms of a basic delay unit which is process dependent such that the absolute delay is represented as the product of a unit less delay of the gate as shown in (2), and the delay unit . Accordingly, abs = .
(A.1) While represents the delay for a multistage path, corresponds to the delay of a single stage logic gate. Parameter needs to be estimated in order to obtain absolute delays and accordingly a delay versus fanout curve is determined for an inverter as shown in Figure 22 by fitting simulation data. The curve is approximated as a straight line and the slope of the line represents since = ( ℎ + ) and logical effort of an inverter is 1. In our case, is estimated as 13 ps.

B. Implementation of 8-Bit Ripple Counter
An 8-bit asynchronous counter was implemented by converting the D flip-flop configuration to a T flip-flop configuration using an EXOR gate as illustrated in Figure 23.
The T flip-flop designed using TGFF is shown in Figure 24. It is considered to be a five stage design and optimized for highest speed using LE theory. The EXOR gate was realized using transmission gates as revealed in Stage 1 of Figure 24. A similar procedure was followed for designing mC 2 MOSff1 based T flip-flop.
For designing the modulo 256 counter, the output of each stage is connected to the clock terminal of the next stage through two intermediate inverters (acting as a buffer) sized ( = 11.52 u, = 5.76 u) such that the input capacitance of the first inverter acts as the load capacitance for the flip-flop The Scientific World Journal 13 configuration of the previous stage as depicted in Figure 25. As a result, the load at the output terminal of each flip-flop is uniformly fixed at 19.92 fF.