Ultra-Low-Voltage Self-Body Biasing Scheme and Its Application to Basic Arithmetic Circuits

The gate level body biasing (GLBB) is assessed in the context of ultra-low-voltage logic designs. To this purpose, a GLBB mirror full adder is implemented by using a commercial 45 nm bulk CMOS triple-well technology and compared to equivalent conventional zero body-biased CMOS and dynamic threshold voltage MOSFET (DTMOS) circuits under different running conditions. Postlayout simulations demonstrate that, at the parity of leakage power consumption, the GLBB technique exhibits a significant concurrent reduction of the energy per operation and the delay in comparison to the conventional CMOS and DTMOS approaches.The silicon area required by the GLBB full adder is halved with respect to the equivalent DTMOS implementation, but it is higher in comparison to conventional CMOS design. Performed analysis also proves that the GLBB solution exhibits a high level of robustness against temperature fluctuations and process variations.


Introduction
Ultra-low-voltage (ULV) operation is a popular design approach to achieve high energy efficiency.When the power supply voltage ( DD ) is scaled down, dynamic power consumption is considerably decreased; however, as  DD approaches the transistor threshold voltage ( TH ), the delay starts to exponentially increase [1][2][3][4][5][6] and circuit performances become extremely sensitive to process variations and temperature fluctuations [7,8].In order to guarantee a widespread adoption of ULV designs, these issues have to be addressed [7].
To boost performances of ULV designs, while also improving robustness against process and temperature variations, the forward body biasing (FBB) technique can be effectively used [7][8][9][10][11][12][13]. The FBB can be applied (also dynamically) at different levels of granularity ranging from macroblock level to the transistor level.The key rationale for applying such a technique at the macroblock level is to amortize the silicon area and the body control signal routing complexity of a finer grained implementation.As a drawback, when  TH is reduced at the block level to compensate for variations and/or to provide a temporary speed boost, leakage power is increased for all the gates in the block, while speed-up would be needed only on timing critical gates.Better energy-delay trade-offs can be obtained by reducing the body-bias control granularity at the expense of larger silicon area occupancy [13].
Body biasing can be dynamically managed at the transistor level by exploiting the dynamic threshold voltage MOS-FET (DTMOS) approach [8].DTMOS logic uses transistors whose gates are tied to their bodies.As the substrate voltage varies with the gate voltage, the threshold voltage of the device is dynamically changed.When the device is turned ON, its threshold voltage is forced to drop, thus allowing a much higher ON current compared to a standard MOSFET [8].On the contrary, in the OFF state, the characteristics of a DTMOS transistor become similar to those of a regular MOSFET.A major limitation for the use of bulk DTMOS devices is that a large distance between transistors controlled by different gate signals has to be maintained to ensure correct body isolation between differently body-biased devices [14,15].This causes not only a higher occupied silicon area but also longer interconnections which in turn degrade speed and energy performances.As an additional drawback, the large body capacitance and resistance [16] of devices provide an additional RC delay in charging the substrate and the input nodes of the DTMOS logic gates [17].Moreover, the substrate bias voltage of DTMOS logic gates would change 2 VLSI Design also when input transitions do not imply output switching.This would charge and discharge the large body capacitances, thus wasting precious dynamic energy [11].All the above effects can erode the expected advantages of DTMOS circuits.
Recently, a gate level body biasing technique was proposed [11,18] to overcome the speed and energy limits of DTMOS logic gates.Exploiting this solution, the RC delay in charging the body of the devices does not affect the speed of logic gates.Additionally, when input signals switch without changing the logic gate status, the body capacitances are no more uselessly charged/discharged.
In this work, an extended postlayout analysis of the potentiality of gate level body-biased (GLBB) nanoscaled designs in low voltage regime is presented.As a main result, we demonstrate that GLBB designs are fully functional, robust, fast, and energy efficient both in the subthreshold and near threshold regions.The benefits of the proposed scheme are initially evaluated by comparing the suggested approach with respect to zero body-biased (ZBB) CMOS and DTMOS solutions in the case of simple logic gates as NAND2 and NOR2.Afterwards, a mirror full adder (FA) [18] implemented according to the GLBB technique is compared to equivalent ZBB CMOS and DTMOS counterparts.All the FA designs, evaluated through a preliminary prelayout analysis in [18], were laid out exploiting the ST 45 nm CMOS triple-well technology.It is worth noting that postlayout analysis is strictly required when adaptive body biasing techniques are used in nanometer technologies.This is because the physical distances needed to provide correct body isolation between differently body-biased devices have a very large impact on delay and energy characteristics of the circuits.All the compared circuits were evaluated at ultra-low-voltage regime under different running conditions.Depending on power supply voltage level, the GLBB FA allows delay to be reduced in the ranges of 6%-34% and 24%-40% in comparison to the ZBB CMOS and DTMOS circuits, respectively.This is achieved also saving energy per operation.As an example, for an 80 FO4 clock cycle period and activity factor of 10%, the GLBB circuit reduces energy per operation in the ranges of 15%-27% and 47%-77% with respect to the ZBB CMOS and DTMOS FAs.Such energy and speed advantages are obtained at the expense of increased silicon area occupancy in comparison to a conventional ZBB CMOS design but reducing area occupancy of about two times with respect to the DTMOS implementation.Additionally, the GLBB FA maintains a high level of robustness against temperature and process variations.
The rest of the paper is organized as follows.Section 2 discusses the operating characteristics of the GLBB approach.The compared mirror full adder designs are discussed and postlayout is comparatively characterized in Section 3. Finally, Section 4 concludes the paper.

Operational Features of Gate Level Body-Biased Logic Gates
As shown in Figure 1(a), the generic GLBB logic gate consists of two circuit sections: the logic subcircuit which is responsible for the logical functionality and the body biasing generator (BBG) which manages the body voltage ( B ) of all the devices belonging to the logic subcircuit.The BBG is a simple push-pull amplifier, which acts as a voltage follower for the output voltage  OUT while decoupling the large body capacitances from the output node.
In Figures 1(b)-1(c), the transient behavior for the input voltage ( IN ), the output voltage ( OUT ), and the body voltage ( B ) is shown for the falling and rising output transitions, respectively.When  OUT is equal to  DD (0 V), the BBG transfers high (low) voltage on  B net, thus preparing the pull-down (pull-up) network for a faster logic gate switching.Since the MOSFETs of the switching network (either pullup or pull-down) are already forward body-biased before gate inputs' arrival, the output transition is largely favored by a switching current significantly higher in comparison to the case of ZBB CMOS scheme.Speed improvement also exists with respect to the DTMOS configuration.In fact, the transition of the input signals is not slowed down from the body-induced RC delay, as occurs in sub-DTMOS gates, whereas the high capacitive load seen by the BBG does not constitute a speed bottleneck, since  B voltage is always established well before inputs' transition.Indeed, by inspecting the transient behavior reported in Figures 1(b)-1(c), it is easy to understand that the body-induced RC delay at the output of the BBG represents a benefit since it allows a slower transition for the body voltage and consequently a faster transition on the gate output.
In spite of the performance advantages previously discussed, GLBB logic gates show somewhat increased leakage current with respect to their ZBB CMOS and DTMOS counterparts.This is mainly due to the fact that the output voltage transition of the BBG is not rail-to-rail (a PMOS device is used to transfer a low voltage on  B net, whereas NMOS transistor is used for transferring the high voltage).This causes the threshold voltage reduction of leaky devices belonging to the OFF network (either pull-down or pull-up) of the logic subcircuit during the idle status.An additional contribution to the static power consumption of the GLBB logic gates would be due to the static current flowing in the BBG circuit.However, such a current is inherently limited by the reverse body biasing of the BBG transistors ( BS,N and  SB,P are always <0 in the BBG circuit) and becomes negligible if reduced size devices are used for the BBG implementation.
In order to estimate the trade-off potentially offered by the proposed approach, leakage current ( leak ) versus delay curves are shown in Figures 2(a)-2(b) in the case of NAND2 and NOR2 logic gates and for the ZBB CMOS, DTMOS, and GLBB implementations, respectively.For a fair comparison, all the logic gates were sized for a  P / N ratio of  ⋅ / between pull-up and pull-down networks, where  was chosen to assure symmetric switching delay under the typical NMOS, typical PMOS (TT) process corner,  DD = 300 mV,  = 27 ∘ C, rise and fall times of 500 ps, and a capacitive load of 1.2 fF.The curves have been obtained varying the sizing factor  from 0.12 m to 1.2 m, with a step of 0.12 m.Reported results are normalized to data obtained for minimum sized ZBB CMOS implementations.
As expected, at a parity of , the GLBB technique shows leakage current higher than that which the other competitors

Logic subcircuit
To the bulk of the logic subcircuit devices BBG circuit

Pull-up network
Pull-down network show.This means that, among the different evaluated choices, the GLBB style is the less suitable if the minimization of static power is the main design target.On the contrary, if the speed requirement represents the main design aim, the GLBB style becomes the most reasonable choice allowing higher performance to be reached at the parity of leakage power consumption since the boosting action of the BBG allows the delay target to be reached using smaller transistors.Moreover, the GLBB technique allows performance ranges which are unaffordable for both ZBB CMOS and DTMOS configurations.

Benchmark Circuit and Postlayout Comparative Analysis
To further validate the GLBB design technique, the lowvoltage mirror FA, shown in Figure 3, was designed and  Devices belonging to the logic subsections of compared circuits were sized with minimum channel length (i.e.,  min = 40 nm), whereas the pull-up/pull-down channel width ratio was chosen to obtain comparable strength for  DD = 0.3 V and  = 27 ∘ C, imposing equal width for series-connected transistors.
In Table 1, the width ratio between pull-up and pull-down networks is explicitly reported for the compared designs and for the different stacking configurations.The sizing factor  was chosen by iterative simulations, imposing similar leakage current at nominal conditions (i.e., TT process corner,  DD = 0.3 V, and  = 27 ∘ C) for all the compared designs.
In order to correctly take into account the impact of layout parasitics on performance, the physical design of the compared circuits was carried out (see Figure 4) considering the design rules imposed by the ST 45 nm bulk CMOS triplewell technology.For DTMOS and GLBB designs, the deep N-well layer was used to shield N-channel devices from the P-type general substrate, thus obtaining P-well regions isolated from the underlying substrate.Each of these regions is vertically surrounded by an N-well region to provide also lateral isolation [14,15].Due to distances needed to provide correct body isolation between differently bodybiased devices, implementations exploiting unconventional body biasing (i.e., DTMOS and GLBB) exhibit significantly increased silicon area occupancy in comparison to the ZBB CMOS circuit.In an area optimized layout, the DTMOS implementation requires one isolated P-well region for each different transistor gate signal, thus requiring 5 different isolated P-well islands.On the contrary, in the proposed approach, the number of isolated P-type islands is reduced to 4 (i.e., one for each BBG).This, along with the reduced size of its transistors, leads the proposed implementation to reduce silicon area occupancy of more than 50% with respect to the DTMOS design.Table 2 reports postlayout comparison results under nominal simulation conditions.Comparative postlayout delay results, evaluated for  DD ranging from 0.2 V to 0.5 V with a voltage step of 0.05 V, are shown in Figure 5.Given results are normalized with respect to the delay of ZBB CMOS design.For  DD = 0.5 V, the suggested approach allows delay to be reduced to 34% and 24% with respect to the standard CMOS and DTMOS implementations, respectively.Observing the insert of Figure 5, it is easy to note that as  DD decreases below 0.45 V, the impact of FBB in boosting the performance is reduced but with a different rate on GLBB and DTMOS techniques.As final effect of this, the speed benefit brought by the suggested approach over the conventional CMOS circuit reduces down to 6% for the minimum considered power supply voltage (i.e.,  DD = 0.2 V).On the contrary, the speed advantages with respect to the DTMOS implementation become more pronounced coming up to 60% for  DD = 0.2 V (the speed boosting on DTMOS due to the FBB is overcome by the negative impact of the body-induced RC delay).
Figure 6 reports  leak versus  DD for the three compared circuit topologies.Here,  leak is normalized to the value of CMOS design for  DD = 0.3 V. Due to the adopted sizing criterion, all the circuits have similar  leak for  DD = 0.3 V (see Table 2).However, this property is not maintained for different power supply voltage levels.As  DD drops lower than 0.3 V, the proposed approach, which benefits from reduced transistors' sizing, leads to the lowest  leak .On the contrary, the standard CMOS FA exhibits the lowest  leak for  DD > 0.3 V.Note that, for  DD higher than 0.45 V, the parasitic pn junctions of DTMOS devices start to conduct a nonnegligible current which dramatically increases leakage power consumption of DTMOS-based FA.
Figures 7 and 8 compare the energy per operation ( OP ) behavior versus  DD for the three compared circuit implementations, evaluated under different running conditions.
Results are normalized to energy data obtained for conventional CMOS circuit evaluated under the operating condition of  DD = 0.3 V, activity factor () of 0.2, and clock cycle time ( clk ) of 80 FO4 (FO4 represents the delay of a CMOS inverter driving four identical inverters), which is typical of low power VLSI circuits [19].More precisely, Figure 7 plots  OP considering  clk = 80 FO4 for  = 0.1, 0.2, and 0.3.
Considering the lowest activity factor ( = 0.1), the GLBB solution allows  OP to be reduced in the ranges of 15%-27% and 47%-77% with respect to the ZBB CMOS and DTMOS designs, respectively.This is mainly due to the reduced transistors' sizes (see Table 1) of the GLBB circuit, which allow decreased total physical capacitances on the internal nodes of the circuit, even taking all the parasitics of the layout into account.Additionally, the proposed body biasing technique allows faster transitions of the gates which in turn diminish the short circuit component in dynamic energy.The above advantages are even emphasized for larger Delay/ZBB delay (a.u.)  activity factors (i.e., when dynamic energy contribution in total  OP increases).Due to the previous discussed input capacitive drawbacks, the larger devices and the longer interconnections, the DTMOS implementation results to be very energy hungry.Additionally, the bulk bias voltage of DTMOS devices can change also when input transitions do not imply switching of circuit internal nodes.This further increases the dynamic energy consumption due to unnecessarily charging/discharging the large body capacitances.Figure 8 shows  OP versus  DD when  = 0.2 and for  clk = 50 FO4, 80 FO4, and 100 FO4.It should be noted that as the leakage energy contribution increases (i.e., when  clk increases), the suggested solution continues to maintain significant advantages in terms of total energy, also for  DD higher than 0.3 V.
Figure 9 better emphasizes PDP and delay advantages of the proposed FA, when employed in a 16-bit ripple carry adder (RCA).The power of the FA under test is consequently evaluated for maximum frequency of the whole adder (to correctly take into account leakage contribution), whereas delay is related to the device under test in the FA chain.In the above scenario, the GLBB FA lowers minimum PDP point of 22% and 68% in comparison to the CMOS and DTMOS circuits, respectively.This is achieved with a speed boost of 17%/66% when compared to the CMOS/DTMOS implementations.Speed and PDP advantages are recorded in the whole power supply range.Figure 10 describes the behavior of the compared circuits as the temperature varies from −25 ∘ C to 100 ∘ C for  DD = 0.3 V.As shown in Figure 10(a), all the circuits demonstrate similar leakage currents at low operating temperatures (<25 ∘ C).However, as the temperature increases, the leakage current of the DTMOS circuit increases faster than its counterparts, becoming approximately 1.6 times higher for  = 100 ∘ C. Figure 10(b) demonstrates that the GLBB FA maintains its speed advantages in the whole considered operating temperature range.
The impact of process variability was investigated by performing Monte Carlo (MC) simulations on 1000 samples for  DD = 0.3 V and  = 27 ∘ C. In this analysis, both interdie and intradie fluctuations were considered.MC leakage and delay results are given in Figures 11 and 12, respectively.When compared to its counterparts, the ZBB CMOS circuit exhibits the lowest mean leakage current (−19% and −9% in comparison to the DTMOS and GLBB designs, resp.) with a slight higher leakage current variability (/ = 11% for the CMOS design against / = 8% and 10.4% for the DTMOS and GLBB solutions).On the other hand, the suggested approach results to be more robust in terms of delay.In fact, MC delay results reported in Figure 12 demonstrate that the mirror FA designed according to the proposed design style reaches a mean delay of only 0.5 us, which is about 20% and 28% lower than that of the standard CMOS (0.63 us) and DTMOS (0.7 us) implementations, respectively, while maintaining a delay standard deviation of about 0.21 s.

Conclusion
In this work, the advantages of the recently introduced ULV gate level body biasing scheme were investigated.A preliminary analysis performed on simple logic gates demonstrates that the speed boosting provided by the suggested approach allows ULV GLBB circuits to reach performances which are unaffordable for both conventional CMOS and DTMOS configurations.
To take into account all the parasitic effects of the gate level body polarization in the case of more complex circuits, a GLBB mirror full adder was laid out and compared against its conventional CMOS and DTMOS counterparts.Postlayout simulation results have shown that the GLBB design style is, at the parity of leakage power consumption, able to obtain significantly higher performance with reduced total energy per operation consumption in comparison to conventional CMOS and DTMOS implementations.The silicon area required by the GLBB full adder is halved with respect to the equivalent DTMOS implementation, but it is higher in comparison to conventional CMOS design.Finally, Monte Carlo simulations prove that the GLBB solution exhibits a high level of robustness against temperature fluctuations and process variations.

Figure 1 :Figure 2 :
Figure 1: Logic gate with gate level dynamic body biasing (a) and transient behavior for output falling (b) and rising (c) voltage.

Figure 3 :
Figure 3: Low-voltage mirror FA designed according to the GLBB technique.

Figure 7 :Figure 8 :
Figure 7: Energy per operation (log scale) for  clk = 80 FO4 and for different activity factors.