Automatic FSM Synthesis for Low-power Mixed Synchronous / Asynchronous Implementation

Power consumption in a synchronous FSM (Finite-State Machine) can be reduced by partitioning it into a number of coupled sub-FSMs where only the part that is involved in a state transition is clocked. Automatic synthesis of a partitioned FSM includes a partitioning algorithm and sub-FSM synthesis to an implementation architecture. In this paper, we first introduce an implementation architecture for partitioned FSMs that uses gated-clock technique for disabling idle parts of the circuits and asynchronous controllers for communication between the sub-FSMs. We then describe a new transformation procedure for the sub-FSM. The FSM synthesis flow has been automated in a prototype tool that accepts an FSM specification. The tool generates synthesizable RTlevel VHDL code with identical cycle-to-cycle input/output behavior in accordance with the specification. An average power reduction of 45% has been obtained for a set standard FSM benchmarks.


INTRODUCTION
Optimization techniques for low average power consumption in synchronous digital CMOS cir- cuits often attempt to minimize the dynamic power consumption described as: p V2DD .f. Oi" C where O is the probability of a signal transition within a clock period at node i, Ci is the switched capacitance in node i, VZD is the power supply voltage and f is the clock frequency.Power optimization can be made on all abstraction levels, from IC technology to the system level.When optimizing on the gate level, or even higher abstraction levels, power optimization minimizes the product a.C, called effective capacitance.Here, both power supply voltage and clock frequency are often regarded as fixed in the system specification and cannot be affected.
For architectural design, it is possible to reduce the effective capacitance by minimizing the com- munication over long wires that have high capaci- tance.Placing the required resources, such as processing units and memories, locally within the module reduces global communication [11].It is also possible to shut down parts of the design that are idle, which makes the effective capacitance equal to zero during that period.Data path units, such as multipliers and ALUs, which are purely combinatorial logic, are shut down by disabling further changes of the values on the input signals.
Here, additional logic is introduced to detect if the unit can be shut down or not.This technique has been described by Alidina et al., in [1] and is called the input-disabling precomputational-based approach.For sequential circuits, gated-clock tech- niques are used to disable the clock signal to the parts of the design that are idle.For large circuits, such as complex microprocessors, this technique is often referred to as dynamic power management.
Here, there are large functional units, such as cache memories and floating point units, with very specific tasks that are shut down when not used.This type ofcoarse-grained gated-clock technique is possible to apply manually by the designer thanks to the small number of places where clock-gating is introduced and to the fact that the different units are functionally well separated and therefore easy to identify.In order to use fine-grained clock- gating, a single functional unit is partitioned into several sub-units where each of them are condi- tionally clocked by a gated clock signal.An automated procedure is needed for synthesizing the original design to a gated-clock implementation optimized for power.The number of places where the clock is gated increases and it becomes less obvious as to how to partition the unit.
For FSMs, the most common approach to low- power design is to divide the FSM into two or more sub-FSMs where only one of these is active at a time.Both the precomputation-based technique and the clock-gating technique have been used.Dasgupta et al. [8] use the precomputation- based technique for PLA (Programmable Logic Array) implementations of FSMs and reduce the effective capacitance in the transition logic and output logic.Benini  [5] et al., detect self-loops, i.e., when the next state is equal to the current state and the clock is gated under this condition.This approach has been extended in [5, 14] where states that are strongly connected, i.e., that there is a high probability of having state transitions among them, are placed in the same cluster or super-state, and the state transitions within the super-state can be seen as a self-loop for that super-state.These approaches result in partitioning based on the description given in a STG (State Transition Graph).In the paper by Roy et al. [14], the partitioning is based on state assignment.For FSMs with few or no self-loops, e.g., counters, it is possible to detect smaller FSMs that have self- loops.On higher levels of abstraction, the gatedclock approach has been applied for low-power optimization from high-level specifications of hardware.In [7], the control flow is examined and mutually exclusive sections of the computa- tion are detected and determine the partitioning of the FSM that controls the execution.In another approach presented by Hwang et al. [9], the FSM is partitioned along with the data path, which leads to an implementation where both the data path and the controller are shut down when idle.
The amount of power that is saved by partition- ing the FSM is mainly determined by how good the partitioning algorithm can cluster strongly connected states together in sub-FSMs and by how large the cost is, in terms of power, to make a state transition from one sub-FSM to another.In our work, we have focused on minimizing the cost of making state transitions from one sub-FSM to another.This has led us to a new implementation architecture that is based on a gated-clock technique for shutting down idle sub-FSMs, and asynchronous communication between the sub- FSMs.The two main benefits of having asynchronous communication are as follows: First of all, the power overhead introduced by the circuits handling sub-FSM communication is up to five times lower than for the corresponding synchronous solution [12].Secondly, a more power- efficient protocol can be employed that on its own, lowers the power consumption up to two times compared to existing ones [13].
The outline of the rest of this paper is as follows: The next chapter describes the principles behind gated-clock FSMs, points out the problems with fully synchronous implementation architectures and motivates the asynchronous approach we propose.Chapter 3 presents the decomposition model we use and Chapter 4 describes details on how partitioned FSMs are implemented.Chapter 5 presents experimental results from automatically synthesized FSM benchmark circuits.

BACKGROUND
The goal of an implementation architecture for partitioned FSMs is that it should provide an implementation with the same input/output behav- ior as the FSM specification describes.In our case, the specification is given in STG form for a monolithic FSM with synchronous behavior.The implementation architecture we propose in this paper is similar to the one that we have earlier presented in [12].It also has much in common with the synchronous architecture that is used by Benini et al. [5].The proposed architecture is depicted in partitioned FSMs is that the communication between the sub-FSMs that is handled by the CCBs is asynchronous instead of synchronous.
The purpose of the sub-FSM interaction protocol is to control the activation and the deactivation of the sub-FSMs.When a state transition to a state in another sub-FSM takes place, the active sub- FSM generates an event on a communication control signal called a go-signal.This event has the following functional meaning: Activate sub-FSM that contains the destination state of the transition and deactivate the currently active sub-FSM.In general, a sub-FSM may submit many go-signals, one for each external state transition, and it can be activated by one of many incoming state transi- tions.The CCB that is associated with a sub-FSM enables or disables the clock signal based on go- signals both from its own and other sub-FSMs.In Figure 2, a timing diagram is shown for state clock .o 0go_l c o_.gE-q__l ck_ ' s'SO:X Sl' X'i:s2 X S3:}( 'SO X Sl X ,SO X FIGURE 2 Timing diagram for transitions from FSM0 to FSM1 and then back to FSM0.transitions between states residing in the two sub- FSMs (FSMO and FSM1), see Figure 1.
With asynchronous control for the CCB, we can remove the need for a clock signal.For low-power design, it is important to have as small effective capacitance as possible.The clock signal is the signal with the highest switching activity (twice as high compared to any data signal) and the capacitance added here will significantly contri- bute to increased power consumption.
The additional control circuitry introduced by the CCBs will naturally introduce additional power dissipation (power overhead).The number of CCBs are equal to the number of sub-FSMs, but only one of them enables the clock at a time.A CCB has three operational modes; they are: Hand-over When a transition from one sub-FSM to another takes place.In this mode, the asynchronous CCB is active and responds to the go-signal.Enable This is one of two passive modes.The CCB is passive and enables the local clock signal to the sub-FSM.In this mode, the CCB dissipates no power except from the AND gate enabling the clock.Disable The CCB is passive and disables the local clock signal.The power consumption comes from switching the input of the AND gate.
In Figure 3, the energy consumption for both the synchronous and asynchronous CCB is given.From this figure we can observe two things.The first is that energy consumption for the asynchronous CCB is lower in all modes.The second is that the difference in the energy dissipation of the asynchronous CCB in the different modes is larger than the difference of the synchronous CCB.This property is typical for asynchronous circuits where power is dissipated only when it is active.In a partitioned FSM when there is no hand-over, only one CCB is in enable mode while the rest are in disable mode.In the clock cycle when a hand-over occurs, one CCB is in hand-over mode.The total power consumption in the CCBs can be expressed as: PCCB a. Phand-over + (1 a) Penable -+-(N 1) Pdisable where ehand-over, eenable, and Pdisable are the power consumption for the CCBs in the different modes, a is the probability of a hand-over and N is the number of CCBs.With asynchronous CCBs, the power consumption can be reduced by five times for CCBs in disable mode, which constitutes the majority of the CCBs.This will have a significant impact on the total power overhead, especially when the number of CCBs (N) is large.
The second advantage of having asynchronous control is that the sub-FSM communication protocol can be made more power efficient.The existing synchronous solutions, e.g.[5], require that two sub-FSMs are clocked simultaneously at hand-over.The power consumption at hand-over will be largest here because two sub-FSMs will be active in this cycle.We have removed this requirement by implementing the CCB as an asynchronous controller.A synchronous control- ler updates its states only at clock edges.In contrast to this, an asynchronous controller can change state as a response to an input change after only some combinatorial delay.We used this property to design an asynchronous protocol that does not require simultaneous clocking of two sub-FSMs at hand-over.The total power con- sumption for the sub-FSMs in the synchronous (Psub-fsm,synch) and the asynchronous (Psub-fsm,asynch) case is expressed as: Psub-fsm,synch Psub-fsm,asynch Ti Psub-fsm,i i=0 where Ti is the duty probability for the sub-FSM, Psub-fsm,i is the internal power consumption of the th sub-FSM, and ai is the probability of activation of the ith sub-FSM.The total power dissipation in a partitioned FSM is the sum of Pcc and Psub-fsm.Using the proposed asynchronous approach reduces both of these components.

FSM DECOMPOSITION
In this chapter, the decomposition model we use is presented first.We then make the necessary definitions that will be used for describing the FSM transformation procedures.Implementation of these procedures is discussed in Chapter 4.  In this chapter, we use abstract automata theory as has been described by Baranov in [3].There is, however, a small difference in terminology be- tween the work in [3] and other works we refer to in the area of implementation of decomposed FSMs, e.g.[5,9].In [3], the initial FSM, which corresponds to a monolithic FSM implementation, is referred to as a source automaton and a sub-FSM is referred to as a component automaton.In this chapter, we use the same abstraction and terminology as in [3].Elsewhere in this paper, the terminology in most of the referred papers concerning implementation is used.

Decomposition Model
The source Mealy automaton is defined as a sextuple" A (S,X, Y, 5, A, s0) where S is the set of states, X is the set of binary inputs, Y is the set of binary outputs, 8 is the transition function, A is the output function and So is the initial state.The automaton can be re- presented in the form of a transition table, where every row defines one transition from a source state to a destination state along with a certain output term according to a certain input term.
Let there be a partition on the set S: The automaton A can be decomposed into a set of component automata where every block siE 7r defines a component automaton: m A m (Sm, Xm, rrn, 6m, ,,m, So We call states S internal states of component automaton.X is the set of input variables at all transitions from the states in S m, and ym is the set of output variables at all transitions from the states in S m. 5 and /m are transition and output functions on the sets S and Xm.Such decom- position can be achieved by reordering the groups of transition table rows having the same source state, followed by segmentation according to 7r blocks.

Definitions
In this section, we will define different sets from the component automaton point of view.Let us define V(sz:) to be the set of states from which there are transitions to the state sk; sk is not included in V(s).With Xh we denote the existence of input valid in the expressions where it is used.
V(Sk) {sjl6(sj,Xh Sk, j =/= k} Similarly, we define a set of states, T(Sk), not included in S to which there are transitions from the states of S m.
We define two similar sets that are of more importance for the whole component automata.Here, V(Sm) is a set of states in S that have transitions to states outside S where sic resides.
T(S m) is a set of states not included in S m, to which there are transitions from the states not included in S m.
Let us define the set of states in S where there are transitions from other component automata as: The position of the sets defined above is depicted in Figure 4.
The set T(S m) originates from another subset of S m, which is denoted as a set w(sm). w(sm) m, Cs m) The position of the sets defined above is depicted in Figure 5.
We will use the shorter denotations Vm, W', Q m, and Tm, in the rest of this chapter.

Transformation of the Network
In this section, we will present the transformation sequence that results in a modified network description suitable for the implementation archi- tecture we are targeting.In the following presenta- tion, the definitions given in the previous section are used.
The transformation is carried out by the following steps: 1. Replace the transitions from the set W to T m with transitions from W to additional states that we call transition states, G i.e., 5(sj, Xh) Sk; Sj Wm, sic T with 5'(sj, Xh) sic; sjG Wm, Sk G m.
There is a one-to-one mapping between the elements of T and G m.
Let us denote the set of states replacing the transitions originating from V with G These transitions cause the activation of com- ponent m. 2. Introduce new unconditional transitions from states in G m to a single state dm. 6(gi, 1) din; gi Gm 3. Introduce new transitions originating from the additional state dm.The new transitions are based on all transitions from the set Q m.
There is a many-to-one mapping between the elements of G m-and Q m.We define additional inputs, one for every state in Q m: The new transition functions can now be evaluated: for every 6(sj, Xh)= &; sj Qm transi- tion functions t(dm, (ei, Xh)) Sic; sj Q m, 6i E m are added.4. Introduce additional output functions: for every ,(si, Xh) sic; si Q output functions /t(dm, (ei, Zt[h))= Sk; eiE E are added.As there are as many entering transitions as there are exiting transitions in the network, we can say that there is a one-to-one mapping between the source transitions, additional transitions and output functions. 5.The first transformation step (replacement of T with G m) may result in states in the set Q m, which do not have any incoming transitions.Such states are redundant and can be removed, except in the case where the state is an initial state of the network.An example of source decomposition and a transformed network is given in Figure 6. 6.The resulting network of the previous steps has the same behavior as the source automaton when the initial state is properly defined.Let So be the initial state of the source automaton.The initial state of the network has to be defined as'.
1.The component containing the state So is assigned So as its initial state.2. Other components are assigned their initial states to the corresponding d-state of the component.

Functional Equivalence
The initial condition of the network, described in the previous section, guarantees that only one e- signal is active at a time.The equivalence of the source automaton and the transformed network can be proved.In the proof, the notation of S S " ' i x ", a ,'" z i i ' i ) S e es^a "',, ,(.,, ,"; et transition tables is used.We will use a reordered and segmented source transition table.
Proof 1. Transitions Inside of the Component There are equivalent rows in the source and transformed transition tables.
2. Exiting Transitions of the Component For every exiting transition in the source table, there is a matching transition with the same output in the transformed table but with unique target states in the set G. Additionally, there are unconditional transitions from the states in G to a single d-state with a unique output signal in E.
For every transition from the target state of an exiting transition, there is an additional matching transition in the transformed network from the d-state of the target component.This matching transition has the same target state, output vector and input term in conjunction with the appropriate signal in E.
According to the initial condition, only one com- ponent can be in the g-state at a time.Consequently, there can only be one e-signal active at a time.This will uniquely define the transitions to be taken, see Figure 6.
A condition that we call static G m-state occurs when an automaton enters G in the cycle followed by the entrance to G m.This condition requires special considerations for the implementation.It will be described later in Section 4.4.

Example
Let us decompose the microprogram automaton A, given in Figure 7, into a two-component network using state-partition.According to the ta- bles, there are three crossing transitions of the partition 7r {{Sl, s2, s3}, {s4, ss}} which define the number of g-states.From the tables it can be seen that there are as many e-signals in the network as there are crossing transitions.It can also be seen that inter-component communication is formed by the signals {el, e2, e4} where the index is bound to the target state in the source automaton.

IMPLEMENTATION
In this chapter, we will describe how decomposed FSMs are implemented.First we describe where in the design flow the decomposition takes place.
Next, an overall picture of the tool that auto- matically carries out the decomposition is de- scribed.After that, the implementation of the FSM transformation steps presented in Chapter 3 are described.Hardware estimation is used to rank. the different partition candidates that are gener- ated by the tool.In Section 4.3, the estimation functions and their parameters are given.In order to have a complete synthesis design flow, addi- tional cells must be added to a standard cell library.The implementation and the design details of the cells are given in Section 4.4.The position of the FSM power optimization procedures that have been implemented in the LIFS tool is depicted in Figure 8. FSM power optimization is one step of several synthesis steps in FSM synthesis, which are a part of RTL synthesis.Therefore, it is important that the computational complexity of the power optimiza- tion step is kept low in order to keep the total time spent in synthesis low.An overview of the information flow in LIFS is shown in Figure 9.The FSM description is given as an STG.Currently, we use the KISS2 format from Berkeley [15], but in principle, RT-level VHDL or graphical input could be used in that they all contain the same information.Power consumption in digital CMOS is dominated by the dynamic power consumption, which is highly data-dependent.In order to estimate power con- sumption, it is necessary to have a set of input data that is, under typical operating conditions, applied to the inputs of the FSM.In LIFS, it is possible to either give these input vectors in a testbench or, when no typical data is available, to specify probabilities for an input to be high (logic one) for each of the inputs.The power optimization is made for a user-given area constraint.The area constraint is given as the maximum acceptable increase in area relatively to a monolithic FSM.This will allow the designer to trade circuit area for reduced power consumption.The tool is designed to work in a standard-cell based design flow.In order to make early power and area estimates, data about power and area for three types of cells are needed: a clocked storage element (D flip-flop), a CCB and a gate (2-input NAND gate).The output of the tool is an RT-level VHDL description of the partitioned FSM.This descrip- tion is normally passed on to a standard logic synthesis tool that produces the gate netlist.Along with the VHDL code, design specific scripts for logic synthesis are also generated.
The tool is divided into two main parts.The first part collects statistics about the FSM in order to find the state transition probabilities.This part may be omitted from the synthesis run if the transition probabilities are already known from the environment of the FSM or from a previous synthesis run where the STG with probabilities has been generated.The second part is where the actual partitioning takes place.

Statistics Collection
The purpose of the statistics collection methods is to determine the probabilities of state transitions in the STG.The statistics form the basis for the partitioner when clustering states, i.e., group states that have the highest probability for their connec- tions.One of the two implemented methods for collecting statistics is used by the tool.
The first method we call profiling.The second method is based on random walk.
Here, the state transition probabilities are based on the input probability vectors given by the user.The user specifies, for each of the inputs, the probability of the input being at high state.Random input vectors are generated from this given input probability vector using a uniform distribution function.The random input vectors are then used to simulate the STG.The simulation is carried out in a STG simulator that is embedded in LIFS.For the design examples that we present later in this paper, the length of the random walk simulation has been set to n3, where n is the total number of arcs in the STG.The simulation time for determining the transition probabilities easily becomes very time-consuming for complex FSMs.It is desirable to run this part only once, even if the FSM must be re-synthesized.For that purpose, we have extended the KISS2 format by adding the transition probability for every transition in the FSM specification.

Partitioning
As previously mentioned, the power reduction strategy is to partition the FSM into a number of sub-FSMs.In a partitioned FSM, a state transi- tion can take place inside the sub-FSM or between two different sub-FSMs, which we call a crossing transition.In Chapter 2, we showed that transi- tions within a sub-FSM dissipate less power than state transitions from one sub-FSM to another.It is also advantageous to have as few sub-FSMs as possible active to reduce the effective capacitance.
But at the same time, dividing the FSM into smaller partitions tends to increase the probability of hand-overs occurring.The partitioning algorithm we use is divided into two phases.In the first phase, a cluster representation of all states in the FSM is built.The states are clustered according to a closeness measure that is based on the size of the mutual state transition probabilities between states.In the second phase, clusters from the first phase are grouped and the FSMs are synthesized.Implementation costs, which are estimates of power and area, are the basis for selecting the final partitioned FSM.
Phase 1: Clustering The input to the clustering algorithm is an STG with arcs, representing state transitions, labelled with state transition probabilities.We use a hierarchical clustering scheme to build a hierarchical system of clustering representation of the states in the FSM.Hierarchical clus- tering is a general technique of clustering similar objects together and it has found its application in many different fields [10].The algorithm builds a binary tree as illustrated in Figure 10.
Phase 2: Selection of best partition From the binary tree built by clustering, it is possible to group the clusters into a large number of combinations that are all candidates for a partitioned FSM.Cutting the cluster tree at a certain level generates the clusters.For example, in Figure 10 the cutting level can be 1, 2, or 3. Cutting at level one, for example, gives a large number of small clusters ({So}, {S1}, {$2}, {$3}, {$4}) and cutting at level three gives two clusters ({SO, Sl,S2,S3}, {s4}).For each cut-level it is also possible to perform concatenation of clusters and new combinations of clusters can be generated.CT is cluster tree for the FSM.Pmin stores the minimum power.
amax is a user constraint (max.area).
BEST stores the best partition.
TMP is the partition candidate.
PFSM is the partitioned FSM.C is the number of clusters at a given cut-level BEST TMP; TMP <--{cl},..,{cj}, {cj+I,...,CN}; partition TMP --{Cl }, {2,...,CN} Cx C forj 2 to N-1 partitioned fsm PFSM -synthesize(TMP); N is the number of clusters in C H is the height of the cluster tree ifPmin > power(PFSM) and area(PFSM) < amax then sort, sorts the clusters by activity, the cluster with the highest internal activity will receive the lowest index.
power and area are the HW estimation func- tions.
synthesize is the FSM synthesis function.
cutlevel returns the clusters in the cluster tree for a given cut-level } return BEST; Empirically, we found a procedure that for every cut-level generates a reduced number of clusters.The procedure, shown in Figure 11, takes a cluster tree and returns the partitioned FSM with the lowest power consumption for a given area constraint.In order to estimate power and area for the partitioned FSM, more details must be known about the implementation.The partitioned FSM, i.e., all sub-FSMs and CCBs, are synthe- sized.After that, the estimation functions for circuit area and power consumption are applied.
The functions for FSM synthesis and hardware estimation are described in more detailed below.

Sub-FSM Transformation
The FSM synthesis takes the clusters of states, given by the partitioning, and generates one sub- FSM for each of these clusters.The synthesis is made according to the transformation steps presented in Chapter 3.For the implementation, the transformation is divided into five steps.

Sub-FSM Communication
The crossing transitions in the STG (see Fig. 13b) are implemented by the CCBs, clock-gating, and structural composition of the sub-FSMs and CCBs in the partitioned FSM.Let us consider sub-FSM A and its associated CCB CCB m.The function of the CCB is to control the gated clock.The acti- vation of sub-FSM A is made by incoming cros- sing transitions.These transitions are detected by decoding the incoming transition states, denoted G m-. Deactivation of A Occurs at the same time as another sub-FSM is activated (only one sub- FSM is active at a time).The outgoing crossing transitions from A are detected by decoding the transition states G m. Signals decoded from G we call g-signals and signals decoded from G m we call d-signals, see Figure 12.The detection of an activa- ting crossing transition can only be made based on the transitions of the g-signals.
The CCB behavior and implementation are shown in Figure 14.
The CCB is an AFSM that holds the one-bit state variable e, reflecting the state of one crossing transition.The collection of all these state variables gives a global state vector E.This state vector is one-hot encoded where only one bit is set high at a time.A high value indicates the last active crossing transition.E is decoded and used as input signals to the sub-FSMs.

Transition-state Insertion
The implementation architecture we use with asy- nchronous CCBs, see Figure 1, requires hazard- free g-signals.The g-signals must therefore be decoded from the state variable only.For example, the crossing transitions in Figure 13 are condi- tioned by the inputs of X.At these locations, where the crossing transition is conditioned by an input signal, the Mealy state transition is transformed to a Moore state transition.In the example given in Figure 13a, the initial machine consists of the set of states S {So, s1, s2, $3, $4} and the partitioned machine is -= {S1, S2}, where S 1= {s0, sl} and $2= {$2,$3,$4}.For every crossing transition we insert the transition states G(S1) {g2, g3} and G(S2) {go}.At this stage, see Figure 13b, the two sub-FSMs are still coupled by the crossing transitions, indicated by the go-signals in the STG.In the actual implementation, these transitions are handled by the hand-over mech- anism that involves the CCBs and clock-gating.
A g-state is not added if it has no other outgoing transition beside the crossing transition (unconditional transition).

Local Transition Insertion
Here, the coupled STGs will be separated.The purpose of having a separate STG for each sub- FSM is that standard synthesis procedures can be used on the STG to get to gate-level implementa- tions.At the occurrence of a crossing transition, there is a state transition to a transition state in the active machine.As a consequence, the sub-FSM containing the destination state is activated.We know that this machine is in one of its transition states.Therefore, all transition states in combina- tion with a global state E act as one of many possible entry states.In the example in Figure 13, the global state vector is E--{E1, E2}, where E 1-{e0} and E 2 {e2, e3}.

Removal of Unreachable States
From Figure 13c, it can be seen that some states do not have incoming transitions.These redun- dant states are R 1-{So} and R2= {s3}, and their function is now, after the two previous steps, located in the transition states. 4.2.5.Setting of Initial States Each of the sub-FSM must have an initial state.The initial state, given by the specification of the original machine, will be the reset-state of the sub- FSM in which it is located.For all the other sub- FSMs, an arbitrary transition state can be selected as the initial state, see Figure 13d.

Hardware Estimation
The objective of hardware estimation is to enable ranking of the different partition candidates so that the best partition can be selected.The ranking is based on the implementation costs in terms of power consumption and circuit area.A small number of estimation functions are used, see Table I.The parameters in these functions are based on the technology that is used, data from statistics collection, or they are empirically deter- mined.The parameters and their values are listed in Table II.
The empirically determined parameters are related to details that are not known on the current level of abstraction.For example, the size of the output logic is not known before a gate-level implementation, and the probability for a transi- tion in the state-register is not known before state assignment.

Library Elements
The goal has been to use a standard cell-based design methodology.The output from the tool is a structural description of the partitioned FSM consisting of sub-FSMs and CCBs.The sub- FSM description is an RT-level description that can be fed to any commercial RTL synthesis tool.The CCBs are asynchronous FSMs and standard tools do not in general support synthesis of these circuits.In our approach, we design a 1-bit CCB as a library element on the gate level.We base this

Output logic area
Partitioned FSM area EA, Etvv x (1 + k,,6) x T x [log 2]S m" i1 m=l EccB a x EccB,hand-over -[- (1 a)  x EcCB,enable X (n-1) Edisable AA, AA. q-Accn.+ Ax CCB on gates from the standard cell library, but for improved performance the CCB can be designed on the transistor level and be included as a cell in the cell library.Various multiple input CCBs are built based on the 1-bit version.The 1-bit CCB is a controller with two bits in the state variable and can be synthesized under the fundamental mode assumption [16] and with single input change (SIC) assumption.The transition map and the gate-level solution are given in Figure 14.
In general, a sub-FSM can be activated by one of many sub-FSMs and deactivated by one of many crossing transitions leaving the sub-FSM.
For this, multiple-input CCBs are needed.These are generated from the 1-bit CCB.In this way, we can avoid synthesis of complex asynchronous controllers.The extension to multiple input CCBs is shown in Figure 15.For the 1-bit CCB, the value of e can be used directly for gating the clock.For the multiple-input CCB, an additional output e_ck must be generated and used for gating the clock.As been described in Chapter 3, there are situations where we may have static Gin-states from a sub-FSM.This condition will prevent the transition on the g-signal, which is needed for triggering the CCB.In this work, we use a transistor-level solution for handling this situation.
A special D flip-flop, called GDFF, has been designed and included in the cell library to be used in the sub-FSM state register.The g-input of the CCB is positive edge-triggered and to avoid the situation of static Gin-states, it must be guaranteed that all g-signals return to zero before assertion.
With the GDFF, we can guarantee that the g- signals, which are decoded from the state register only, are zero during the first half of the clock period.The GDFF has an additional output (GQ) that is the state of the flip-flop gated with the clock signal.The normal Q-output is used for the transition function and the GQ-output is used only for decoding the g-signals.Due to uncertain- ties in loading conditions for the different nets between the cells after layout, a gate-level im- plementation of the function for GDFF may give hazardous results that cannot be accepted for signals to an edge-triggered input (g-input of the CCB).In Figure 16 we propose a transistor level solution of the GDFF.In order to attain a glitch-free output, it is important that the flip-flop structure has a small clock to output delay.
Suitable flip-flops are based on RAM cell struc- tures or CVSL style [17].These flip-flops can be implemented directly and no special delay matching is required.

EXPERIMENTAL RESULTS
Our tool LIFS, which consists of a partitioning algorithm and a set of transformation rules, has been implemented as a software prototype tool in the Java language.LIFS, together with any "T" characteristic equations Q+=D GQ+ D^CK FIGURE 16 Transistor-level implementation of GDFF in static CVSL style.
standard RTL synthesis tool, forms a complete synthesis path from STG description of an FSM to its gate-level implementation, where the imple- mentation architecture is our proposed mixed synchronous/asynchronous architecture.In order to demonstrate the effectiveness of the proposed architecture, eight of the MCNC standard bench- marks [18] were tested.The number of states in the benchmarks range from 10 to 121 states.When estimating energy consumption, the input data pattern to the circuit that is to be characterized is important.For FSMs, the sequence of the input vectors will determine the state transition probabilities and consequently determine how the FSM is partitioned.Unfortunately, typical input data is not specified by the MCNC benchmarks, which makes it difficult to compare the results with other reported works.In this work, we have set the input probability vector, used by the STG simulator in LIFS, to 0.5 for all inputs in all FSMs.This chapter reports the experimental results from LIFS.First we illustrate the partitioning considerations by an example.We then describe the results of the structural decomposition.After that, the energy consumption and circuit area are reported separately for the sub-FSMs, output logic, and CCBs.Also, the total energy and area for the partitioned FSM are compared to its corresponding monolithic FSM implementation.Finally, we compare the timing by looking at the critical paths in the different implementations.
The energy figures were obtained from gate-level power estimations by Power Compiler (Synopsys) and the area estimates are based on the cell area.The timing information was obtained from static timing analysis in Design Compiler (Synopsys).The target technology is a 0.5 gm CMOS standard cell technology.A wire-load model, supplied by the silicon vendor for this specific library [2] has been used.
Table III shows the main characteristics of the benchmark FSMs.It describes from the top row and down, the number of inputs, number of outputs and number of states.
The first phase of the partitioning (clustering) concentrates solely on state transition probabili- ties when grouping the strongly connected states.From the small FSM example given in Figure 17a, we can see that states sO and s have self-loops with high probabilities and they are also, in relation to other states, strongly connected.The actual solution given by LIFS for this FSM was sO and s located in one sub-FSM and the rest of the states located in a second sub-FSM.But only looking at state transition probabilities says very little about the implementation costs.The number of g-and entry-states of a sub-FSM plays an important role in implementation costs.In our example, there are two entry-states, {sO, sl } and two g-states, {g2, g4} in A .S ub-FSM A 2 also has two entry-states and two g-states.An increase in the number of g-states, [G[, will increase the size of the transition function and may also require larger state memory in the sub-FSM.For each entry- state, an internal enable signal, defined by E, is required.The number of enable signals, [E[, will   influence the fan-in of the logic for both the transition function and the output function, see Figure 17b.In summary, a good partition has  sub-FSMs with high probabilities of state transi- tions within the sub-FSM and a small number of entry-and g-states.
Table IV presents structural information that influences the implementation costs of the partitioned FSMs.Here, T denotes the duty prob- ability of the sub-FSM and ISI denotes the number of original states located in the sub-FSM.
In Table V, the energy consumption and the circuit area for the power-optimized FSMs are presented.The column labelled sub-FSM gives the sum of energy and area for all sub-FSMs.The column labelled output gives the energy and area for the output function, and the column labelled CCB gives the sum of energy and area for all CCBs in the partitioned FSM.The column labelled total A* gives the sum of energy respectively area for the three previously mentioned columns.The next three columns labelled FSM, output, and total A, contain energy and area for a monolithic imple- mentation.The last column labelled change shows the decrease or increase in energy and area for the partitioned FSM in comparison to the corresponding monolithic implementation.For all the benchmarks we have tested, sig- nificant power reductions have been obtained.However, there is a large difference in achieved improvement for the different machines.For example, the bbara FSM seems to have good potential for large power reduction.Since it is small, the power overhead in sub-FSM commu- nication becomes relatively large.For small FSMs, the area increases when partitioned.When using length encoding of the states, the FSM will always require more loglSI] < [1oglSml] m=l In the case of exl, we have a large sub-FSM (A4) that is active most of the time.Further decom- position would have increased the area dramati- cally but with only small power savings as a result.Cases where large power reductions have been obtained are for tma and scf.Here, small clusters with high duty probabilities have been identified.
For tma, one sub-FSM with five states is active 96% of the time while the other sub-FSM containing 15 states is only active 4% of the time.
Both tma and scf are large enough so that partitioning them does not introduce excessively large area overhead.Finally, we present the timing results in Table VI.The column labelled sub-FSM gives the critical path in the sub-FSM and the following column labelled output gives the critical path in the output function for the partitioned FSM.The next two columns labelled FSM and output contain the critical paths for the monolithic implementation.The last two columns labelled change shows the decrease or increase in the critical path for the partitioned FSM in comparison to its corresponding monolithic implementation.
In general, one could expect a slight increase of the delay in the output logic for the partitioned FSM.Here, the fan-in will increase with IEI signals.In the case of exl, where we have four sub-FSMs and IEI 8, the increase of the com- + 15/o + 20/0 plexity of the logic has significant influence on the delay in the output logic.For most of the benchmarks, we can observe only small changes in the delay of the critical path of the sub-FSMs.
Alternative ways of organizing the logic for the output function will be further investigated.
Asynchronous controllers dissipate less power than synchronous controllers when idle.
A more power efficient hand-over protocol for communication between the sub-FSMs can be employed.
The effectiveness of the proposed implementa- tion architecture accompanied with the automated synthesis procedure, implemented in a software tool, has been demonstrated by the MCNC FSM benchmarks.Power reductions of up to 68% have been achieved at the cost of an increase in area of 20%.We have found these results encouraging and we also see that further improvements are possible.More attention should be given to improving the partitioning algorithm.For some of the benchmarks, we can observe large increases in both area and power for the output logic.
Figure a and consists of: (1) a number of sub- FSMs, (2) an equally large number of CCBs (Clock Control Blocks), and (3) AND gates for gating the clock signal.Alternative implementation architecture is a fully synchronous partitioned FSM, shown in Figure b, and a monolithic implementation, shown in Figure c.The impor- tant difference between the two architectures for

FIGURE 3
FIGURE 3 Energy consumption in synchronous and asyn- chronous CCB in different modes.

FIGURE 4 FIGURE 5
FIGURE 4 The transition sets of a state (left) and component automata (right).

FIGURE 6
FIGURE 6 Example of source decomposition (top) and a transformed network (bottom).

FIGURE 8
FIGURE 8The LIFS tool positioned in a synthesis-based digital design flow.
corresponds to the initial STG specification.It also inserts profiling information during the generation of the VHDL code.The state transitions are traced during simulation and collected statistics are written back on file.The simulation is made in a standard commercial VHDL simulator.

FIGURE 11
FIGURE 11 Procedure for selection of best partition.

FIGURE 12 2 FIGURE 13 FSM
FIGURE 12 Signal interface of the CCB.

FIGURE 17
FIGURE 17 Example of partitioned FSM (bbara): (a) STG with transition probabilities, and (b) structure of the implementation.
Profiling uses user-supplied input vectors for simulating the FSM.The tool first generates VHDL code that FIGURE 9 Overview of the LIFS tool.

TABLE II
Parameters in estimation functions.Units are pJ and gate equivalents for energy dissipation and area respectively

TABLE V
Energy consumption [pJ] and circuit area [#gate eq] for partitioned FSMs and monolithic FSM EA, AA.EccB ACCB EA, AA, Efsm Afsm EA AA EA AA E