System-level Time-stationary Control Synthesis Pipelined Data Paths *

We address the prblem of time-stationary control synthesis for pipelined data paths. Control synthesis system accepts scheduled control data flow graph with conditional branches which are produced by high level synthesis tools such as Sehwa [1] as input specification and generates a FSM controller. First a scheduled control/data flow graph is analyzed and the various states are identified. Overlapped states are grouped together to produce L groups where L is the pipeline latency. Next, state transitions are identified and a state table is generated. Finally, a highly optimized FSM controller is implemented by performing horizontal partitioning and the corresponding stae encoding so as to minimize the total controller area. We compared our approach to published work on FSM generation and optimization and the results indicate that our method results in large savings in total controller area.


INTRODUCTION
Pipelining is a widely used approach for designing high performance digital circuts.As design size increases, pipelined architectures become quite complex and thus automatic pipeline design synth- esis tools are necessary to cope with such complexity.pipelined data path synthesis problems have been well investigated in [1,2,5].Sehwa  [1] performs allocation of functional modules and scheduling of resources and estimates the cost of registers and interconnections.In [5] a method for module assignment and Register-Transfer level synthesis of pipelined data paths was presented.Once an RTlevel data path is obtained, the corresponding controller can be synthesized.
A time-stationary control mechanism [6] pro- vides the control signals for the entire pipeline *This work was supported by NSF Grant#MIP-8909677 and by a TRW fellowship.Based on "Automatic synthesis of time- stationary controllers for pipelined data paths", by J.T. Kim, F. J. Kurdahi *Corresponding author.Tel.: (714) 856-8104, Fax: (714) 856-4152, e-mail: kurdahi@balboa.eng.uci.edufrom a single source external to the ppeline.The main characteristic is that at each unit of time these controls govern the entire state of the machine.The design of this type of controller is a complex task since the controller must also remember the current pipe state in order to provide control signals to the pipe stages occupied by multiple overlapping tasks.
The controller is modeled as a Moore style Finite State Machine (FSM) in which state memory holds its present state, and the combina- tional parts decide the next state and the primary output functions.The combinational circuits can be implemented using PLAs or random logic.The structure of a time-stationary controller is shown in Figure 1.The controller is vertically partitioned into a Sequencing part and a Command part.The Sequencing circuit solely implements the next state function whereas the Command logic generates the output function.The Sequencing logic is partitioned horizontally into two parts since it was observed by Paulin [7] that such a partitioning minimizes the total controller area.
We describe a method to generate control specifications of the pipelined data paths which makes use of the node labelling and mutual exclusion testing techniques described in [1].The synthesis of the sequencing part follows the Behavioral Synthesis tasks of scheduling and resource allocation.Once the RT-level-data path synthesis is done, (i.e., the tasks of module assignment, and multiplexor and register alloca- tion), the control requirements of each RT level component are known and thus the command logic can be synthesized.Figure 2 describes the flow of tasks in the overall high-level synthesis of data path and control.
Section 2 presents the control specification of pipelined data paths.The partitioning and state assignment algorithms are described in Section 3. Section 4 shows some experimental results.Conclusions are drawn in Section 5.

RELATED WORK
state transition diagram are used for input specification and the tasks involved are the state assignment and the logic minimization with or without topological optimization.The survey on PLA-based FSM optimization is presented in Section 4.2.
Control synthesis task can be divided into the generation of control specification and the FSM synthesis.The control specification generation and the FSM synthesis can be referred as control synthesis at the register transfer level and at the logic level, respectively.
Most of the previous work done in the control synthesis at the register transfer" level was for nonpipeline systems.Automatic synthesis of micro- programmable control hardware was addressed in [10] using the two optimization algorithms, the autonomy algorithm and the attraction algorithm.
It is constrained by the capacity of the micropro- gram storage, speed requirement, and the number of control signals that can be activated at the same time.The optimization techniques can be used to reduce the number of branches, to shorten condi- tional baranching time, and to reduce the number of micro cycles which leads to increase the performance of the micro engine.The HAL system tries to use this approach for minimizing the control path cost.The CONSPEC [8] dealt with the automatic production of control specifications from high-level behavioral descriptions in control and timing graph form and is designed for interface processors.
Bridge [9] is a high level synthesis system developed at AT&T Bell Laboratory and performs data path and control path allocation for non- pipelined systems by applying either a local slicing or a global slicing techniques.The Bridge system starts with a micro-architecture model and some- times the implementation can be reduced to a finite state machine or a combinational circuit.Structural synthesis [12] in the Yorktown Silicon Compiler integrates data path and control synthesis for nonpipelined designs.
Several control synthesis works at logic level are reported in [3,11].A state table or an equivalent Embedding, and Horizontal Partitioning.In prac- tice, many of the LTO algorithms developed are combinations of the above listed methods and are usually associated with a state encoding scheme.Since PTOs do not modify the logic, they can be applied after processing a PLA with the LTO.We are focusing on LTO in this work.
A crucial step to prepare for the minimization is the task of state assignment.The codes of the states are assigned in such a way that results in boolean minimizations.De Micheli proposed a technique for state assignment of Finite State Machines based on symbolic minimization of the FSM combinational component and on a related constrained encoding problem [16].The algorithm derives a set of coding constraints and encodes so that the associated state transitions can be implemented by one common product term.The distinctive approach that logic minimization of the symbolic description is applied before state encod- ing was proposed.His solution to the state assignment problem was to find the assignment of minimum code length among the assignments that minimize the number of rows of the PLA.Amann presented state assignment algorithms that permit the synthesis of optimal counter-based PLA FSM's [18].The algorithms are divided into two step: state chain calculation and state chain Some of these works can be found in [4,17,3].Vertical partitioning is a classical PLA optimi- zation technique which separates the set of output functions into two or more PLA's while minimiz- ing the number of redundant product terms in all the PLA's [21].A common technique is the separation of state outputs from command out- puts.An initial PLA personality matrix is parti- tioned to yield the sequence PLA and the command PLA which generate next states and the primary output functions, respectively.Horizontal Partitioning was proposed by Paulin [7] and combines the advantages of traditional vertical partitioning and counter embedding.It allows the reduction of the number of input and/or output columns in the PLAs resulting from the partition.The technique also reduces the total number of product terms, as in counter embedding techniques.The concept of horizontal partitioning used in Paulin's algorithm is a generalization of Amann's work.The significant difl'erence in procedures is how horizontal partitioning is performed.In paulin's algorithm, the final parti- tioning of the sequencing PLA into two PLAs is done by considering some boolean relations between various product terms in the sequence PLA.Paulin's algorithm was implemented with minor modifications and enhancements in [26].In [7], Paulin proposed a horizontal partitioning scheme of PLA-based FSMs.The objectives for the horizontal partitioning are to: Find a partition that satisfies all boolean relations, and 2 Find a partition that holds PTs which depend on common outputs and/or common inputs.

CONTROL SPECIFICATION
In this paper, we present a method to synthesize a Moore-style FSM controller specifications given a pipelined data path with conditional branches.The input is a scheduled data flow graph which shows operator-to-time step assignments and dependencies between operations.The output is a FSM specification in the form of a state table.A data flow graph (DFG) is a directed graph representing the functionality of a digital system or a computer program.In a DFG, a node represents an operation on values and a directed edge represents the flow of values between its source and sink nodes.There are many constructs which can represent conditional execution paths in DFGs.However, in this paper, we use OR-FORK and OR-JOIN (also referred to as a distribute :join (D-J) node pair [1]) to represent conditional execution paths in DFGs.Whenever an execution path is to be selected by some condition, a distribute node must be used to split the values to every possible execution path.Conditional branches can be nested as many levels deep as needed.When the execution path is no longer dependent on the branching condition, a join node is used to indicate the termination of conditional execution.A DFG which is augmented with these D-J constructs is referred to as a Control/Data Flow Graph (CDFG) since it now includes additional control information.At this point, and in order to avoid any ambiguity, it is important to define the term latency which will be extensively used throughout this paper.

DEI'INITION
The number of time units between two consecutive initiations in a pipeline is called the latency, L of the pipeline.
Loops can be modeled in several difl'erent ways, e.g.[13,14].Loops with a small number of iterations can be handled by unrolling as in [1].Loops can be also treated as conditional blocks as Although the technique in this paper can easily be extended to multi-way l)-Js, we will assume that the l)-J pairs are two-way splits.
described in [15], for example.In this case, the inputs to a loop are selected between the inputs from outside the loop and those from previous loop interaction.The selection conditions can come from the loop counter (in a FOR-loop) or the corresponding conditions from the data path (in a WHILE-loop).Hence, loops can be scheduled without unrolling them at the expense of lower performance and throughput.In this paper, we assume that the system CDFG's to be synthesized are loop free, either as initially specified, or through the application of loop elimination techniques as described above.Extensions to this work towards more general CDFGs are currently underway.
The control specification procedure consists of three major steps: preprocessing, state decision, and state transition.In the remainder is this section, we describe these steps and present our approach to solving each one.
4.1.Preprocessing Some pipeline scheduling schemes (such as Sehwa [1]) assume that the conditional nodes, D and J, have a negligible delay compared to other opera- tions, and thus do not assign them to any specific time steps.In preprocessing we rearrange the D and J nodes, and insert NOP nodes in the CDFG.NOP nodes are inserted along execution paths in order to simplify the control synthesis tasks since now only nodes need to be considered (as opposed to nodes and edges).The procedure checks the node distance between two adjacent nodes and if the distance k is greater than it inserts k NOP nodes into k--1 stages between two nodes.The conditions for each D-J pair is kept in a one-bit memory and is assumed to be available when the pipeline is initiated or at some earlier time step.
The number ot" state is dependent on the Schedul- ing of the D and J nodes.A subgraph of a CDFG is shown as an example in Figure 3.In Figure 3(a), the conditional branch occurs in time Step 1, nodes A and B are assigned to stages 3 and 4, respectively, and J node is assigned in stage 5.
There are no operation nodes in stages 2 and 4, but there are two edges in those stages. 2Edges with no operation nodes in D-J block increase the number of states in the FSM and our method to avoid this problem is keep the D and J nodes as close as possible, i.e., move a D node to stage where both or either one of the next nodes are located in stage + 1, and a J node to stage./whereboth or either one of previous nodes are scheduled.This always satisfies the availability of the condition for the conditional blocks, since a D is moved to a later time step.The corresponding condition is still available and the same can be said about the J nodes. Figure 3(b) shows such an arrangement, where the D is moved to time Step 2 and the J node to time Step 3.This reduces the number of edges in time Steps 2 and 4. Therefore reduces the number of states.Suppose that the latency is 2, then there are eight state in Figure 3(a), since NOP nodes and time steps are overlaped.But they can be reduced to four states after rearrangement of distribute-join block in Figure 3(b).
ln this paper, we will use the terms stage, time step, and control step interchangeably.

State Decisions
It is assumed that the CDFG schedule is pipelined with a latency of L and that the total number of stages (or time steps) is n t.Conditional branches are handled by using an algorithm described in [1].
The algorithm assigns to every node a label consisting of a sequence of one or more integer codes.Using these labels, we can test for mutual exclusion between any pair of nodes (operations) in pseudo-constant time.Before going any further, we define the following terms which assume a CDFG scheduling with a latency L.
DEFINITION 3 Given two events in a data flow graph which occur conditionally.If the condition that selects one event always falsifies the condition selecting the other, and vice versa, then the two events are called mutually-exclusive with respect to each other.DEFINITION 4 A set M of nodes is said to be a mutual exclusion set (MES) if all the nodes in M are pairwise-mutually exclusive and M is not included in any larger MES M '.For a given time step i, we weill denote by M i,, M ,2,..., the MES's which cover the nodes scheduled in i.
Based on Definition 4, MESs are the maximal groups of mutually exclusive operations within a given time step.Next, we find sets of operations which can be executed concurrently in each time step by picking one operation from each MES and combining them.DEFINITION 5 Let Mi,1, Mi,2,..., Mi, denote the MES covering time step i, a Possible Execution Mode or PEM, P is defined as a set of n operations, one from each MES.Pi {o,..., onlo Mi,h, h 1,..., n }.We will denote by Pi,1, Pi,2,..., the different PEM's in time step i.
Thus, each PEM Pi,j represents a subset of nodes that can be executed in parallel during time step i.Without loss of generality, we can assume that < L. Since the schedule is pipelined, time steps i, /L, /2L, are overlapping and therefore, a state can now be defined as follows: DEFINITION 6 Given <i < L, a state S is defined as a set of PEM's corresponding to overlapping time steps i, / L, 2L,...Si {Pk,1 V Pk,l, Pm,n E Si, k mod L m mod L i}, and Si is not included in any larger state S: l" We will denote by Si,1, Si,2,... Si,ni all the state can be generated by different combinations of PEM's in and the time steps that overlap with it, and ni is the number of such different combina- tions.Since <i<L, we can define groups of states G, G2,..., Gi,..., GL such that Gi--{Sij, <_j <_ ni}.

State Transitions
After identifying the states, we need to determine the state transitions.Given a CDFG pipeline- scheduled with a latency L, we observe that state transitions occur between adjacent groups of states in the following sequence: GIG2...Gi Gi+l...GL --G1.This is mainly due to the pipelined nature of the scheduling and is shown in Figure 5.This is a key property in our optimization scheme, as will be discussed later.Another important factor affecting the control specifications are the distribu- tion nodes (D).If the present state has m D nodes, there are 2 possible combinations of input conditions.4 Given a particular state, we now present a strategy for finding the next state by considering only node labels.
We define two nodes to be compatible if they are not mutually exclusive.Thus, the next state is the one which has all the compatible.(i.e., not mutually exclusive) nodes of the present state.Using the state representation outlined in Section 2.2, can find compatible nodes by searching only the PEMs corresponding to the next time step within states in the next group.Starting with G, we choose a present state and find all the possible next states in G2 for all possible input conditions.This procedure is repeated for all the states in G. Next, states in G are considered in the same manner and so on until all L groups of states have been visited.A 0 (or false) condition in a D node implies that the left branch will be traversed and thus a transition must occur to one of the states corresponding to the left child of the D node.If a state is not dependent on a particular D node, then 3A more detailed example will be presented in Section 4.

FSM OPTIMIZATION
As part of our approach to FSM optimization, we use a variation of the horizontal partitioning technique.In Paulin's algorithm, the second objec- tive must be weighed against the first one.By exploiting the specific characteristics of the pipeline control synthesis problem, we developed a new algorithm for horizontal partitioning in which both objectives are satisfied without conflicts and there- fore is more efficient in the situation above.While it is originally targeted at PLA optimization, his methodology can also be applied to non-PLA control structures, such as random logic, and would still result in area svaings as will be shown experimentally in Section 4. In addition, we extended the horizontal partitioning from a two- partitioning to multi-way partitioning.This enables us to explore more optimization possibilities and thus obtain more area-efficient controller implementations.
efficiently and within 5% accuracy for standard cell implementations, In either case, the problem reduces to finding the partitioning which results in a minimum total Area.

The Partitioning Algorithm
The area of a controller logic can be reduced by reducing the number of states (row reduction in PLAs) and also by reducing the number bits/state (column reduction in PLAs).The first reduction can be achieved by placing all the possible binary relations (objective in Paulin's approach) in either one of the partitions and the number of bits/state can be minimized by grouping together PTs which depend on common inputs (objective 2).In general these two are interdependent (and sometimes even competing) and thus can not be optimized simulta- neously.However, in out model, the D nodes are scheduled in only one time step and therefore the inputs to each of the groups G; are always mutually exclusive Thus, grouping overlapped stages in a pipelined data path has the important advantage that it solves the first objective in Paulin's algorithm without worrying about the second one.In other words, the input/output relations do not block any binary simplictions between terms because the inputs to each group are mutually exclusive.
We partition the groups into two subsets SP and SP2 such that the total area of the resulting partitioned FSM is minimized.If the controller is implemented as a PLA, in order to calculate the area we need to known the number of columns and Cv2 in SP and SP2, the number of product terms PT and PT2, and the number of binary relations in each partition.We can estimate the number of rows RI, and Rp2 in each partition by subtracting the total number of rows reduced by coding constraints from the total number of PTs in the partition.The total area is Area Rp, Cp, + Rp2 Cp2.If the controller is implemented in random logic, we can estimate the area of the layout by using the LAST area estimator [24] which can do so quite 5.1.1.Exhaustive Search Since the partitioning scheme now deal with only L groups instead of a much larger number of states, the problem is greatly reduced in size.For small values of L we can find the optimal partitioning by exhaustive search.The number of distinct ways of partitioning L distinct objects al, aL into n (n _< L non-empty partitions is given by L RL,n rl! Z el en! (1) el +...+e,,=L, e>_ l,...,e,,_> Table I! shows some numerical values of RE, for values of L and n between and 6.Since the values of RL.n are less than 100 in all the cases, an exhaustive search to obtain the optimal partition is feasible for small values of L.
The experimental results in Section 4 show that partitioning the groups of states onto more that two partitions can result in more area efficient implementations than the two-way partitioning.The multiway partitioning Algorithm is a simple modification of the two-way partitioning.

Branch and Bound Method
As noted in Section 2, the presence of loops in a CDFG schedule could easily make the latency L relatively large.Thus, exhaustive search would become computationally intractable.In these cases, the problem is reformulated as a branch and-bound alogirthm to reduce the search space.
In our branch and bound method a decision tree is constructed as follows.The ith level in the tree represents a possible 2 way partition with partition 5The estimate does not take into consideration the effect ot" further logic minimization and is only an approximation of the tinal result.size of and n -i, where n is the total number of distinct objects (or groups).The labels on the edges denote the objects selected.For example, the first level shows the possible partitions of sizes and n 1, the second level represents the possible partitions of sizes 2 and n -2, and so on as shown in Figure 7.We observe that there are n levels in the decision tree. 6In order to implement an efficient branch and bound technique, we need to derive good bounding function which can be easily computed.In our case, the cost function to be minimized is the sum of the areas of the resulting partitions.In the following, we discuss this issue and propose a bounding technique to be used by the search algorithm.Here, we assume that the target implementation is done using PLAs.
Let's represent the unpartitioned PLA by a rectangle of which w o (# of input conditions) and h o (# of PTs).Since two groups or two sets of groups will have disjoint PTs input conditions, we can represent a partitioning as two rectangles connected by one point P (a, b as shown in Figure 8(a).Let w and h be the sides of the rectangle which shows the unpartitioned design.Let a and b be the sides of one partition as shown in Figure 8(a).Then the (shaded) area occupied by the partition is computed by f (a, b ab + (w -a ) (h b 2ab + wh ah bw.The unshaded portion of the rectangle shows the saving area by the partitioning.While the search is going down to the leave nodes one group is moved from lower partition to upper partition, so the point P (a, b can only move to a point (a +Aa, b +Ab where Aa >_ 0, Ab > 0 from the upper left corner of the unpartitioned rectangle, the point S in Figure 8(a).Let's have a closer look on the function f (a, b how it have changed in the space bounded byx --0, x -w,y =0, andy =hline segments.Let's divide the space into four quadrant as shown in Figure 8(b).It is bounded by x w/2 and y h /2 line segments.We can make the following observations" 6Note that the tree is not balanced.(Of (a, b )/(Ob )) 2a -w if a < w/2f (a, b is decreasing function along with the y direction.if a > w/2f (a, b is increasing function along with the y direction. Since in Quadrant I, a < w/2 and b < h/2, both i)j (a,b)/Oa and Of (a,b)/Ob are less than zero, therefore, j' (a, b) is decreasing function in Quad- rant I. On the other hand, in Quadrant IV, both Of (a,b)/Oa and Of(a,b)/Ob is greater than zero.Therefore, f(a,b) is always increasing in the Quadrant IV while a and/or b is increasing. 1 OBSERVATION 2 The point C :f (w/2, h/2) in Figure 8(b) is a saddle point off (a,b) and C is unique.
Proof We prove this using the partial differ- entiation.The sufficient condition for a saddle point [23] is Therefore the point C is an saddle point and this point is unique.OBSERVATION 3 If P lies on either a w/2 or b h/2 line segments as shown in Figure 8(c ), then the area./" (a, b occupied by the partitions is wh /2.i) if the point P lies on segment (a w/2) the total areal (w/2, b 2(w/2)b + wh (w/2)h wh /2 ii) if the point P lies on segment (b h/2) the total area f (a, h/2) 2a (h/2) + wh -ah -(h/2)w wh /2 Therefore the area./" (a, b  wh/2.
The above observations show how f (a,b)   changes in Quadrants I and IV.In the case of moves in Quadrants II and III, f (a,b)s can be decreasing or increasing.We also find that f (a, b is greater than wh/2 in Quadrants I and IV, but it is less than wh/2 in Quadrants II and III. Based on these observations we developed a new bounding technique to be used in the Branch and Bound method for our horizontal partitioning.While the decision tree is searched down to the leaf nodes, point P (a, b in Figure 8 moves from point S to point T through Quadrants II, III, or through point C. Subsequently, the search will go down on the subbranches until the point P reaches Quadrant IV.Beyond this point, any further move would increase the area./" (a, b and further search through this bounds of the tree is not necessary as can be seen in Figure 9.The sufficient and necessary conditions to detect that point P has already reached Quadrant IV are that: (1) the resulting area of child node is larger than the parent node's, (i.e.,./" (a, b is now increasing), and (2)./' (a, b of the child node is greater than wh /2.These two conditions are our bounding factors for our branch 200 20 0 a 20 FIGURE 9 3-I) plot of functionf (a, h tbr h w 20.   and bound based algorithm.The less the levels which we need to investigate the better this branch and bound scheme works.In other words, if the minimum point is found at the earlier levels for each branch, then the search space is reduced signifi- cantly.To satisfy this objective, the objects are sorted according to the size of their diagonal, (i.e., by increasing distance to Point S).

Heuristic Approach
While the branch and bound approach can significantly cut the search time, its worst case performance is still exponential.In order to further reduce the partitioning runtime, we developed a heuristic algorithm which is a variation of branch and bound method.As we mentioned in the previous section, f(a, b) is decreasing with both a and b in Quadrant I.It means that f(a, b) is likely to be large in earlier levels since point P is now clear to point S in Quadrant I. So, by avoiding the computation of./"(a,b) in the earlier levels, we can reduce the overall amount of computation.While the Branch and Bound scheme prevents searching in Quadrant IV, the heuristic approach tries to reduce the search in both Quadrants and IV.It controls the starting level of the next sub-branch in Quadrant and adopts the Branch and Bound method in Quadrant IV.As depicted in Figure 10, the tree is searched in a depth first manner.If the minimum value of the previous sub-branch is found at level 1, then in the next sub-branches we start the search from the same level 1.Thus, savings can be achieved if the next sub-branch starts from level k where (1 _< k < ), in which case we save l-k calculations.Figure 10 shows an example of the design space with a latency of 6, where the number inside the circle show the total normalized PLA area of each partitoning.Since the minimum partition is found in level 3 (1-2-5), only the solutions at levels greater than or equal to level 3 are computed, thus only the space between line and 2 is the design space which the heuristic algorithm explores, and the space above the line 3 is the one for branch and bound method.
Both approaches were implemented in one program as shown in Figure 11.Currently, it is up to the user to decide which method will be used in partitioning.Table III compares the efficiency of the branch and bound method and the heuristic algorithm with respect to the exhaustive search on the quadratic equation solver example depicted in Figure 12.In this example we assume that the values a, b, and c in the equation (ax 2+ bx + c 0) are 8-bit integers so that the number of iterations compute the square root is 7.We unrolled the loop completely and ran Schwa to schedule the CDFG with different values of latency(L up to 11.As explained in Section 2.2 there are L groups of states to be partitioned.In order to show the efficiency of our branch and bound and heuristic approaches we created addi- tional cases (of latency more than 11) by duplicat- ing the group, i.e., we use each group twice.For example, the case with latency of 22 is made up by taking the scheduling with a latency of 11 and using each group twice.The heuristic always finds the optimal partition except when the latencies are 11 and 22.In these cases, the heuristic finds suboptimal solutions which are only 1.2% and 1.8% greater in area than the optimal solution, respectively.

State Encoding
Once the horizontal partitioning of the state table is done, we need to perform state encoding.Given a set of coding constraints, the objective of this procedure is to assign state codes so that the size of the sequencing logic is reduced.We generate coding constraint groups consisting of states having the same next state and matching primary inputs.States in the same coding constraint group can be collapsed into one common PT, thus reducing the number of states.In addition to saving by horizontal partitioning, we can also reduce the number of bits/state in the two-way partitioning case by assigning even codes to all the next states of one partition (in a PLA, this will set the last column in the OR-plane to all zeros, and in random logic, this will reduce the gate count and the wiring).To decide on a candidate for this reduction, we compute the number of next states in each partition and check if it is less than [log2(total number of states)I/2.If this applies to both partitions, choose the partition that can result in a larger reduction in area.This is always possible since the number of next states either partitions is less than equal to a half of the total number of states and also the number of available codes are always at least equal to the total number of states.Furthermore, PTs not in the current partition area included but their next states are set to don't cares.This allows further minimization by logic optimization tools such as Espresso [19] (for PLAs) or MIS [20] (for random logic) since it reduces the number of literals.The state encoding algorithms is shown in Figure 13.The encoding algorithm can be extended to the multi-way partitioning in a straightforward manner by dividing the partitions onto two blocks and assigning state codes to each as if it were a two- way partitioning.

RESULTS
In this section, we present some experimental results which were obtained by applying our approach to two design examples.The first example is from [1], the second is a reduced instruction set version of the M6502 microprocessor.In both cases, we show evidence that our approach achieves better area savings compared to traditional synthesis methods.

The Sehwa Example
The first example CDFG [1] is shown in Figure 4. We used Schwa to schedule this CDFG with different latencies.In Example Schwa-l, the CDFG is scheduled with latency L 3. Prepro-cessing adds two NOP nodes to the CDFG, one, (110), is added in time step 2 and other, (10), is in time step 4, so that the procedure deals with only operation nodes.There is a total of 16 states in this example.Starting from state S, we construct the state table.Using the FSM synthesis algorithms of Section 3, the controllers are built with both PLAs and standard cells.For the PLA controller, partitioning can save us the second and fifth input columns in partition 1, and the first, third, and fourth input columns in partition 2. The last column of the OR-plane in partition is also reduced by the state encoding.We then minimized each partition with Espresso.The array area of each PLA can be estimated by Apla (2 x [nil-+no x r/pt where ni and no are the number of bits of inputs and outputs, respectively, and /7pt gives the number of product terms.Table IV(a) shows the PLA areas obtained by NOVA [17], by the modified Horizontal Partitioning [26], and by our algorithm in PLA area units (normalized).We added an estimate of the routing and buffering area for the last two approaches which produce multiple PLAs.We ran the i_exact encoding strategy for NOVA.The modified Horizontal Partitioning we use different sizes of cluster which are 8, 10, 12, 16, 20 and we choose the best result.
In this particular example, our algorithm achieves PLA area savings of more than 24% as compared to the other two algorithms.
In Example Schwa-2, the same DFG in Figure 4 is scheduled with latency L -2.Here we use the io_hybrid encoding strategy for NOVA since the i_exact encoding was computationally infeasible (we ran for more than 70 hours on a Convex super computer).The savings are much greater in this case.
In the case of random logic controllers, we minimized the logic by using the MIS multi-level logic optimizer.Each partition was optimized  separately using NAND, NOR, and INVERTER gates.The three partitions were then merged to one block and laid out using the GDT standard cell place and route tools [27].Table IV(b) shows that our approach achieves savings of 9.5% and 72.9% in layout area over NOVA for Example Schwa-1 and Schwa-2, respectively.In each case, we chose the standard cell row configuration which resulted in minimum layout area.Again, we note that for Example 2, our comparison with NOVA is based on a sub-optimal io_hybrid run because an optimal run of NOVA was computa- tionally infeasible.

Reduced M6502 Example
The Schwa example, while interesting in its structure, is synthetic (i.e., it does not perform any useful computations).In order to benchmark our controller synthesis approach against realistic cases, we selected the MOSTEK 6502 micropro- cessor as another example.The specification in ISPS was obtained from the High Level Synthesis benchmark set [30].In order to obtain a manage- able size example (which can be handled by Schwa), we reduced the instruction set to four instructions.This resulted in the CDFG shown in Figure 15.Also, the original specifications were based on a non-pipelined scheduling which was reflected in the assumed data path.In order to enable us to perform pipelined scheduling with high throughput, we made some modifications on the data path and the CDFG, as follows: we assume a dual ported memory in which two memory read operations are permitted to over- lap.Only one memory write operation is permitted at a time, though.we increased the number of various resources, such as registers, in order to enable the over- lapped execution of some register operations in the CDFG.
We used Sehwa to schedule the CDFG with latencies L--4 and 6.For each scheduling, we generated a pipelined RT-level implemnetation of the data path.Table V shows some statistics on the data paths and state tables for both latency values.Figure 16 shows the data path for L 6.For both data path, we used our algorithm to synthesize several implementations of the control part using both PLAs and standard cells. 7In each case, we generated layouts corresponding to various n-way partitionings of the groups of states for n 1,2, 4, 6.Table VI shows area and delay data of the various implementations at latency L 6.The total area figures for the PLA implementations includes the sum of the areas of the individual partitions plus estimates of the buffering and routing areas.The delays of all the implementations are computed as the worst case delay.The PLAs were laid out using octtools [29] in SCMOS 3  7Here, we could not compare against NOVA because the state table was too large for NOVA to handle, even in io_hybrid mode.Cells were laid out using the GDT CMOS 3 la technology.The PLA delay figures were obtained using Crystal [28].The Standard Cell delay figures were estimated by MIS.In both styles, the best area and performance were achieved using a four- way partitioning of the controller.The slightly , and N. B. Park which appeared in International Conference on Computer Aided Design (ICCAD '91), November 1991. (C) 1991 IEEE.

FIGURE 7 A
FIGURE 7 A decision tree.

FIGURE 8
FIGURE 8  Analysis of horizontal partitioning design space.

2 FIGURE 10
FIGURE10 The decision tree for L 6.
(PTO) and Logical Topology Optimization (LTO).The PTO tools attempt to minimize PLA area by changing the physical layout of the PLA.An example of PTO methods in PLA folding.LTO methods include Vertical Partitioning, Counter tion

TABLE
Values of R L, for various n and L

Table II
Efficiencies of the branch and bound and heuristic methods

Table IIl
State decisions for example Sehwa-1 and TableII; assume that the current state is S2,1, then the next state depends S,, to identify input condition Ds. Check .T. KIM et el. J

TABLE V
Data path and state table information of the

TABLE VI
Experimental results for the M6502 example with L --6.