Register-Transfer Synthesis of Pipelined Data Paths

We present a new approach to the problem of register-transfer level design optimization of pipelined data paths. The output of high level synthesis procedures, such as Sehwa, consists of a schedule of operations into time steps, and a fixed set of hardware operators. In order to obtain a register-transfer level design, we must assign operations to specific operators, values to registers, and finish the interconnections. We first perform module assignment with the goal of minimizing the interconnect requirements between RT-level components as a preprocessing procedure to the RT-level design. This will result in a smaller netlist which makes the design more compact and the design process more efficient. In addition to reducing the total number of interconnects, this approach will also reduce the total number of multiplexors in the design by eliminating unnecessary multiplexing at the inputs of shared modules. The interconnect sharing task is modeled as a constrained clique partitioning problem. We developed a fast and efficient polynomial time heuristic procedure to solve this problem. This procedure is 30–50 times faster than other existing heuristics while still producing better results for our purposes. Using this procedure, we can produce near optimal interconnect sharing schemes in a few seconds for most practical size pipelined designs. This efficient approach will enable designers to explore a larger portion of the design space and trade off various design parameters effectively.

INTRODUCTION uring the past decade, much of the research in design automation has focused on the physical level design of VLSI circuits and printed circuit boards.Although there are many inherently complex and difficult tasks left unsolved, the efforts at this level have produced many useful theories and prac- tical results which have been actually put into use in industry.High-level and register-transfer synthesis work also has started producing useful results.At this stage of development of electronic design au- tomation, we believe that there is increased demand for a better interface between these somewhat in- dependently studied levels of design automation.Especially from the high-level synthesis point of view, high-level design tradeoffs and optimizations must be made based on their effects on the final cost and performance of the physical implementations in or- der to guarantee optimality of the final design.An interesting synthesis experiment which demonstrated this aspect of high-level synthesis was reported by McFarland  [1].
Synthesis of digital systems from behavioral specifications is one of the most challenging problems in design automation nowadays.The main goal is to automatically generate a register-transfer level de- sign starting from a behavioral specification, and a set of constraints on overall design parameters such as cost (area) and speed.Behavioral synthesis is a computationally intractable problem when optimal solutions are desired.In fact, most subproblems re- lated to synthesis are known to be NP-complete or NP-hard.In order to reduce the complexity of the global problem to manageable levels, the synthesis process is divided into subtasks as depicted in Figure 1.The specific subtasks addressed in this work are shaded.
The work reported in this paper aims at completing the behavioral pipelined data path synthesis package Functional Specifications Behavioral Synthesis (Sehwa) In our approach, scheduling and resource allocation are not forced by any a priori clustering in order to achieve maximum resource utilization for pipelining.
Finally, BUD also applies only to non-pipelined de- signs.Other important works addressing various as- pects of synthesis of pipelined data paths are re- ported in [8], [5], [91, [10l, and [11].Sehwa [2] and to integrate it with other synthesis tools developed by the authors, CSSP [3] and REAL [4].In this paper we focus on the register-transfer synthesis problem, specifically on the module assign- ment and interconnect sharing subtasks of RT-level synthesis of pipelined data paths.Once a good mod- ule assignment and interconnect sharing schemes are obtained, the pipelined RT-level design can then be completed by using a multiplexor and bus allocation program which can accept partial pre-existing bind- ings, such as MABAL [5].
While the bus allocation, module assignment, and interconnect minimization problems have been dis- cussed in the literature before [6] [7], little work has been published on the module assignment problem for pipelined data path designs.In [6], module as- signment is either forced by schedule or carried out in an ad hoc fashion.Furthermore, the techniques described work only on non-pipelined data path de- signs.BUD [7] attempts to cluster the design prior to scheduling, thus module assignment and intercon- nect sharing are forced as a result of this clustering.
The task of pipeline synthesis is to produce a register- transfer implementation of a pipelined digital com- puter from the data flow (function) description of the tasks to be executed.Constraints set by the designer are: the total available cost budget and the minimum required performance.The design tasks include: 1. scheduling: determine execution order and tim- ing of operations, 2. resource allocation: determine how many of which types of resources to allocate before or after scheduling, and 3. register-transfer level synthesis: detailed assignment of operations to operators, and placement and interconnection of hardware modules.
When an optimal design (with respect to design constraints) is desired, these tasks cannot be per- formed separately.For example, the shortest pos- sible schedule does not always guarantee the fastest performance since it might force the resource allo- cation in such a way that the fastest possible initiation rate of the tasks is not feasible due to resource con- flicts between consecutive tasks.Also, an expensive pipeline does not always guarantee better performance since scheduling may not be able to utilize all the available resources.Finally, the feasibility of re- source allocation within cost constraint can be checked only after the register-transfer level synthesis is completed.
Currently, the pipeline synthesis software package Sehwa [2] synthesizes pipelined data paths at the behavioral level.The inputs to Sehwa are a data flow graph with either a cost or a performance constraint and a module library.Sehwa produces a pipeline schedule in time steps, a list of modules (the number of each selected type of modules), and a resource allocation scheme to time steps in the schedule.Figure 2 shows an example input dataflow graph to the pipeline synthesis program Sehwa [12].The graph shows operations on values and their ordering.In this graph, there are two types of operations,

Time
Steps of three out An optimally scheduled data flow graph of a 16-point FIR filter with three multipliers and five adders for the latency add( + and multiply(*).The figure also shows the output from the pipeline synthesis program Sehwa.The partitioning of the data flow graph into time steps is shown with solid lines.Scheduling determines how many of each type of operators to allocate to each time step.The constraint is not to allocate to each group of overlapping time steps more than the total number of available modules.For this example, the optimization goal is to minimize the length of the schedule within the resource constraint.The allocation of operators to time steps is de- picted in the allocation table (Table I) which shows time steps divided into three groups.The time steps in each group overlap in time due to pipelined ex- ecution.Three multipliers and five adders are chosen and shared over all six time steps.The number of rows in the allocation table corresponds to the num- ber of available modules.The allocation table is an extension of a conventional reservation table [13] [14].
The results of scheduling and resource allocation together with the chosen module set are the inputs to the module assignment task, which performs specific operation-to-operator mapping.In a pipelined data path design, if the pipeline input initiation la- tency, L, is one, all the operators perform useful  operations at all times, except when there are haz- ards or conditional branches.In such cases, no op- erators or registers can be reused during the exe- cution of each cycle, which makes register-transfer synthesis straightforward and simple.However, if the latency L is greater than one, then each operator can be reused at most L times during the execution of each cycle.In such a case, sharing interconnects is very important since the excess amount of wiring space due to multiplexing at the inputs of the operator being shared might weigh against the advan- tage of operator sharing for lower cost designs.

REGISTER-TRANSFER SYNTHESIS OF PIPELINED DATA PATHS
The high-level pipeline synthesis procedure, Sehwa, produces a schedule of operations into time steps, the types and numbers of modules needed, and a resource allocation scheme to time steps.This sched- uling is optimal with respect to the given and chosen constraints (e.g. a cost constraint given by the user and an optimal latency chosen by Sehwa within the cost constraint).In order to complete the register- transfer pipeline data path, we need to perform the following tasks: specific operation-to-module assignment within each time step, multiplexor assignment for shared inputs to modules, storage (stage latches) assignment for pipelined values, and interconnection of all these components.
These design tasks interact with each other.For example, the numbers of multiplexors and registers necessary depend on the specific operation-to- module assignment.The amount of interconnects de- pends partly on how the above components are to be connected.When an optimal design is desired, we need to enumerate all possible RT-level imple- mentations of the given pipeline data path and com- pare the final layouts.However, exhaustive enu- meration is computationally intractable.Even the operation-to-module assignment alone can have ex- ponential time complexity (O(nm) in the worst case where n is the number of operations and rn is the number of modules).
In order to produce a near optimal design in a reasonable amount of time, we need to prioritize the importance of the effect of the design decisions on the final implementation and perform the design tasks in some sequential fashion, although not com- pletely sequential.In this paper, we propose such an approach.

MODULE ASSIGNMENT FOR PIPELINED DATA PATHS
One of the most important optimization tasks in VLSI design is to minimize the total chip area, which also tends to minimize circuit delays.Previous syn- thesis work has,attempted to minimize the active area in layouts by minimizing the total number of active components in the design, such as operators, regis- ters and multiplexors.However, as VLSI designs get more and more complex, wiring area becomes the dominating factor in layout area.The current approach to area optimization addresses wiring area minimization as a side effect of optimizing the active area and not as a first order effect.The main premise of the work presented in this paper is that the wiring effect must be treated as a first order effect during the register-transfer level synthesis in order to optimize the total chip layout area.In order to minimize the total wiring area, we need to 1. minimize the number of interconnection nets to layout and 2. perform optimal placement and routing.The second problem has been extensively studied before and will not be addressed in this paper.Rather, we focus on the first problem, minimizing the number of nets to layout.By attempting to min- imize the number of interconnections, we are auto- matically simplifying the placement and routing tasks.
During RT-level synthesis of pipelined data paths, the number of interconnections depends partially on the allocation scheme of operations to modules and values to registers.For a scheduled data flow graph, if the initiation latency L > 1, each module can be used L times as long as it is not assigned more than once in any overlapping time steps during pipelining.Naturally, a wire which is connected to an output of such a shared module can also be shared together with the module.If the destination operations share the same module, we can share the interconnect without any multiplexors or output control circuitry, as shown in Figure 3.In other words, interconnect sharing also minimizes multiplexors.This fact distin- guishes the proposed interconnect sharing scheme from the conventional bus sharing which requires sharing which requires either input multiplexors or output control circuitry for the modules sharing a common bus.If desired, bus sharing can still be im- plemented even after the interconnect sharing scheme is fixed.Likewise, we can share registers needed to store the value on the shared intercon- nects.In other words, optimal interconnect sharing tends, in general, to optimize the multiplexors and, in some cases, registers at the same time.
The interconnect sharing pattern is determined by the specific operation-to-module assignment.For this reason, our approach to register-transfer level implementation of pipelined data path must start with a good module assignment.
Following behavioral synthesis, the number of modules needed for each type of operation is known.
In an ad hoc approach, the operations are then as- signed to modules individually, the only criterion being that the module be capable of performing the particular operation and that it is not being used to perform other concurrent operations.This ad hoc approach may work fine when the hardware cost function does not include interconnect costs.Assigning an operation to any module capable and free to perform it does not increase or decrease the design cost function.However, when interconnect cost is factored into the cost function, then the assignment of operations to modules does indeed affect the over- all system cost.Figure 4 (a) shows a portion of a scheduled da- ta3ow graph.In (b), a "greedy" module assignment resulted in operations 1 and 3 sharing adder A, 2 and 4 sharing adder B, and 5 and 6 sharing subtractor C. Now if we attempt to share patterns instead of individual operators, then patterns (1,5) and (2,6) are similar and can hence share an adder connected to a subtractor, patterns (3) and (4) are also similar and can therefore share an adder.This will result in the assignment shown in (c) which is less complex and will most probably result in a more compact layout.In addition to having a simpler interconnect pattern, the assignment in (c) has one less multi- plexor than the assignment in (b).

Problem Definition and Procedure Outline
The inputs (received from Sehwa) to the module assignment task are: the scheduled data flow graph which shows operator-to-time step assignment, interconnects between operations, and stage latch requirements, and the corresponding allocation table showing the number and types of modules assigned to each time step.The outputs are: specific operation-to-module assignments, interconnect sharing schemes, and stage latch placement on the shared intercon- nects.
The optimization goal here is to maximize inter- connect sharing in a way that will also maximize stage latch and register sharing and minimize multiplexors.
We now define the conditions of module assign- ment for interconnect sharing, and then discuss the general procedure of module assignment for maxi- mum interconnect sharing.We first define several terms for clarity of discussion.Definition 1: A path vector is defined as an ordered list of the operations and stage latches along a di- rected path in the scheduled data flow graph for pipelining.A path vector always starts from an op- eration and terminates at an operation.The length of a path vector is defined as the number of operations (nodes) in that path.Definition 2: A pair of path vectors, P and Pi are called compatible if and only if all the following con- ditions are true.
1. Pi and Pj have identical resource-usage patterns, i.e., contain identical numbers of operations and stage latches in the same order.
2. Corresponding elements in P; and P. can be as- signed to the same module (the same operator or stage latch). 3. Every element in Pi does not overlap in time with the corresponding element in P. during pi- pelining, i.e., they can share the same module in different non-overlapping time steps.
The pairs of matching operations or stage latches in two compatible path vectors can be assigned to the same module (operator or register).In other words, the two compatible path vectors use the same set of modules and interconnects in the same order in nonoverlapping time steps, as we have seen in the example of Figure 3. Examples of compatible path vectors in the scheduled data flow graph of Figure 2 are <(+1 r*l r+a) (+4 r*4 r+c)>, <(+6 r r*6) (+8 r r *8)>, and <(+b +c)(+d +e)(+f +g)>.
In these expressions, 'r' denotes a stage latch.
For a set of compatible path vectors, interconnect sharing is possible when every pair of path vectors in the set are compatible with each other.For ex- ample, for the compatible set <( + b + c) + d + e) (+f +g)>, the additions +b, +d and +f can be assigned to an adder, say adder 1, and + c, + e and + g can be asigned to another, say adder 2. This way, the three compatible path vectors can share the same physical interconnect from adder 1 to adder 2 without any input multiplexors.

Synthesis Procedure
We first outline the procedure for register-transfer level implementation of pipelined data paths.Then we will present the detailed descriptions of the major steps of the procedure.

PROCEDURE MOD-ASSIGN;
Step 1: Enumerate all the compatible path vectors of certain lengths in the scheduled dataflow graph.
{Approach: Use depth-first graph traversal algorithm starting from each operation in the graph except those in the last time step of the schedule.} Step 2: Determine a good interconnect sharing pat- tern by grouping path vectors that can share the same interconnect.{Approach: Find cliques of size L (pipe ini- tiation latency) or smaller using a polyno- mial-time clique partitioning heuristic.)}/* A module can be reused at most L times in a pipeline with initiation latency L */ Step 3: Resolve conflicts between groups (cliques) obtained in Step 2 (since path vectors are not necessarily disjoint, implementing intercon- nect sharing for a group may make some part of the sharing in other groups not feasible).
Step 4: Perform operator and register assignment ac- cording to the resulting operation/value-tooperator/register assignment and intercon- nect sharing schemes.
Step 5: Complete register-transfer design by adding interconnects, multiplexors, and registers for the rest of the pipelined data path.
In Step 1 of the procedure, path vectors of chosen lengths are generated.Ideally, in order to get the optimal interconnect sharing scheme, we need to enumerate all the possible paths of all possible lengths.Then we need to test interconnect sharing for all possible combinations of the paths.Unfor- tunately, the path enumeration problem alone has exponential time complexity.Finding an optimal sharing scheme for one such combination is also of exponential time complexity.For this reason, to be realistic, we only consider path vectors of length two (nodes) in our discussion.The path vectors of longer lengths can still be covered as chains of these shorter path vectors.The procedure PATH_ENUM, de- scribed in Appendix A.1, uses a recursive, depthfirst graph traversal algorithm, DF_Path, to list all paths of a certain length (or number of nodes).The time complexity of this procedure is O(n) where n is the number of nodes and c is the length of the path vectors.
From the path vector list, we can build graphs which show compatibility relationships among the path vectors.The conditions for compatibility were given by Definition 2. The completed compatibility graphs for the path vectors of Table II are shown in Figure 5.The edge connecting nodes 1 and 2 shows that the paths 1 and 2 can share the same modules (operators and registers) and thus the interconnects.The COM_Path algorithm, described in Appendix A.2, builds compatibility graphs from the list of path vectors.
As mentioned before, in pipelining with latency L, each module can be reused L times.Naturally, the next task will be finding cliques of size L from each connected subgraph.Unfortunately, the clique partitioning problem is NP-Complete.An efficient clique partitioning heuristic procedure is needed.  +d +e) (5 5) (+a r +b) (3 4) (*5 r +d) ( (*7 r r +f) ( (*8r +g) (5 6) Note that we need to get only the cliques of size L or smaller.Also, maximum interconnect sharing is achieved by minimizing the total number of cliques and not by maintaining the size of each clique.To take advantage of these facts, we developed a new clique partitioning procedure instead of using exist- ing general procedures such as Tseng's [6].The time complexity of this procedure is O(L E N2) in the worst case, where L is the pipe latency, E is the number of edges (pairs of compatible paths), and N is the number of nodes (paths) in the compatibility graph.However, as we find cliques, these cliques are removed from the graph and therefore the number of remaining paths is reduced rapidly and the procedure converges fast.The new heuristic procedure, L_CLIQ, described in Appendix A.3, finds near optimal solutions for most graphs when the clique size is smaller than 10.The procedure is programmed in C on a SUN3 workstation and runs in a few seconds for graphs with 100 nodes and 1000 edges.Next, we need to carry out specific module as- signment according to the results of module assign- ment and interconnect sharing schemes.Steps 3 and 4 of the overall procedure (MOD-ASSIGN) are car- ried out in parallel.The procedure, CLIQ_ASSIGN, described in Appendix A.4, uses a resource alloca- tion table to keep track of current status of module assignment and availability of modules, as shown in Table III.For the precise usage of the resource al- location table, refer to [2].
At this stage of the synthesis, there might be op- erations not assigned to any hardware modules yet.There are two approaches to the completion of RTlevel designs.First, we can repeat the overall procedure, MOD-ASSIGN, with shorter path vectors which contain the unassigned operations.A Second approach is to complete the module assignment using a greedy algorithm (since the possibility of finding more interconnect sharing schemes at this stage will be very low).
Once module assignment is complete and an in- terconnect sharing scheme is obtained, we need to complete the RT-level design by adding more reg- isters and multiplexors.In our approach, we perform  optimal register assignment first using the register allocation procedure REAL [4].Then the multi- plexor placement is forced by the module assignment and the register assignment.

Example
Since there has been no reported work on module assignment and interconnect sharing for pipelined designs, we cannot provide any comparative data on the performance of our approach.However, in this section, we apply our technique to the 16-point FIR filter in Figure 2. From this example we generate two different RT-level designs, one using a greedy mod- ule assignment, the other using our technique.Results of the design comparison are presented at the end of this section.
The data flow of a 16-point FIR filter of Figure 2 receives inputs from an external 16-bit shift register, which is not shown here but might actually be built into the same chip with the filter.The data flow graph is scheduled with three multipliers and five adders, which is the minimum set of modules in order to achieve an initiation latency L of three (clock cycles).
The allocation table (Table I) shows resource allo- cation to each group of overlapping time steps during pipelining.As mentioned before, once we get the scheduled data flow graph with resources allocated to each group of overlapping time steps, we need to perform the specific module assignment to each op- eration.The objective function is to minimize the number of interconnection nets, which might also minimize the number of multiplexors and stage latches necessary to complete the register-transfer pipeline design.First, we construct a list of path vectors.In this example, we use path vectors of length two only for simplicity of illustration as shown in Table II.
The second step is to group the path vectors into compatible path vectors.The result of this grouping is shown in Figure 5. Next, we need to partition each of the compatible path graphs into cliques of size L (pipe initiation latency) or smaller.The solid edges in the graphs of Figure 5 show the result of the clique partitioning.
Once we get the cliques of compatible paths, we need to check whether there are any conflicts in im- plementing all the interconnect sharing schemes.For example, clique C1 contains a path from + b to +c and C4 contains a path from *3 to +b.Sharing an interconnect for C1 might make sharing a single in- terconnect for all the paths in C4 not feasible.For this reason, we construct a graph which shows pos- sible interaction between the cliques as shown in Fig- In our current implementation, we start module assignment with the cliques with minimum number of interaction edges with other cliques.Intuitively, if we start implementing the clique with largest num- ber of interaction edges, we are likely to affect and destroy the interconnect sharing pattern of many other cliques.For this example, the first clique to implement will be either C2 or C3.Actually, we have chosen C2 first and C3 next, using the lowest-index- first tie-breaking rule, followed by C1, C5, and then C4.
Figure 7 shows the final result of the interconnect sharing and the corresponding module assignment.Cliques C2, C3, and C1 are implemented without any conflict with each other.However, interconnect sharing for all three path vectors in C5 (12, 17, 21)   is not feasible.The reason is as follows.As shown in Figure 7, interconnect sharing for C1 requires two  FIGURE 7 Interconnect sharing results and partial module assignment adders, Adder 3 and Adder 4. These adders needed to be used together in three consecutive time steps (4, 5 and 6) and thus cannot be reused in any other time steps since the pipeline initiation latency is three.Interconnect sharing for C5 needs +a, +c and + e to be assigned to the same adder.Adder 4 has already been assigned to +c, +e and +g and therefore Adder 4 cannot be assigned to + a. There- fore, path vector 17 (+ a r + b) other path vectors in C5, 17 and 21.By the same reason, for C4, only path vectors 13 and 18 can share the same operators and interconnect.This phase pro- duces a partial module assignment with the goal of minimizing the number of interconnection net, as shown in Figure 7.The rest of the register-transfer synthesis will be the completion of the module as- signment, and placement of multiplexors and regis- ters.The final design was completed by hand and register sharing was optimized using a procedure de- scribed in [4].Figure 8 shows the completed RT- level design.
We also generated another RT-level design using a greedy approach to module assignment, as shown in Figure 9. Table III shows the module assignments for both approaches.statistics on various design components.While both designs have the same number of adders, multipliers, and registers, the optimized design represents a sig- nificant improvement over the non optimized one, using four less multiplexors and eight, 8-bit inter- connects (or 64 wires).This represents about 20 per- cent savings in the size of the steering logic.

CONCLUSION
In this paper, we presented an approach to solving the problem of optimal synthesis of register-transfer pipelined data paths.Synthesis tasks are partitioned and prioritized according to their importance in area optimization of the final register-transfer pipelined data path designs.In the proposed approach, the highest priority is given to the module assignment task.We presented an efficient solution to the mod- ule assignment problem as part of register-transfer synthesis of pipelines.The main objective is to max- imize sharing of interconnections among the mod- ules.A model for sharing using compatibility graphs was proposed and an efficient heuristic was imple- mented in Franz Lisp running on a SUN3 worksta- tion.This procedure produces near optimal inter- connection sharing schemes in seconds for most practical size examples.As a result of this assign- ment, the total number of interconnections is ex- pected to be minimal, and the multiplexing to be simpler.This produces area efficient layouts with im- proved performance.Once module assignment is done, the subsequent tasks of register and multi- plexer allocations, need to be performed before we can obtain a complete RT-level design implemen- tation.The register allocation problem was ad- dressed in [4].The remaining task of multiplexer allocation is currently being researched, and efficient solutions must be proposed before a complete syn- thesis system can be realized which can produce com- plete RT-level designs from behavioral specifica- tions.Our system produced near optimal register- transfer pipelined data paths for most examples which ran quickly as mentioned before.We believe that this type of synthesis work is necessary for the high-level synthesis tools to be fully automated and to be practical.

FIGURE 3
FIGURE 3 Sharing of physical paths

FIGURE 5
FIGURE 5 Compatible path graphs

FIGURE 6
FIGURE 6 Clique interaction graph

TABLE The Resource
Allocation Table for the Schedule of

TABLE III
cannot share the same module nor the same interconnect with the FIGURE8Complete RT-level design with optimized module assignment

Table IV
9An RT-level design with greedy module assignment Design

TABLE IV Comparison
of greedy versus optimized designs