Module Selection in Microarchitectural Synthesis for Multiple Critical Constraint Satisfaction

Accurate design descriptions during synthesis allow efficient use of resources. The appropriate use of distinct implementations of RTL operators helps generate optimal VLSI designs. The system presented here utilizes libraries composed of multiple modules with identical functionality, but distinct performance and area characteristics. Such libraries allow the generation of an accurate estimate of the area and delay of the final design during synthesis. Full use of the module selection capability is possible by allowing the user to specify a total area limit rather than a detailed allocation. Consequently, tradeoffs between different allocations can be fully explored. Scheduling, module selection, and allocation are performed simultaneously to achieve optimal use of area and delay, and to facilitate the incorporation of lower level design considerations into behavioral synthesis. Synthesis decisions are made in a time-constrained and area-constrained fashion, by using both constraints to identify and avoid infeasible design possibilities. Module selection, scheduling, and allocation for pipelined designs is also implemented. Experimental results show that the use of module selection and time-and-area-constrained synthesis results in an area/delay design curve which is superior to the results of traditional systems.


INTRODUCTION
There are many parameters of a final chip design, such as area, power, and performance, which must be addressed to make the chip useful, and its manufac- ture profitable.At each step of the design process, these parameters must be accurately estimated and used to guide the progress of the design towards a high quality chip.Behavioral synthesis typically suf- fers from inaccurate estimates because there is little information about the physical layout at the algorith- mic level of description.Recently, attempts have been made to connect behavioral synthesis to lower levels of design (floorplanning [31], placement and routing [26]) to achieve better estimates during syn- thesis.Since design decisions made at the behavioral synthesis level have a great impact on the quality of the final design, it is crucial that good estimates be used to direct synthesis.
The goal of behavioral synthesis is to generate a datapath from an algorithmic description of the de- sired functionality of the chip.The first step in the process is usually the generation of an intermediate algorithmic representation [24], composed of a con- trol flowgraph describing conditional branching and looping constructs in the behavioral description, and a dataflow graph describing the dataflow dependencies between operations in the algorithm.The task of deciding at which clock cycle each dataflow opera- tion will be performed is called scheduling.The allo- cation task allots hardware modules to perform the operations in the dataflow graph, and the operator binding task assigns each dataflow operation to a particular allocated hardware module which will perform it.Additionally, an RTL description of the con- trol unit must be generated to sequence the operations as described in the schedule.A layout can then be generated from the RTL description of the entire chip using standard physical design tools.
In the design of a chip with any reasonable degree of complexity, it is likely that more than one imple- mentation of an operator will be utilized.For in- stance, a slow ripple carry adder could be used as well as a carry-lookahead or carry-select adder.These three different types of adders will have different area and delay characteristics which should be considered during behavioral synthesis while the datapath is be- ing created.The characteristics of the modules used during synthesis should be close to the characteristics of the modules which will be used in the physical design so that accurate estimates of delay and area consumption can be made.Delaying the resolution of abstract modules to actual components until after mi- croarchitectural synthesis may result in inefficient area usage and reduced throughput.
In this paper, we propose integration of module selection into the scheduling and allocation tasks of behavioral synthesis.Traditional hardware allocation is performed under the assumption that only one module type of each functionality exists in a design library.Allocation with module selection allows us to obtain a better estimate of the area of the final design by using area information from a full library rather than the rough approximation afforded by only a sin- gle module type.Performing scheduling with module selection allows us to better estimate the delay of the final dataflow graph schedule by using timing infor- mation from a full library.Module selection facili- tates the efficient use of area and delay resources by allowing non-critical path nodes to be executed by small, low speed modules.Performing module selec- tion during behavioral synthesis facilitates better use of chip area by allowing operator binding to be per- formed to the full range of library modules.

RESEARCH CONTRIBUTIONS
Design decisions during behavioral synthesis are driven by the need to meet a set of constraints on the properties of the physical design.Performance con- straints are critical in many applications where pro- cessing must be performed in real time, such as DSP.At the same time, new design requirements such as design-for-test [4, 32, 1], reliability [25], and fault tolerance [19], have made the optimization of area overhead more imperative.For time-critical and area- critical designs, both constraints should be enforced simultaneously during high-level synthesis.Efficient use of resources is necessary to meet user-imposed design constraints.The use of module selection provides a more accurate description of the area and de- lay characteristics of the design during behavioral synthesis, allowing area and delay resources to be used efficiently.The tradeoff between area and delay can be explored fully when no user-imposed limits on the allocations of each module type are specified.The primary research contributions of our work can be summarized as follows: Module Selection Conventional behavioral synthesis methods are extended to accommodate li- braries containing modules with identical function- ality but different area and delay characteristics.Time-and-Area Constrained Synthesis Time and area constraints are simultaneously satisfied by us- ing the area estimation to determine which design decisions will cause the area limit to be violated and using the module delays in the module library to determine which decisions will violate the delay constraint.
Allocation Freedom Conventional behavioral synthesis algorithms require that the user specify a limit on the allocation of each hardware type.In contrast, we propose the use of a total chip area constraint to enable thorough exploration of the so- lution space for the appropriate allocation.

PREVIOUS WORK
Many scheduling algorithms which are either time- constrained such as Force-Directed Scheduling [28], or area-constrained such as List Scheduling [16], have been explored [5].To our knowledge, no ap- proach besides the ILP formulations ( [6,22]) per- forms synthesis under both an area and a delay con- straint.Although promising execution time results have been shown, the ILP problem is NP-complete and remains intractable for challenging synthesis problems.
Various forms of the module selection problem have been explored previously.The problem of mod- ule set selection through resolution to a restricted li- brary containing a single module type for each operator functionality prior to microarchitectural synthesis has been explored in [14, 15, 17].This approach selects a single module type for each functionality which will be used in the allocation by generating an area/delay design curve for each possible subset of module types.The selection of a module set and a corresponding clock period has been studied in [3].The selection of a single module type for each func- tionality negatively impacts the scheduling of flow- graphs that contain paths of varying criticality; it is desirable to perform critical path nodes on fast mod- ules and non-critical path nodes on slow modules in this case.Selection of a single module type makes this tradeoff impossible.
A module selection algorithm has been proposed by Ramachandran and Gajski [30] which performs component selection in conjunction with scheduling and operator binding using a distribution graph model [28] to estimate the effect of each compound decision on the area and performance of the final design.Ramachandran and Gajski's work expands this model by computing the distribution of each module type in the module library.Ishikawa and De Micheli [13] propose a module selection algorithm which uses heuristics to select modules types while meeting a latency constraint.A module selection algorithm for pipelined datapaths is proposed in [23] which uses a detailed module delay model, requiring increased CPU time.
The algorithm proposed in [8] performs allocation with module selection before scheduling, by using a hill climbing technique to explore the search space of different allocations.Since allocation is performed before scheduling, allocation decisions cannot make use of scheduling information to achieve improved results.
Scheduling of pipelined datapaths has been studied in several research projects such as [27, 18, 11, 7,  12].Even though both area and performance are fre- quently critical due to the stringent throughput re- quirements of DSP applications, module selection has not been commonly incorporated.

SYSTEM OVERVIEW
The basic components of the algorithm are heuristic synthesis, time-and-area constrained synthesis, and area estimation.The heuristic synthesis component uses heuristic measures to choose scheduling, alloca- tion, and module selection decisions to be included in the design.The time-and-area constrained synthesis component examines the design state after each heu- ristic decision and prunes away options that can be seen to lead to area or delay constraint violations.The area estimation component is used by the time-and- area component to determine which design decisions will lead to infeasible designs.Heuristic decisions are made which limit the de- grees of scheduling freedom of flowgraph nodes.Their effects are propagated throughout the design by the time-and-area constrained synthesis component.Such propagation may further force additional deci- sions automatically.Once such propagation is com- pletely finished, a new allocation is predicted based on the resulting design state and new modules are added to the allocation if necessary.Decisions are made heuristically until all nodes will have been committed to clock cycles and bound to modules. Figure is a flowchart showing the interaction be- tween the time-and-area constrained synthesis and the heuristic synthesis.
One aspect of the heuristic component consists of allocation decisions.An allocation decision limits the freedom of allocated modules to be mapped to differ- ent module types.A novel aspect of the algorithm is that the allocation is composed of a number of flexible modules which are identified by the area es- timation algorithm to perform the operations in the dataflow graph.A flexible module may be mapped to a set of feasible module types, rather than immedi- ately being fixed to a single module type.This flexi- ble representation allows the system to describe more accurately the information in the partial design state.When an allocation decision is made, the feasible module set of a flexible module is pruned.
While scheduling decisions determine the clock cycle at which a node will be executed, module se- lection decisions determine the flexible module which will perform the operation.Each node has a set of feasible clock cycles to which it may be scheduled.When a node is scheduled, this set is reduced to a single clock cycle.Each flowgraph node also has a feasible module set and can only be bound to a flex- ible module whose feasible module set shares some module types in common with the feasible module set of the node.This algorithm performs these two types of decisions in an intertwined fashion to allow all three tasks to benefit from partial design informa- tion during synthesis.The degree of intertwining is controlled by a user-defined parameter tx i.
The time-and-area constrained synthesis component is used to prune design options which can be shown to result in constraint violations.This is achieved by considering area consumption in performance determination while control step assignment possibilities are considered in area determination.
The area estimation component generates an esti- mate of the area by predicting an allocation which is minimally sufficient to perform the flowgraph nodes given the current state of scheduling.The area esti- mate is used by time-and-area constrained synthesis to determine which design options lead only to infea- sible designs, and the predicted allocation produced is compared to the current allocation to determine if new modules need to be allocated.
Hardware utilization can be improved by chaining operations, that is, allowing two or more operations to be performed serially within one clock cycle.This alleviates underutilization by allowing hardware to be used in time in a clock cycle that would otherwise be wasted.The earliest and latest times at which a node can be scheduled, C and C respectively, are kept as a clock cycle and a time displacement within the clock cycle to enable appropriate handling of chain- ing.This paper will first describe how time-and-area constrained synthesis assures that both constraints are met.Subsequently the area estimation algorithm will be presented followed by a description of the heuris- tic scheduling, module selection and allocation algorithm.Results demonstrating the effectiveness of the basic algorithm will be presented.Then synthesis of pipelined systems will be discussed, and results of pipelined synthesis with.moduleselection will be presented.

SYNTHESIS
In order to satisfy both time and area constraints, it is necessary to determine which design decisions would lead only to infeasible designs.Once infeasible de- sign options have been identified, they can be avoided by the heuristic synthesis component of the algorithm.A non-exhaustive search, using only infor- mation contained in the current state of the design, is sufficient to identify such infeasibilities.In order to determine that a design decision will violate the area constraint, it is necessary to generate an area estimate for the final design.
The scheduling and allocation possibilities are de- scribed by the feasible clock cycles of each node, the feasible module types of each node, and the feasible module types of each flexible module in the alloca- tion.Propagation of the effects of each design deci- sion may require that these sets be pruned for affected nodes and modules.Constraint Enforcement causes new design decisions to be made when they are nec- essary to guarantee that the area and delay constraints are not violated.
Propagation restricts the scheduling freedom of ad- jacent nodes.The scheduling freedom of a node is limited by restrictions imposed on the scheduling of adjacent nodes in the graph, and by the availability of hardware at different clock cycles.When a node's scheduling freedom is restricted, the Ce and Ct values of all adjacent nodes are recomputed.Any clock cy- cles which are no longer within the feasible schedul- ing range of a node are pruned.Computation of the Ce and C values considers the area as well as the performance constraint.In figure 2, nodes +X2, +X3, +X4, and +X5 all have C equal to 4, the last clock cycle.Consideration of the delay constraints only would allow /X1 to be scheduled as late as clock cycle 4 also (assuming chaining).Yet consider- ation of the area constraint shows that at most two addition modules can be allocated in a clock cycle.Since only two addition operations can be performed in a clock cycle, it can be determined that operation +X1 can be performed no later than clock cycle 2. Constraint enforcement limits scheduling and mod- ule selection options when area or delay limits are approached.When insufficient area is available to al- locate a new module of a certain functionality, the already allocated modules must suffice for the sched- uling of all remaining nodes.If the currently allo- cated hardware of a given functionality is fully uti- lized at a certain clock cycle, then no other flowgraph operations of that functionality may be committed to that clock cycle.This observation is propagated throughout the partial design state by pruning the fea- sible clock cycle sets of all appropriate nodes.Feasi- ble module types of a flowgraph node will be re- stricted if they would result in infeasible schedules or a violation of the area constraint.If the feasible clock cycles of a node are restricted to 2 clock cycles, then a module type will be pruned from the node if its delay would result in a violation of the clock duration constraint at both adjacent clock cycles.This is deter- mined by examining the time displacements within the clock cycle.When a module type has been pruned from all elements of the allocation, and remaining area is not sufficient to allocate a new module of that type, then that module type is pruned from all flow- graph nodes.
Additionally, the feasible module types of each flexible module are examined during constraint en- forcement to determine if pruning is necessary.Area estimation relies on an optimistic estimate of the area based on the slowest feasible module type of each allocated module.Consequently, if there is not enough area remaining to upgrade the module from its slowest to its fastest module type, then the fastest module type is infeasible.

AREA ESTIMATION ALGORITHM
The area estimation component of the algorithm esti- mates the area remaining by predicting an allocation that is minimally sufficient to perform the flowgraph nodes in the partial design state.The area estimate is used by time-and-area constrained synthesis to deter- mine which design options lead to constraint viola- tion, and to determine if new modules need to be added to the allocation.The estimate generated by the area estimation component is never an overestimate; this consistent characteristic of our design estimate ensures that feasible design options are not pruned.
The method of estimating the area is an application of the pigeonhole principle [2] to nodes confined to ranges of clock cycles.We will use the dataflow graph in figure 3 to demonstrate the use of the pi- geonhole principle in predicting a minimum alloca- tion.In figure 3, four nodes must be scheduled within three clock cycles, therefore, by the pigeonhole prin- ciple, at least two addition modules must be allo- cated.The pigeonhole principle can be analogously generalized to consider nodes which are confined to ranges of clock cycles, as well as module types.
Our approach is illustrated in the dataflow graph of figure 4 wherein the three shaded nodes are sched- uled to clock cycles (+4, +5, +6) while the other three are free to be scheduled over all three clock cycles (assuming chaining).The unscheduled nodes are annotated with their feasible module types.
Clearly nodes + 1, +2, and +3 must be scheduled within clock cycles 1, 2, and 3, and the total module availability over that range of clock cycles and over the feasible module range, ({Fast, Med} modules) is three.The availability in the range is figured by counting the number of clock cycles at which each

Current Allocation
Med [--i s:ow Io module whose type is a subset of the feasible module range is not utilized.Since there are two flexible modules in this range with three nodes already com- mitted, the total remaining availability in this range is 3. Since there are also three unscheduled nodes in this range, the current allocation is sufficient to perform the schedule.An additional unscheduled node in the range would have resulted in a new module of type {Fast, Med} being introduced.
The algorithm performs optimistic estimates of area so that the partial design state is not unnecessar- ily constrained.To ensure that the area is an underes- timate, the examination of ranges of module types is ordered by the smallest module type contained in the range.This causes smaller modules to be allocated first, and large modules to be allocated only if small modules are insufficient.In general, the availability is reduced to zero inside a range of clock cycles C and module types M if: MR M CI CRcl , AVc,M (1) ceC where equation.MR M is the set of nodes whose feasi- ble module types are subsets of M, CR, is the set of nodes which must be scheduled within the set of clock cycles C, and AVc,M is the number of modules whose feasible module types are a subset of M and are not executing an operation in clock cycle c.
A new module is added to the current allocation when inequality 2 is true for some range of clock cycles C and module types M. MRM CRc , AVc,M cC (2) 7. HEURISTICS Once time-and-area constrained synthesis has pruned away all design decisions that are definitely infeasi- ble, there remain a number of feasible design options.
We propose an heuristic based approach to choose a design option in a computationally efficient manner.
Heuristic decisions are of two types, a combined scheduling/module selection decision or alternately an allocation decision.Each scheduling/module se- lection decision commits a node in the dataflow graph to be performed at a clock cycle, and to be performed by a particular flexible module.Each allo- cation decision refines the real allocation by pruning a feasible module type from a module.
The heuristic subsystem alternates between a scheduling/module selection phase and an allocation phase.We have observed that the order in which scheduling, module selection, and allocation are performed impacts the optimality of the design.Completion of one phase may limit the solution spaces of the subsequent phases in such a way that no feasible so- lutions which satisfy both area and delay constraints remain.
This algorithm intertwines the tasks of scheduling, allocation, and module selection, so that each task can be guided by partial information from the others.
Experimental results show that the degree to which scheduling and allocation decisions are intertwined can have a significant effect on the quality of the design.

Scheduling/Module Selection Decisions
The system schedules each node to a clock cycle, and binds each node to an allocated module.A node is committed to a clock cycle and bound to a flexible module simultaneously.First the node which is most ready to be committed is determined, and then the best clock cycle and module for node commitment is chosen.The node is selected based on the following criteria.chain of nodes rather than a single node.If the path scheduling freedom for a node is low then it is on a path whose completion time is large compared to the maximum time in which the path must complete to meet the delay constraint.The nodes on such critical paths should be scheduled early because they have less freedom.
The flexible module to which a node will be bound and clock cycle at which a node will be performed are selected based on the criteria listed below.
Module Type Similarity This is the same Module Type Similarity criterion used to select the node to be scheduled.A node is bound to a flexible module which has many feasible module types in common with the node.
Module Type Similarity A node which has many feasible module types in common with a module should be bound early.Binding a node to a dissimilar module would result in constraining the feasible module types of the node and/or the module.
Clock Cycle Similarity A node which must be com- mitted to a module which is unutilized at many of the node's feasible clock cycles should be committed prior to a node which has few clock cycles in com- mon with any module.Binding a node to a module which is utilized during the feasible clock cycles of the node restricts the freedom of the node.Decisions which bind nodes to modules which are dissimilar in terms of available clock cycles are deferred, allowing the algorithm to take advantage of future module al- location that may be more compatible.
Scheduling Freedom We define the scheduling free- dom of a node to be the number of (clock cycle, module type) pairs to which it may be committed.A node which has less freedom should be committed early in the scheduling because the node can lose its scheduling freedom as a result of propagation and constraint enforcement effects of intervening heuris- tic decisions.
Path Scheduling Freedom A node has less freedom if it is on path of the dataflow graph which has little scheduling freedom.The criticality of a path is a measure of the degree of scheduling freedom of a Uniform Hardware Usage By distributing nodes across the graph, this rule attempts to fully utilize hardware at all clock cycles.We define a heuristic measure of the collective scheduling freedoms of the predecessor and successor paths containing a node.By committing a node to the clock cycle which most evenly matches the scheduling freedoms of predecessor and successor paths of each node, node clustering into a few clock cycles, which would have resulted in underutilization, is avoided.
Hardware Availability This rule attempts to fully utilize hardware by committing nodes to states where fewer nodes are likely to use the hardware.A proba- bilistic estimate of the number of nodes that will re- quire hardware at a clock cycle is generated by add- ing the path scheduling freedom values of all nodes which may be committed to that clock.Mathematical formulations of these heuristic mea- sures may be found in [10].

Allocation Decisions
For each flexible module in the allocation, the system can heuristically decide to prune either the fastest or the slowest module type from its feasibility set.When the feasible module type of such a module is pruned, that type is also pruned from the feasible module type sets of all nodes bound to that module.For this rea- son, each node which is bound to the module must be examined to see if such pruning is acceptable.
The fastest module type can be eliminated from the feasible module type set if it is estimated that no node bound to the module will utilize this module type.A probabilistic determination is made as to whether the bound nodes must use this module type.The determi- nation is based on the Path Scheduling Freedom of a node as previously described.If a node's paths have a large degree of scheduling freedom, then that node is less likely to require a fast module type.The slowest module type is pruned using a similar metric which measures the scheduling freedom of paths containing the node under the assumption that each node on the paths is performed by the slowest feasible module type.A fast module type is pruned only if none of the bound nodes require it, while a slow module type is pruned if there is a single bound node which cannot use it.

Intertwining Threshold
The user provides an input parameter, Oi, which de- termines how much scheduling information is needed before an allocation decision can be made.This parameter is used to control the degree of intertwining of the scheduling/module selection decisions and the al- location decisions.When deciding whether or not to prune the feasible module types of a flexible module, a weighted average of the scheduling freedoms of the paths containing the bound nodes is compared to c and pruning is performed if the weighted average is greater than o i.Low values of oi cause allocation de- cisions to be eager, which provides early direction to the scheduling, while high values cause scheduling de- cisions to be eager, giving early direction to allocation.

SYNTHESIS RESULTS
We have conducted a set of experiments to test the ability of the heuristics to navigate, and of the time- and-area constrained synthesis to prune the search space.The first example in figure 5 demonstrates that the effects of the time and area constraints are suc- cessfully enforced.In this example, time-and-area constrained synthesis pruned all decisions that could be deduced infeasible from the initial constraints.Since almost all decisions were automatically pruned as a result of constraint enforcement, all scheduling, module selection, and allocation decisions were com- pleted except for the limited freedom of nodes +X2 and +X14.In this example, the constraint enforce- ment part of the system automatically assigned all addition operations to medium speed adders, and all multiplication operations to slow multipliers.
In another set of experiments, we studied the abil- ity of the heuristics to guide the search through the design space under tight constraints.We scheduled the differential equation example [28], the AR-fil- ter [27] and the FIR-filter [27] flowgraphs with con- straints and results shown in figures 6, 7, and 8, re- spectively.Under the given constraints, the solutions identified by the algorithm are the only feasible solu- tions.The solutions use a rich set of modules and would not have been feasible under the single module type assumption.
To demonstrate that module selection produces de- signs which utilize area and delay resources more ef- ficiently than designs generated which use a single module type for each operator, we compared the re- sults of our system to results generated by the HAL [29] algorithm on the AR-filter dataflow graph.The resulting area-time curves are shown in figure 9.A clock cycle duration of 250 ns was used for this ex- periment.
The HAL algorithm considers only one module of each functionality, .sowe supplied it with each pair of adder and multiplier modules in the library shown in figure 10.Our system, which uses the full library, is a better area/delay curve than HAL in almost every case.
In order to investigate the effect of changing the degree of intertwining of scheduling and allocation decisions, we performed scheduling on the FIR-filter example with different degrees of intertwining by changing the value of o i.The results are shown in figure 11.The area of each result is marked on the graph, and is annotated with the allocation of mod- ules corresponding to that result.In these results, the o parameter ranges from 0 to 1. Low values of o cause allocation decisions to be eager, which pro- vides early direction to the scheduling.Eager alloca- tion may cause allocation decisions to be made pre- maturely as occurred with the o 0.33 result shown.High values of o cause allocation decisions to be delayed until more information is available about the schedule.The scheduling decisions are made with less allocation information and therefore have a better chance of resulting in a suboptimal de- sign.This occurred with the o 0.5 and o 0.55 results.
To observe how efficiently the algorithm uses area under different clock cycle constraints, we performed scheduling on the AR filter example with different clock cycle limits.The results are shown in figure 12.Using 8 clock cycles, it is possible to schedule the graph without any chaining, so only slow modules were used.As the number of clock cycles decreases, the number of fast modules needed to schedule the graph increases.
In another experiment, we scheduled the FIR-filter example with different clock durations and clock cy- cle limits, holding the total completion time constant at 180 ns.We define the total completion time to be the real time that is required to perform the entire algorithm, and we compute the real time as the product of the number of clock cycles and the duration of each clock cycle.The results are shown in figure 13.To demonstrate the speed of the algorithm, we per- formed synthesis on the differential equation (DE), FIR-filter (FIR), and the AR-filter with different area and delay constraints.All experiments were per- formed on a Sun-4 SPARC-based CPU 20MHz ma- chine with a Weitek 3170-based floating point unit.The average run times are shown in figure 14.Comparison to other algorithms is difficult since few other algorithms perform all of the tasks that this algorithm does, and published execution time results are lim- ited.We have found a comparable execution time re- sult by the HAL algorithm [28] which performs only a scheduling of the FIR-filter example with similar pa- rameters in 30 seconds.These results show that de- spite simultaneous performance of scheduling, allo- cation, operator binding, and module selection, effec- tive heuristics and tight constraint enforcement produce computationally effective solutions.
Usually, a larger clock duration with a smaller number of clock cycles will require area to increase because more operations must be performed in the same clock cycle, but this is not always the case, as is  TIME (In clock cycles) FIGURE 9 Comparison to HAL using limited libraries.
evidenced by the decrease in area when the clock duration was increased from 20 ns to 30 ns.The area may decrease as the clock duration increases because certain clock durations are more amenable to chain- ing. 9. CONSIDERATIONS FOR PIPELINED SYNTHESIS latency.To consider this effect on hardware usage, we conclude that the hardware at a clock cycle c is fully utilized, when the number of nodes committed to all equivalent clock cycles c such that ci mod L c mod L equals the total allocated hardware.We have ex- tended the basic area estimation algorithm to accom- modate pipelined designs.When determining the number of nodes which must be committed to a range of clock cycles, all nodes are included which must be committed to the range of clock cycles being consid- ered, or to any clock cycles which are equivalent modulo the latency.Dependencies may exist between instances of op- erations in different iterations of the loop.When an inter-iteration dependency exists, the Ce and C val- ues of the successor node must be computed differ- ently.Clearly a node must be scheduled after all of its predecessor nodes in the dataflow graph, so Ce(n) >- Ce(m), where node rn is the predecessor node of n.If this predecessor node is across an inter-iteration de- pendency which represents a dependence across it- erations, then the inequality must be changed to Ce(n) >-Ce(m) (i * L).This change reflects the fact that the predecessor node is being performed in an instan- tiation of the loop which was initiated (i * L) clock cycles before the current instantiation that the succes- sor node is being performed in.Special considerations must be made when calculat- ing hardware usage at a clock cycle in a pipelined system.Instances of two nodes which are scheduled to clock cycles c and cj respectively, will be executed simultaneously if c mod L cj mod L where L is

PIPELINED SYNTHESIS RESULTS
We have performed the following experiments to test that the area and delay resources are used effectively in the scheduling of pipelined systems.
In order to test the ability of the system to perform under tight constraints, we scheduled the differential equation example presented in [29], containing its original inter-iteration dependencies.The resulting scheduling and module selection are shown in figure 15.The edges representing inter-iteration dependen- cies are shown in bold.All of the inter-iteration de- pendencies represent dependencies between succes- sive iterations.Under these tight constraints, the system discovered a minimum area solution which fully utilizes all adders and multipliers.In the given exam- ple, feasible scheduling and module selection possi- bilities are few.Time-and-area constrained synthesis pruned away all infeasible decisions early in the de- sign process, guiding the heuristics by leaving few scheduling options.The solution uses different mod- ule types and would not have been feasible under the single module type assumption.
We have also scheduled the fifth-order elliptic filter dataflow graph with the original inter-iteration depen-  dencies first presented in [20].The results and con- straints of scheduling are shown in figure 16.The prescribed clock cycle limit was 14 but the resulting schedule utilized 10 clock cycles in order to meet the given latency constraint of 10 clock cycles.The area used by the design is the minimum area possible with the given constraints.
11. DISCUSSION SECTION We have proposed an algorithm which generates a scheduling, allocation, and operator binding of a dataflow graph G using modules from a library of modules which is provided by the user.Synthesis is performed within a chip area constraint and timing constraints which include clock duration and the maximum number of clock cycles.have multiple functionality (i.e. an ALU) as long as there is only one module type which can perform each dataflow graph operation.
The area estimate is generated after each design decision of this system, so it is important that esti- mates be achieved in a computationally efficient man- ner to allow the system to explore tradeoffs quickly.For this reason, the area estimate includes only the functional unit area.As promising new work in lay- out estimation progresses ( [33], [21]), new algorithms will allow a more detailed area estimate while still meeting satisfactory execution time constraints.sible designs, and to the time-and-area constrained synthesis approach which guides the search by prun- ing infeasible design options.Performing module se- lection allows this system to find solutions under tight constraints when no solution would exist under the single module assumption.
b. Simultaneously satisfies multiple constraints such as performance and area.c.Helps bridge the gap between high-level synthesis and physical design through more accurate repre- sentation of library components.
FIGURE 16 Fifth-order elliptic filter results.
The approach used by this system to satisfy area and delay constraints is flexible and may be extended to perform tradeoffs between other conflicting con- straints.For instance, a module library could addi- tionally capture power information.Approximation of power constraint satisfaction can be easily achieved if appropriate modeling of cumulative as- pects of power is incorporated.
12. CONCLUSIONS In this paper we have presented an algorithm which integrates module selection into high-level synthesis of pipelined and non-pipelined designs.Furthermore, we have illustrated an heuristic algorithm which in- tertwines module selection and scheduling decisions.The experimental results show that modules are se- lected appropriately and minimum area designs are accomplished.This system successfully performs time-and-area constrained scheduling, even under tight constraints when very few solutions are possi- ble.This success is due both to the heuristics which guide the search through the design space toward fea- FIGURESystem flowchart.

FIGURE 2
FIGURE 2 Earliest and latest feasible clock cycle computation.

FIGURE 3 30 FIGURE 4
FIGURE 3 Four nodes confined to three clock cycles.

FIGURE 5
FIGURE 5 Fifth order elliptic filter benchmark.

FIGURE 6
FIGURE 6 Differential equation results.

FIGURE 15
FIGURE 15 Differential equation results.