A Timing-Driven Partitioning System for Multiple FPGAs

Field-programmable systems with multiple FPGAs on a PCB or anMCM are being used by system designers when a single FPGA is not sufficient. We address the problem of partitioning a large technology mapped FPGA circuit onto multiple FPGA devices of a specific target technology. The physical characteristics of the multiple FPGA system (MFS) pose additional constraints to the circuit partitioning algorithms: the capacity of each FPGA, the timing constraints, the number of I/Os per FPGA, and the pre-designed interconnection patterns of each FPGA and the package. Existing partitioning techniques which minimize just the cut sizes of partitions fail to satisfy the above challenges. We therefore present a timing driven N-way partitioning algorithm based on simulated annealing for technology-mapped FPGA circuits. The signal path delays are estimated during partitioning using a timing model specific to a multiple FPGA architecture. The model combines all possible delay factors in a system with multiple FPGA chips of a target technology. Furthermore, we have incorporated a new dynamic net-weighting scheme to minimize the number of pin-outs for each chip. Finally, we have developed a graph-based global router for pin assignment which can handle the pre-routed connections of our MFS structure. In order to reduce the time spent in the simulated annealing phase of the partitioner, clusters of circuit components are identified by a new linear-time bottom-up clustering algorithm. The annealing-based N-way partitioner executes four times faster using the clusters as opposed to a flat netlist with improved partitioning results. For several industrial circuits, our approach outperforms the recursive min-cut bi-partitioning algorithm by 35% in terms of nets cut. Our approach also outperforms an industrial FPGA partitioner by 73% on average in terms of unroutable nets. Using the performance optimization capabilities in our approach we have successfully partitioned the MCNC benchmarks satisfying the critical path constraints and achieving a significant reduction in the longest path delay. An average reduction of 17% in the longest path delay was achieved at the cost of 5% in total wire length.


INTRODUCTION
Field Programmable Gate Arrays (FPGAs) are be- coming a mainstream technology in board, system and application specific integrated circuit (ASIC) de- sign processes.However, design complexity will con- 309 tinue to increase more rapidly than the availability of larger and faster devices.System-level ASIC design- ers are turning to FPGAs for design verification to take advantage of their low cost and fast prototyping.
Large and complex designs can require multiple iter- ations in order to achieve a successful design imple- mentation.If the automatic design tools cannot provide a feasible solution, the designer is forced to ob- tain expert level architectural knowledge to support the manual intervention required to complete the de- sign.Current FPGA architectures can handle a max- imum of only 6000 to 9000 gates compared to ASIC devices which offer hundreds of thousands.Designers utilize multiple FPGAs when a single FPGA is not sufficient for a design implementation.
Multiple re-programmable FPGAs have been con- figured on multichip modules (Figure 1) and on boards (Figure 2).A large design can be implemented on a PCB [1] using multiple chips of a particular target FPGA architecture such as Xilinx [2], Actel, or Altera.New field programmable architectures using multiple FPGAs on multichip modules have emerged to offer utilization requirements for prototyping large designs.A multiple FPGA system (MFS) can be mod- eled as a collection of FPGA chips configured on a single board or a package to realize a design.
In order to effectively use MFSs and benefit from shorter time-to-market, users require an automatic method to partition a large design among multiple FPGAs.This process could be viewed as a divide- and-conquer process in order to speed the placement and routing phases or as a conventional top-down design process.Each chip in this N-(multiple) chip combination is considered a partition.Any decision made early in the design process will affect the performance of the subsequent design tools.The quality of the partitioning results will influence several aspects of the design implementation: 1) Capacity: The target FPGA architecture used in the MFS has a maximum gate capacity.However, the amount of logic in each chip is limited by the utilization levels that can be handled by the placement and routing tools.Thus, the feasible utiliza- tion levels are always some fraction of the maxi- mum gate capacity.The partitioner must ensure that each chip contains a feasibly implementable amount of logic.
2) Congestion in inter-chip communication: The par- titioner must be able to minimize the amount of inter-chip communications.The signals external to individual chips must be routed using the lim- ited number of inter-chip connections to produce a feasible partitioning solution.The fixed inter-chip connections of the packaging or the board design restrict the flexibility of routing any signals which are external to a chip.This process can be viewed as pin assignment at the chip level.Any overflow generated during pin assignment will lead to a de- sign which is not implementable.
3) Delay introduced for intra-chip and inter-chip communications: High utilization of logic in indi- vidual chips causes congestion in intra-chip rout- ing.This leads to longer paths and thus longer delays for signals intemal to the chips.Also, the signals which cross one or more chip boundaries in the MFS will accumulate a substantial amount of delay associated with the I/O buffers and the inter-chip wire.Depending on the application, this delay can range from high, as in the case of PCBs, to moderate as in MCM based systems.The system cycle time will be determined by the length of the longest path from a primary input to the pri- mary output of the entire MFS.The partitioner must satisfy the timing specifications for the MFS.
be able to exercise any timing driven capabilities for an unmapped circuit.This is a major limitation be- cause timing problems and achieving minimum sys- tem delay is very important for large complex designs which would typically be partitioned over multiple FPGAs.
After technology mapping, the partitioner can take into account the target FPGA technology specific de- tails such as total logic block count, the routing re- sources associated with each group of logic blocks assigned to each partition and the actual timing infor- mation during the partitioning process.Hence, the partitioning process should follow the technology mapping stage.
Thus the constraints of the MFS partitioning prob- lem are: 1) Set of FPGA chips with their locations, dimensions and maximum capacity; 2) Configura- tions of the chip level I/O frames for inter-chip signals; 3) Configurations of the MFS package level I/O slots for the system I/O signals.In addition, the fol- lowing design constraints need to be satisfied during MFS partitioning: 4) The timing constraints of the system being implemented; 5) Additional user con- straints such as the utilization levels within the chips and the preplaced logic which must remain within a particular chip.
The ASIC and board designs which will typically be implemented in an MFS are large and complex.Therefore, it's important to focus on the speed of the partitioning system.A fast partitioning will be com- patible with the rest of the design flow of rapid prototyping, one of the major advantages of the MFS.
MFS partitioning over multiple chips can be per- formed before technology mapping onto the target FPGA or after technology mapping.If the partitioner manipulates the gate level netlist (as in [3]) before technology mapping, estimation of chip utilization and routability is difficult without the exact count of the target technology logic blocks.These estimations are therefore made conservatively to ensure success- ful execution of place and route tools.At this level, there is no information regarding delay of compo- nents and interconnects.Thus, the partitioner will not

PREVIOUS WORK
The previous work in partitioning can be classified into three categories.Netlist partitioning includes classical partitioning of a netlist into sub-netlists with the goal of minimizing the communication between the sub-netlists.No physical assignment of these sub- netlists is involved during partitioning i.e. in the floor-planning senses.Several improved versions of the classical mincut algorithm by Kernighan-Lin [4] with enhancements by Fiduccia and Mattheyses [5]  have been reported [6,7,8,9,10].Rectilinear partition- ing is sometimes viewed as floorplanning.It consists of physical partitioning of the rectilinear regions on a chip in conjunction with netlist partitioning to parti- tion the circuit for its placement into these regions.Hierarchical placement techniques 14,15] for physi- cal design are based on rectilinear partitioning.Multi- chip partitioning is a relatively new area of research.This problem is the basic rectilinear partitioning problem with additional constraints involved in a multichip system.Timing-driven system partitioning approaches were reported [16] based on a TCM (Thermal Conduction Module) MCM technology.
These approaches were based on extended N-way Kernighan-Lin partitioning with a cost function in- cluding both capacity and timing constraints.The problem of pin assignment was not handled by this approach.The Anyboard rapid prototyping system [2]  was created at NCSU for the development and rapid implementation of digital hardware designs.This sys- tem consists of the Anyboard PC card which consists of multiple Xilinx FPGAs.
Several commercial vendors have developed tools for partitioning FPGAs.The Prism software tool [17] from NeoCAD provides an environment to perform timing-driven partitioning over multiple FPGAs.
InCA has an FPGA partitioner named Concept Sili- con [18] which partitions an FPGA or PLD netlist onto multiple FPGAs.Quickturn's RPM emulation system [19] creates a hardware prototype from an ASIC or full-custom chip netlist.A hierarchical par- titioner is used to partition the design over as many FPGAs as necessary.

MOTIVATION FOR A NEW PARTITIONING ALGORITHM
The partitioning of the MFS precedes the subsequent place and route stage of the single FPGAs.The func- tion of the partitioning system in such a top-down flow is to serve as a global placement or floor-planning stage in order to generate a good hierarchical solution.Each partition or chip will be transformed into an independent layout problem after partitioning.A partition has a relative position with respect to other partitions.The components assigned to a parti- tion thus acquire a global placement with respect to other components after partitioning.The global con- nectivity and the relative placement of the compo- nents brings in a sense of global distance between components.
The MFS system thus needs a rectilinear partition- ing approach.The sub-netlists obtained after parti- tioning need to be physically assigned to the FPGAs.Minimizing the total wire length between components while partitioning will effectively minimize congestion in addition to reducing the global lengths of nets between partitions.The routability of the global connections due to I/O signals and the inter-par-tition signals can be maximized by minimizing total wire length between partitions and system I/Os.The signal path timing constraints of the system need to be satisfied.The timing violations need to be com- puted based on the position of the components in the MFS.
Multi-way netlist partitioning strategies like recur- sive mincut bipartitioning strategies or enhanced mincut for multiple partitions only minimize the nets cut between partitions and do not understand the no- tion of distance between partitions.These methods thus cannot be deployed successfully to this problem since issues such as total wire length and the length of critical signal paths cannot be controlled.Minimi- zation of the number of pinouts on a partition is just one of the several important objectives of an MFS partitioning system.Furthermore, in this paper we will show that N-way partitioning by recursive appli- cation of mincut bi-partitioning yields inferior results compared to our new approach.
The basis for our partitioning algorithm is simu- lated annealing since it is easy to incorporate signal path timing constraints [23] into the partitioning problem.Initially, this might seem surprising since the computation time for partitioning a flat netlist us- ing simulated annealing could be similar to that of placing the flat netlist, which is precisely what we seek to avoid by using partitioning.In fact, the parti- tioning can proceed at least an order of magnitude faster than placement.There are several reasons for this.First, in placement a given component can reside at the position of any other component.In other words, a component can be placed in O(n) positions, where n is the number of components.However, in partitioning, the number of possible positions for a given component is on the order of the number of partitions, and this is essentially a constant, independent of the size of the netlist.Hence, in some sense the complexity of the state space is one order lower.Second, in placement it is necessary to get the abso- lute minimum total wire length since our research has shown that an additional percent or two decrease in wire length tends to yield an additional percent or two of area savings due to a corresponding reduction in congestion.However, in partitioning, it is not nec- essary to achieve the absolute minimum total wire length as long as the timing requirements and the pin out constraints are satisfied.Third, we present a new cell clustering algorithm in section 6 which greatly reduces the number of moveable objects (clusters), the number of nets (connections) and the complexity of the nets (the number of clusters per net).
Our MFS partitioning system consists of three main phases as shown in Figure 3.We first reduce the netlist using a new netlist clustering algorithm.The clustered netlist is then partitioned using a simulated- annealing based partitioning algorithm.Finally, chiplevel pin assignment is performed using a graphbased global router.
In section 4, we describe the simulated-annealing based partitioning algorithm.The pin assignment stage is described in section 5.The key issues which affect the performance of the simulated annealing al- gorithm in phase 2 were strongly considered in the design of our new clustering algorithm which is used in the first phase.We thus present the clustering al- gorithm in section 6. Section 7 is devoted to the re- suits and the conclusion is the subject of section 8.

PHASE 2: SIMULATED ANNEALING BASED N-WAY PARTITIONING
The MFS partitioning is achieved by a simulated- annealing-based algorithm.This algorithm is a timing-driven N-way partitioner which understands the notion of distance between the various partitions.It uses a simplified cost function which consists of the total weighted wire length and a penalty for tim-  ing violations.The cost function does not consist of any penalty function involving capacity of partitions analogous to overlap penalty in [7].Instead, during new state generation, the partitioner only picks moves which are feasible in terms of a pre-defined target utilization.This allows the algorithm to con- dense the search space of new states and thus im- prove run time.The system-level pin assignment or pad placement must be performed with respect to di- rect I/O connectivity of the components and thus must be performed simultaneously during partition- ing.The new state generation function picks moves which involve both circuit components and I/O pads in order to accomplish pad placement at the same time as the MFS partitioning.The annealing schedule used is a statistically derived schedule proposed by Lam [22].Each FPGA chip is considered a partition and the chip edges are the cut lines for partitioning.The num- ber of the cut lines depend on the number, size and the boundaries of the partitions and the size of the MFS.The rectangular regions formed by the intersec- tion of these cut lines are defined as bins.If the cut lines do not produce a satisfactory grid and thus a satisfactory number of bins, additional cut lines are placed automatically to obtain a finer grid.The pur- pose of dividing the core into bins is to make wire length calculations more accurate.During the parti- tioning process, a component to be partitioned will move from bin to bin.Its location at any instant is taken to be the center of the bin to which it currently belongs.The finer the grid the cut lines produce, the higher the number of bins they generate.A large number of bins make the wire length calculations more precise, especially with respect to timing.However, with a large number of bins, the search space for component moves is large.This makes annealing more expensive in terms of CPU time.An effective trade-off between the accuracy of wire length and CPU time was obtained by using a number of bins on the order of 20 or the total number of partitions, whichever is greater.Figure 4 shows an example of bin configurations for an MFS with four FPGAs.are for the system-level (top level) pad placement.Note, the chip-level I/O slots (for second level pads) are not shown.

Cost Function
The partitioner cost function C consists of two terms as shown in (1).The first term is the total weighted wire length, represented by W. The second term is the timing penalty function, represented by Pt.
C W a t-Pt (1) 4.1.1.Total Weighted Wire Length W At the end of partitioning, its is desirable to obtain the lowest number of pin-outs possible for each chip, since there are a limited number of chip level I/O slots.If each net had the same weight, minimizing the total wire length would not generally minimize the number of pin-outs.We therefore introduced a new dynamic net-weighting scheme which serves to min- imize the number of pin-outs.In our scheme, nets which traverse two adjacent bins but are in two dif- ferent chips (e.g.N2 in Figure 5) must be penalized more than nets which traverse two bins in a single chip (e.g.N1).Since one of the major hard- constraints is the number of I/O's available per chip, we formulated a net-weighting scheme which is guided by the number of I/O's a signal needs if it traverse more than one chip.The nets which are re- stricted to one chip, or the single-chip nets, do not need any I/Os and thus have a weight equal to 1.

Chipl Chip4
Chip3 FIGURE 5 Comparing two nets of same length but different weights.
Figure 6a) and 6b) show the conditions when a net traverses two adjacent chips and two diagonal chips, respectively.Both nets need at least two I/Os to make the connection between the two chips.Figure 6c) shows a net which traverses three chips and which needs at least four I/Os. Figure 6d) shows a net which traverses four chips and which needs at least six I/Os.
For each net topology encountered in a given MFS, integral weights are designed depending on the num- ber of I/Os it needs for inter-chip connections, IOn.A constant, K, increases the differences in weight from the single-chip nets.The weight of a net, n, is w 210 + K.We use K 2. The summation of all the half perimeters of the nets weighted by the dynamic net weight is the total weighted wire length, (as in (2)) for a particular configuration of cells.At any The net weighting scheme.point during the annealing, the new weight of a moved net is re-evaluated and used to compute the weighted wire length.W is given by: W , (Sx(n) + Sy(n)) Wn, (2) n=l where S(n) and Sy(n) are the width and height of the minimum bounding rectangle of the net, respectively, and w is the weight of net n.In order to minimize the CPU time necessary to update W for large nets, we use an incremental net-span updating scheme.The incremental scheme devised for the partitioner takes advantage of the gridded nature of the bin structure used and thus is simpler and faster than previously reported methods [20,23].Since detailed placement is going to follow this partitioning stage, the clusters of components are assigned to the center of the bins and the pins of a component are also taken to be at the center of the component.Unlike placement, where we need the exact location of a pin for precise wire length calculations, for each net the partitioner only needs to store the number of pins for a net at each grid line.Thus, updating the pin configuration of a net after a move is easy.The global scale of the grid lines are stored in a lookup table.The x and y span of a net is calculated using the maximum and minimum of the active grid lines of the net (i.e.grids which contain one or more pins for this net).

Timing Penalty
The timing penalty in the cost function is calculated based on the slacks generated in the critical paths of the circuit by partitioning.A critical path may consist of several nets.The timing penalty is minimized dynamically during partitioning.In this section, we will first describe the propagation delay model we have designed for a timing path over multiple FPGAs.Based on this model, we will define the timing pen- alty.
The total delay on a path p over multiple FPGAs is the sum of the delay generated in the configurable logic blocks (CLB) inside each chip, T/(p), and the total interconnect delay, TR(P).

T.(p) T(p)+ T(p)
(3) TR(p) is the sum of all the constituent net routing delays, TR(n), due to the intra-chip and inter-chip connections of the net.TR(P) E TR(n) (4) nep Logic Delay: The total logic delay of a path p is: r,(p) No rc, (5) where ND is the number of logic levels or depth of the particular critical path and is available from the logic synthesis stage of the circuit.TcLB is the intrin- sic delay of the CLB.For a given technology and CLB design of an FPGA, TcLB is constant and inde- pendent of the configuration, number of inputs or out- puts.(According to spice simulation results reported by Singh et al. in [25], the typical worst case delay of a 4-input lookup-table based logic block in a 1.2 am CMOS process is about 1.7ns).Rose et al. eval- uated and compared the performances of Altera, Actel and Xilinx FPGAs in [26].We will follow their assumption that for lpm -< k <--1.75pm, the gate delay is approximately proportional to k.Given a TcLB of a CLB for a particular k, we accordingly scale TcLB for the same CLB design for a different K in that range.
Routing Delay" The total routing delay of a net n, TR(n), is the sum of the delay due to the intra-chip, Ts(n), and inter-chip connections, TM(n), of a net.
TR(n Ts(n + TM(n) (6) Intra-chip Routing Delay: Ts(n) is a function of the routing architecture of the FPGAs used, fanout of a connection, length of a connection, the process tech- nology, and the programming technology.The two main components of Ts(n) is delay due to the switches in the interconnect path and the parasitics of the wire segments.The delay due to the switches can be modeled for a particular programming technology and the number of switching stages between CLBs in the routing architecture as shown in [27].(For the anti-fuse technology and single segment routing, the number of switches between two logic blocks were taken to be 2 and the RC model was formed accord- ingly in [27]).The total switching delay including the parasitics seen by the wire segments used by the net can be modeled as a lumped RC: Rsw is the equivalent drive resistance or the switching ON resistance and Csw is the total load capacitance seen by the driver.Csw consists of the gate input ca- pacitance, Cg, and the parasitic capacitance, Cp, of the wire segments used to form the interconnection.
[For an anti-fuse technology (e.g.Actel), Rsw does not change with h.However, for pass transistor technol- ogy (e.g.Xilinx), Rw is proportional to h.Cg is pro- portional to h in both anti-fuse and pass transistor technology.]Cp depends on the process technology used for the wiring segments and can be computed using the lumped capacitance model and is propor- tional to wire length.The wire length of a net can be estimated at the partitioning stage using the half- perimeter bounding box: (8) Ca,. and Cab are the capacitances (per unit length) of the vertical and horizontal tracks or busses in the routing architecture.Thus (7) can be expanded as: Inter-chip Routing Delay: In addition to the delay in the FPGA chips, a net acquires an additional delay when it crosses the chip boundaries.Depending on the type of MFS, MCM or PCB, the modeling of an interconnect between two chips will differ [28].Interconnect wires on PCBs are usually wider (60-100pm) and thicker (30-50m) than thin-film MCMs (where the wire width is in the range of 10-25pm and thickness 5-81am).Figure 7 shows a generalized model for the interconnect of an inter- chip connection in an MFS following the macro- model described in [28].The model consists of a transmitter capacitance, receiver capacitance and a transmission line modeling the wire segment in be- tween them.The capacitor at the driving end, Co, models the output capacitance of the driver and the pad capacitance of the chip on the MFS, while the capacitor on the receiving end, Ca, consists of the input capacitance of the receiver and the pad capaci- tance of the receiver.PCB interconnects usually have low resistance per unit length and thus behave like distributed LC transmission lines (lossless).These lines are generally terminated with a resistor that matches the characteristic impedance, Z o, to avoid reflections.The total resistance of MCM interconnect lines is comparable to the characteristic impedance (which depends on the structural properties of the substrate) and are thus lossy.MCM interconnect lines are usually unterminated [29].The inductance of the chip-to-MCM bond is assumed to be negligible which is typical for the flip-chip-attached integrated circuits.The line parameters R, L, and C of the MFS interconnect will depend on the material properties such as the dielectric constant of the insulator (e), the resistivity of the metal (p), the permittivity (p) of free space and the line geometry of the wire.Based on this model, the delay for a chip-to-chip interconnect, Tco appropriate to the particular MFS is pre-computed.We assume that a net which con- nects to more than one chip will be connected by the Tlansnitter Receiver FIGURE 7 The inter-chip interconnect structure and the circuit used in delay modeling.
shortest path tree between the chips and we let I0,, be the number of inter-chip connections a net requires under this assumption.In our main applications, the spacing between the chips in Figure 8 is comprised of pre-wired connections which run perpendicularly to the chip edges.Hence, the total inter-chip connection delay for a net is: Tm(n IO Tcc (1 O)   Thus, the total routing delay of a net n over multiple FPGAs is: TR(n Ts(n + TM(n)= ((RswCg + Rsw[CL,,Sx(n) + CL,.Sy(n)]) + IO * Tcc (11) The total path delay is: In ( 12), we can precompute the expressions which are independent of wire length and number of inter-chip connections as: Kp Tz(p) + Z (RsvG) (13) nep

Chip2 Chip3
As reported in [23], the total timing penalty is com- puted as the sum of the penalties over all critical paths specified.For each critical timing path, the user supplies an upper bound Tub(p) and a lower bound Tub(P) on the required arrival times.The penalty as- signed for a path p is the amount the delay deviates from satisfying the bounds.
Tpd(P)--Tub(P) if Tpd(P) > Tub(P) P(p) T,(p) Tpd(p if Tpd(P) < Tb(P) 0 otherwise (15) The total timing penalty is the sum of penalties over all the critical paths specified.
Pt , P(P) An important objective of the partitioner is to insure that each FPGA contains a feasibly implementable amount of logic and to satisfy specific capacity con- straints given by the user.The partitioner also needs to perform system-level pin assignment or pad placement on the system-level I/O frame of the MFS.We have implemented a new approach for the generation of new states in the simulated annealing to attain these objectives.(In this section, when we refer to pads or I/Os, we mean system-level pads or system- level I/Os.)During annealing, the partitioner uses two main types of moves: a single object move and a pair-wise interchange of two objects.An object is either a cir- cuit component or a pad.During a single object move, a component is moved from one bin to another bin or a pad is moved from an I/O slot to another I/O slot.Two objects are inter-changed during a pair-wise interchange move.
The partitioner seeks to maintain an ideal distribu- tion of components in the bins in terms of total utili- zation of logic per bin.The utilization (U) in a bin is defined as the total component area in that bin di- vided by the bin area.If the feasible utilization level of the FPGAs is set by the user, the partitioner uses that factor to set the a priori target or baseline utili- zation (U) for each bin in those FPGAs.In the ab- sence of such input, the objective of the partitioner is to achieve roughly comparable utilization in each of the bins.The a priori target or baseline utilization (U) for each bin is defined as the total component area for all of the chips divided by the sum of the chip areas.To obtain the target utilization, a proposed move is only tested through the annealing criterion if it satisfies the utilization_test.Basically, a proposed move is evaluated only if the new utilization (Unew) does not exceed the baseline value (U) or if the new utilization is closer to the baseline than the current utilization.A move passes the utilization_test accord- ing to the pseudo-code shown in Figure 9.The utili- zation test acts as a screening routine for the proposed moves.Thus, only feasible partitioning solu- tions are tested through the annealing criterion.
In order to generate a new state, a component A is selected randomly from the set of all objects: compo- nents and pads.If A is a component, we then ran- domly select a new location within the range limiter window [21].The bin which covers this location is noted.If the single move of A to this bin passes the utilization test, the proposed new configuration is ob- tained by attempting a single move to the new bin.Otherwise, a component B is randomly selected from the list of cells assigned to the bin.The pair-wise interchange of A and B is tested through a similar utilization test as that shown in Figure 8.If the test is passed, the proposed new configuration is attempted by pair-wise interchanging the components A and B. On the other hand, if A is a pad, we randomly select an I/O slot to which A can move to.If that I/O slot is utilization_test (move) empty, a single move of pad A to that slot is at- tempted.Otherwise, the pad B occupying that slot is trial interchanged with A. 5. PHASE 3: GLOBAL ROUTING AND PIN ASSIGNMENT In this section we will describe the third and final phase of the partitioning system.At the end of an- nealing, the system-level pads have been placed and each partition contains unplaced components.Fol- lowing the physical partitioning of the netlist, each of the n partitions (FPGAs) are converted into complete and independent layout problems in this phase.Pin assignment is performed on the chip-level I/O frames so that the chips in the MFS can be interconnected consistently using the pre-wired connections in be- tween chips and those between the chips and system I/Os.This phase is mandatory for MFS partitioning in order to make the application complete.The signals which cross one or more chip boundaries are external signals.Given the total number of pin-outs of each partition, the objective is to assign the external signals to the chip level I/O's in a way such that there is no overflow.
We employ a graph-based global router in this phase.An example of the global router graph for an MFS with four FPGAs is shown in Figure 12a.A node is defined at the center of each bin.In order to route nets which connect pad pins, additional nodes are de- fined outside the MFS core as shown.All rectilinearly adjacent node pairs inside the core are connected by edges.However, to avoid route segments connecting adjacent pads, the edges connecting the nodes which represent pads are excluded from the graph except for the corners of the MFS core.This is done to accom- modate pads at those corners (e.g.PD1 and PD2) by providing edges and thus paths to connect to them.A capacity is assigned to each edge.The edges which intersect any chip boundary are assigned a capacity equal to the number of pre-placed interconnect wires or I/O slots available on that boundary within the range of that edge.All internal edges are assigned a large capacity to encourage the router to use these edges over the external edges if possible.Initially, the weight of an edge is equal to the length of the edge.
The global router seeks the shortest possible routes while minimizing the overflow over available routing resources.The global routing algorithm is shown in Figure 10.Initially, the shortest path routes are found for all external signals.Based on these routes, the total overflow is calculated.The function update_ edge_weight is used to update the weights of the edges with overflow.The routes which use the edges with overflow are discarded and the corresponding signals are re-routed using an iterative rip-up and re- route scheme.The pseudo-code for generating a route for a net is shown in Figure 11.
Pin assignment is performed based on the final routes given by the global router for the external sig- nals as shown in Figure 12b.For each route obtained for a net, we generate an I/O pin at each intersection of a chip boundary and a route segment.Let the net consisting of pins pinl, pin2 and pin3 be routed using the L-shaped route as shown in Figure 12a).A pin a Set of external nets' N ."Set of edges with overflow Eo" Set of routes for nei n: .Algorithm Global_Route_for_Pin_Assignment for all n Nex r generate_a_route(n, 0); Rn R,, u r ;/* add to the set of routes for net n */ Calculate overflow on all edges and form Eo; Main_iteration 0; While (total overflow > 0 OR Main_iteration < MaxIteration) for all e E Update_edge_weight(e ); for all e E E for all n e r generate_a_route(n, Max_improve); R,, R,, u r ;/* add to the set of routes for net n */ FIGURE 10 Global routing algorithm.
Set of pins for net n :P (n) Set of trees for net n :T(n) Subroutine Generate_a_route(net n, Max_improve)   Make each node corresponding to pin E P (n) a tree and form T(n); while (IT (n)l > Find a shortest cost path p between two nodes, v and vj vi.T and vie T ;/* T T(n) and Tt.T(n) *[ Merge the trees and path into one tree Tk T + T + p T (n) T (n) T if (Max_improve > 0) Improve_route_tree( T (n), Max_improve); return (T (n) ); Subroutine Improve..route_tree(T, Max_improve) iteration O; while (iteration < Max_improve) Select a random edge e T; Create a path p by tracing e to nodes with degree d > 2 Create two trees T and T by removing Pl from T; Find the shortest cost path P2 between two nodes v .T and vj f.T2; If cost(P2 < cost(p then T T + T + P2 is created at the left side of Chipl and a pin b is created at the bottom side of Chipl.If such an inter- section point is on a side shared by two chips, as in the case of pin b, the pin is assigned to both chips.In this case, pin b is assigned to both Chipl and Chip2 to maintain consistency in the routing path through that point.Since the routes follow the grid lines, a group of pins are likely to be produced at the same intersection point if several nets share that segment.
In such cases, the pins are assigned in the same order on a shared chip boundary.N independent layout  problems are created at the end of the global routing such that they can be independently placed and routed in parallel if need be.
6. PHASE 1: CLUSTERING 6.1.Motivation for a New Clustering Algorithm The extent to which a netlist is reduced by clustering depends not only on the characteristics of the original netlist but also strongly on the clustering strategy.
The characteristics of the clustered netlist have a strong influence on the quality of results obtained from the partitioner.We have experimented with al- gorithms using various bottom-up clustering strate- gies.The effects of the various strategies have helped us to identify the key objectives of a clustering algorithm in the context of our main goal which is to dramatically speed up simulated annealing-based N-way partitioning.The run time of the simulated annealing algorithm implemented in the chip partitioner depends on various factors related to the netlist.

a) b)
Moveable components: The total number of com- ponents of the MFS which participate in the chip partitioning algorithm determines the total number of moves to be executed in each iteration.A re- duction in the number of moveable components will thus improve the run time.Nets: When a move is attempted with a compo- nent, c, the wire length of the set of nets, N c, as- sociated with c needs to be updated.The time re- quired to update the wire length of a net n e Nc is proportional to the cardinality of the fanout of n, ]Phi (Pn is the set of pins on net n).Therefore, the two main factors controlling the run time associ- ated with this move are INcl and levi.Thus, a com- ponent with a large number of nets is an expensive component to move and a net with a large number of pins is an expensive net to update.When com- ponents are picked randomly during the annealing process, the larger nets are picked more often than c) the smaller nets.An objective of the clustering algorithm must therefore be to reduce the fanout of nets as well as to reduce the total number of nets and total number of pins.
Component size: As described in section 4.1.3,the chip partitioner uses a utilization test routine to maintain a feasible partitioning solution at each step in the annealing process.If two components picked for pairwise interchange differ in their size by a considerable amount so as to fail the utiliza- tion screening test, the partitioner goes on to gen- erate another move.Generating infeasible moves wastes CPU time and inhibits full exploration of the state space.Thus, the clustering algorithm should try to generate clusters which are as uni- form in size as possible.
Three prominent clustering metrics were proposed recently.Intuitively, the degree/separation metric used in the random-walk based clustering algorithm, RW-ST [13], strengthens the objective of seeking a minimum cut between clusters.However, the RW-ST heuristic has overall worst-case time complexity of O(n3) making it inefficient for its application to large circuits.The shortest path clustering algorithm [12], based on uniform multi-commodity flow problem, was evaluated using the ratio cut metric.In most cases the clusters produced by this method had a large range of size.Clusters having such properties are not at all suitable for N-way partitioning as dis- cussed previously.
A third clustering algorithm based on (k,1) connec- tivity was reported in [24].If there are k edge-disjoint paths of length between components s and t, then s and are said to be (k, /)-connected.A cluster was defined as a group of components such that every two components in that group are (k, /)-connected directly or indirectly through transitive closure.We have ex- tended this algorithm for multipin nets and have im- plemented it for evaluation purposes.We have im- proved the worst case time complexity of this algorithm to O(nB2) for 2 and O(nB3) for 3.Here n is the number of nodes/components in the graph and B is a constant representing the upper bound of the number of immediate neighbors a component has in a circuit.However we found several disadvantages with this algorithm.The (k, < 1) criterion may yield non-intuitive clusters if the circuit is not structured enough.Also, there is no control over the size of the clusters obtained.The algorithm uses the netlist con- nectivity graph for enumeration of paths.The use of such an algorithm for large circuits is impractical.Through extensive experimentation of this algorithm with (k, 1), combined with other heuristics, we ob- tained clusters which were of reasonable quality.
However, since the appropriate value of k was diffi- cult to predict without experimentation on a particular circuit, we could not use this algorithm to gener- ate clusters for the chip partitioner over a wide range of circuits.

Approach for Clustering
Our clustering approach is a bottom-up hierarchical technique based on an agglomerative method of clus- tering.At each level of hierarchy, we cluster nodes which qualify to merge with each other.This results in a netlist of reduced complexity.Our experiments on circuits having a large size range have showed that the clustering process needs to be adaptive to the cur- rent state of the netlist in order to obtain the best N-way partitioning results.Thus we have devised a two-phased natural and adaptive clustering technique.
The first method finds the natural clusters of the cir- cuit.The criterion for merging nodes at each level of hierarchy is based on the net connectivity and density of the weighted netlist graph.The second and more adaptive strategy is based on a heuristic technique which aims to refine the netlist obtained from the first method for its most effective application in the chip partitioner.
Since we are interested in circuit applications, we wish to minimize the number of inter-cluster nets.Thus, we considered the k-edge-connectivity of a graph as opposed to k-vertex-connectivity of a graph.Our initial concept of a cluster was derived from the notion of (k, 1) connectivity [24].However, we adopted a more general model of a graph representa- tion for a netlist which allowed us to handle multipin nets as easily as 2-pin nets.The accumulative weighted graph is simpler and more accurately rep- resents the circuit structure than a hypergraph.The time complexity of our basic algorithm is linear with respect to the number of nodes in the accumulative weighted graph.

Accumulative Weighted Graph
The nodes of the graph represent the components of the circuit and an edge between two nodes indicates that a hyper-edge contains these two nodes.For ex- ample, each multipin net connecting n components is represented by a complete graph contributing n(n 1)/2 edges to the graph.Our ultimate purpose is to use the clusters for distance-based partitioning.In a final placement, the nets with very high fanout natu- rally span a larger portion of the core and thus have higher wire lengths.Thus, partitioning programs will have more success in minimizing the net lengths of low fan out nets.In order to differentiate between large and small nets, we have formulated an edge weighting scheme based on the fanout of a net.Thus an edge representing a net with n pins is given a weight 1 n 1.An edge from a 2-pin net is given a weight of 1.After assignment of weights, we collapse all the edges between a pair of nodes into one edge.This final edge thus carries an aggregate weight equal to the sum of weights of the original edges between those nodes.In this way, as shown in Figure 13, we can represent the global nature of the connectivity of the netlist through a simplified graph as a pair of nodes will have at most one edge between them.DEFINITION 1: An edge of the weighted graph with a weight w > 1.0, is a strong edge.

Multilevel Cluster Growth
Natural Clustering Algorithm The purpose of the first phase of clustering is to ex- tract dense subgraphs (natural clusters) from the weighted graph of the netlist.The initial seed for this agglomerative growth process consists of the individ- ual components in the netlist.We start out with the netlist of the circuit and construct the weighted graph.
Nodes which are connected through strong edges are clustered.The Cluster_Natural algorithm is shown in Figure 14.When nodes v and v are merged using the natural clustering algorithm, the edges which were between them become internal edges to the cluster these two nodes belong to.These internal edges do not participate in subsequent levels of clustering.The external edges get new weights depending on the new accumulative graph obtained after updating the netlist.The clusters are grown in consecutive layers by iterative calls to the Cluster_Natural algorithm.At the second level and below, the big clusters already formed at previous levels are not allowed to grow anymore.We have defined big as a multiple of the average component size.The process is frozen when no more strong edges are found.
Input: V [G] Set of vertices in the graph G, F e (v) Set of adjacent edges of v.
Algorithm Cluster_Natural() for all e F e (v such that e Iv, v i)

Adaptive Clustering Algorithm
Since the first phase of clustering is dependent on the density of the graph, the distribution of components in clusters may not be uniform.The clusters obtained in the first phase are thus refined through heuristic adaptive clustering techniques.These techniques were designed to further mold the shape of the re- duced netlist to make chip partitioning more efficient.
With a minimal number of changes to the natural clusters, it is desirable to further reduce the complexity of the network while trying to achieve a relatively uniform distribution of clusters sizes.We identified two main ways of reducing the size of the network: a) collapse a small net (typically 2-pin nets) and thus decrease the total number of nets in the network, b) reduce the fanout of a large net and thus decrease the average net fanout.
A set of small clusters is formed and used in the algorithm Cluster_AdaptiveO (Figure 15).A small cluster is defined as a cluster with size less than the average cluster size.
We tested this two-phased natural and adaptive ap- proach of clustering on several MFS circuits.The na- ture of the clusters obtained on one such circuit is shown in Figure 16.The number of clusters contain- ing a particular count of components and the aggregate cluster area (sum of the constituent component areas) versus the number of components in a cluster are shown.Note that even though the number of  components per cluster has a wide range, the average size (total component area) of the clusters is rela- tively uniform.
Our algorithm was designed to be an efficient im- plementation and thus usable for practical applications.Efficient data-structures like disjoint-union sets are used to form clusters.The worst case time com- plexity of our algorithm for multipin nets is O(nB), where B is a constant representing the upper bound of the number of immediate neighbors a cell has in a circuit and n is the number of cells.Due to the accu- mulative weighted graph, the space requirements for our algorithm are O(n2) in the worst case.Practical circuits generate much sparser graphs than complete graphs and thus require approximately O(n) space.Also, this space requirement is for the first level of clustering only.As the network is reduced in succes- sive levels, the space requirement reduces accord- ingly.

RESULTS
Our MFS partitioning system, Tomus, has been devel- oped in C with an X11 graphics interface to provide interactive features.Several industrial circuits were used to test the partitioning system.The circuit de- scriptions are shown in Table I. Figure 17 shows the partitioning results of the circuit alma for a 2-FPGA MFS.The net connectivity between bins is shown before and after partitioning.This demonstrates the effectiveness of Tomus in minimizing congestion across chip boundaries.Partitioning results using clustering The reduced netlist obtained from our clustering al- gorithm was used during the normal course of the MFS partitioning.Once the MFS partitioning is com- plete, the clusters are disintegrated into the original components.A low temperature annealing is per- formed on these components using only pairwise in- terchange of components.Since updating the cost function in annealing is the major speed bottle-neck, a reduced number of nets and pins speeds up the annealing process compared to using a flat netlist.
Table II compares the partitioning results between the flat mode and the clustered mode of Tomus on a 4-FPGA MFS.For a total of 20 runs of Tomus, we report the minimum of the nets cut, the average and the minimum overflow (unroutable nets) after chip- level pin assignment.The CPU times reported are on a DEC 5000/200.For the clustered mode, the CPU time includes the time taken to generate clusters.On average we obtained a speed up of up to 5 times while improving the quality of the partitioning results in all cases as shown in Table II.Based on the en- couraging results obtained using clusters, we have used the clustered mode of Tomus for rest of our ex- periments in the following sections.

Global routing
In this section, we show results of the pin assignment phase in Tomus.We partitioned each circuit using a 4-FPGA MFS.Given the total number of nets cut over the four FPGAs at the end of the simulated an- nealing phase, global routing was executed for pin assignment.The global router first finds the shortest routes.An iterative rip-up and reroute process then improves these routes to minimize overflow.In Table III, out of a total of 20 runs, we picked the run which gives the best final overflow and compared the over- flow with the initial shortest routes and the final over- flow after overflow minimization.In all but one case, the overflow (unroutable nets) was reduced to zero.
For the case with non-zero overflow, the routing was completed by using unassigned system-level I/Os (which are not available for most designs).

Comparison with Mincut
In this section we compare the N-way partitioning results using Tomus with results from recursive bi- partitioning using mincut.We have implemented the mincut bi-partitioning algorithm proposed by Fiduc- cia and Mattheyses [5].

Four-Way Partitioning
For an MFS with four FPGAs, we partitioned each circuit recursively applying Fiduccia and Mattheyses (F-M).Table IV shows the 4-way partitioning results using recursive mincut.For each run of F-M, we start with different initial partitions.Over a total of 30 runs on each circuit, we report the best results in terms of nets cut in Table IV.We used Tomus to physically partition each circuit onto four FPGAs.Table V shows the results using Tornus.Equal partition sizes were specified in both cases.The best results of 30 runs are shown for both F-M and Tomus.For all the circuits, Tomus obtains lower or equal number of nets cut between four FPGAs.The average improvement of Tomus over recursive F-M is 34%.

Comparison with InCA
We have also compared the Tomus partitioning results for a 4-FPGA MFS with the test results from an in- dustrial FPGA partitioner (Concept silicon from INCA).The overflow (number of unroutable nets) af- ter pin assignment is compared in Table VIII.The average improvement of overflow is 73%.

Timing Results
We tested the timing-driven capabilities of the parti- tioner on the MCNC benchmark circuits.Multiple FPGA configurations using Xilinx devices on a PCB were used to obtain these results.For each circuit we first used Tomus to find a partitioning without impos- ing any delay bounds.Using the nominal net lengths obtained from this run, we extracted (in order) the rn most delay critical primary input (PI) to primary out- put (PO) pin pairs. (The value of rn was limited so as to not more than double the overall CPU time versus the case when no delay bounds are imposed.We veri- fied that none of the non-included pin pairs gave rise to a critical delay at the conclusion of the partitioning).We extract the current longest path between these pins.These constitute the set of critical paths used in our timing penalty function, and we impose the delay bound on these paths.Because a particular critical path may not always be the critical path for a pair of pins, we update the set of critical paths 150 times (once per iteration) during the course of the annealing based partitioning.
We compared the results with the timing penalty deactivated versus the results obtained with the tim- ing penalty activated for each circuit.In Table IX we have shown the number of paths which were within specifications and the number of paths which were outside the specifications in both cases.Column 1 shows the number of devices used.Using our timing penalty function, Tomus successfully partitioned these circuits satisfying the timing constraints in all cases.In all the circuits, Tomus achieved a significant reduction in the longest path delay by using the tim- ing penalty function.The average reduction was 17%.These results were obtained at the cost of 17% in nets cut and 5% in wire length on average.the timing constraints are satisfied during partitioning.The partitioning algorithm minimizes total weighted wire length in order to minimize the pin- outs of each FPGA.We have introduced a timing model which is peculiar to a multiple FPGA architec- ture.It combines all the possible delay factors in- volved in a system with multiple FPGA-based chips.
The pin assignment phase is mandatory for MFS par- titioning in order to make the application complete.
Although the global router uses a few well known heuristics, however, we have made appreciable exten- sions which can handle the pre-routed connections of an MFS structure.We have introduced a new two- phased natural and adaptive clustering algorithm which has improved the quality and run time of MFS partitioning.The annealing-based N-way partitioner executes four times faster on average using the clus- ters as opposed to the flat netlist with improved par- titioning results.For several industrial circuits, our approach outperforms the recursive mincut bi- partitioning algorithm by 35% on average in nets cut.
Our approach also outperforms an industrial FPGA partitioner by 73% on average in overflow.We have tested the timing-driven capabilities of the partitioner.Using the timing penalty function, Tomus success- fully partitioned several MCNC benchmarks satisfy- ing the timing constraints in all cases.An average reduction of 17% was achieved in the longest path delay at the cost of 17% in nets cut and 5% in wire length on average.In the future, we will extend our research to accommodate new MFS designs with new interconnect wiring designs [1][30].

FIGURE 2
FIGUREAn MFS on an MCM.

FIGURE 3
FIGURE 3  The main phases of the MFS partitioner.

FIGURE 12 a
FIGURE 12 a) A global route on the graph, b) Independent chips after pad/pin assignment based on the route.

FrequencyFIGURE 16
FIGURE 16 Distribution of MFS circuit components in clusters.

FIGURE 17
FIGURE 17 Partitioning results of dma on a 2-FPGA MFS, a) before and b) after partitioning.

Table III
Overflow minimization using global routing.

Table V
Tomus results for four-way partitioning.FPGAs and conducted a similar experiment.For each run of F-M, we start with different initial partitions.Over a total of 30 runs on each circuit, we report the best results in terms of nets cut in Table VI for 2-way partitioning results.The best results of Tomus for 2-way partitioning from 30 runs per circuit are shown in TableVII.Again, for all circuits, Tomus obtains lower nets cut.The aver- age improvement of Tomus over F-M is 40%.

Table VI F
-M Mincut results for two-way partitioning.
8. CONCLUSIONS We have presented a timing driven N-way partitioner for MFSs.The physical constraints of the MFSs and Circuit Midi Table VII Tomus results for two-way partitioning.

Table IX
Timing driven partitioning.