Placement and Routing for Performance-Oriented FPGA Layout

This paper presents a performance-oriented placement and routing tool for field-programmable 
gate arrays. Using recursive geometric partitioning for simultaneous 
placement and global routing, and a graph-based strategy for detailed routing, our tool 
optimizes source-sink pathlengths, channel width and total wirelength. Our results 
compare favorably with other FPGA layout tools, as measured by the maximum 
channel width required to place and route several benchmarks.


INTRODUCTION
Field-programmable gate arrays, or FPGAs, pro- vide a versatile and inexpensive way to implement and test VLSI designs [7,16].FPGAs are available in a number of styles and configurations [40].One of the most common FPGA architectures [9,43] consists of a matrix of user-configurable logic blocks interconnected by a set of programmable routing resources (Fig. 1).FPGA reprogram- mability is achieved at the expense of performance, as there may be long signal delays through the reconfigurable routing resources [39].To increase FPGA performance, partitioning and technology mapping have been extensively studied [11, 20, 27,  35].However, the observation that circuit perfor- mance is impacted more by routing delays rather than by device delays [6,26] has focused recent attention on routing [8,15,31,32,42].
This paper presents a performance-oriented FPGA Placement and Routing (FPR) tool.FPR is based on a recursive geometric strategy for simultaneous placement and global routing, fol- lowed by a graph-based detailed-routing phase.FPR heuristically minimizes both wirelength and source-sink pathlengths.Thus, FPR optimizes the number of FPGAs required to implement a given design, as well as the performance of the implementation.In particular, FPR successfully lays out a number of large industrial benchmark circuits using smaller channel widths than other FPGA layout tools, and also optimizes source-sink pathlengths as a secondary criterion.The rest of the paper is organized as follows.Section 2 provides an overview of our methodology.Section 3, Section 4 and Section 5 detail the main phases of FPR, namely placement, global routing and detailed routing, respectively.Section 6 establishes the efficacy of our implementation on industrial benchmark designs, and we conclude in Section 7. The Appendix develops some theoretical results for multi-weighted graphs used in the multi-objective optimization phase of detailed routing.Preliminary versions of this work have appeared in [1,2,3].

OVERVIEW
FPGA logic blocks typically contain a program- mable look-up table, which enables arbitrary combinational-logic functions of up to four vari- ables to be implemented.Each logic block thus contains a small portion of the overall circuit logic.The logic blocks are interconnected by channel segments, which are linked together by switch blocks.The switch blocks contain programmable internal connections among certain subsets of incident channel segments.Switch-block edges are often implemented as pass transistors, which can be "turned-on" to interconnect incident channel edges.Finally, connection edges allow logic-block pins to latch onto adjacent channel segments.
During the FPGA design process, placement and routing are performed following the technology mapping phase.Technology mapping decomposes the circuit design into units of logic, which are then assigned to specific logic blocks during placement.
Thus, the input to FPR consists of unplaced logic blocks and a set of nets (a net is a set of logic block I/O pins that must be interconnected).FPR performs simultaneous placement and global rout- ing using a recursive geometric technique called thumbnail partitioning, which decomposes the circuit area into an m xn grid, for some small fixed rn and n.This grid is called the partitioning template.The placement is then optimized and a global routing is determined relative to the partitioning template using optimal rectilinear Steiner arborescences (RSAs) [34] (i.e., minimum- weight shortest path trees).Since rn and n are small and fixed, these optimal RSAs (called thumbnails) may be precomputed for efficient lookup during execution.Setting m=n 3 yields the basic 3x3 partitioning template that is used in our imple- mentation (Fig. 2(a)).Thumbnail partitioning is a generalization of sharp partitioning [5], which in turn is a generalization of quadrisection [38].
Our strategy consists of placement and global routing, followed by detailed routing.During placement and global routing, a partitioning heuristic is used to assign the logic blocks to regions in the partitioning template, minimizing (. source-sink pathlengths as well as the total length of the thumbnails.When the circuit area is divided according to the partitioning template, each logic block lies in one of the m xn regions.For each net, we construct a pointset in the mxn grid, where a point is present in a region if some logic block associated with the net lies in that region (Fig.
To reduce overall routing congestion, alterna- tive thumbnails are selected in order to balance the number of thumbnail edges that cross each edge of the partitioning template."Virtual" pins are then created at the intersections of thumbnails and partitioning-template edges (Fig. 2(d)), and the algorithm is then applied recursively to each subregion of the partitioning template.This scheme simultaneously produces both a placement and a global routing in which source-sink pathlengths, total wirelength, and maximum channel congestion are all heuristically minimized.The resulting placement and global routing is then used in the detailed-routing phase to produce a complete routing solution.
During the detailed-routing phase, nets are assigned specific routing resources based on global routes.By modeling the FPGA routing architec- ture as a graph, efficient graph-based algorithms may be used to produce detailed-routing solutions.Nets are routed one at a time; as resources are committed to nets, the corresponding edges in the underlying graph are made unavailable to subse- quent nets.
The next three sections detail the main phases of FPR, namely: (1) logic-block placement and thumbnail selection for balancing congestion, (2) global routing, and (3) detailed routing.

PLACEMENT
The placement phase overlays the FPGA with the partitioning template and initially partitions the design logic into m.n regions.Cut lines of the partitioning template go through switch blocks so that each logic block lies entirely within a single region of the partitioning template.The distribu- tion of logic blocks among regions of the partitioning template is then improved using simu- lated annealing [28], where a move consists of swapping two logic blocks that lie in different regions of the partitioning template.The simulated annealing objective is to minimize (1) the sum of the maximum source-sink pathlengths in the thumbnails over the nets, and (2) the total length of the thumbnails for all nets.Note that the I/O blocks on the perimeter of the FPGA are not moved during these iterative refinement steps.Routability is a primary concern during the FPGA design process [6,10].An important measure of the quality of a placement and global routing is maximum congestion, which in our case is the number of thumbnail edges that cross any given partitioning-template edge.Thus, once logic blocks have been assigned to regions in the partitioning template, a congestion-balancing step is undertaken as follows.
A typical pointset can have many thumbnails; for example, Figure 3 illustrates a pointset and its eight thumbnails.The objective of the congestion- balancing step is to assign one of the precomputed thumbnail alternatives to each net in a manner that minimizes the maximum thumbnail congestion.This task is accomplished using the following greedy heuristic: Sort the nets in ascending order of the number of distinct thumbnails for each net; and For each net on this list, choose the thumbnail that minimizes the maximum congestion in- duced by all previously processed nets.
Intuitively, this scheme postpones the global routing of nets for which there are a greater number of thumbnail choices; this enables FPR to better compensate for the less avoidable conges- tion incurred earlier by nets with fewer thumbnail choices.

GLOBAL ROUTING
After FPR has mapped the logic blocks to regions in the partitioning template and each net has been assigned a thumbnail, every edge in each thumb- nail is then assigned to a specific switch block along the crossed cut-line of the partitioning template.Each such switch block is then concep- tually added as a new "virtual" pin in the net.The portion of each net within each region of the partitioning template is then passed on to a lower level of the recursion (this is similar to the virtual terminal [5] and terminal propagation [14] techni- ques).Thus, the global routing computed for a net corresponds to the topology of its thumbnail.
Assignment of nets to switch blocks is accom- plished in a manner similar to PHIroute [37].The number of nets that can be assigned to each switch block is bounded by the number of nets crossing the cut, divided by the number of switch blocks on the cut.This construction induces a structure that may be represented by a complete bipartite graph with nets in one partition and switch blocks in the other.Edge weights in this graph model the cost of assigning a net to the corresponding switch block.Assignments are then determined by computing a minimum-cost matching [33].
Recursion terminates when a region contains at most one logic block, along with the adjacent channel segments and switch blocks.We then route nets within the channels surrounding the logic block (if it exists) while minimizing the maximum channel congestion.In our implementa- tion, an optimal solution is computed using integer programming [30].This is efficient in practice since the number of nets involving any single logic block is small [17]. 5. DETAILED ROUTING After placement and global-routing, FPR per- forms detailed routing by assigning specific chan- nel and switch-block edges to each net.The placement and global-routing phase passes the following information to the detailed router: (1) locations of relevant logic-block pins (i.e., the net to be routed), (2) a "loose" route for the net (leaving unspecified the edges within channel segments and switch blocks), and (3) switch blocks that are likely to serve as Steiner nodes in the detailed routing (Fig. 4).
A design goal for FPR has been the ability to handle a wide variety of FPGA architectures.Towards this goal, we have adopted a graph-based approach to detailed routing.Each switch block contains internal switch-block edges that may be programmed to connect incoming channel edges.The routing structure of the entire FPGA is captured by a routing graph: detailed routes on @ @ N N [] @ @ FIGURE 4 Global-routing information for a three-pin net, showing the associated logic blocks (dark squares), global route (cross-hatched region), and potential Steiner switch block (large dark square).
the FPGA correspond to paths in the routing graph, and vice-versa (Fig. 5).In a routing graph, vertices model logic-block and switch-block nodes, while the edges correspond to connection, chan- nel, and switch-block edges.This strategy enables the detailed router to employ generic graph algorithms in order to produce detailed-routing solutions.
Using the routing-graph approach, detailed routing entails interconnecting the logic-block vertices using edges and vertices inside the corresponding global-route region.This goal is modeled by the graph Steiner tree (GST) problem: given graph G=(V, E), where V is the vertex set and E C_ Vx V is a set of weighted edges, find a minimum-weight tree in G that spans a subset of the vertices N c_ V (the logic-block vertices in a net), using switch-block vertices as possible Steiner nodes.The cost of a tree T, denoted T, is the sum of the costs of its edges.

Logic Block
Switch Blocks () Since the GST problem is NP-complete [24], we utilize the heuristic of Kou, Markowsky and Berman [29] (KMB), which approximately solves the GST problem in polynomial time, and is guaranteed to yield solutions with cost less than twice the optimal.While the KMB heuristic always finds a feasible detailed routing if one exists, it often does not "branch" at the appropriate Steiner nodes (Fig. 6(a)).This potential drawback is effec- tively ameliorated using the greedy strategy described below.
Our detailed-routing algorithm is based on combining a greedy, iterated heuristic [21,25] with the KMB algorithm; we refer to this hybrid method as the Iterated-KMB (IKMB) algorithm [1].Given a routing graph G (V, E), a net N c_ V, and a set S of potential Steiner nodes, we define the savings of S with respect to N as AKMBzG(N, S KMB6(N) -KMB6(N U S). Intuitively, AKMB6(N, S) repre- sents the interconnect savings incurred by KMB when the Steiner nodes in S are included into the node set N to be spanned.This is illustrated in Figure 6(b), where using a candidate Steiner node from the shaded switch block results in an optimal solution.In order to efficiently find such Steiner nodes, a set of candidate Steiner nodes is determined for each net.Candidate Steiner nodes are switch- block nodes that correspond to Steiner switch blocks (Fig. 4).
The IKMB method operates by repeatedly finding candidate Steiner nodes that reduce the overall KMB cost by the largest amount, and then including them into a growing set S of Steiner nodes.The cost of the KMB tree over NUS decreases with each added node, and the construc- tion terminates when there is no xEV with AKMB(NUS,{x}) > 0. The final topology is obtained by computing the KMB construction using NUS as the pins and the remaining V-(NU S) nodes as potential Steiner nodes.
The overall IKMB method is more formally described in Figure 7.
The placement and global-routing phases seek to minimize congestion, thereby enabling the detailed router to find a feasible (and high-quality) solution more easily.However, since it is NP- complete to determine whether there exists a feasible detailed-routing solution for all nets [41], we use a deterministic net-ordering scheme to route nets one at a time.When a detailed-routing solution for a net is found, the corresponding routing resources are committed to that net and are made unavailable for subsequent nets (i.e., they are removed from the underlying graph).If infeasibility is encountered during the detailed routing of a net (i.e., some logic-block pin is unreachable in the routing graph from the other pins of the net), the following two heuristics are employed.
First, an incremental "wavefront-expansion" technique is used to gradually "loosen" the global route, allowing the detailed route to detour around local blockages caused by previously-routed nets (Fig. 8).Note that wavefront expansion deter- mines the region searched by the routing algorithm, as opposed to the order in which graph The Iterated-KMB (IKMB) Algorithm Input: A weighted graph G (V, E) and net N C_ V Output: A low-cost tree spanning N While C {x e V NIA-M--a(N U S, {x}) > 0} O Do Find x E C with maximum AKMBa(N LJ S, {x}) s=su{} Return KMBG(N U S)  edges are explored [22].Second, we strive to minimize congestion, which is a measure of resource utilization.To gauge congestion, we divide routing resources into disjoint groups according to functional similarity and physical proximity.For example, all channel edges inter- connecting the same two switch blocks form a group, as do all edges inside a particular switch block.As nets are routed, the detailed router updates each group's congestion information (i.e., the number of edges in each group taken by all previously routed nets).Multi-objective optimiza- tion is used in the IKMB graph searches to heuristically minimize a combination of wirelength and congestion (See the Appendix for additional details).Thus, within the region specified by the global route, our detailed router searches for a feasible solution minimizing both congestion and wirelength.We found that in practice, the majority of those nets that fail to route using the initial global route become routable after only a single loosening operation.In cases where wavefront expansion fails to produce a routing solution, we employ a "move-to-front" heuristic [36], where unroutable nets are moved to the beginning of the net-routing order and the new routing order is attempted.

EXPERIMENTAL RESULTS
Our algorithms have been implemented using C + + in the Sun/UNIX environment and incorpo- rated into FPR.Two FPGA architectures, corre- sponding to Xilinx 3000-series and 4000-series parts, were modeled [7,43] (these architectures are identical to the ones used by CGE [8], SEGA [32] and GPB [42], respectively).We compared the performance of these tools on fourteen large benchmark circuits: the suite of five 3000-series benchmarks used by [8], and the suite of nine 4000- series benchmarks used by [32] and [42].The 3000- series benchmarks were routed on FPGAs with switch-block flexibility F 6 and connection flex- ibility Fc [0.6x IV], where IV is the the channel width.The 4000-series benchmarks use FPGAs with F 3 and Fc IV.
During FPGA physical design, a common objective is to minimize maximum channel width.
(Smaller channel width implies the ability to route larger designs on a fixed-size part).Table I shows the maximum channel widths of actual complete placement and routing solutions produced by FPR; these compare favorably with CGE [8] for the 3000-series benchmarks, and with SEGA [32] and GBP [42] for the 4000-series benchmarks.The channel width required by FPR is smaller than that required by CGE, SEGA, and GPB in 8 of the 14 benchmark circuits, and is equal on all but one of the remaining 6 benchmark circuits (further improvements have been recently obtained in [4]).
We also measured how well FPR optimizes total wirelength and maximum source-sink pathlengths or radius.Since previous works do not report these statistics, we have implemented a modified version of FPR, called FPR-S, that uses unrooted Steiner trees as thumbnails [17], instead of the preferred arborescence thumbnails de-

TABLE II Comparison of arborescence-based FPR against
Steiner-tree-based FPR-S.Wirelength statistics reflect average number of channel segments used by nets in the circuit; radius statistics reflect average number of channel segments encoun- tered on longest source-sink path for each net.The A % column gives the percent change from FPR-S to FPR scribed in Section 3. We compared the solutions produced by FPR-S against performance-oriented solutions produced by the unmodified FPR tool.We observe that the additional 1.0% in wirelength used by FPR yields a 6.7% decrease in radius (Tab.II).We believe the 1.0% total wirelength difference is insignificant but the 6.7% difference in average radius is significant.Therefore we recom- mend the use of FPR with its use of RSA's over FPR-S and other similar tree-based tools.The time to run FPR is comparable to other tools: CPU times to completely lay out the circuits on a Sun SparcServer 10/514 workstation ranged from several minutes for the smallest circuit to several hours for the largest.Figure 9 shows the solution produced by FPR for the smallest of the bench- mark circuits.

CONCLUSION
We have developed FPR, a placement and routing tool for FPGAs that combines a recursive geo- metric strategy for simultaneous placement and global routing with a general graph-based de- tailed-routing algorithm.FPR addresses perfor- mance issues by minimizing source-sink path- lengths as well as total wirelength and maximum channel width.FPR compares favorably to exist- ing tools on both 3000-series and 4000-series Xilinx-type parts, as measured by the maximum channel width required for complete layout of a number of industrial benchmarks.
tic to operate on multi-weighted graphs, where each of the k optimization criteria is modeled by a separate set of edge weights.The simultaneous optimization is accomplished by transforming these multiple edge weights into a single weighted average, which is then used by IKMB in the normal way.The relative magnitudes of the weighting factors dl,d2,...,dk (i.e., tradeoff para- meters) are designer controlled, enabling a smooth tradeoff among the various competing objectives.This technique is flexible in that new criteria are easily incorporated into the model by introducing additional weight sets into the graph.Such a framework subsumes e.g., "alpha-beta" routing (which has been used for jog minimization in IC design [12,23]), and also has practical application in non-VLSI domains [13].
Let V (Vl,V2,...,Vn} be a set of nodes, and let E c_ V V be a set of edges.We define a k-weighted graph G=(V, E) to be a weighted graph with a vector-valued weight function " E---9tk.In other words, associated with each edge ej E E is a vector of k real-valued weights o.-(Wijl, wij2,..., Wijk).Note that ordinary weighted graphs are a special case of k-weighted graphs, with k 1.
Let d= (dl,dz,...,dk) be a vector of k real- valued tradeoff parameters, where 0 _< di < for 0 < i_< k, and E/k=l di 1.From the k-weighted graph G (V, E) and the tradeoff parameters d we construct a new weighted tradeoff graph Ek m=l dm Wam.The tradeoff graph t is an ordinary weighted graph having the same topology as G, but whose single edge weights represent the weighted averages of the multi-weights of G, with respect to d.
Let if--(1,..., 1), and .(0,...,0, vi,0,...,0) denote the vector obtained from the vector ' by using vi in the ith place, and the rest of the places set to zero.Thus, ffi denotes the vector consisting of zeros everywhere except the ith place, which will contain a 1.A k-weighted graph G induces k distinct graphs Gi--G(i), each with an identical topology but with edge weights restricted to only one of the k components of vector-valued weight function .
We define the minimum spanning tree for a multi-weighted graph G with respect to the trade- off parameters d as the ordinary MST over the tradeoff graph G(d), and denote it by MST(G(d)).
Similarly, we can compute the MST on each of the k induced graphs Gi, and we denote these MST(G).For convenience we will use MST to denote the cost of the MST.
We then clearly the cost of edge ej in all k trees is k )2m=lWjm, and the cost of this edge scaled by the tradeoff parameters is Ekm=ldm.wijm,which is equal to the cost of this edge in MST(G(d)).
Clearly, if all of the k MST(Gm),I _< m < k contain the same edges as MST(G(d)), then equality holds and the theorem is true.On the other hand, if MST(G(d)) contains an edge that is not in MST(Gm), _<m_<k, then the cost of k MST(G(d)) relative to )2m=dm.MST(Gm) can only increase.
Next, we prove the non-existence of general upper bounds.Ideally, we would like to bound the MST cost of arbitrary multi-weighted graphs in terms of only the costs of the MST(G)'s, d, and n.
Unfortunately, this is impossible to in general: THEOREM 8.2 For any k-weighted graph G over n vertices, and tradeoff parameters d, the tradeoff graph cost MST(G(d)) can not be bounded from above by any function of only MST(Gi)'s, d, n, andk.Proof Consider the 2-weighted graph G=(V,E) over n 3 nodes, where k=2.))) has cost which can be arbitrarily large.
0 _< dl,d2 <_ 1.Let M be some very large constant, V {a, b, c}, and E= Vx V, with Wab O, Wbc O, Wac1 M, and let Wab2 M, Wbc2=O, Wac2=O (see Fig. 10).Observe that MST(G1)-MST(G2) 0, k= 2, n 3, dl, and d2 are all constants.On the other hand, MST(G) min(dl.M, d2.M), which can be made arbitrarily large for any fixed d by making M large enough.Since any expression involving only constants must also be bounded by a constant, MST(G) can not be bounded from above by any function strictly in terms of only MST(G1), MST(G2), k, n, and d. [---1 The negative result of Theorem 8.2 only applies to non-metric graphs.We now give a general upper bound for metric graphs" THEOREM 8.3 For any metric k-weighted graph G over n vertices, and tradeoff parameters d, MST(G(d)) < (n-1).k =d.MST(G).
Proof Consider an arbitrary edge eiy in k MST(G(d))and its cost, Em=ldm.wim.Consider the m th element in this summation, and the corre- sponding MST ofGm.MST(Gm) spans vertices viand vy, but does not necessarily contain the edge Rather, a path must exist in MST(Gm) from vi to vj, denoted minpathMsT(am)(i,j) with cost denoted by diStMST(am) (i,j).Bymetricity, Wij diStMST(Gm) (i,j).Since most nets in typical VLSI designs contain three pins or less [18], we derive a tighter upper bound for 3-pin nets where metricity holds (i.e., graphs with weight functions satisfying the tri- angle inequality dist(a, b) + dist(b, c) >_ dist(a, c), Va, b, cE V): THEOREM 8.4 For 2-weighted metric graphs with three nodes, and any scaling vector -(dl, d2), the following holds: Proof Let G=(V,E) be a 3-node 2-weighted graph, with edge weights (a,x), (b,y), and (c,z).
For 4-node graphs the general upper bound of Theorem 8.3 implies a multiplicative factor of n-1 3; yet, an extensive computer-aided search has been unable to find an example of a 4-pin net with metric weights where the cost of the tradeoff MST exceeds the lower bound by more than a factor of 3/2.We therefore conjecture that our proven bounds can be made considerably tighter, and leave this as an open problem (recently, tighter bounds were indeed derived for MSTs over multi- weighted graphs [19]).

FIGURE 2 (
FIGURE 2 (a) Partitioning template for m=n=3; (b) a sample pointset (the source is at the upper-left); (c) one of its possible thumbnails; and (d) the associated virtual pins.

FIGURE 3
FIGURE 3 All eight thumbnails for the pointset shown in the

FIGURE 5 2 .FIGURE 6
FIGURE 5 Global-routing information (a) is used to con- struct a routing graph (b) for a Xilinx [43] 4000-series part with channel width 2.

FIGURE 8
FIGURE 8  Wavefront expansion is used to "loosen" global routes when infeasibility is encountered.
t=. d G(d)-(V, E) with weight function wi w O.

FIGURE 11 A
FIGURE 11 A general upper bound in the metric case for MST(G(d)) in terms ofMST(Gi)'s,d,n, and k: (a) depicts MST (Gin); (b) depicts MST(G(d)); and (c) shows how the cost of the th rn weight component of each e/j can be bounded by dm'MST(Gm).
benchmarks and additional related papers may be found at WWW URL http://www.cs.virginia.edu/,,vlsicad/.
*Corresponding author.tOur FIGURE A typical FPGA architecture.

TABLE Maximum
start by showing a general lower bound for