Photocopying permitted by license only Publishers imprint. Printed in Malaysia. Tutorial on VLSI Partitioning

The tutorial introduces the partitioning with applications to VLSI circuit designs. The problem formulations include two-way, multiway, and multi-level partitioning, partitioning with replication, and performance driven partitioning. We depict the models of multiple pin nets for the partitioning processes. To derive the optimum solutions, we describe the branch and bound method and the dynamic programming method for a special case of circuits. We also explain several heuristics including the group migration algorithms, network flow approaches, programming methods, Lagrange multiplier methods, and clustering methods. We conclude the tutorial with research directions.


INTRODUCTION
Automatic partitioning [5, 61,72, 78, 95] is becom- ing an important topic with the advent of deep submicron technologies.An efficient and effective partitioning 12, 17, 19, 48, 69, 70, 77, 81,94, 105]   tool can drastically reduce the complexity of the design process and handle engineering change orders in a manageable scope.Moreover, the quality of the partitioning differentiates the final product in terms of production cost and system performance.
The size of VLSI designs has increased to sys- tems of hundreds of millions of transistors.The complexity of the circuit has become so high that it is very difficult to design and simulate the whole system without decomposing it into sets of smaller subsystems.This divide and con- quer strategy relies on partitioning to manipu- late the whole system into hierarchical tree structure.
Furthermore, a good partitioning tool can decrease the production cost and improve the system performance.With the advance of fabri- cation technologies, the cost of a transistor drops while the cost of input/output pads remains fairly constant.Consequently, the size of the interface between partitions, e.g., between chips, deter- mines a significant portion of the manufacturing expenses.And the quality of the partitioning has strong effect on production cost.Further- more, in submicron designs, interconnection de- lays tend to dominate gate delays [8]; therefore system performance is greatly influenced by the partitions.
Partitioning has been applied to solve the various aspects of VLSI design problems [5, 36]: Physical packaging Partitioning decomposes the system in order to satisfy the physical packaging constraints.The partitioning conforms to a physical hierarchy ranging from cabinets, cases, boards, chips, to modular blocks.Divide and conquer strategy Partitioning is used to tackle the design complexity with a divide and conquer strategy [21].This strategy is adopted to decompose the project between team members, to construct a logic hierarchy for logic synthesis, to transform the netlist into physical hierarchy for floorplanning, to allocate cells into regions for placement and RLC extraction, and manipulate hierarchies between logic and layout for simulation.System emulation andrapidprototyping One ap- proach for system emulation and prototyping is to construct the hardware with field program- mable gate arrays.Usually, the capacity of these field programmable gate arrays is smaller than current VLSI designs.Thus, these prototyping machines are composed of a hierarchical struc- ture of field programmable gate arrays.A par- titioning tool is needed to map the netlist into the hardware [110].
Hardware and software codesign For hardware and software codesign, partitioning is used to decompose the designs into hardware and software.
Management of design reuse For huge designs especially system-on-a-chip, we have to manage design reuse.Partitioning can identify clusters of the netlist and construct functional modules out of the clusters.
While partitioning is a tool required to manage huge systems in many fields such as efficient storage of large databases on disks, data mining, and etc., in this tutorial, we focus our efforts on partitioning with applications to VLSI circuit designs.In the next section, we describe the nota- tions for the tutorial.In section three, the formu- lations of the partitioning problems are stated.Section four covers the models for mutiple pin nets.Section five depicts the partitioning algorithms.The tutorial is concluded with research directions.

PRELIMINARIES
In this section, we establish notations used and formulate the partitioning problems addressed in our approaches.A circuit is represented by a hypergraph, H(V,E), where the vertex set V-{vii i= 1,2,...,n} denotes the set of modules and the hyperedge set E={e#lj 1,2,...,m} de- notes the set of nets.Each net eis a subset of V with cardinality le.l > 2. The modules in e. are called the pins of e-.
The hypergraph representation for a circuit with 9 modules and 6 signal nets is shown in Figure 1, where nets e, e3 and e5 are two-pin nets, net e6 is a three-pin net, and nets e2 and e4 are four-pin nets.
When the circuit has only two pin nets, we can simplify the representation to a graph G(V, E).A net connecting modules v; and v# is represented by e o. with a connectivity ci-.We set co.-0 if there is no net connecting modules .F and v#.We shall show later that for certain formulations we replace multiple pin nets with models of two pin nets.The replacement is performed when the partition- ing algorithm is devised for graph models.(i) Module Size and Net Connectivity Each mod- ule V is attached with a size si in R +, positive real numbers.We define S(Vj) viv si to be the size of a partition . .Each net ei is attached with a connectivity ci in R +.By default, ci 1.For a bus of multiple signal lines, we can represent the bus with a net ei of connectivity ci equal to the num- ber of lines.We can also assign higher weights for some important nets, this will enable us to keep the modules of these nets in the same partition.
In this tutorial, we will assume that circuits are represented as hypergraphs except when stated otherwise, hence, the terms circuit, netlist, and hypergraph are used interchangeably throughout the tutorial.
(ii) Partitions and Cuts The set of hyperedges con- necting any two-way partition (V1, V2) of two disjoint vertex sets V1 and V2 is denoted by a cut E(V, V2): {e-C E 0 < le.N vii and 0 < i.e., e# E(V, V2) if there exist some pins of ei in v and some different pins of e-in v2.We define C(V, V2)--,e,E(v,,v2)ci to be the cut count of the partition (V, V2).
(iii) Replication Cuts and Directed Cuts For repli- cation cuts and performance driven partitioning, the direction of the nets makes a difference in the process.We characterize the pins of each net into two types: source and sink.A directed net e.
is denoted by (a,bz) where a.c V are the source pins of the net and bi c V are the sink pins of the net.We assume that laiLJ bil >_ 2, lail >_ and Ibl > 1. Usually, each net has one source pin and multiple sink pins.However, some nets may have multiple sources which share the same interconnect line.Furthermore, one pin can be both a source pin and sink pin of the same net.Therefore, a and bg may have a nonempty intersection.
For two disjoint vertex sets X and Y, we shall use E(X-+ Y) to denote the directed cut set from X to Y. Net set E(X--, Y) contains all the nets eg (a,bg) such that X intersects the source pin set a; and Y intersects the sink pin set b, i.e., E(X--g)={ele=(a,bi), aYO, bY:/:O}.
We use the function C(X--, Y) to denote the to- tal cut count of the nets in E(X--, Y), i.e., C(X --+ Y) -eiE(X-Y) Ci" (iv) Performance Driven Partitioning In perfor- mance driven partitioning [106], modules are distinguished into two types: combinational ele- ments and globally clocked registers.In illustra- tion, we shall use circles to represent the combinational elements and rectangles to repre- sent the registers in figures (Fig. 13).Each module v. has an associated delay d..
A path p length k from a module vi to a module vis a sequence (Vpo, Vp,..., Vp:) of modules such that vi Vpo, v# Vp: and for each 1,2,..., k}, minvsEVl,v, cV2 C(V1, V2) (1) where V1 and V2 are disjoint and the union of the two sets is equal to V.This partitioning is strongly related to a linear placement problem.In a linear placement, we have Vl equally spaced slots on a straight line (Fig. 2).
Modules vs and v are fixed at the two extreme ends, i.e., vs on the first slot (left end) and v on the last slot (right end).The goal is to assign all mod- ules to distinct slots to minimize the total wire length.Let us use xi to denote the coordinate of module vi after it is assigned to the slot.The length of a net ei can be expressed as the difference of the maximum coordinate and the minimum coordi- nate of the modules in the net, i.e., maxv.ce,xj--minvkEe,Xk.The total wire length can be expressed as follows.
Z(maxv.exj minv.,xj) (2) etE The relation between partitioning and placement can be derived under the assumption that all nets are two pin nets [50].THeOreM 3.1 Given a graph G(V, E) with modules vs and vt in V, let (V1, V2) be a min-cut partition separating modules v and vt.Let v and vt be the two ) is a min-cut separating modules vs and v.There exists an optimal linear placement that modules in V2 are at the right side of modules in V.
modules locating at the two extreme ends of a linear placement.Then, there exists an optimal linear placement solution such that all modules in V2 are on the slots right of all modules in V1 (Fig. 2).
Thus, we can use the min-cut to partition a linear placement into two smaller problems and still maintain optimality.Conceptually, we can conceive that modules in V1 or V2 have stronger internal connection within the set than its mutual connection to the other set.Thus, if the span of modules in V1 and in V2 are mixed in a linear placement, we can slide all modules in V to the left and all modules in V2 to the right to reduce the total wire length.In fact, this is the procedure to prove the theorem.
The min-cut with no size constraints can be found in polynomial time using classical maximum flow techniques [1].However, it may happen that the optimal solution separates only vs or vt from the rest of the modules, i.e., V {vs} or V2--{l:t}.This result is very likely to occur because most VLSI basic modules have very small degrees of connecting nets (e.g., the degree of a 3-input NAND gate 4).
3.1.2.Minimum Cost Ratio Cut modules.Thus, if the min-cut cannot provide any nontrivial solution, we may adopt the cost ratio cut to perform another trial.
In cost ratio cut, we fix two modules v and vt at two different sides.Our objective is to find a vertex set A to minimize a cost ratio function: where vertex set A does not contain v and v.
Vertex set A is non-empty, i.e., S(A) > O.
Cost ratio cut is also strongly related to a linear placement.Assuming that all nets are two pin nets, we can derive the following theorem [22]: TrtEOREM 3.2 Given a graph G(V, E) with modules v and v in V, let (VI, V2) be an optimal cost ratio cut partition.There exists an optimal linear placement solution such that all modules in A are on the slots left of all modules in V-A-{v.}.
Conceptually, we can conceive that C(A, V- A-{v}) is the force to pull A to the right and C(A, {v}) is the force to push A to the left.The denominator S(A) is the inertia of the set A. A set A with the minimum cost ratio moves with the fastest acceleration toward left end of the slots The cost ratio cut formulation supplies a partition different from the min-cut that separates two fixed Example In Figure 3, the circuit contains six modules.The optimum cost ratio cut solution has v v3 . .' v2 t FIGURE 3 A six module circuit to illustrate the cost ratio cut.
A {11, 1:2, 1:3}.The cost ratio value is The cost ratio value of any other choice of set A is larger than expression (4).
The cost ratio cut solution can be found in polynomial time for a special case of serial parallel graphs [22].We are unaware of algorithms for general cases.Note that, the solution may have V-A-{v,} equal to set {vt}.In such case, the partitioning result is not useful for decomposing the circuit.

Min-cut with Size Constraints
For min-cut with size constraints, we have lower and upper bounds on the partition size $I and S,, where 0 < $/_< S, < S(V) and Sz+ S, S(V).
The bipartitioning problem is to divide vertex set V into two nonempty partitions V1, V2, where V1 C? V2 (3 and V U V2 V, with the objective of minimizing cut count C(V, V2) and subject to the following size constraints: The min-cut problem with size constraints is NP complete [43].However, because of the import- ance of the problem in many applications, many heuristic algorithms have been developed.
Random Partitioning We use a random parti- tion estimation of min-cut with size constraints to demonstrate that the quality variation of partitioning results can be significant.Let us simplify the case by assigning the modules with uniform size, i.e., s; for all vi in V, and the nets with uniform connectivity, i.e., ci for all e; in E. Let us assume that the modules are partitioned into two sets U1, V 2 with equal sizes: S(V)= S(V2).
The partition is performed with an independent random process [10] so that each module has a 50% chance to go to either side.For a net e; of two pins, we can derive that net e; belongs to the cut set E(V, V2) with a 0.5 probability (Fig. 4).
Similarly, we can derive that for a net ei of k pins (k > 2), the probability that net e; belongs to cut set E(V, V2) is (2 k-2)/2 k.This probability is larger than 0.5 and approaches one as k increases.
In other words, the expected cut count C(V1, V2) is equal to or larger than half the number of nets.For example, a circuit of one million modules usual- ly has an asymptotic number of nets, i.e., IEI O(I V I)= 1,000,000.The expected cut count would be C(V, V2)>_ 500,000.This number is much worse than the results we can achieve.In practice, the cut counts on circuits of a million of mod- ules are usually no more than several thousands [34, 36].In other words, the probability that a net belongs to a cut set is small, below one percent for a circuit of one million gates.
Suppose the two bounds of partitioned sizes are not equal, Sz -S,.Using the proposed random graph model, the expected cut count C(V, V2) is proportional to the product of two sizes, i.e., S(V) S(V2).Consequently, the expected cut count is smallest if the size of one partition ap- proaches the upper bound S(Vi)=S, and the size of another partition approaches the lower bound S(V-) Sz.In practice, we do observe this behavior.One partition is fully loaded to its maximum capacity, while another partition is under utilized with a large capacity left unused.This phenomena is not desirable for certain applications. 3.1.4.Ratio Cut Ratio cut formulation integrates the cut count and a partition size balance criterion into a single objective function [87,109].Given a partition (V, V2) where V1 and V2 are disjoint and V1 U V2 V, the objective funtion is defined as The numerator of the objective function minimizes the cut count while the denominator avoids uneven partition sizes.Like many other partitioning problems, finding the ratio cut in a general network belongs to the class of NP-complete problems [87].
Any other partition corresponds to a much larger cost.
The Clustering Property of the Ratio Cut The clustering property of the ratio cut can be illust- rated by a random graph model.Let us assume that the circuit is a uniformly distributed random graph, with uniform module sizes, i.e., si 1.We construct the nets connecting each pair of modules with identical independent probability f.
Consider a cut which partitions the circuit into (Vl, V2) FIGURE 5 An example of seven modules, where partition (V, V2) is a minimum ratio cut.
two subsets V and V2 with comparable sizes vl and (1 c0 x vl respectively, where c<l.
The expected cut count equals the probability f multiplied by the number of possible nets between V1 and V.
On the other hand, if another cut separates only one module vs from the rest of the modules, the expected cut count is Expec(C({vs}, V-{vs))) (Igl-a) f (8)   As IV approaches infinity, the value of Eq. (7)   becomes much larger than Eq. ( 8).
This derivation provides another explanation why the min-cut separating two fixed modules tends to generate very uneven sized subsets.The very uneven sized subsets naturally give the lowest cut value.Therefore, the ratio value C(VI, V2)/ (S(V1) x S(V2)) is proposed to alleviate the hidden size effect.As a consequence, the expected value of this ratio is a constant with respect to different cuts: C(V1,V2) )_f Expec S(V S(V2) --f (9)   Thus, if the nets of the graph are uniformly distributed, all cuts have the same ratio value.In other words, the choice of the cuts and the partition sizes does not make difference in such a uniformly distributed random graph.In a general circuit different cuts generate different ratios.Cuts that go through weakly connected groups corre- spond to smaller ratio values.The minimum of all cuts according to their corresponding ratios defines the sparsest cut since this cut deviates the most from the expectation on a uniformly distributed graph.

Multi-way Partitioning
For multi-way partitioning, we discuss a k-way partitioning with fixed size constraints and a cluster ratio cut.These two problems are the extensions of the min-cut with fixed size con- straints and the ratio cut from two-way to multi- way partitioning, respectively.

K-way Partitioning
For multi-way partitioning, we separate vertex set V into k disjoint subsets where k > 2, i.e., (V1, V2,..., Vk).There is an upper bound Su and a lower bound $l on the size of each subset Vi, i.e., SI <_ S( Vi) <_ Su.
There are different ways to formulate the cut cost because of the different criteria used to count the cost of multiple pin nets.In the following we list a few possible objective functions.
(i) Minimize the cut count, C(Vl, V2,..., Vk) Z Ci (10)   eiGE(Vl ,V2 Vk) (ii) Minimize the sum of cut counts of all vertex sets.Let us denote the cut count of vertex set Vi to be C(Vi)-,eicF(Vi)Ci.The sum of cut counts of all subsets can be expressed as Thus, the cost of a net connecting three sub- sets is more expensive than the same net connecting two subsets.

Cluster Ratio Cut
Cluster ratio cut is an extension of ratio cut from two-way partition to multiway partition.There is no bound on the size of each subset.Furthermore, the number of partitions, k, is not fixed, and instead is part of the objective function.
C(Vl, g2,..., gk) Rc min> If the number of partitions is one, the denomi- nator becomes zero.Thus, k is restricted to be larger than one.
Example Figure 6 shows a fifteen module circuit.The modules are of unit size and the nets are of unit connectivity.The square dot in the figure rep- resents a hypernet.The partition shown by the dashed line is a minimum cluster ratio cut.The cost of the cut is (1/2) [4(15-4) + 3(15-3)+4(15-4) +4 (15-4)] 21 (15) The physical intuition of cluster ratio can be explained using a random graph model [10].Let G be a uniformly distributed random graph.We con- struct the nets connecting each pair of modules with identical independent probability f.Since the nets are uniformly distributed, the probability of finding a subgraph which is significantly denser than the rest of the graph is very small, meaning that there is no distinct cluster structure in G.
and the expected value of cluster ratio equals ( C(Vl, V2," Vk) ) Expe(Rc)-Expec x_,------,T-i i,l IVjI Z.. i=j+ Z..j= =f (17)   Since f is a constant, all cuts have the same expected cluster ratio value.Therefore, if we use cluster ratio as the metric, all cuts would be equally favored, which is consistent with the fact that G has no distinct clusters.However, in a general circuit, different cuts generate different ratio values.Cuts that go through weakly con- nected groups correspond to smaller ratio values.The minimum of all cuts according to their cluster ratio values defines the cluster structure of the circuit since this cut deviates the most from the cuts of a uniformly distributed graph.

Multi-level Partitioning
In multi-level partitioning [4, 23, 47, 58, 67, 68,109  110], the final result is represented by a tree struc- ture.All the modules are assigned to the leaves of the tree.The tree is directed from the root to- ward the leaves.The level of the nodes is defined to be the maximum number of nodes to traverse to reach the leaves.Thus, the leaves are ranked level zero.Each node is one level above the maximum level of its children.When the level of the root is only one, the problem is degenerated to two-way or multiway partitioning.
Each net ei spans a set of leaves.Given a set of leaves, there is a unique lowest common ancestor.The level of the lowest ancestor is defined to be the level l(ei) of the net.
The cost of a net ei is defined to be the multi- plication of its connectivity ci and the weight w(l(ei)) of level l(ei) for net ei to communicate, i.e., ci x w(l(e)).The cost of the multi-level partition is the sum of the cost of all nets, i.e., -]e,E ciw(l(ei)).

J-level K-way Partitioning
When the root of the partitioning tree is level j and the number of branches of each node is no more than k, we say it a j-level k-way partition.We can set different communication weights for each level.Usually, the function is monotone, i.e., w(1) is larger when level increases.The ver- tex set Vi of each leaf has its size bounded by S S(Vi) S For electronic packaging, the tree is bounded by the number of external connections.We call a leaf is covered by a node if there is a directed path from the node to the leaf in the tree representation.For each node ni, we define T; to be the union of the modules in the leaves covered by node n;.Let E(Ti) be the external nets of Ti, i.e., E(Ti)={eil O < [eiA Til < [eil}.The cut count of each node should not exceed the capacity of the external connection of the packaging, i.e., C(Ti)-Z cj < Cap(l(ni)) (18)   ejE(Ti) min Ci 21(ei) eiCE subject to the constraint on the capacity of the leaves, i.e., S(Vi)< S, where Vi is the vertex set of leaf i.The level of the root is adjusted according to the minimization of the objective function.
Example Figure 8 illustrates a generic binary tree for partitioning.In this figure, the root is at level three.Each node has at most two children.
where Cap(l(ni)) is the capacity of the external connection of level l(ni).
Example Figure 7 shows an example of a 3-level 5-way partitioning structure.The leaves are at level 0 and the root is at level 3.Each node has at most five children.Net ei {Vl, 12, 13} is covered by node na at level l(na)= 2.

Generic Binary Tree
A generic binary tree structure [110] is proposed to simplify the multi-level partitioning.There is only one constant S, to set in the binary tree.Thus, it is much easier to make a fair comparison between different algorithms.
In a generic binary tree, each internal node has exactly two children.The weight of each level is defined to be w(l)--21.Thus, we have the objective function 3.4.Replication Cut In the replication cut problem, a subset of the circuit may be replicated to reduce the cut count of a partition [54,64,82].In this section, we use a two-way partition to illustrate the problem.We fix two modules vs and vt at two sides of the cut.
We use three vertex sets to represent the partition, V1, V2, and R, where V1, V2, and R are disjoint and V1U V2UR= V, vs V1, vt V2.Subsets V1 and V2 are separated by the cut and subset R is to be replicated at both sides (Fig. 9).
Each copy of R needs to collect a complete set of input signals in order to compute the function properly.Thus, the nets from V to R and from V2 to R are duplicated.However, the output signals of R can be obtained from either copy of R. For example, nets from the right side R to V in Figure 9(b) are not duplicated because V gets inputs  from the left side R.For the same reason, we do not replicate the nets from the left side R to V2.Given two disjoint sets V1 and V2, let a replication cut R(V1, V2) denote the cut set of a partitioning with R V-V-V2 being duplicated.From Fig- ure 9(b), we can see that R(V, V2) is the union of four directed cuts, that is, (v, v: (v --+ ) e(v: ).
Let St and S, denote the size limits on the two partitioned subsets.We state the Replication Cut Problem as follows: Given a directed circuit G, we want to find a replication cut R(V, V2) with an objective minCR(V,,V2)-C e, ER(V ,V2) subject to the size constraints S[ <_ S(V U R) < Su and S[ <_ S(V2 U R) <_ Su, and the feasible condition VfhV--O, R--V-V-V2.
Interpretation of the Replication Cut Suppose we rewrite the replication cut in the format: where r and '2 denote the complementary sets of V1 and V2, i.e., 1 V-V1 and '2 V-V2.
The cut set becomes the union of E(V1 -+ V1) and E(V2 ---, V2).We can interpret the cut set of the replication cut R(V1, g2) as two directed cuts on the original circuit G as shown in Figure 10.

Performance Driven Partitioning
The goal of performance driven partitioning is to generate a partition that satisfies some timing con- straints.Due to the physical geometric distance and interface technology limitations, inter-parti- tion delay contributes the dominant portion of sig- nal propagation delay.Consequently, instead of minimizing the number of the crossing nets as the only objective during partitioning, we should take into account the interpartition delay to satisfy the timing constraints.Clock period is a major measurement for circuit performance.It is determined by the longest sig- nal propagation delay between registers.Each crossing net is associated with an interpartition delay (5 determined by VLS! technologies.Given a path p from one register to another register with no interleaving registers, let dp be the sum of combinational block delays and dp be the sum of interpartition delays along path p.The longest delay dp + dp among all paths p should be smal- ler than the clock period T, i.e." max 4 + 4 < T. (20) p Now we state the performance-driven partition- ing problem as follows: Given hypergraph H(V,E), clock period T, two bounds of sizes $I and S, and interpartition delay (5, find a partition (V1, V2) with the minimum cut count, subject to SI <_ S(V1) Su, S S(V2) Su, and maxpdp + dp <_ T.
Example In Figure 11, path p starts at register V and ends at register v/.The path crosses between the partition (V1, V2) three times.Thus, the inter- partition delay dp- 3(5.Replication can improve the performance of the partitioned results [83].In Figure 12(a), vertex set R locates at the side of V2.Path p crosses between the partition (V1, R U V2) three times.By replicat- ing vertex set R (Fig. 12(b)), path p needs to cross the partition only once.

Retiming
Retiming shifts the locations of the registers to improve the system performance [76].It is an effective approach to reduce the clock period.
Moreover, the process also reduces the primary input to primary output latency which is another important measurement for circuit performance.FIGURE 11 An illustration of performance driven partitioning.As in [85], we assume that the combinational blocks are fine-grained.A module is called fine- grained, if it can be split into several smaller mod- ules.Alternatively, if a module cannot be split, it is called coarse-grained.The interpartition delay 6 on crossing nets is inherently coarse- grained and cannot be split.
Given a path p, we use rp to denote the number of registers on the path.Let W(i,j) denote the minimum rp among all possible paths p from to j, i.e., W(i, j) min{rpl p E Pij}, where Po is the set of all paths from module Yi to vy.We define a path p from 1 to vy as a W-critical path if rp equals W(i,j); W-critical path p is also called an IO-W-critical path if modules vi and vj are the primary input and output, respectively.
(i) Iteration Bound While retiming can reduce the clock period of a circuit, there is a lower bound imposed by the feedback loops in the hypergraph [92].Given a loop l, let dl, dl and rl be the sum of combinational block delays, the sum of interparti- tion delays, and the number of registers in loop l, respectively.The delay-to-register ratio of a loop is equal to (d + d)/r.The iteration bound is defi- ned as the maximum delay-to-register ratio, i.e." J(V, V2) max { d + r l d lEL}, (21)   where L is the set of all loops.Note that the iteration bound of a given circuit yields a lower bound on the achieved clock period by retiming.
(ii) Latency Bound Let p denote the I0-W-critical path with maximum path delay among all IO-W- critical paths from vi to vj.. Since the number of registers in path p is equal to W(i,j), the I0 latency (i.e.(W(i,j)-1) x T) between vi and vj. is not less than dp + dp, where T denotes the clock period, and dp and dp are the sum of combinational block delays and the sum of interpartition delays on path p, respectively.Thus, we define latency bound M as follows [85,86]" of cut count, subject to St < S( V1) <_ S,, St _< S(V2) <_ Su, J(V1, V2) _< ), and M(V1, V2) _</17/.
Example Figure 13 illustrates the effect of repli- cation on the iteration bound.Let us assume that the interpartition delay is 6=4.Before replica- tion, the iteration bound is dominated by loop ll.
Latency bound also imposes a lower bound on the system latency achieved by using retiming.An all-pair shortest-path algorithm can be used to calculate the latency bound.
We have two reasons to use the iteration and latency bounds.(i) It is faster to calculate these bounds.(ii) The iteration and latency bounds stand for the lower bounds of the clock period and system latency achieved by adopting retiming, re- spectively.The partition with lower iteration and latency bounds can achieve better clock period and system latency by using retiming.Therefore, we want to generate a partition with small iteration and latency bounds.
Statement of the Problem Now we state the per- formance-driven partitioning problem as follows: Given hypergraph H(V, E), two numbers (1 and 1I, bounds of sizes St and Su, and interpartition delay 6,find a partition VI, V2) with the minimum number After replication [85], the bound contributed by loop l is equal to dll + dll 8   2. (24) rll 4 The iteration bound now is dominated by the union of loops l and 12, d,+ + d11+ 18 + 2 x 4 rl+12 3.25, (25) which is smaller than the iteration bound before replication.

Clustering
Clustering [6] is similar to multiway partitioning in that the process groups modules into k subsets.
However, for clustering the number of subsets is usually much greater than for a typical multiway partitioning problem, e.g., k >_ 10.
Often, a clustering process is used as part of a divide and conquer approach.Thus, it is FIGURE 13 Illustration of replication anal its effect on iteration bound.important to choose an objective function that fits the target application.If the goal is to reduce problem complexity, we set the objective function to be" k C(Vi) (26) min Cl(Vi) i=1 where Vi's are disjoint vertex sets and their union is equal to V. Function C(Vi) is the external cut count of cluster Vi and CI(Vi) is the count of nets connecting vertex set Vi, i.e., eix(vi) ci.
For performance driven clustering, the objective function is to minimize the number of cuts between registers.

MULTIPLE PIN NET MODELS
The handling of multiple pin nets strongly depends on the partitioning approach [102].A proper model is needed to reflect the correct cut count and improve the efficiency.In this section, we first introduce a shift model which is used for itera- tions of shifting a module or swapping a pair of modules.We then describe a clique model which is used to replace a multiple pin net.The star and loop models are variations of two pin net mod- els, however, with less complexity than the clique model.Finally, a flow model is introduced for net- work flow approaches.

Shift Model
The shift model [101] for multiple pin net is useful when we perturb the partition by shifting one module to a different vertex set or by swapping two modules between different vertex sets.Let us simplify the description by assuming only one mod- ule is shifted to a different vertex set.A swap of a pair of modules can be treated as two steps of module shifting.
For each shift, we want to update the cut count.We also want to update the potential change in cost for each module if it were to be shifted, so that we can rank the modules for the next move.Such cost revision can be expensive if the circuit has large nets which contain huge numbers of pins, e.g., hundreds of thousand pins.
The shift model reduces the complexity of the cost revision by utilizing the property that for huge nets most shifts of its pins do not change the cost of the other pins in the net.
Let us simplify the description by considering a two way partitioning.The model can be extended to multiple way partitioning according to the choice of objective functions.Let module v be shifted from vertex set V1 to V2.The configuration of nets ei E({vj.})connecting module vj. is revised.For each net ei, we denote ki to be the number of pins of ei in V1 and ]ei]-ki the number of pins of ei in V2 (Fig. 14).With respect to net ei, we update the pin numbers ki and lei]-ki after mod- ule v.. is shifted.We also update the cost of mod- ules in nets 1.If the revised ki>_2, the potential cost of pins due to net ei is zero.For the case that ]ei]-ki =1, we increase the cut count by ci and set the potential cost of pins in ei.Other- wise, the move has no effect on the cut count and potential cost.
2. If the revised pin count ki 1, the shift of the last pin of ei in V will decrease the cut count by ci.We then update the potential cost of this last pin.
3. If ki=O, the cut count reduces by c;.However, the shift of any pin v ei from V2 to V1 will increase the cut count.Thus, in this case, we reflect the cost of potential shift on the pins of ei, which takes O(]eil) operations.  .2.Clique of Two Pin Nets Some researchers use cliques of two pin nets to model multiple pin nets.Given a multiple pin net 6'i, we construct a clique of (1/2)[eil(leil-1) two pin nets to connect all pairs of pins in the net.The clique model maintains the symmetric rela- tion of the modules of the same net in the sense that the order of the pins in the net has no effect on the cost.The weight of two pin nets in the clique module is adjusted by some factor.One approach is to use 2/lei to scale down the connectivity.The total weight of all the nets in the clique is ( 1)Ci.Note that it takes lei[two pin nets to form a spanning tree of [eil modules.
Other factor has been proposed such as 1/ (leil-1) which is based on a different probability model.However, no factor can exactly reflect the cost of a multiple pin net model.
Complexity of the Clique Model The complexity of the clique model is high.There are O(leil2) two pin nets in a clique model.Suppose the process of each two pin net takes a constant time.It takes O(lei[2) operations to process a multiple pin net ei.Therefore, in practice, if the pin number is larger than a threshold, the net is ignored in the process.

Loop Model of Two Pin Nets
A loop model reflects the exact cut count [22], however, it is sensitive to the order of the pins.We can derive heuristic ordering of the pins us- ing a linear placement.Modules are sequenced ac- cording to their x coordinates in the placement.We find the partition by collecting the modules according to the sequence.Following the order of the modules in the x coordinates, we link the modules of a multiple pin net with two pin nets into a loop.We link the pins in a sequence (Fig. 15) alternating on every other module.The loop is formed by the two con- nections at the two ends.
A factor of (1/2) is assigned to the two pin nets so that the cut count separating modules accord- ing to the sequence is one.The model remains cor- rect even if any two consecutive modules in the sequence swap their order.

Flow Model
For the network flow approach, we consider each net ei as a pipe.A set of saturated pipes forms a bottleneck of the flow.The union of the saturated pipes becomes the cut of the circuit.In such a model, we set the capacity of the pipe equal to the corresponding connectivity ci [52].

Star of Two Pin Nets
A star model introduces less complexity than a clique model.Given a net ei, we create a dummy module i.The dummy module i connects every pin in ei with a two pin net.This module maintains the symmetry of the net.However, we need only leil two pin nets.
For the clique and star models, the cost of the partition depends on the number of pins on the two sides of the partition.The cost is higher when the pins are distributed more evenly on the two sides of the cut.Thus, these models discourage even partitioning of the pins in the nets.Let Xiu be the amount of flow from pin 1 to net e, and x,a.be the amount of flow from net e, to pin va. (Fig. 16).The total flow injected into the net should be smaller than or equal to its capacity and the incoming flow is equal to the outgoing flow, i.e., Z xiu cu' (27) li C Xiu Xui-O.(28) eu eu functions.For example, we can apply group migra- tion to multiway [98,99] or multiple level parti- tioning problems [67,68] with modification to the cost of the moves.Furthermore, some methods may be combined to solve a problem.For ex- ample, we can use clustering to reduce the size of an input circuit and then use group migration to find a partition of the reduced circuit with much greater efficiency [24, 59].In fact, this strategy derives the best results in terms of CPU time and cut count in recent benchmark [2].

APPROACHES
In this section we introduce several approaches to partitioning.We first discuss two methods for optimal solutions: a branch and bound method and a dynamic programming algorithm.The branch and bound method is effective in search- ing exhaustively for the optimal solution for small circuits.The dynamic programming method pre- sented runs in polynomial time and finds an optimal partition for a special class of circuits.
We then explain a few heuristic algorithms: group migration, network flow, nonlinear program- ming, Lagrangian, and clustering methods.The group-migration approach is a popular method in practice due to its flexibility and effectiveness.The network flow method gives us a different view of the partitioning problem by transforming the minimization of the cut count into the maxi- mization of the flow via a duality in linear pro- gramming.This approach derives excellent results with respect to certain objective functions.The nonlinear programming method provides a global view of the whole problem.The Lagrangian method is a useful approach for performance driven problems.Finally, we depict a clustering method for the partitioning.
In most cases, we illustrate the method in question using two-way partitioning as the target problem.However, many methods can be ex- tended to other problems or different objective

Branch and Bound Method
The branch and bound method is an exhaustive search technique that may be effectively applied to the min-cut problem with size constraints for small cases.In the branch and bound process, the modules are first ordered in a sequence.For each module, we try placing it to either side of the cut.
The process can be represented by a complete binary tree with IV levels.The root of the tree is the first module in the sequence.The nodes in the kth level of the tree correspond to the kth module in the sequence.The two branches at each node represent the two trials where the kth module is placed on each of the two different sides.A path in the tree from the root to a leaf corresponds to one assignment for the partition.
We use a depth first search approach to traverse the binary tree.We prune the search space ac- cording to the size constraint and a partial cut count.In the binary tree, a node at level k along with the path from the root to the node represents a partition assignment of the first k modules.Let V1 and V2 be the two vertex sets of the partitions of the first k modules.If S(Vi) > Su for or 2, the size constraint is violated, and there is no need to proceed.Thus, we prune the branches below.
We also use a partial cut count to prune the binary tree.The cut of the partial partition is expressed as: E(VI, V2)={eil leiUI VII > 0 and leiN V21 > 0}.The partial cut count is described as" C(V1,V2)= Y']eieE(v,v2) Ci.If the partial cut count C(V, V2) is larger than the cut count of a known solution, the partition results below this node are going to be worse than the existing solu- tion.We prune the branches of such a node.Complexity of the Method Suppose the circuit has unit size si =1 on each module and the constraint requires an even size SI=Su=[VI/2 (assuming that vI is even).Applying Stirling's approximation [63], we have the number of pos- sible partitions: FIGURE 17 Construction of serial and parallel graphs.
Although the number of combinations is huge, we have found that the application to small cir- cuits is practical.We improve the efficiency of the pruning by ordering the modules according to their degrees, i.e., the number of nets connecting to the modules, in a descending order.With an elegant implementation, we can find optimal solu- tions when the number of modules is small, e.g., vl _< 60.

Dynamic Programming for a Serial and Parallel Graph
For the special case where the circuit can be represented by a serial and parallel graph of unit module size, we can find a minimum two way partition (V, V2) with size constraints in poly- nomial time.In this section, we first describe the serial and parallel graph.We then depict a dynamic programming algorithm that solves the partitioning problem on this class of graphs.
We assume that all modules are of unit size, i.e., Si--1.
A serial and parallel graph can be constructed from smaller serial and parallel graphs by serial or parallel process.Each serial and parallel graph has a source module v. and a sink module vt.A graph G(V, E) with two modules, V {v., vt} and one edge E={e}, e={v, vt} is a basic serial and parallel graph.A serial and parallel graph is constructed from the basic graph by a series of serial and parallel processes.Serial Process Given two serial and parallel graphs, G(V,E1) and Gz(V2, E2), we construct a serial and parallel graph G(V, E) by merging the sink module Vl of G1 and the source module v,;2 of G2 (Fig. 17(a)).The source module V.l of graph G becomes the source module of graph G, i.e., v. v.The sink module vt2 of graph G2 becomes the sink module of graph G, i.e., vt vt2.
Parallel Process Given two serial and parallel graphs, G(V,E) and Gz(V2, E2), we construct a serial and parallel graph G(V, E) by merging the source module vs of G and the source module v.2 of G2 and by merging the sink module Vtl of G1 and the sink module vt2 of G2 (Fig. 17(b)).
The merged source module and merged sink module become the source module v and the sink module v of graph G, respectively.
Dynamic Programming The dynamic programming algorithm performs a bottom up process according to the construction of the serial and parallel graph.It starts from the basic serial and parallel graph.For each graph G(V, E), we derive two tables.a(i,j): the minimum cut count with modules on the left hand side and j modules on the right hand side under the condition that source module v is on the left hand side and sink module v is on the right hand side.
b(i,j): the minimum cut count with modules on the left hand side and j modules on the right hand side under the condition that both source module v and sink module vt are on the left hand side.
Let graph G(V,E) be constructed with G(V1,E) and G2(V2, E2) by one of the serial and parallel processes.Let a, b be the tables of graph G and a2, b2 be the tables of graph G2.
We construct the tables a, b of graph G(V, E) as follows.
For table a(i,j), we try all combinations of tables al and a2 with the constraint that the num- ber of modules on the left hand side is and the number of modules on the right hand side is j.
Note that the extra addition of in the index is used to compensate the merging of the two source modules or the sink modules.For table b(i,j), we try all combinations of tables b and b2 with the same size constraint.
Table Formula for Serial Process a(i, j) min(mink+m=lv21al (i-k,j + m) q-bz(m, minz: b(i,j) min(mink+m=lv21al For table a(i,j), we try all combinations of tables a and b2 and all combinations of tables bl and a2.For the combinations of tables al and b2, the merged module (by merging vtl and ;s2) is on the right hand side.For the combinations of tables bl and a2, the merged module is on the left hand side.For table b(i,j), we try all combi- nations of tables al and a2 and all combinations of tables bl and b2.For the combinations of tab- les al and a2, the merged module is on the right hand side.In terms of G2, its source module v2 is on the right hand side and its sink module vt2 is on the left hand side.Thus, the indices of table a2 are reversed, i.e., a2(m,k) instead of az(k,m).For the combinations of tables b and b2, the merged module is on the left hand side.

Group Migration Algorithms
The group migration algorithm was first proposed by Kernighan and Lin [60] in 1970.Since then, many variations [15, 26, 27, 33, 39, 45, 49, 84, 97- 99, 108, 111, 116] have been reported to improve the efficiency and effectiveness of the method.Today, it is still a popular method in practice.The probability of finding the optimum solu- tion in a single trial drops exponentially as the size of the circuit increases [60].Using the origi- nal version, Kernighan and Lin showed that the probability of obtaining an optimal solution is a function of the problem size, p(I vl)-2-n/30.
In other words, if the circuit size is large, then the heuristic Kernighan-Lin algorithm is unlikely to jump out of local minima, and so the optimum solution will not be found.The progress made by researchers on the method has definitely pushed the envelope further.
In this section, we concentrate on two-way min- cut with size constraints.The method is flexible and can be extended to other partitioning pro- blems with modifications of the moves and the cost function.
The algorithm performs a series of passes.At the beginning of a pass, each module is labeled unlocked.Once a module is shifted, it becomes locked in this pass.The group migration algorithm iteratively interchanges a pair of unlocked modules or shifts a single module to a different side with the largest reduction (gain) of the cost function.This continues until all modules are locked.The lowest cost along the whole sequence of swapping is recorded.The group migration takes the sub- sequence that produces the lowest cut count and undoes the moves after the point of the lowest cost.This partitioning result is then used as the initial solution for the next pass.The algorithm terminates when a pass fails to find a result with a cost lower than the cost of the previous pass.FIGURE 18 Cost of a sequence of moves and subsequence selection.
Input: Hypergraph H(V, E) and an initial parti- tion.Cost function and size constraints.
shifts, however, with consideration of the mutual effect between the two shifts.
1.One pass of moves. (i) 1.1 Choose and perform the best move. 1.2 Lock the moved modules. 1.3 Update the gain of unlocked modules. 1.4 Repeat Steps 1.1-1.3 until all modules are locked or no move is feasible. 1.5 Find and execute the best subsequence of the move.Undo the rest of the sequence.
2. Use the previous result as an initial partition.
3. Repeat the pass (Steps and 2) until there is no (ii) more improvement.
Figure 18 illustrates the cost of a sequence of moves.This algorithm escapes from local optima by a whole sequence of the moves even when a single move may produce a negative gain.In the following, we discuss variations of several parts in the process: basic moves (Step 1.1), data structure, gains (Steps 1.1 and 1.3).At the end of this subsection, we introduce a net based move and a simulated annealing approach.

Basic Moves
Basic moves cover the shifting of a single module and the swapping of a pair of modules.A swapping can be conceived as two consecutive Module Shifting For each unlocked module, we check its gain: the cost function reduction by shifting the module to a different side assuming that the rest of the modules are fixed.To select the best module to shift, we order on each side the modules according to their shift gains.If the size constraints are vio- lated after the shift, the move is not feasible.
We search for the best feasible module to move [40].Pairwise Swapping We exchange two modules in two vertex sets of the partition.Note that the gain of the swap is not equal to the sum of the gains of two shifts.The mutual effect between the two modules needs to be included when we derive the gain.Thus, the best pair may not be the two modules on the top of the two sides.The search of all pairs takes o(Iv llv21) operations.In practice, we order modules according to their shift gain.The search of the best pair is limited to the top k modules on each side, e.g., k 3. Thus, the complexity is actually O(k2).
Pairwise swapping is a natural adoption when the size constraint is tight.When no single shift is feasible, we can use swapping to balance the size of the partition.

Data Structure
The choice of data structure strongly depends on the cost functions, gains, and the characteristic of VLSI circuitry.A sorting structure such as heap or AVL tree is a natural choice to sort for the top modules.However, for the case that the gain differs by a very limited quantities, an array struc- ture can simplify the coding and the complexity.
(i) Heap or AVL Tree We can use a heap or AVL tree to sort the modules according to their shift gain.Each side of the partition keeps a heap.The top of the heap is the module of the maximum gain.The sorting of each module takes O(1VIlog([ vl )) operations.
(ii) Array (Bucket) of Link List Figure 19 illus- trate a bucket list data structure.The gain is transformed to the index of the bucket [40].
Modules of the same gain are stored in the same bucket by a link list.A bucket is an ef- fective data structure when the objective func- tion is the cut count.The gain of cut count is limited by the maximum degrees of the modules, i.e., degma x -maxv, cVeE({vi}) Thus, the dimension of the bucket is set to be 2 degmax.
For VLSI applications, the degree of modules is much smaller than the number of modules.Thus, the dimension of the bucket is small.It is very efficient to search and revise the module order in the bucket structure.In fact, it is proven that us- ing the bucket structure and cut count as the objec- tive function, it takes linear time proportional to the total number of pins to perform each pass [4o].

Gains
In this subsection, we use cut count as the objective function.The extension to other cost functions is possible.However, we may loose efficiency.
(i) Shift Gain We use shift model for multiple pin nets.Given a module vi, we check the set E({vi}) of nets connecting to this module.The contribution of each net e E E({vi}) by shifting module vi is the gain ge(Vi) of the net with respect to module vi.The gain g(vi) of module vi is the total gains of all its adjacent nets, i.e., g(vi) e6E({vi}) ge(Vi).
(iii) Weights of Multipin Nets The sequence of the move depends much on the gain calculation.For a circuit of 1,000,000 modules, suppose the degree of most modules is less than 100 and each max gain module # module 2 FIGURE 19 Bucket list.net is of unit weight.We have roughly 1,000,000 modules/200 gain levels 5,000 modules per gain level.To differentiate these 5,000 modules, we have to adjust the weight of multiple pin nets.
(iii) (a) Levels with Priority The first level gain is identical to the shift gain of cut count.The second level gain is equal to the number of nets that have one more pins on the same side.Thus, the kth level gain is equal to the number of nets that have k more pins on the same side [65].The pins on the other side will increase by one after the mod- ule is shifted.Thus, the negative gain of level k is contributed by the nets with k-1 pins on the other side.
Let us assume that module vi is in vertex set V to simplify the notation.For each net e/E E({vi}), we denote kj lej-A V[ the number of pins in V.
Let us define E(+,i,k) to be the set of nets e./E E({vi}) with kj.=k+l pins in V (the extra one is used to count module vi itself) and nonzero pins in V2, i.e., ]e/l > k/.And E(-, i, k) to be the set of nets e/ E({vi}) with no other pins in V and k- pins in V2, i.e., [ej.=k and kj 1.Then, the kth level gain of module vi, gi(k), is the weight difference of the two sets, E(+, i, k) and E(-, i, k).gi(k)ce-ce (34) eEE(+,i,k) eEE(-,i,k) E(+,i,k)-{ejlej E({vi}),kj-k--t-1,]ej > kj} (35) We compare the modules with a priority on the lower level gain.In other words, we compare the first level first.If the modules are equal at the first level gain, we then compare the second level and so on.In practice, we limit the number of levels by a threshold, e.g., <_ 3.
(iii) (b) Probabilistic Gain In probabilistic gain model [37], each module vi is assigned a weight p(vi).The weight p(vi) is a function of the gain g(vi) of module vi to reflect the belief level (potential) that the shift of module F will be executed at the end of the pass.Thus, if module vi is unlocked, p(vi) f(g(vi)).(37) Otherwise, p(vi)=0.Figure 20 illustrates function f, which increases monotonically.The slope within go and gup amplifies the difference of gains.
The slope is clamped at two ends Pmax and Pmin (0_<Pmin < Pma_ < 1) which represent the maxi- mum potential that the module will shift or stay.
For each net eE({vi}), its contribution ge (vi)   to the gain of module vi is the tendency that the whole net will shift with module vi to the other side.To simplify the notation, let us assume that module vi is in V1.Thus, we have the following expression.
ji, vj Cefq Vl vjeen V2 where I-Ivjsp(vj) if s is an empty set.The first term IIji,vjecv,p(vj) in the parentheses is the potential that all the pins will shift with module vi to V2.Hence, Ce x 1-Iji,vEeev, p(vj) is the expected gain if module vi is shifted.The second term I-Ivjenv2p(vj) is the potential that the pins in V2 will shift to V. Thus, Ce x I-Ivecw2p(vj) is the expected loss if module vi is shifted.The gain of a module vi is the total gains of the adjacent nets with respect to this module, i.e., g(vi)--Z ge(vi).Net gain ge(V) and module potential p(vi) are mutually dependent.We derive the values via iterations.Initially, we use the plain shift gain (by cut count) to derive the potential p(vi)=f(g(vi)).From these initial potentials, we derive the prob- abilistic net gain.The net gain is then used to derive the module gain.In practice, we stop after a limited number of cycles, e.g., two iterations ( [37]).Note that there is no guarantee that the iteration will converge.
After each move, the associated module potential and probabilistic net gains are updated and the plain cut count is recorded.Exact cut count is used when we select the subsequence of move to execute.
It has been shown via benchmarks released by ACM/SIGDA, the probabilistic gain model pro- duces excellent partitioning results; it outperforms the other gain models by wide margins.

Net-based Move
The net based process [32,115] is similar to the module based approach except that all operations are based on the concept of the critical and com- plementary critical sets.The main differences are (1) Instead of a single module, each move now shifts one critical or complementary critical set, depending on the type of objective function.For convenience, we say a move is initiated by a net eu if this move is composed of shifting the critical or complementary critical set associated with e,.
(2) The locking mechanism is operated on a net, that is, if the critical or complementary critical set of a net has been moved then all the moves ini- tiated by this net will be prohibited thereafter.
Given a net eu and a vertex set Vb, let us define the critical set of net eu with respect to set Vb as sub eu Cq Vb, (40) and the complementary critical set of eu with re- spect to set V as sub eu Vb (41) For a move associated with a net eu, we can either place the critical set Sub into a partition other than V, or the complementary critical set Sub into the partition Vb.The gain of each move is then computed by evaluating the change of the cost due to the move of the critical or complementary critical set.
Usage of Basic Module Moves Although the net-based move model provides a different process to improve current partition, it is more expensive than the module-based move model because more modules are involved in each move.
We can mimic the net based move by adding weights to the connectivity of desired nets [38].
The basic move is still based on the modules.However, after module vi is moved, we add more weights on the nets connecting to vi, i.e., E({vi}).
These extra weights encourage the adjacent mod- ules to go along with module vi and thus achieves the effect of net based move.Empirical study finds improvement on the partitioning results.

Simulated Annealing Approach
For simulated annealing [14, 20, 56, 62, 81 ], we can adopt the basic moves such as module shifting and pairwise swapping.There is no need of lock mechanism.To allow a larger searching space, we incorporate the size constraints into objective function, e.g., C(V1, V2) + a(S(V,) S(V2)) 2. (42) where a is a coefficient.We can adjust it accord- ing to the annealing temperature.As temperature drops, we gradually increase a to enforce the size balance.

Flow Approaches
In this section, we assume that the circuit can be represented by a graph G(V, E) with unit module size, i.e., si and all nets are two pin nets.The flow approach can be extended to multiple pin nets using a flow model.
We first go through maximum flow minimum cut [1,73] to introduce the duality [30] and the concept of shadow price.The derivation is then extended to a weighted cluster ratio cut and a replication cut.Finally, we introduce heuristic algorithms that accelerate the flow calculation.The flow approach can derive excellent results.Furthermore, exploiting its duality formulation, we can derive a tight bound of the optimal solutions.

Maximum Flow Minimum Cut
In maximum flow minimum cut formulation, the flow injects into module Vs and drains from module vt.The flow is conservative at all other modules.The capacity of the nets eij is equal to its connectivity, co..We set cij=O if there is no net connecting modules vi and v#.The notation xi9 denotes the amount of flow from module vi to module v# and x#i denotes the amount of flow from module vj to module vi on net e0..The objective is to maximize the flow injection f into vs.
Figure 21 illustrates the formulation.As we increase the flow, certain nets are going to satu- rate, i.e., the two sides of inequality expression (44) become equal.Once the saturated nets be- come a bottleneck of the flow, the set of nets forms a cut E(V1, V2) with vs E V1 and vt E V2.In duality, the potential of modules in V2 increases to one, and the potential of modules in V1 remains to be zero, i.e., Ai 1, Vvi V2 and Ai 0, Vvi VI.The distance of nets in the cut is one, while the distance of nets outside the cut is zero, i.e., do.= 1, Vc E( V, V) and d=0, Vci _ E( V, V2).5.4.2.The Weighted Cluster Ratio Metric and a Uniform Multi-commodity   Flow Problem In a uniform multi-commodity flow problem [74,75], the demand of flow between each pair of modules is equal to an identical value f.As we keep increasing f, some of the nets become saturated.These saturated nets form a bottleneck of communication and thus prescribes a potential clustering of the communication system [71].
We simplify the notation by assuming a graph model G(V,E).From each module Vp, we inject flow f/2 to each of the rest modules.Summing up the flow in two directions, the flow between each pair of modules is f.We define the flow origi- nated from module Vp as commodity p.Let x be the flow for commodity p on net e0..The objec- tive is to maximize f: subject to the flow demand from module Vp to the other modules / -f/2 ifi:/-p, and (IVI-1)f/2 if/-p, and <_i,p<_ IVI, <_i,p<_lVI, and the net capacity constraint, p=l p=l (54) We transform the above linear programming problem to its dual expression by assigning dual variables Al p) to module vi with respect to com- modity p Eq. ( 53), and distance d o. to net eiy Eq. ( 54), then we have: Obj" min Z cidi (55) eo.EE subject to d/j C I, i,j,p <_ gl (56)   Ivl E E (A/(P)-Ap (p)) >_1 p=l i=l,ip (57) The Properties of Shadow Prices The shadow price d can be viewed as bidirectional, i.e., do.=4i.
It represents the distance of net ei#, which cor- responds to the cost to transmit flow through ei#.
Variable A/(p) is the potential of module vi with respect to commodity p.
From constraints (56), ( 57), we can derive two properties for distance function d o and potential Property I: Triangular Inequality The distance metric d satisfies the triangular inequality" dij --4k dik, Viii, Vj, F k V (58)   Property II: Potential Function The term A/(p)-Ap (p) in expression ( 56) is equal to the shortest distance between modules v; and Vp based on net distances do..In fact, from triangular inequality, we obtain A7 )-Ap(P)= dip.
We normalize the objective function (55) with the left hand side terms of inequality (57).The objective function can be expressed as: In the solution of linear programming problem ( 52)-( 56), the nets with positive d o. values parti- tion V into vertex sets V1, V2,..., Vk.More specifically, nets connecting modules in different sets, Vi, Vj., C j, have the same distance d O. values (we use d o to denote the distance between vertex sets Vi and V. when this does not cause confusion), while nets connecting only modules in the same subgraph have zero distance, d/y 0 (Fig. 22).We can rewrite the denominator of the objective function and state the problem as follows.
Statement of Weighted Cluster Ratio Cut [103] Find the distance d o and the number of partition k with an objective function of weighted cluster ratio: minu,kWc V1, V2,..., Vk) m,nu,, y,./:j+, ,)f dijS(Vi) S(Vj) where distance d o is subject to the property of triangular inequality.
According to the mechanism of the duality, the objective functions of the primal and dual formulations are equal when the solution is optimal [25].THEOREM 5.1 For feasible solutions, we have the inequality f <_ Wc(V1, V2,..., Vk).The equality holds when the solution is optimal, i.e., the maxi- mum uniform multicommodity flow equals the minimum weighted cluster ratio of any cut, maxxgjf <_ mind,kWc(V1, V2,..., Vk).Expression (60), weighted cluster ratio [103], is similar to cluster ratio with a weighted metric do..In general, the solution for the minimum weighted cluster ratio does not directly correspond to the partition of optimum cluster ratio.However, if distance d o. is a constant value between all pairs of vertex sets Vi and V then the weighted cluster ratio provides the solution for cluster ratio.
When the nets with positive distance d o. form a two-way partition, we can show that the partition defines the ratio cut.When the nets with positive distances form a k-way partition with k < 4, we also find that there exists a two-way partition that again defines the ratio cut [28].THEOREM 5.2 Let net set D {eo.ld O. > 0} define a cut that separates the circuit into k disconnected subsets.If k <_ 4, then there exists a ratio cut that is a subset of D. 5.4.3.A Replication Cut for Two-way Partitioning We adopt the linear programming formulation of network flow problem [1, 30], where each module is assigned a potential and a cut is represented by the difference of module potentials as shown in Figure 23.With respect to the directed cut E(Vl --0'1 ), we use w; to denote the potential dif- ference between the cut from module vi V1 to module v V1.The potential of each module vi is denoted by Pi.For module vi in V1, pi 1, and for Pi=O, qi=l p/=l, q/=l Pk =O, qk =O FIGURE 23 p potential and q potential of each module.
modules vi in rl, pi=O.Thus all nets e6 E E(V1 --V1) have Wig 1.The remaining nets have With respect to the directed cut E(V2 V2), we USe Uji with a reversed subscript ji to denote the potential difference between the cut from module vi E V2 to module Vg V2 (Fig. 23).The potential of each module vi is denoted by qi.For modules vi in V2, q;= 1, and for modules vi in V2, q,.=0.The potential difference Hji has a reverse direction with net eig because we set the potential on V2 side high and the potential on V2 side low.All nets eij E E(V2 -+ V2) have Ugi 1.The remaining nets have ugi O.
Expression (64) demands potential qi be not less than potential pz for any module vie V. Since high potential Pi corresponds to set V1, and high potential qi corresponds to set V2, inequality (64) enforces V1 be a subset of '2.Consequently, the requirement that V1 N V2 is satisfied.Constraints ( 65)-( 68) set the potentials of modules vs and yr.Constraint (69) requires poten- tial difference wig and u/ be nonnegative.Fig- ure 23 shows one ideal potential configuration of the solution.
Dual Linear Programming Formulation If we assign dual variables (Lagrangian multiplier) x 0.
to inequality (62) with respect to each net, x. to inequality (63), Ai to inequality (64) with respect to module Vi, and a, bs, at, bt to inequalities (65)-( 68), respectively, then we have the dual formulation.
We can view G(V, E as a network flow problem and interpret cij as the flow capacity, xij as the flow of net %.. Constraint (71) requires that the flow x 0. be not larger than the flow capacity ci# on each net ei#.In constraint (72), the set of nets is not are in a reversed direction and flow x/ larger than the capacity of the capacity c#; of net e#i in E. Corresponding to G(V,E), we use G'(V,E I) to denote the reversed graph.Constraint (73) has the total flow xij injected from module vi into G be equal to -A;.On the other hand, constraint (74) has the total flow xij injected from module vi, into G be equal to Suppose we combine Eqs. ( 73) and ( 74), we have --Xij + Xji "i ZX.Xi. (81)   J J This means that the amount of flow Ai which emanates from module v; in G enters its corre- sponding module in vi, in G.
Constraints ( 75)-( 78) indicate that as and bs are the flow injections to module vs in G and its reversed circuit G; at and bt are the flow ejections from module vt in G and its reversed circuit G', respectively.Combining circuit G and G together, we have the maximum total flow, as+ b, be the optimum solution of the minimum replication cut problem.

The Optimum Partition
In this subsection, we describe the construction of replication graph and take an example to describe it.We then apply the maximum flow algorithm on the constructed replication graph to derive an optimum replication cut.The optimality of the derived replication cut is proved by using a net- work flow approach.
Construction of Replication Graph Given a cir- cuit G(V, E and modules Vs and vt, we construct another circuit G'(V',E') where V' 1=1 V[ with in V corresponding to a module vi each module v in V, and ]E'l= EI with each directed net eij in E' in the reverse direction of net %. in E. We create super modules v and v and nets (v, v), (v, v), (vt, v'), and (v't, v') with infinite capacity as shown in Figure 24.From every module vi in V except vs x X O:D X' X' FIGURE 24 The replication graph G*. and vt, we add a directed net of infinite capacity in Vt.We refer to to the corresponding module v the combined circuit as G*.
Polynomial-time Algorithm The optimum repli- cation cut problem with respect to module pair vs and vt and without size constraints can be solved by a maximum-flow minimum-cut solution of the circuit G* with v as the source and v as the sink of the flow (Fig. 24).Suppose the maximum-flow minimum-cut finds partition (X,X) of V with vsE X and vt X and partition X 2o (X,2') of V' with vs and v Then a repli- cation cut (V1, V2) of the original circuit with VI=X, V2-{ii'2'} andR=V-V-V2isan optimum solution.Note that V2 is derived from the cut in vertex set V. To simplify the notation, we shall use (X,2) to denote the derived replica- tion cut of G.
Example Given a circuit in Figure 25, its replication graph G* is constructed as shown in Figure 26.The maximum-flow minimum-cut of G* derives (X, with a flow amount, 5 (Fig. 26).Thus the sets V1 {v, Va} and V2 {vt} define an optimum replication cut R(V1, V2) with R {vb, vc} and a cut cost equal to 5 (Fig. 27).
The network flow approach leads to the opti- mality of the solution as stated in the following theorem.
THEOREM 5.3 The replication cut R(X,f() derived from the transformed circuit G* generates the minimum replication cut count CI(X,f(I) (expression (19)).

Heuristic Flow Algorithms
We introduce the heuristic approaches that accel- erate the flow calculation and take advantage the optimality properties of the flow methods.
We first introduce an approach that utilizes the maximum flow m,i.nimum cut method for the min cut with size constraints.We then explain a short- est path method for multiple commodity flow calculation.
(i) Usage of Maximum Flow Minimum Cut We adopt a heuristic approach [113] to get around the unbalanced partition of the maximum flow and minimum cut method.First, we find two seeds as the source and the sink modules, vs, yr.We then use the maximum flow and minimum cut method to find partition (V1, V2) with vsE V1 and vt E V2. Suppose the size S(VI) of V is larger than the size S(V2) of V2, we find from V a module vi to merge with V2 and shrink set V2 as a new sink module.Otherwise, we find from V2 a module vi to merge with V1 and shrink set V as a new source module.We repeat the maximum flow minimum cut process on the graph with new source or sink module until the size of the partition fits the size constraint.

Two Way Partitioning using Maximum Flow
Minimum Cut 1. Find two seeds as v and vt.
2. Call Maximum Flow Minimum Cut to find partition (V1, V2). 3. If S(V1)> S(V2), find a seed vie V, merge {vi} U V2 into a new sink module v. 4. Else find a seed vie V2, merge {vi} V1 into a new source module Vs. 5. Repeat Steps 1-4, until S < S(VI) < Su and S < S(V2) < S We can use parametric flow approach recur- sively to the maximum flow minimum cut problems recursively (Step 2).The total complexity is equivalent to a single maximum flow minimum cut.
The seeds are chosen according to its con- nectivity to the vertex set in the other side.The result is sensitive to the choice of the seeds.We can make multiple trials and choose the best results.Other methods such as programming ap- proach can serve as a guideline on the choice of the seeds [79,80].The method has shown to derive excellent results with reasonable running time.
(ii) Approximation of Multiple Commodity Flow Based on the multicommodity flow formulation [103], we try to solve a multiple way partitioning by deriving approximate multiple commodity flow with a stochastic process [13,55,114,117].
Given a circuit H(V, E ), the flow increment A, and the distance coefficient c, the algorithm starts with procedure Saturate-Network to saturate the circuit with flows.A stochastic flow injection algorithm is adopted to reduce the computational complexity.Then, Select-Cut is activated to select a set of nets by the flow values to constitute a cut.The conversion from weighted ratio cut to cluster ratio cut is performed by a Select-Cut routine which selects the subset of the cut derived from Saturate-Network with a greedy approach.
Procedure Saturate-Network (H, A, c) 1. Set the distance of each net e to be one.
2.1.Randomly pick two distinct modules v and vt.
2.2.Find the shortest path between v and vt.
2.3.For each net e on the shortest path, let f (e) and de be the flow and distance of net e.
2.3.1.If n is not saturated, increase f (e) by A and set de exp ((c x f (e))/ 2.3.2.If e is saturated, set de to be 3.Output E with flow informations.
The initial distance of each net is one since there is no flow being injected (see the distance formulation in Step 2.3.1).
Step 2.1 uses a random process with even distribution over all modules to pick two distinct modules, and Steps 2.2-2.3 inject A amount of flows along the shortest path between the modules.In Steps 2.3.1-2.3.2, the dis- tances of the nets whose flow has been increased are recomputed using an exponential function de=exp((c x f(e))/Ce) to penalize the congested nets, where de and f (e) are the distance and flow of net e, respectively.Steps 2.1-2.3 are iteratively executed until a pair of modules are chosen where all possible paths between them are saturated by flows.These saturated nets identify a partition of the circuit.
Figure 28 shows a sample circuit saturated by flows after executing Saturate-Network with A 0.01 and c 10.The flow values are shown by the numbers right beside each net.The dashed lines indicate the cut lines along the set of saturated nets to form the three clusters.These saturated nets define an approximate weighted cluster ratio cut which are potential set of nets for a selection of cluster ratio cut.

Programming Approaches
For programming approaches [7, 18, 35, 41,44, 46], we adopt two way minimum cut with size con- straints as the target problem.We assume that the nets are two pin nets and thus, the circuit can be described as a graph G(V, E).We also assume the modules are of unit size, i.e., si 1.The two way partition (V1, V2) is represented by a linear placement with only two slots at coordi- nates and 1.For an even sized partition, half of the modules are assigned to each slot.Let X denote the coordinate of module vi.If vie Xi---1, else Xi----for 11 E V 2. The cut count can be expressed as follows.c(u , where X is a vector of x;, and X is the transpose of vector X.Matrix B has its entry b0.=-c0.if .67 ! \ .,l.OO':-.-./_ ...--':""" "1.00 t FIGURE 28 The flow and partition generated by saturate-network.
ij, else bii--]l_<j<_lvl cij.Suppose we relax the slot constraint by enforcing only the rules of the gravity center and the norm.The constraint of vector X can be expressed as: X X-IVl (84) Matrix B is symmetric and diagonally semido- minant.Thus, it is semipositive definite, i.e., all eigenvalues are nonnegative.And its eigenvectors are orthogonal.Let us order its eigenvalues from small to large, i.e., A0_< A1... < AlVl_l.The smal- lest eigenvalue A0=0 with its eigenvector X0 1.The second eigenvalue A1 is nonnegative with its eigenvector orthogonal to the first eigenvector, i.e., X-X 1-rX 0. Therefore, the second eigenvec- tor X is an optimal solution to objective function (82) with constraints (83) [46].Since X-rX=IV Eq. ( 84) the solution X-[BX1 --/1 X ?X ,,l x IVl, (85)   which is a lower bound of the min-cut problem.
To push for a higher lower bound, we can adjust the diagonal term of matrix B by adding constants di.Let C(Vl, V2) C(Vl, V2) --di x X2i l<i<lvl 4 2----" di (86) l<i<lvl -x-x where matrix has its entry ib if C j, else ii bii + di.Either xi or xi 1, the last two terms cancel each other.The modification thus does not alter the optimal partition solution.
The new nonlinear programming problem is to find the assignment of d to maximize the objec- tive function [11]: where /1 is the second smallest eigenvalue of matrix l.The solution is an upper bound of the partition.It is larger than A1 in the sense that A1 can serve as an initial feasible solution to maximize expression (87).
Remarks The programming approach finds a global view of the problem [9,79,80,118].How- ever, the formulation is very restricted.The exten- sion to multiple pin nets and the incorporation of fixed modules will destroy the nice structure based on which we have the eigenvalue and eigenvector as optimal solutions.Therefore, it is diffi- cult to utilize the approach recursively.
For a general case, we can view the problem as nonlinear programming with Boolean quadratic objective function.Nonlinear programming techniques are adopted to derive the results [16,107]. 5.6.A Lagrange Multiplier Approach for Performance Driven Partitioning Lagrange multiplier is one useful tool for perfor- mance optimization.In this section, we demon- strate the usage of Lagrange multiplier for performance driven partitioning.The problem is to optimize the performance of a two-way partition (V1, V2) with retiming [86].
We first introduce a vector of binary variables to represent a partition.The performance-driven partitioning problem is thus represented by a Boolean quadratic programming formulation with nonlinear constraints.We then absorb the non- linear constraints into the objective function as a Lagrangian.We use primal and dual subproblems to decompose the Lagrangian and derive the partitions.Lagrange multiplier is adjusted in each iteration via a subgradient method to monitor the timing criticality and improve the performance.We assume that the circuit can be represented by a graph G(V, E) with two pin nets and unit module size.The two-way partition is described by a vec- tor x= (Xl,1,..., x,n, x2,,..., x2m), where Xb,i is if module vi is assigned to vertex set Vb, otherwise xb,i is 0. If modules vi and v. are in different vertex set, the value of the term Xl,iX2,jqt-X2,iXl,j is equal to 1.This contributes one interpartition delay 8 into the delay of the net eij.Let gt(x) denote the delay to register ratio of loop I. Delay ratio gt(x) can be written as the following formula: dg @ etjEl X (Xl,iX2,j @ X2,iXl,j) gl(x) rl (88) Given a path p, the total delays hp(x) of p is as follows" hp(x) dp Av Z x (Xl,iX2,j Av x2,iXl,j) (89)   eo.Ep To formulate the problem, we use an objective function of cut count" min cij(x,,ix2,j + X2,iXI,j), Actually, we don't need to consider all loops in C3.Because all loops are composed of simple loops, we have the following lemma: LEMMA Given a number ), if gl(x) is less than or equal to )for any simple loop l, then g(x) is less than or equal to J for all loops 1.Let 7rc and 7rp represent the number of the simple loops and the number of /O-critical paths, respectively.Let A denote the vector (Ag,,..., Auc, Ah,,..., Ahp).Using Lagrangian Relaxation [104], we absorb the constraints ( 93) and (94) into the objective function (90).The Lagrangian- relaxed problem is as follows.max min L(x,A) ( A>0 x subject to constraints C1 and C2, where t(x,/) Z ij(Xl,iX2,j --t-x2,iXl,j) + Z Ag, (gl(x) 1) V simple loop (i) The Dual Problem Given vector x, we can represent (96) as a function of variable A, i.e., Lx(A).Thus, the dual problem can be written as:

A>0
(ii) The Primal Problem Let F O. and Qo denote the sets of the simple loops and/O-critical paths passing the net e0..The cost a o. of net e 0. is composed of connectivity c,../ and the penalty of the timing constraints.
Given vector A, we can represent (96) as a function of vector x, i.e., La(x).Thus, the primal problem can be rewritten as: min L (x) min Z aij(Xl,iX2,j n c-x2,iXl,j) +/ E (99) subject to C1 and C2, where /3 represents the constant contributed by A. We solve the partitioning problem through primal and dual iterations on the Lagrangian.A Quad- ratic Boolean Programming, QBP, [16] is used to solve the primal problem and generate a solution x (Step 2).
For the dual problem based on x, we select the set of loops and paths that violates the timing constraints as active loops and paths.The nets contained in the active loops or paths are termed active nets.
Active Loops and Paths Given a solution x, a loop is called active, if g;(x) is not less than J.A path p is called active, if hp(x) is not less than .
Active Nets Given a net e, we define e to be an active net, if net e is covered by an active loop or an active path.
We call a minimum cycle mean algorithm [57] and an all-pairs shortest-paths algorithm to mark all the nets on active loops and paths, respectively (Step 3).For every net eij on active paths, we record q0: the maximum path delay among all paths passing through e0..For every net eij on active loops, we record Po: the maximum delay-to- register ratio among all loops passing through e0..We then calculate the subgradient on the marked nets and update the constants a o. for the next primal dual iteration (Steps 4-5).We increase the costs of active nets using subgradient approach [104].The iteration proceeds until the bound of all loops and paths are within the given limits.

Clustering Heuristics
We first discuss the usage of clustering heuristics.
We then discuss top down clustering and bottom up clustering approaches.At the last, we discuss some variations of clustering metrics.

Usage of Clustering Heuristics
The usage of clustering heuristics plays an important role in determining the quality of the final results.In the following, we discuss the issue in different topics.We use a two-way partitioning with size constraints as the target problem.
1. Top Down Clustering versus Bottom Up Clus- tering: Top down clustering approach provides a global view of the solution.The operations are consistent with the target problem.However, it is more time consuming because the clustering operates on the whole circuit [29].Bottom up clustering is efficient.However, be- cause the process operates locally, the target solution is sensitive to the clustering heuristics [59].2. The Level of the Clustering: Suppose we rep- resent the clustering results with a hierarchical tree structure.Let the root correspond to the whole circuit, the leaves correspond to the smallest clusters, and the internal nodes corre- spond to the intermediate clusters.Hence, the size of the clusters grows with the level of the nodes.Top down clustering creates clusters corresponding to nodes in high levels, while bottom up clustering creates clustering corre- sponding to nodes in low levels.
For example, in [60], Kernighan and Lin proposed a top down clustering approach, which divides the whole circuit into four clusters only.In [59], Karypis et al., used a bottom up clustering which starts with clusters of two modules or a net.If we continue the application of bottom up clustering on intermediate clusters, the quality of the clusters degenerates as the size of the clusters grows bigger.
Iteration of Clustering and Unclustering: We go through the iterations of clustering and un- clustering to improve the quality of the results.
At each level of the hierarchical tree, we derive an intermediate target solution, e.g., a two-way partition.In unclustering, we go down the level of tree hierarchy to find an expanded circuit with more modules.In clustering, we go up the level of tree hierarchy with a circuit of a smaller number of modules.The previous parti- tioning result becomes the initial of the new par- titioning problem.Note that the hierarchical tree is constructed dynamically.For each clus- tering, the modules can be grouped based on the current partitioning configuration.
The Clustering Operations and the Target Solution: The clustering operation has to be consistent with the target solution.For exam- ple, suppose the target is finding a two-way min-cut with size constraints.Then, it is natural to cluster modules based on net connectivity because the probability that a net is in an opti- mal cut set is small (see the subsection of min-cut with size constraints in problem for- mulations).Moreover, it is important that the clustering follows the current partitioning results, i.e., only modules in the same parti- tion are clustered.We use an application to two-way cut with size constraints to illustrate the top down clustering approach [24,29].The partitioning of huge designs is complicated and the results can be erratic.Our strategy (Fig. 29) is to reduce the circuit complex- ity by constructing a contracted hypergraph.The clusters for the contracted hypergraph are searched via a recursive top down partitioning method.The number of modules is much reduced after we contract the clusters.Hence, a group mig- ration approach can derive excellent two way cut results on the contracted hypergraph with much efficiency.Furthermore, since the clusters are grouped via a top down partitioning, concep- tually a minimum cut on the hypergraph can take advantage of the previous results and generate better solutions.
In this section, we describe a top down clus- tering algorithm.A ratio cut is adopted to perform the top down clustering process.Other partition approaches can also be used to replace the ratio cut.A group migration method is used to find a minimum cut of the contracted hypergraph with size constraint.Finally, we apply a last run of the group migration algorithm to the original circuit to fine tune the result.
Input a hypergraph H(V,E), an integer k for the number of expected clusters, an integer num_of_reps for repetition, and St, S, for the size constraints of two resultant subsets.
5. Construct a contracted hypergraph Hr(Vr, Er). 6. Apply num_of_reps times of a group migration algorithm to Hr with the size constraints St, S,.
7. Use the best result from Step 6 to the circuit H as an initial partition.Apply a group migra- tion algorithm once to H with the size con- straints St, S,.The choice of cluster number k It was shown [24] that the cut count versus cluster number k is a concave curve.When k is smll, the quality is not as good because the cluster is too coarse.When k is large, there are too many clusters.We lose the benefit of the clustering.
For the case that the circuit is large, we may need to adopt multiple levels of clustering to push for the performance and efficiency [58,66].

Bottom Up Clustering Approaches
In this section, we discuss bottom up clustering [90] with two applications: linear placement and performance driven designs.We then show two strategies to perform the clustering: maximum matching and maximum pairing.We will demon- strate via examples the advantage of maximum pairing over maximum matching.
(i) Linear Placement For linear placement, we reduce the complexity of the problem by a bot- tom up clustering approach [53, 96, 100].The clus- tering is based on the result of a tentative placement.We adopt a heuristic approach to generate tentative placements throughout itera- tions.In each iteration, we cluster modules only when they are in consecutive order of the place- ment.We then construct a contracted hypergraph.In the next iteration, the heuristic approach gen- erates the placement of the contracted hypergraph.For each iteration, we either grow the size of the clusters or construct new clusters adaptively.
Inspired by the property of the minimum cut separating two modules (Theorem 3.1), we use a density as a measure to find the cluster.A density d(i) at a slot of a linear placement is the total connectivity of nets connecting modules on the different sides of the slot.The following algorithm describes the clustering using a given placement.Each cluster size is between L and U.
Input placement P, two parameters L and U.
2. Scan placement P from slot p toward the right end.Find slot such that p + L <i < p + U and density d(i) is minimum among d( p + L) d(p + U).
3. Cluster modules between slots p and i. Set p=i+l.
4. Repeat Steps 2, 3 until the scan reaches the right end.
Remark The proposed clustering process and the criteria are consistent with the target linear placement application.The whole process depends on an efficient and effective linear placement.
(ii) Performance Driven Clustering For perfor- mance driven clustering [31, 112], nets which con- tribute to the longest delay are termed critical nets.Pins of the critical net are merged to form clusters.
For a special case that the circuit is a directed tree, we can find optimal solution in polynomial time.Let us assume the tree has its leaves at the input and its root at the output.We use a dynamic programming approach to trace from the leaves toward the root.Each module is not traced until all its input modules are processed.For each module, we treat it as a root of a subtree and find the optimal clustering of the subtree.Since all the modules in the subtree except its root have been processed, we can derive an optimal solution of the root in polynomial time.
(iii) Maximum Matching The maximum match- ing pairs all modules into IV[/2 groups simulta- neously.Given a measurement of pairing modules, we can find a matching that maximizes the total pairing measurement in polynomial time.
We can call maximum matching recursively to create clusters of equal sizes.However, this strategy may enforce unrelated pairs to merge.The enforcement will sacrifice the quality of final clustering results.
Example Figure 30 illustrates the clustering be- havior of maximum matching.The circuit con- tains twelve modules of equal size.The first level maximum matching pairs modules (a,b), (d,e), (g,h), (j,k), (c, 1), and (f, i).Modules in the first four pairs are strongly connected with their partners.However, the last two are not.Module c and have no common nets but are merged because their choices are taken by others.
Furthermore, as we proceed to the next level maximum matching, the merge of pairs (c, l) and (f, i) will enforce grouping modules into cluster {a, b, c,j, k, l} and cluster {d, e,f g, h, i}.If we measure the quality of the results with cluster cost (expression (26)), the cost of the two clusters is ,i((C(Vi))/(C(Vi)))=4/12 + 4/12=2/3.For this case, we can find a better solution of clusters {a, b, c, d, e,f} and {g, h, i,j, k, l} of which the cluster cost is equal to zero. Figure 31 shows another example of twelve modules with connectivities attached to the nets.
The connectivity is if not specified.Figure 3 l(a) shows an optimum cut with cut count 6.6.If a maximum matching [61] criterion is adopted in the bottom up clustering approach, then modules with a net of weight 1.1 between them will be merged.A minimum cut on the merged modules yields a cut count of 18 (Fig. 31(b)).In general, a 2n module circuit having a symmetric configuration as in Figure 31 will have a cut count of n2/2 if the maximum matching criterion is ap- plied to perform the clustering; while the optimum solution will have a cut weight of 1.1 x n.From this extreme case, we can claim the following theorem: THEOREM 5.4 There is no constant factor of error bound of the cut count generated by the maximum matching approach, from the cut count of a minimum cut.
Proof As shown in the above example, the factor of error bound is (n2/2)/(1.1 x n) n/2.2, which is not a constant.
Q.E.D. (iv) Maximum Pairing The maximum pairing is similar to maximum matching, except that it does not enforce the matching of all modules.Only the top q percent of the modules are paired.Thus, we can avoid the enforced pairing of unrelated modules.
However, this strategy may cause certain mod- ules to keep on growing and produce very un- even cluster results.Thus, we need to choose a proper cost function that discourages unlimited growth of the cluster size, e.g., cost function (26).In order to identify good clusters, we need to look beyond the direct adjacency between modules.
It is useful if we can also extract the relation be- tween the neighbors' neighbors, or even several levels of neighbors' neighbors.The probabilistic gain model of group migration approach is one good example of such approach [37,42].
In this section, we will discuss a few different clustering metrics.For the case of k connectivity, we count the number of k-hop paths between two modules.Or, we use an analogy of a resistive network to check the conductance between the modules.Furthermore, we check beyond the hypergraph and use other information such as the module functions, pin locations, and control signals.
(i) kth Connectivity The number of k-hop paths between two modules provides a different aspect of information on the adjacency.Suppose the cir- cuit has only two-pin nets.We can derive the kth connectivity with sparse matrix multiplication.Let C be the connectivity matrix with connectiv- ity c/j as its elements at row column j, and at row j column i, and its diagonal entry ii:O.
Note that we set co.=O if there is no net connect- ing modules vi and vj.Let c!. 2) be the element of the square of matrix C tj (C2), and el. ) be the element of the kth order of k (k) matrix C (C).Then we have cij representing the number of distinct k-hop paths connecting mod- ules vi and vj.
(ii) Conductivity We use a resistive network analogy [21,93] to derive the relation between modules.Suppose the circuit has only two pin nets.We replace each net eiy with a resistor of conductance ciy.Hence, we can view the whole system as a resistive network and derive the conductance between modules.The system con- ductance between two modules vi and vy reveals the adjacency relation between the two modules.
The network conductance can be derived using circuit analysis.We can also approximate the conductance with a random walk approach.In a random network model, we start walking from a module vi.At each module Vk, the probability to walk via net ekl to module v is proportional to the connectivity, i.e. (Ckl/-]m Ckm).We can derive the relation between the random walk and the conductivity [89]: 2-e Ce ho.@ hji [El (100) oij where h o. denotes the expected number of hops to walk from modules vi and v, and aij denotes the conductance between vi and (iii) Similarity of Signatures We can use certain features beyond connectivity for the clustering metric [88,91].For example, the index of data bits, sequence of the pins, function of logic, and relation with common control signals can serve as signatures of function blocks in data path designs.All these features form the first level adjacency.We can extend the relation to multiple levels.For example, two modules connecting a set of modules with strong similarity makes these two modules similar.
Example As shown in Figure 32, modules A and B are similar in signature because they are of the same OR function, connected to consecutive bit number at the same pin location, and control- led by the same control signal at the same pin location.
Modules C and D become similar because module C obtains signal from A, module D ob- tains signal from B, and modules A and B are similar.

RESEARCH DIRECTIONS
Partitioning remains to be an important research problem.Many applications such as floorplan- ning, engineering change orders, and performance driven emulation demand effective and efficient partitioning solutions.
Recent efforts released benchmarks with reason- able complexity [3].However, more design cases are still needed to represent the class of huge cir- cuitry with details of functions and timing.
In this section, we touch on a few interesting research problems regarding the correlation be- tween the partition of logic and physical designs, the manipulation of hierarchical tree structure, and the performance driven partitioning.It is desired to correlate the logic hierarchy with the physical design hierarchy.The main reason is the control of timing for huge designs.Current- ly, the design turnaround takes 2-8 months for ASIC and much longer for custom designs.Throughout the design process, designs keep on changing.We don't want to lose control of timing as design changes.A tight correlation of logic and physical hierarchies makes timing predictable.Without this kind of mechanism, the timing char- acteristics of a floorplan may become erratic after iterations of design changes.

Manipulation of Hierarchical
Partitioning Structure One main issue in mapping a huge hierarchical circuit is the utilization of the hierarchy to reduce the mapping complexity.We can drastically improve the efficiency of the mapping process, if we properly exploit the structure of the de- sign hierarchy.The generic binary tree is a good formulation to start with.
The handling of a hierarchy tree gives rise to many fundamental research problems.For exam- ple, finding k shortest-paths or exploring the maximum-flow minimum-cut of the whole circuit [51] embedded in a hierarchical tree can be use- ful for interconnect analysis and optimization.Such research can also benefit many different fields which have to handle huge hierarchical systems.

Performance Driven Partitioning
For performance driven partitioning, we need a fast evaluation on the hierarchical tree structure.The analysis needs to be incremental with incor- poration of signal integrity.
The network flow method is a potential approach for the partitioning with timing con- straints.More efforts are needed to improve the speed and derive desired results.

FIGURE 4
FIGURE 4 Four possible configurations of net ei {a, b} in a random placement.

FIGURE 7
FIGURE 7 An example of a 3-level 5 way partitioning tree structure.

FIGURE 8 FIGURE 9
FIGURE8 An example of a generic binary tree.

FIGURE 12
FIGURE 12 Illustration of replication and its effect on partitioning.The figure shows path p (a) before and (b) after vertex set R is replicated.

FIGURE 14
FIGURE 14 Multiple pin net model of shifting process.

FIGURE 15 A
FIGURE 15 A loop model of multiple pin net where modules are placed on an x axis.

FIGURE 16 A
FIGURE 16 A flow model with respect to net eu.

FIGURE 21
FIGURE 21 Illustration of maximum flow minimum cut formulation.

FIGURE 22
FIGURE 22 Distance between clusters.FIGURE23 p potential and q potential of each module.

FIGURE 29
FIGURE 29 Strategy of top down clustering.

FIGURE 30
FIGURE 30 Clustering of two module circuit.

FIGURE 31 A
FIGURE 31 A twelve module example to demonstrate maxi- mum matching.

FIGURE 32
FIGURE 32 Signature identifies data structure.

6. 1 .
Correlation of Hierarchical Partitioning Structure Between Logic Synthesis and Physical Layout

Table Formulas for
Parallel Process