Reprints Available Directly from the Publisher Photocopying Permitted by License Only (c) 2000 Opa (overseas Publishers Association) N.v. Iterative Partitioning with Varying Node Weights*

The balanced partitioning problem divides the nodes of a [hyper]graph into groups of approximately equal weight (i.e., satisfying balance constraints) while minimizing the number of [hyper]edges that are cut (i.e., adjacent to nodes in different groups). Classic iterative algorithms use the pass paradigm [24] in performing single-node moves [16, 13] to improve the initial solution. To satisfy particular balance constraints, it is usual to require that intermediate solutions satisfy the constraints. Hence, many possible moves are rejected. Hypergraph partitioning heuristics have been traditionally proposed for and evaluated on hypergraphs with unit node weights only. Nevertheless, many real-world applications entail varying node weights, e.g., VLSI circuit partitioning where node weight typically represents cell area. Even when multilevel partitioning [3] is performed on unit-node-weight hypergraphs, intermediate clustered hypergraphs have varying node weights. Nothing prevents the use of conventional move-based heuristics when node weights vary, but their performance deteriorates, as shown by our analysis of partitioning results in [1]. We describe two effects that cause this deterioration and propose simple modifications of well-known algorithms to address them. Our baseline implementations achieve dramatic improvements over previously reported results (by factors of up to 25); explicitly addressing the described harmful effects provides further improvement. Overall results are superior to those of the PROP-REXest algorithm reported in [14], which addresses similar problems.


Introduction
Given a hyperedge-and node-weighted hypergraph H = V ; E , a k-way partitioning P k assigns the nodes in V to k disjoint nonempty partitions.The k-way partitioning problem seeks to minimize a given objective function cP k whose argument is a partitioning.A standard objective function is net cut, i.e., the number of hyperedges signal nets whose nodes are not all in a single partition.Constraints are typically imposed on the partitioning solution, and make the problem di cult.For example, limits on the total node weight in each partition balance c onstraints result in an NP-hard formulation 17 ; certain nodes can also be xed in particular partitions xed c onstraints.This research w as supported by a g r a n t from Cadence Design Systems, Inc.
A k ey driver for hypergraph partitioning research in VLSI CAD has been the top-down global placement of standard-cell designs.Key attributes of real-world instances include: size: number of nodes up to one million or more all instance sizes equally important sparsity: numb e r o f h yperedges very close to the numb e r o f n o d e s , a n d a verage node degrees typically between 3 and 5 in gate-and cell-level netlists average hyperedge degrees typically between 3 and 5 small number of extremely large nets e.g., clock, reset wide variation in node weights cell areas due to the drive r a n g e o f deep-submicron cell libraries and the presence of complex cells and large macros in the netlist tight balance tolerances, i.e., the sum of actual cell areas assigned to each partition mu s t b e v ery close e.g., within 2 to the requested target area.
In this application, scalability, speed and solution quality are all important criteria.To achieve speed and exibility in addressing variant formulations, move-based heuristics are typically used, notably the Fiduccia-Mattheyses FM heuristic 16, 8 . 1 We note that reporting in the research literature has centered on hypergraphs with unit node weights, in particular, the original works of Kernighan and Lin 24 , Fiduccia and Mattheyses 16 as well as many others evaluate new partitioning heuristics on such graphs.Prior works that address variable node weights have t ypically used ACM SIGDA benchmarks 6 where hypergraph node weights vary little compared, e.g., to the size variance in modern VLSI cell libraries, and the netlist topology has relatively low node degrees up to 10. Alpert 2 noted that many of these circuits no longer re ect the complexity of modern partitioning instances.Accordingly, the ISPD98 Circuit Benchmark Suite, consisting of 18 larger benchmarks arising in the physical implementation ow o f i n ternal IBM designs, was released in early 1998 2, 1 .Many of the ISPD98 benchmarks have nodes with area bigger than 10 of the total and node degrees in the several hundreds; however, these instances have no large nets.By contrast, ACM SIGDA b e n c hmarks have o n l y l o w-degree nodes with nearly uniform areas, but can have nets of degree greater than 1,000.
Akin to 14 , this work addresses the di erences between partitioning with varying node weights and unit node weights.Section 2 critically reviews iterative partitioning heuristics, including the popular LIFO and CLIP algorithms, and demonstrates using partitioning results published in 1 that varying node areas indeed cause performance deterioration of those heuristics.Section 3 describes a particular e ect caused by heavy nodes that a ects iterative partitioners, especially CLIP.The best of the proposed xes" to LIFO and CLIP appear to be quite e ective.
In Section 4, we d e v elop a type of temporary tolerance relaxation to counter the immobility o f heavy nodes.Our technique is somewhat di erent than that in 14 and easier to implement.Calibration of runtimes to results reported in 14 and subsequent best of n" tests suggest that our approach is more e ective.Section 5 concludes with closing remarks.

Move-based partitioning
Today, competitive partitioning algorithms e.g.22, 3 a r e o verwhelmingly based on iterative heuristics 24, 1 6 , 1 3 that perform single-node moves in passes in order to improve t h e initial solution.It is typically the case that improvements in these classic heuristics will also improve leading-edge heuristics.Furthermore, advances in classic heuristics often provide very immediate returns since there is a large base of users in real-world settings, as well as a more comprehensive body of results and implementations available for calibration.

Satisfying balance constraints
The need to satisfy tight balance constraints is motivated by applications in, e.g.top-down VLSI placement, where hypergraph partitioning is used to reduce large problems to smaller ones.Physical layout considerations for sub-problems translate into size area constraints for partitioning see 14, 9 for more details.
Turning to 1 , we compare results in Table 5 for unit-weight partitioning with 2 tolerance to those in Table 6 where nodes are assigned varying actual weights.While lowest solution costs are comparable for both cases e.g.274 vs 297 for IBM01, the average performance of FM and CLIP on IBM benchmarks di ers by factors of least 5-10.Moreover, comparing average cuts in the FM and CLIP columns of Table 6 against the Nets" column of Table 2, we see that the two iterative heuristics essentially failed on many benchmarks | more than 50 nets are cut on average in IBM02-IBM04 and IBM06-IBM13 11 out of 18 and over 25 in several others, whereas solutions exist with only several percent of nets cut.This motivates further analysis of how balance constraints are treated in move-based partitioners.
To satisfy particular balance constraints, it is common to generate an initial solution that satis es the constraints2 is legal" and require that all intermediate solutions be legal as well.Thus moves leading to illegal solutions are rejected regardless of the gain they provide.Nodes that are heavier than the balance tolerance can never move in a typical implementation, even though such nodes often have very large degrees and the solution cost strongly depends on their assignment.Given an unfortunate" initial assignment of several heavy nodes, a move-based partitioner is never able to recover low-cost solutions.For many instances, e.g.ISPD-98 benchmarks, heavy nodes are assigned similarly in most low-cost solutions, which means that a random assignment of heavy nodes is most likely unfortunate".
In particular algorithms such as FM and CLIP, immobile nodes may impair the ability o f other nodes to move, trapping FM-and CLIP-based iterative partitioners in high-cost local minima this corking e ect is described and addressed in Section 3.Such phenomena are magni ed by t i g h t balance tolerances e.g., 2 and the presence of heavy nodes, e.g., in the instances of the ISPD-98 benchmark suite see 1, Table 2, p.81 .

The FM algorithm
As is well known, the FM heuristic 16, 8 iteratively improves an initial partitioning solution by m o ving nodes one by one between partitions.FM starts with a possibly random solution and applies a sequence of moves organized as passes.At the beginning of a pass, all nodes are free to move unlocked, and each possible move is labeled with the immediate change in total cost it would cause; this is called the gain of the move positive gains reduce solution cost, while negative gains increase it.Iteratively, a m o ve with highest gain is selected and executed, and the moving node is locked, i.e., is not allowed to move again during that pass.Since moving a node changes the gains of adjacent nodes, after a move is executed all a ected gains are updated.Selection and execution of a best-gain move, followed by gain update, are repeated until every node is locked, or until no legal move i s a vailable.Then, the best solution seen during the pass is adopted as the starting solution of the next pass.The algorithm terminates when a pass fails to improve solution quality.
The FM algorithm can be easily seen 8 to have three main operations: 1 the computation of initial gain values at the beginning of a pass; 2 the retrieval of the best-gain feasible move; and 3 the update of all a ected gain values after a move i s m a d e .The contribution of Fiduccia and Mattheyses lies in observing that circuit hypergraphs are sparse, so that the gain of any m o ve is bounded by p l u s o r m i n us the maximal node degree in the hypergraph times the maximal edge weight, if edge weights are used.This allows hashing of moves by their gains: any u p d a t e t o a g a i n v alue requires constant time, yielding overall linear complexity p e r p a s s .In 16 , all moves with the same gain are stored in a linked list representing a gain bucket".
To guarantee that the output solution is balanced, m o ves that cause violations of balance constraints are typically ignored.Furthermore, in a typical implementation if the rst move i n a b u c ket is ignored, then, for CPU time considerations, the entire bucket is ignored for choosing moves it is extremely time consuming to traverse a bucket's entire list, hoping that one of the nodes in it can be legally moved.Note that moves are examined in priority order, so the rst legal move found is the best.We believe that current practice is not only motivated by speed, but is also partly a historical legacy from partitioners being tuned for unit-area, exact-bisection benchmarking.Recent work of Dutt and Theny 14 is notable for addressing the issue of partitioning with tight balance constraints, and a comparison of results is given further below.However, our techniques are orthogonal in the sense that 14 changes the structure of a pass in a sophisticated way, while we simply show how to x a classic FM implementation in the context of tight balance constraints and uneven node weights.

The CLIP Algorithm
The actual gain of a node in the classic FM algorithm can be viewed as a sum of initial gain i.e., the gain at the beginning of the pass and the updated g a i n d u e t o n o d e s m o ved.The CLIP algorithm of 13 uses updated gain instead of actual gain to prioritize moves.At t h e beginning of the pass, all moves have zero updated gain, and ties are broken by total initial gain.The authors of CLIP report very impressive experimental results 13 , and CLIP has been cited as enabling within a recent m ultilevel partitioner implementation 3 .The method has also been the basis of such extensions as 15 .

The Corking E ect
As noted above, CLIP begins any p a s s b y placing all node moves into buckets corresponding to zero updated gain.The nodes with highest initial gain are placed at the heads of these zero-gain buckets.Hence, if the move at the head of each b u c ket at the beginning of a CLIP pass is not legal, the whole pass terminates without making any m o ves.Particularly when starting from a random initial solution, the nodes with highest gain will tend to be t h e n o des of highest degree, which correspond to the heaviest nodes.Furthermore, even if the rst move is legal, CLIP is still vulnerable to termination soon afterward: without enough time for the moves to spread out", nearly all moves will still be in the zero-gain bucket when it is revisited, and then ignored due to an illegal move ending the pass.We call this the corking e ect: the heavy node at the head of the bucket acts as a cork.Our traces of CLIP executions show that corking occurs quite often with the more modern ISPD98 benchmarks.This is because these benchmarks contain very heavy nodes whose weight approaches or exceeds typical balance tolerances see Table 2 in 2 .We h a ve developed three uncorking techniques to counteract the corking e ect.
Explicit uncorking.Continue to look beyond the rst move in a bucket, if the rst move is illegal.
LIFO pass before starting CLIP.Execute a single LIFO FM pass 19 before starting CLIP passes.This greatly reduces the likelihood of large-degree nodes having the highest total gain, and corking the CLIP gain buckets.This technique should not noticeably increase CPU time as CLIP typically makes dozens of passes and an additional LIFO pass will not signi cantly a ect runtime.
Fixing heavy nodes.At the beginning of the pass, do not place any node whose weight i s greater than the balance tolerance into the gain structure.This technique has essentially zero overhead. 4st We nd the rst technique to be too time-consuming, and it moreover appears to have a harmful e ect on solution quality.Independently applying or not applying the two remaining techniques L-Uncorking by adding an initial LIFO pass, and F-Uncorking by xing heavy nodes yields four di erent CLIP implementations: generic corked CLIP, L-Uncorked CLIP, F-Uncorked CLIP, a n d LF-Uncorked CLIP.T ables 1 and 2 s h o w t h e cutsize results for these variants on ISPD98 benchmarks. 5We report the best and average cutsize obtained over 100 independent single-start trials for each benchmark, and we also report the average CPU time seconds on a 300MHz Sun Ultra-10 workstation with 128MB RAM required by a single-start trial.
The experimental data clearly reveals the correlation between corking e ect, early CLIP termination small runtimes, and inferior solution quality.There are substantial performance di erences between the corked and uncorked CLIP variants, and we believe that the F-Uncorked CLIP variant is the most useful in practice.We also reproduce the best and average cutsizes for CLIP, published by Alpert in 2 .Our uncorked CLIP implementation obtains stunning improvements over Alpert's CLIP implementation up to factors of 25 reduction in average cutsize.

Temporary tolerance relaxation
A brief examination of the recent ISPD98 Circuit Benchmark Suite 1 r e v eals cellsnodes whose areaweight takes more than 10 of the total areaweight.Such cells are guaranteed to always be immobile during move-based partitioning with tolerance less than 10 and likely to be immobile even with larger tolerance.As explained in Section 2, this prevents move-based algorithms from achieving low-cost solutions from most initial solutions.
Temporarily relaxing partitioning tolerance in order to move otherwise immobile nodes is a natural idea; it has been explored in 14 where high-gain nodes could be moved in a pass even when this caused illegal solutions.Such temporary illegalities were resolved in the same pass upon reaching a certain threshold.The proposed algorithms appear di cult to tune, are far from conventional FM or CLIP and can take u p to four times longer to run.A di erent type of temporary tolerance relaxation appears more successful and easier to implement.

Proposed metaheuristic
We perform two or more chained" calls to a black-box iterative partitioner; every next call uses a smaller partitioning tolerance; the tolerance for the rst call is large enough for every node to be movable, while the last call uses the originally requested tolerance.Solutions produced by a proceeding partitioner call are used by the next call.A solution that is illegal with respect to the smaller tolerance is greedily legalized" before the next partitioner call.To d o this, nodes are moved from over lled and to under lled partitions, always choosing a highest-gain move rst.In practice, a separate greedy legalization" step is unnecessary because reasonable FM and CLIP implementations, if given an illegal initial solution, automatically perform greedy legalization" whenever necessary.A similar technique is used in the Metis package of Karypis et al. and, likely, i n hMetis 22, 2 3 a s well, which implements multi-level partitioning heuristics.However, we are not aware of any w orks exploring it for at partitioning.
Two implementation details are useful but, strictly speaking, unnecessary: a tiebreaking on balances, and b the ability to limit the number of passes.If during a pass, the current solution has the best-seen cost, it will be preferred over the previous best solution if and only if it is closer to begin exactly balanced.Secondly, the last several passes in a given partitioning call often produce very little improvement.Given that the resulting solution will be processed by another partitioner call with a di erent tolerance, it may n o t b e useful to wait for a non-improving pass.Therefore the number of passes may be limited; alternatively, one can require minimal improvement in a pass.

Empirical evaluation
To juxtapose the performance of the two proposed approaches to partitioning with varying node weights, we compare the best uncorking variants of LIFO and CLIP LIFO U and CLIP U described in Section 3 to their further improvements LIFO 2 and CLIP 2 with a simple-minded two-stage temporary tolerance relaxation.For the rst pass, the tolerance is set to the larger of a three times the maximal node weight 5 , and b 20 of the total.We also limited the number of passes in the rst stage to 10 and, in CLIP 2 , used CLIP only at the rst partitioning stage.Appropriate experiments have suggested this particular combination from among a number of similar settings.
We analyze algorithm performance in the context of average best of n" for n = 1 ; 2; 4; 8.This technique, advocated in 10 , allows detailed analyses of run-time-versus-quality tradeo s and is also representative of important application contexts, e.g., VLSI placement.The results are presented in Table 3 and suggest that two-stage tolerance relaxation indeed improves solution costs without considerably increasing runtime.
As can be seen in 3, the LIFO 2 and CLIP 2 algorithms provide substantial improvements over even the uncorked" LIFO and CLIP partitioners.The two-stage algorithms actually improve per-start runtime for some examples, and improve results given equal runtimes for nearly all the testcases.It is unclear whether LIFO 2 or CLIP 2 is the superior algorithm.This is surprising, given the clear dominance of CLIP over LIFO.The IBM04 testcase is a particularly striking example, as it shows an improvement of 25 from CLIP U the best uncorked result to CLIP 2 the best 2-stage result and a reduction in single-start runtime of 65!Of the eight examples presented, IBM04 contains the largest number of nodes larger than the tolerance of 2. Thus, it is encouraging to see that the two-stage approach addresses this di cult problem so well.
Next, we compare our two-stage temporary tolerance relaxation to PROP-REX est , a leading algorithm from 14 which employs temporary illegalities within passes to address similar issues.PROP-REX est is the best of the several algorithms reported in 14 . 6These experiments were performed on a 140MHz Sun Ultra-1 workstation.To calibrate our runtimes to those reported in 14 , we ran our plain FM implementation on the ACM SIGDA benchmarks 6 used in that work.The overall performance ratio of approximately 1.7 was fairly consistent, and our FM implementation produced very similar average solution costs.
The results of our comparisons to PROP-REX are given in Table 4.They suggest that four starts of our two-stage CLIP variant C L I P 2 achieves superior solution costs in comparable amounts of time, improving upon the performance of PROP-REX by u p to 31.A single start of CLIP 2 produces results similar to that of PROP-REX, while requiring much less runtime up to 86 less for the avq small testcase.At the same time, CLIP 2 is only a x" to CLIP and is rather simple to implement.

Conclusions
From analysis of partitioning results from 1 , we notice that the performance of FM and CLIP partitioners deteriorates when node areas are allowed to vary.We describe two general e ects that cause such performance deterioration, and that are likely to a ect a w i d e variety of iterative partitioners.In addition, we describe the previously unknown corking e ect, which is particularly harmful to the popular CLIP algorithm 13 , notably within CLIP's motivating context of top-down standard-cell VLSI placement.We propose easyto-implement, low-overhead techniques to counteract the latter problem, and demonstrate notable improvements in solution quality.We speculate that the CLIP corking e ect was not diagnosed earlier because of the tendency to compare partitioners according to unit-area  bisection results, and because of a reliance on older benchmarks that have only uniformlysized cells.We also propose a simple technique of temporary tolerance relaxation, di erent and more successful than the best of all techniques presented in 14 .
Our results suggest that prospective a d v ances in algorithm technology should be evaluated with respect to a full range of applicable instances and contexts i.e., use models.Furthermore, the substantial performance di erences between our CLIP implementation and, e.g., that reported by Alpert 2 suggest that the partitioning research c o m m unity can still bene t from improved understanding of the iterative heuristics upon which new methods are based.

Table 1 :
Comparison of generic Corked, L-Uncorked, F-Uncorked, and LF-Uncorked CLIP results for ISPD98 benchmark test cases.Results shown are minimum average netcut average CPU seconds on Sun Ultra-10 obtained over 100 independent single-start trials, with actual node weights and a 2 balance constraint.We also show the CLIP FM results reported by Alpert in 2 Other CLIP".

Table 2 :
Comparison of generic corked, L-Uncorked, F-Uncorked, and LF-Uncorked CLIP results for ISPD98 benchmark test cases.Results shown are minimum average cutsize average CPU seconds on Sun Ultra-10 obtained over 100 independent single-start trials, with actual node weights and a 10 balance constraint.We a l s o s h o w the CLIP FM results reported by Alpert in 2 Other CLIP".

Table 4 :
Comparison of reported CLIP-REX results with those produced by 2-stage LIFO and CLIP methods.Nodes were assigned actual cell areas.Solutions are constrained to bewithin 0.5 of bisection partitions must contain between 49.75 and 50.25 of total cell area.Data expressed as average cutaverage CPU time.CPU times were normalized to those reported in 14 .