An Effective Solution to the Linear Placement Problem

We present an effective solution to the linear placement problem, which has several applications in the physical design of integrated circuits. Our approach belongs to the class of iterative improvement heuristics. The important difference of this new technique from the previous ones is in its moves, and in the order of application of these moves. A phase of the algorithm begins with simple moves and gradually shifts towards more complex moves. Phases are repeated as long as further improvement is possible. Our experimental results show that nearly optimal solutions can be achieved. For a number of examples collected from the literature, our algorithm generated optimal solutions.


INTRODUCTION
he linear placement problem (LPP) occurs in different forms in the physical design automa- tion of integrated circuits.Various names were used to describe LPP: linear ordering [1], board permu- tation [2], back-board ordering [3, 4, 5], string place- ment [6], and 1-dimensional gate assignment [7].Ac- cording to Yamada et al. [7], LPP is regarded by some researchers as a fundamental and significant problem in the layout of VLSI circuits.
A linear placement is the arrangement of a number of interconnected circuit elements in a row so that a certain objective is met.Various applications of LPP have been reported in the literature.In [1], Kang   uses LPP in a constructive initial placement for stan- dard cells.The cells are initially arranged in one long row, which is then folded into a 2-dimensional placement.Another approach to the placement of stan- dard cells is to assign the cells to rows in a first step.In a second step, the order of the cells in their re- spective rows is determined by solving a linear placement problem in each of the rows.The latter approach has been used by Cho and Kyung [8].In gate array design style, Chowdhury [9] mentions that a good placement may be achieved by initially assign- ing the gates to rows and columns.Then, the place-ment is improved by permuting the columns to min- imize the horizontal wire length.
The linear placement problem is indeed a special case of the more general 2-dimensional placement problem.Therefore, in principle, techniques for the latter problem can be used to solve LPP.However, there is rarely any mention in the literature of such approaches.To the contrary, techniques for LPP have been used to solve the more difficult 2-dimen- sional placement problem [1, 8].Because LPP is an NP-Hard problem [14], most algorithms in the lit- erature are heuristics in nature [1][2][3][4][5][6][7][9][10][11][12][13].In [4], Goto et al. provide a branch-and-bound algorithm which generates a solution whose cost is no more than (1 + e) times the optimal cost.], Kang presents   a constructive algorithm, which places the circuit ele- ments one at a time.The most lightly connected circuit element is placed in the first position.Sub- sequently, based on a selection rule, an unplaced element is chosen and is placed in the next vacant position.This process is repeated until all the circuit elements have been placed.In [11], Schuler and Ul- rich give an algorithm, which divides the circuits ele- ments into several clusters.Subsequently, the circuit elements are arranged in a row, such that elements 118 YOUSSEF SAAB and CHENG-HUA CHEN of the same cluster are kept close to each other.In [12], Cheng obtains a linear placement method using the max-flow min-cut theory [15].Cheng's algorithm   requires for its input a circuit which has the topology of a graph.For parallel graphs, Cheng proves that his algorithm finds an optimal solution in polynomial time.Previously, Adolphson and Hu [13] used net- work flow theory to obtain an algorithm which finds an optimal solution in O(nlog (n)) operations if the input graph is a rooted tree.In [9], Chowdhury uses   nonlinear programming techniques to solve LPP.His algorithm starts at a point deep inside the infeasible region and proceeds gradually towards the feasible region along the trajectories in which the cost tend to be minimized.In [7], Yamada et al. offer a hi- erarchical algorithm for LPP based on contraction of nets.In this approach, the original problem is par- titioned into a number of levels in a bottom-up con- traction of multiterminal nets, such that nets with fewer terminals are given first priority.After that, the elements of the circuits are assigned to positions in the linear order in a top-down manner.
The algorithm described in this paper is iterative and can be thought of as an incremental clustering approach.The early stages of each phase of this al- gorithm form and rearrange small clusters in the cir- cuit, while later stages rearrange larger clusters that were formed during the previous stages.In the al- gorithm, any set of consecutive nodes in the linear placement can be a cluster, i.e., no other criterion is used in the definition of a cluster.The hypothesis is that as the linear placement improves, larger clus- ters appear as a sequence of consecutive nodes in the placement.Further improvement of the linear placement can then be achieved by the rearrange- ment of these large clusters.Phases of the algorithm are repeated as long as further improvement can be achieved.One key feature of this algorithm is that nodes of large clusters are allowed to break away from each other in later stages.This feature should be compared to the approach used by Schuler and   Ulrich [11].The algorithm of Schuler and Ulrich con- sists of two steps" clustering and linear placement.In the clustering step, larger clusters are recursively formed from previous smaller clusters.This step re- sults in the so-called clustering tree.Each vertex in this tree corresponds to a cluster of nodes of the circuit.The cluster of an internal vertex of the tree is the union of-the clusters of its left and right chil- dren.After the formation of the clustering tree, a linear placement can be obtained by a left to right reading of the nodes corresponding to the leaves of the tree.However, this placement may not be good.
Therefore, Schuler and Ulrich consider improving this placement by rotating subtrees of the clustering tree around their roots.The rotation of subtrees can be done in two ways: top-down or bottom-up.In the top-down approach, larger clusters are placed first, and then the placement is improved by rearranging nodes within the same cluster.In the bottom-up ap- proach, the sub-clusters of larger clusters are ar- ranged before the arrangement of the larger clusters.Note that in both the top-down and bottom-up ap- proaches, nodes of the same cluster must stay to- gether.This is a disadvantage, since the number of distinct placement that can be reached from a given clustering tree is 2 (n-2) which is a small fraction of all n!/2 different placements.In comparison with Schu- ler and Ulrich algorithm, each phase of our algorithm can be considered as a bottom-up approach in which the clustering and the linear placement steps are in- ter-mixed.The advantage of our approach over the approach of Schuler and Ulrich is that nodes of large clusters are allowed to break away from each other.This removes the limitation on the number of distinct placements that can be explored.In fact, later in this paper, it will become clear that any two linear place- ments are reachable from each other through the moves used in our algorithm. 2

PRELIMINARIES
An electrical circuit consists of a set V of n elements, and a multiset E, which consists of subsets of V.In the literature, the elements of V are given different names by different authors.Modules, nodes, gates, and cells are just few of the names that are used to describe the elements of V.Each element of E is a subset of two or more elements of V, and it repre- sents a physical electrical connection of its constit- uent elements.The elements of E are formally called hyperedges, but are commonly called nets.In this paper, we refer to the elements of V as nodes, and to the elements of E as nets.
To simplify the notation, the set V is taken to be the set of the first n positive integers, i.e., the integer 1 <--< n is the name of the i-th node in V.It should be clear from the context whether an integer is used as the name of a node in V, or as a numerical value.
The j-th net in E, is denoted by N, 1 -< j -< m, where rn IEI is the number of nets.With this notation, a permutation is a one-to-one mapping 7r: V V.If "rr(i) j, then we say that node j is in position with respect to the permutation r.Given a permutation 7r of V, define: (1) l(Tr, j) (h(Tr, j)) to be the minimum (maximum) integer in the set {i: rr(i) Ni}, i.e., l(Tr, j) (h(Tr, j)) is the leftmost (rightmost) position of a node in net N.
(2) cut(Tr, i) to be the number of nets in the set {N: l(Tr, j) <i < h(Tr, j)}.
(3) D(Tr) to be the maximum element of the set [cut(Tr, i)" 1 -< < n}.The parameter D(Tr) is the density of the permutation (4) L(Tr) to be the sum of the element of the set {cut(Tr, i)" 1 -< < n}.The parameter L(Tr) is the length of the permutation 7r.Note that L(r) Y-,?=I (h(r, j) l(Tr, j)).
Consider the circuit consisting of 8 nodes as shown in Figure 1.The nodes of this circuit have been ar- ranged in a row according to a permutation r.In this permutation 7r, the position of node 6 is 3 (7r(3) 6).The number of nets crossing the dashed ver- tical line is cut(Tr, 4) 3. The leftmost (rightmost) position of a node in N1 is l(Tr, 1) 3 (h(Tr, 1) 8).The density of r is D(Tr) 3, and the length of 7ris L(Tr) 16.Given a circuit (V, E), a linear placement problem (LPP) seeks a permutation 7r, which minimizes a certain cost function.One possible cost function is the density D(Tr) of the permutation.The number of tracks required to route the nets is at least equal to D(Tr).Therefore, the minimization of D(Tr) leads to a smaller number of routing tracks.Another cost function is the length L(Tr) of the permutation.The total wire length required to route the nets is at least equal to L(Tr).Consequently, the minimization of L(Tr) leads to a shorter wire length.A third cost function is '= (h(r, j) l(Tr, j))2.This cost function is similar to the second one.However, it places more weights on the minimization of longer wires.According to Yamada et al. [7], better layout can be achieved if shorter wires are given precedence.Tak- ing this point of view, the third cost function may not be a good choice.The advantage of the third cost function is the fact that it is mathematically well- behaved, and is therefore suitable for analytical methods [9].In this paper, the length L(Tr) of the permutation is used as a cost function.However, our algorithm is not dependent on the particular cost function used, and any other cost function may be used instead.In [7], it is shown that the layout area is linearly dependent on the length of the permuta- tion.Therefore, the minimization of L(Tr) leads to smaller overall area. 3THE ALGORITHM An iterative improvement algorithm starts with some initial solution, which is then improved by a local transformation called a move.The new solution is then used as the initial one, and the process is iterated until no further improvement is possible.
Several iterative improvement algorithm for LPP have been reported in the literature [2,6,10].In [16], Goto introduces the notion of A-optimality.For LPP, one may call a )t-move any move that involves the rearrangement of A nodes of the circuit.If there are no A-moves that can reduce the cost, the solution is called A-optimal.Clearly, an n-optimal solution is optimal.For a given A > 1, there are (])A!possible A-moves.Consequently, if it is desired to guarantee A-optimality, the computational effort of an iterative improvement algorithm increases exponentially with A. Hence, the most commonly used value for A is 2 [9].In this case, the iterative improvement algorithm is better known as a two-exchange strategy.Two-exchange algorithms for LPP are easily trapped in local minima.Consider a circuit of n nodes and n 1 nets, such that net Ni connects the nodes and 1.Let us call such a circuit a chain.Clearly, up to a reversal of order, the only optimal permu- tation for a chain is the permutation r given by or(i) i, 1 -< -< n.For the chain of n nodes initially placed as in Figure 2, and for large enough n, no two nodes can be exchanged such that the cost is strictly reduced.The cost of the permutation in Figure 2 is equal to n/2 + 2(n/2 1) 3n/2 2, while the optimal cost is n 1.This illustrates that even for a simple chain circuit, a two-exchange algorithm may be trapped in a local minimum which is far away from the globally optimal solution.Moreover, even if moves which increase the cost are occasionally ac- cepted as in simulated annealing [17] or stochastic evolution [10,18], a two-exchange algorithm may spend a long computation time before reaching the optimal solution.The exchange of the positions of two nodes is an example of a simple move that changes the permutation only slightly.Formally, if one defines a dis- tance measure on the space of all permutations then a new permutation generated by a simple move is only a short distance away from the previous one.
Suppose that 'W1, 71"2, "/l" k is a sequence of permutations in the order of increasing distance from a reference permutation r0.Suppose also that this se- quence is such that "/'g'i is obtained from ri-1 by a simple move.Let C1, C2, C k be the sequence of costs corresponding to the sequence of permutation.We can now plot a sample cost-versus-permutation curve using the sequence of coordinates (d, c), (d2, c2), (dk, Ck), where di is the distance of ri from the reference permutation r0.If the cost-versus-per- mutation curve is as in Figure 3, then a simple move is a downhill walk along this curve from the starting permutation.Consequently, given the shape of the curve, it is clear that a simple move may not go a long way before it is stuck at a local minimum.This discussion suggests that if a near-optimal solution is sought, then an iterative improvement algorithm should use moves other than simple ones.The kind of moves that are suggested here are illustrated in Figure 4 in which the arrow represents a possible move.These moves are not simply a walk along the cost curve.Rather, they constitute jumps from one point on the curve to another.Such moves can be called compound moves, and they involve the rear- rangement of a large number of nodes in the circuit.
Here, a conflict is reached.In order to find near- optimal solutions, one have to consider compound moves in which a large number of nodes are rear- ranged.However, if all possible moves that involve the rearrangement of a large number of nodes are considered, the computational effort of the algorithm grows exponentially.To remedy this situation, the algorithm should only consider few compound moves that are meant to produce large improvements in return for the computational expense of using them.
The method used by our algorithm can best be called a limited exhaustive search strategy (LESS), since effort is made to choose small polynomial num- ber of moves out of exponentially many possible ones.The algorithm itself is also called LESS.
The idea which lead to the development of LESS is now explained.Given a permutation r of the nodes of the circuit, any set of nodes occupying consecutive positions is called a block.Clearly, any two integers n/2 nodes n/2 nodes FIGURE 2 A possible arrangement for a chain of n nodes.and j such that 1 -< -< j -< n determine the block B {Mk): -< k -< j}.Hence in a permutation, there are (n + n)/2 blocks.Suppose that an iterative improvement algorithm for LPP has been used to determine a permutation r.If at all any improve- ment has been achieved, it would be that the permutation 7r contains a number of blocks which should also appear in an optimal permutation.If 7r is not optimal, it is likely that these good blocks are either placed in the wrong positions and/or they are placed in the wrong order.Our problem now is that it is not known which set of consecutive nodes in 7r is a good block.One strategy for improving 7r is to consider every block of nodes at a time and try to place it in a position which yields the maximum reduction of cost.The length of a block is defined to be the num- ber of nodes in that block.Clearly, the longer the good blocks are, the better is the permutation 7r.The strategy in LESS is to consider the initial permutation as one which is far away from an optimal solution.Therefore, the only possible good blocks at this stage are those of length 1, i.e., every node constitute a block.Hence initially, LESS considers every node and tries to place it in the position which yields the maximum improvement.After all nodes have been considered, the permutation has been improved con- siderably that it may contain many good blocks of length 2. Consequently, blocks of length 2 are con- sidered next.In general, at the i-th stage of LESS, the blocks that are considered have length i.This process continues until all blocks have been consid- ered.This completes a phase of LESS.Phases are repeated until no further improvement is possible.
There are three types of moves used by LESS: FLIP, TRANSFER, and TRANSFER_FLIP.The move FLIP(i, ], 7 0 takes the block Bii and reverses the order of nodes in it.More precisely, this move transforms the current permutation 7r into a new permutation 7r' such that 7/(0 7r(i + j l) if <- <-j and 7r'(l) "rr(l) otherwise.The move TRANS- FER(i, ], k, 7 0 takes the block Bi and places it after position k in the permutation.If k < 1, then the nodes in positions k + 1 up to 1 are shifted to the right to make room for the nodes in B i).If k > j, then the nodes in positions ] + 1 up to k are shifted to the left to make room for the nodes in B i. If k FIGURE 4 An illustration of a compound move.1, then TRANSFER(i, ], k, 7r) leaves the per- mutation unchanged.Note that k can be equal to 0.
If k 0, then the block B is placed in front of the remaining nodes in the permutation.Without loss of generality, assume k > ].The new permutation created by TRANSFER(i, j, k, 7r) is such that 7r'(l) "rr(l) ifl<iorl>k, 7r'(1) 7r(l + ]-+ 1) ifi-<l<i + k-],andr'(l) 7r(l-k +])ifi + k ] <-<-k.The move TRANSFER_FLIP(i, ], k, zr) is similar to TRANSFER(i, ], k, r) except for the reversal of the order of the nodes of B. The move TRANSFER_FLIP(i, ], k, 7r) takes the block B and places it in reversed order after position k in the permutation.Without loss of generality, assume k > ].The new permutation 7/ created by TRANSFER_FLIP(i, ], k, 7r) is such that 7r'(l) "rr(l) ifl<iorl>k, Tr'(l) 7r(l +j-+ 1) ifi <-l<i + k-], and 7r' (1) It should be understood that after any of the above moves the permutation 7r is updated to reflect the change.Now, the algorithm LESS can be described as follows: Step 1. Generate an initial permutation Step 2. Set length 1 and set improve false.
Step 5. Find position k [i 1, j] such that the move TRANSFER(i, ], k, 7r) yields a maximum reduction gain in cost.
Step 6.If gain > 0 then perform the move TRANSFER(i, ], k, 7r) and set improve true.
Step 7. Ifj < n then set + 1 and go to Step 4.
Step 8.If length 1 then set length length + 1 and go to Step 3 (The moves FLIP and TRANSFER_FLIP do not make sense unless length > 1).
Step 11.Find position k [i 1, j] such that the move TRANSFER_FLIP(i, ], k, r) yields a max- imum reduction gain in cost.
Step 12.If gain > 0 then perform the move TRANsFER_FLIP(i, j, k, 7r) and set improve true.
Step 13.If ] < n then set + 1 and go to Step 10.
Step 16.If the move FLIP(i, j, 70 reduces the cost then perform the move FLIP(i, j, 70 and set im- prove true.
Step 17.If j < n then set / 1 and go to Step 15.
Step 18.If length -< n 3 then set length length + 1 and go to Step 3 (Moves for lengths n 2 and n 1 are simulated by earlier moves).
Step 19.If improve true then go to Step 2.
Step 20.Output the permutation 7r and stop.
The complexity of LESS can be assessed as fol- lows.In Step 5 of LESS, a search is conducted for the best position k for the block B. There are O(n length) distinct possible values for k.For each such value x, a simulation of the move TRANS-FER(i, j, k, 70 must be performed in order to cal- culate the reduction in cost gain.If one assumes that each node has a bounded number of nets connected to it, and each net connects a bounded number of nodes, then the number of nets rn O(n) and the length of each net in a given permutation can be computed in constant time.Thus, the complexity of one simulation of the move TRANSFER is at most O(n), since at most O(n) nodes change positions and the length of at most O(n) nets must be updated.Consequently, the complexity of Step 5 is O(n(n length)).Hence, the complexity of the loop embod- ied by Steps 4-7 is O(n(n length)2).A similar analysis shows that the complexity of the loops em- bodied by Steps 10-13 and by Steps 15-17 are (O(n(n length)2) and O(n(n length)), respectively.Fi- nally, the complexity of one phase of LESS (Steps 3-18) is O 7=1 n(n i)2) O(n4).LESS executes few phases (In our experiments, the number of phases never exceeded 15) before it stops, so that in fact the overall complexity of LESS is also O(n4).

EXPERIMENTAL RESULTS AND FINE TUNING OF LESS
All our experiments were performed on a Sun Sparc 1 + workstation.LESS was first tested on small cir- cuits collected from the literature.In some cases, the nets have heavy integer weights.Since LESS assumes that all the nets have unit weights, every net of weight was replaced by nets identical to it.The results are summarized in Table I.The initial permutation was generated randomly using the following proce- dure: Step 1. Set 7r(i) for each 1 -< -< n.
Step 3. Return the permutation The results indicate the high quality solutions of LESS in comparison with the best previously known solutions which are given in the last column of Table I.The running time of LESS in seconds is given for each circuit in Table I.For the data collected from the literature in Table I, the running time is not avail- able except for circuit figl0 [7] for which the a cost of 1470 was obtained by an algorithm described in [7] in 6.1 seconds.
LESS was then used on two larger circuits in order to get a better idea about its running time as well as its performance.The performance was estimated by comparing the results of LESS to the results of the SE algorithm [10] on these two circuits.In [10], the SE algorithm has been empirically shown to be more effective than simulated annealing [17] and Kang's constructive algorithm [1].Table II shows for each of the two circuits, the number of nodes and nets, the cost of the solution found by LESS, the cost of the solution found by SE, the execution time of LESS in hours, and the number of phases of LESS.To make our comparison meaningful, the SE algorit.hm was given the same running time as LESS.For both circuits, the solution found by LESS is better than the solution found by SE.Unfortunately, it took LESS fourteen and a half hours to solve circuit CKT2.
While the results of LESS are excellent in terms of the quality of the solutions obtained, the running time of LESS may be impractical especially if LESS is to be used to solve much larger circuits.Therefore, in order to make LESS more practical, its running time must be reduced while maintaining the quality of its solutions.One way of achieving that is by lim- iting LESS to execute only the moves that are likely to reduce the cost by a considerable amount.Notice that each phase of LESS can be considered as the application of a sequence of n 2 iterative improve- ment algorithms parameterized by the variable length.The i-th improvement algorithm tries to im- prove the cost by performing the moves TRANS- FER, TRANSFER_FLIP, or FLIP on every block of length i.In a phase of LESS, call length useful if the i-th improvement algorithm was capable of reducing the solution cost.To understand which of the n -2 lengths considered during a phase are use- ful, LESS was programmed to output the cost after every length, i.e., the cost of the current solution is written into an output file before LESS executes Step 18.It was observed that very few lengths are useful in a phase of LESS.For example, the lengths which were useful in the first phase of LESS while working on circuit CKT2 are: 1-10, 15, 17-20, 28-29, 57- 58, 155,178, 180.It is clear that there are large gaps between periods of useful lengths.For instance, the lengths 59 through 154 were useless.Because no im- provement occurs during periods of useless lengths, the time spent on trying to execute moves at these lengths is wasted.Moreover, notice that there are more useless lengths than useful ones: Out of 181 lengths, only 22 lengths are useful.This means that- much of the time spent by LESS is a wasted time.
The sequence of costs corresponding to the above useful lengths is: (1, 7729), (2, 3900), (3,3276)  (180, 1550).For convenience, this sequence is given in the form (useful length, cost).The cost of the initial solution is 15106.One should notice now that the majority of useful lengths are short.Further- more, the largest reductions in cost occurs at short lengths.Therefore, much of the quality of the final solution is retained if only the first few lengths were considered in LESS.This can be achieved by chang- ing Step 18 in LESS to: New_Step 18. if length <-n/a then set length length + 1 and go to Step 3,   where a > 1 is a parameter chosen by the user.The computation time can be reduced by choosing a large value for a.Call the modified algorithm LESS(a).LESS(10) was used on circuits CKT1 and CKT2, and its results were compared to those of LESS in Table III.It can be seen that for each of the two circuits CKT1 and CKT2, the quality of the solution of LESS( 10) is slightly worse than the solution of LESS.However, LESS( 10) is much faster than LESS.
It is also possible to improve the running time of LESS by cutting down on the computation of Steps 5 and 11.Here, it is shown how Step 5 can be im- proved.
Step 11 can be modified similarly.The com- putation of Step 5 consists of a search for a position k [i 1, j] such that if the block Bij is transferred to this position, then the reduction in the cost of the permutation is maximum.Therefore, if block B o has length l, then O(n l) positions are tested.If the set of candidate positions is restricted to a small sub- set of the set of all possible positions, then the speed of the algorithm can be improved considerably.The first observation is that the reduction in cost due to the transfer of block Bii to a new position k should not be very different from the reduction due to the transfer of Bii to a position adjacent to k.The second observation is that as the candidate positions are tested from left to right (If k < k2, then we say that position k is to the left of k2), the reduction in cost changes noticeably at critical positions.To illustrate this second observation, suppose all the nodes of the circuit except one node (say node y) has been ar- ranged in some order.To insert node y in this order, there is a position k such that if node y is squeezed in this position, then the cost of the resulting permutation is no more than the cost of the permutation which results from squeezing y in any other position.
By squeezing y in position k we mean that the nodes in positions > k are shifted one unit to the right Circuit CKT1 CKT2  and then y is put in position k + 1.Thus, the range of k is [0, n].Let C be the cost of the current per- mutation of all the nodes but node y, which ignores the length of the nets connected to y (Given a per- mutation rr, the length of a net N is equal to h(Tr, j) l(rr, j)).Let CUT(k) be the set of nets that are not connected to y, and that have nodes in positions less or equal to k as well as in positions greater than k.Let c be the number of nets in CUT(k).Let l, be the total length of the nets connected to y when y is squeezed in position k.Then, the cost of the resulting permutation of all the nodes is equal to C + l, + c.The term c is the squeeze factor which is due to the fact that the length of every net in CUT(k) increases by 1 if node y is squeezed in the k-th position.It is clear now that the best position for node y in the existing order of the other nodes is the one that minimizes l + c.If variations in c are negligible, then the best position k for node y is the one that minimizes l.Now notice that as k in- creases from 0 to n, l first decreases steadily up to a certain position m, it then stays at the same value until some position M (M can be the same as m) is reached, and it increases steadily after that.Such a function l of k is called a convex function.It can be verified that the sum of convex functions is convex.
Therefore, to show that l is convex, it suffices to show that the length of every net connected to y is convex.Let w, be the length of some net N con- nected to y if y is inserted in position k.Let m (Mi)   be the minimum and maximum position of the nodes other than y in N in the existing order.Clearly, we have: Therefore, W k is a convex function of k.The mini- mum of Wk is achieved at k rn and k M. There- fore, if Y is the set of all nets connected to y and S {m, M: Ni Y}, then the minimum of lk must be achieved at some k S. In general, let Yii be the set of all nets connected to nodes in the block Bii, and let S ij {mx, M x: Nx Yi]} {i 1, i, + ]}.Now, Step 5 can be restricted to search for a best k S.
Step 11 can be similarly modified.
Call FAST(a) the algorithm LESS(a) in which Steps 5 and 11 has been modified as discussed above.FAST(a) was used on a number of circuits.For every one of these circuits, a was set equal to n/20 so that the maximum length of a block is no more than 20.
The results of FAST(n/20) were compared to those obtained by the SE algorithm [10].The same initial random permutation was used by both algorithms.Table IV shows the results.The time shown in Table   IV is the running time of FAST(n/20) in hours.The SE algorithm was given the same running time.A comparison of the running times of LESS, LESS(10), and FAST(n/20) on circuit CKT2, shows that FAST(n/20) is indeed faster than both LESS and LESS (10).The solutions found by FAST(n/20) are comparable in quality to the corresponding solutions found by the SE algorithm.This illustrates that even under the restriction imposed on the basic algorithm, it is still possible to obtain good quality solutions.
Another possibility for improving the running time of the basic algorithm is to start with a good initial permutation.First, the cost of a good initial permutation is expected to be less than the cost of a random permutation, so that fewer iterations may be needed by the algorithm.Second, in a good per- mutation, nodes that are heavily connected are ex- pected to be close to each other.Consequently, the moves TRANSFER and TRANSFER_FLIP should be less time consuming (The computations of TRANsFER(i, , k, zr) and TRANsFER_FLIP(i, j, k, or) are proportional to j k if k < and to k if k > j).Clearly however, starting from a construc- tive initial permutation need not lead to better so- lutions.The heuristic CLO [10] was used to generate initial permutations to both FAST(n/20) and SE.The new results of both algorithms are shown in Table V.Again, SE was given the same running time as FAST(n/20).A comparison of the results in Tables IV and V shows that for most circuits it was advan- tageous to start with a good initial permutation.The improved performance is obvious for larger circuits.For circuit CKT7 for example, the solution of Fast(n/ 20) starting from the initial permutation generated by CLO has about half the cost of the solution of Fast(n/20) starting from a random initial permuta- tion.In addition, for circuit CKT7, the use of a good initial permutation made FAST(n/20) about three times as fast.The results in Tables 4 and 5 show a dependency on the initial permutation.However, the quality of the final solution of the original algorithm LESS is much less sensitive to the initial permutation as is indicated by the empirical results.The restric- tion on blocks to be of short length and the limita- tions imposed on Steps 5 and 11 in FAST(n/20) en- hance the sensitivity of the algorithm to the initial permutation.
In the appendix, we give a procedure for the con- struction of test circuits for which the optimal cost is known.The input to the construction procedure consists of the desired number of nodes n, the desired number of nets m, and a desired bound B > 2 on the maximum number of nodes that can be connected by a single net.The construction procedure outputs the constructed circuit along with its optimal cost.Using this procedure, a number of circuits were generated.Then, FAST(n/20) and the SE were tried on these circuits using a random initial permutation and an initial permutation generated by CLO.The results obtained by initially using a random permutation and the permutation generated by CLO are shown in Table VI and Table VII, respectively.A comparison of the results in Tables VI and VII reveals that better results can be achieved in a much shorter time if a good initial permutation is used.This is true for both FAST(n/20) and the SE algorithm.When a per- mutation generated by CLO is used as an initial permutation, FAST(n/20) results are close to being op- timal in Table VII.The speedup over the use of a random initial permutation is also clear.For circuit C8 for example, FAST(n/20) starting from a good initial permutation generated by CLO, achieved a solution whose cost is less than 1/3 the cost obtained when a random initial permutation is used with a speedup of 13.71.As can be observed in Table VI, when the initial permutation is random, the solution of FAST(n/20) becomes more and more away from the optimal solution as the size of the circuit becomes larger.This is mostly due to limiting the computation to blocks of length less or equal to 20.When the size  of the circuit is small (circuit C1 for example), an optimal solution can still be achieved.However, for larger circuits, length 20 becomes too small in com- parison with the number of nodes.Consequently, FAST(n/20) becomes more susceptible to be trapped in local minima.To avoid this situation, longer blocks need to be considered.This certainly will increase the computation time enormously since the majority of possible block lengths are useless as it was men- tioned earlier in this section.When longer blocks are allowed, the algorithm may perform many useless computations before it reaches a useful length during each phase.Note that SE was capable of obtaining better so- lutions than FAST(n/20) for some cases in Tables IV and V, but the results of SE are consistently worse than the results of FAST(n/20) in Tables VI and VII.
While it is not possible to give a conclusive justifi- cation of this observation, it seems that SE does not perform as well on the artificial circuits used in Tables VI and VII.This may be partly due to the fact that SE uses only two-exchange moves.
The above results show that FAST(n/20) is effec- tive and more practical and faster than LESS.There- fore, it is natural to check the performance of FAST(n/20) on the same circuits in  marized in Table VIII (The last column shows the best previously known solutions).These results are comparable to the one obtained by LESS in Table I, but they were obtained in a much shorter time.Also, for Data V, FAST(n/20) obtained a slightly better solution than LESS.
To illustrate that FAST(a) can indeed produce near-optimal solutions if larger blocks are considered (e.g., a small value of a is used), FAST(2) was used on the smaller circuits in Tables VI and VII (up to 600 nodes).For larger circuits, FAST(2) uses more computation time than we can afford on a Sun Sparc 1 + workstation.Again, two different choices for the initial permutation were used: a random one, and a constructive one generated by CLO.The results are shown in Table IX.It can be seen that near-optimal solutions can be achieved.In fact optimal solutions were generated for circuits C1, C2, and C5 regardless of the initial permutation.The worst cost was generated for circuit C4 when the initial permutation was randomly generated.Nevertheless, even in this case, the cost of the solution generated is within 10% of the optimal cost.The running time and the cost were consistently less when the initial permutation was generated by CLO.The saving in running time can be significant.For example, for circuit C5, FAST(2) was five times faster when the permutation generated by CLO was used as the initial permuta- tion.

CONCLUDING REMARKS
The notion of a "block" in a permutation is the fun- damental concept used in our linear placement al- gorithm.A good block should be a set of consecutive nodes which form a cluster (i.e., a set of heavily interconnected nodes).If there is a simple criterion for determining whether a sequence of consecutive nodes is a good block, then LESS can be made much faster.The lack of such criterion, forces us to con- sider all (n + n)/2 subsequences of consecutive nodes in a permutation as blocks.As part of future work, we intend to develop such a criterion.Even though part of the computation time is un- productive, LESS can be modified into the effective algorithm FAST(a).The algorithm FAST(c 0 re- stricts the length of a block to be at most na.Thus, the computation time is inversely proportional to a.
However, the cost of the final solution is proportional to a. Consequently, a must be chosen appropriately so that the computation time is sharply reduced with- out damaging the quality of the final solution.
Our experiments indicate that only a small fraction of possible block lengths are useful.The smallest lengths (1-10) were most of the time useful.How- ever, larger useful lengths (>10) are separated by gaps of many useless lengths.These gaps are traps that prevent heuristics from achieving good solu- tions.LESS had to step over many of these traps in order to discover the useful lengths.The existence of these gaps may be an indication that probabilistic algorithms such as simulated annealing [17] and sto- chastic evolution [18] may need a long time to gen- erate near-optimal solutions if only simple moves (e.g., transpositions) are used.circuits can be used to estimate the effectiveness of a heuristic.In particular, unless a heuristic is poor, it should perform reasonably well on these circuits.
The idea behind our construction method is very simple: If there is a permutation 7r of the nodes of a circuit such that the cost of 7r is equal to a lower bound on the optimal cost, then this permutation is an optimal solution to the linear placement problem.Clearly, the length of any net in any permutation cannot be less by more than one than the number of nodes connected by this net.Therefore, if a circuit has rn nets, where the number of nodes connected by net N. is xi, then the cost L(7 0 of any permutation 7r must satisfy _> a). (*) Step 4. Let be a random integer in the interval [1, n].Let j be a random integer in the interval [1, n] f3 [i-B + 1, + B-1].Suppose without loss of generality that < ].Construct the net N {Tr(/): _< <-j}.Set Co Co + j i.
Step 5. Increment k.If k -< m then go to Step 4.
Step 5. Output the circuit and the optimal cost Co.
Clearly, all the nets are constructed such that their nodes are consecutive in the random permutation Therefore, for this permutation 7r, equality holds in (*).Consequently Co which at the end of the algo- rithm is the cost of 7r, is indeed the optimal cost for the constructed circuit.It should be mentioned here that the initial random permutation r need not be the only optimal solution.

Biographies
The input to the algorithm consists of the number of nodes n, the number of nets m, and a bound B > 2 on the maximum number of nodes that can be connected by a single net.The algorithm generate nets connecting at most B nodes.However, the al- gorithm can be easily modified so that a fraction of nets connect only 2 nodes each, 3 nodes each,... etc.The algorithm begins with a random permutation 7r and constructs the nets so that equality holds in (*).Random permutations are generated as de- scribed in Section 4 of this paper.Here is the outline of this algorithm: Step 1. Read n, m, and B.
Step 2. Generate a random permutation 7r of the set of nodes {1, 2,  n}.

FIGURE 3 A
FIGURE 3 A hypothetical cost-versus-permutation curve.

TABLE Performance of
LESS on Small Circuits Collected From the Literature No.

TABLE IV Comparison
of FAST(n/20) and SE Starting From a

TABLE V
Comparison of FAST(n/20) and SE Starting From an Initial Permutation

TABLE VI Comparison
of FAST(n/20) and SE on Artificial Circuits Using a Random Initial

TABLE VII Comparison
of FAST(n/20) and SE on Artificial Circuits Using an Initial

Table
1, whichwere first used to test the effectiveness of LESS.The results of FAST(n/20) on these circuits are sum-