A Fast Algorithm for Performance-Driven Module Implementation Selection*

We develop an O( p log n) time algorithm to obtain optimal solutions to the p-pin n-net single channel performance-driven implementation selection problem in which each module has at most two possible implementations (2-PDMIS). Although Her, Wang and Wong [1] have also developed an O(p log n) algorithm for this problem, experiments indicate that our algorithm is twice as fast on small circuits and up to eleven times as fast on larger circuits. We also develop an O(pn c-l) time algorithm for the c, c > 1, channel version of the 2-PDMIS problem.


INTRODUCTION
In the channel routing problem, we have a routing channel with modules on the top and bottom of the channel, the modules have pins and subsets of pins define nets. The objective is to route the nets while minimizing channel height. Several algorithms have been proposed for channel routing (see, for example, [2]).
When the modules on either side of the channel are programmable logic arrays, we have the flexibility of reordering the pins in each module; any pin permutation may be used. The ability to reorder module pins adds a new dimension to the routing problem. Channel routing with rearrangeable pins was studied by Kobayashi and Drozd [3]. They proposed a three step algorithm (1) permute pins so as to maximize the number of aligned pin pairs (a pair of pins on different sides of the channel is aligned iff they occupy the same horizontal location and they are pins of the same net), (2) permute the nonaligned pins so as to remove cyclic constraints, and (3) while maintaining an acyclic vertical constraint graph, permute unaligned pins so as to minimize channel density. Lin and Sahni [4] developed a linear time algorithm for Step (1) and Sahni and Wu [5] showed that Steps (2) and (3) are NP-hard. Tragoudas and Tollis [6]  tCorresponding author, e-mail: {yccheng, sahni}@cise.ufl.edu a linear time algorithm to determine whether there is a pin permutation for which a channel is river routable. They also showed that the problem of determining a pin permutation that results in minimum density is NP-hard in general, and they developed polynomial time algorithms for the special case of channels with two terminal nets and channels with at most one terminal of each net being in each module.
Variants of the channel routing with permutable pins problem have also been studied [7][8][9][10]. In these variants restrictions are placed on the allowable pin permutations for each module. Restrictions may arise, for example, because the module library contains only a limited set of implementations of each module [7]. Another variant, considered by Cai and Wong [8] permits the shifting ofmodules and pins so as to minimize channel density. Extensions to the case when over the cell routing is permitted are considered in [9] and [10].
The variant of the channel routing with permutable pins problem that we consider in this paper is the performance-driven module implementation selection (PDMIS) problem formulated by Her, Wang and Wong [1]. In the k-PDMIS problem, we are given two rows of modules with a routing channel in between, up to k possible implementations for each module (different implementations of a module differ only in the location of pins, the module size and pin count are the same); and a set of net span constraints (the span of a net is the distance between its leftmost and rightmost pins). A feasible solution to a k-PDMIS instance is a selection of module implementations so that all net span constraints are satisfied. An optimal solution is a feasible solution with minimum channel density.
Figure l(a) shows a routing channel with two modules on either side of the routing channel. Assume that each module has two implementations and that the pin locations for the second imple- Her, Wang and Wong show that the k-PDMIS problem is NP-hard for every k >_ 3. For the 2-PDMIS problem, they develop an O(p log n) algorithm to find an optimal solution. In this paper, we develop an alternative O( p log n) algorithm to find an optimal solution to the 2-PDMIS problem. Experiments indicate that our algorithm is twice as fast on small circuits and up to eleven times as fast on larger circuits.
We begin, in Section 2, by providing an overview of the O(p log n) algorithm of [1]. Then, in Section 3, we describe our O( p log n) algorithm. In Section 4, we develop an O(pnc-1) algorithm for the c, c > 1, channel 2-PDMIS problem. Experimental results using the single channel 2-PDMIS algorithm are presented in Section 5. Cspan/ eden is satisfiable. As shown in [1], the size of Cspan/ Cden is O(p); Steps 1-3 take O(p) time; and the overall complexity is O(p log n).

OUR O(p log n)-TIME ALGORITHM
Our algorithm is a two stage algorithm that does not construct a 2-SAT formula. In the first stage, we construct a set of 2m "forcing lists", where m is the number of modules. L[i] is a list of module implementation selections that get forced if the first implementation of module i, _< _< rn is selected; L[m + t] is the corresponding list for module when the second implementation of module is selected. By forced, we mean that unless the module implementations on L[i] (L[m+i]) are selected whenever the first (second) implementation of module is selected, we cannot have a feasible solution that also satisfies the given density constraint. In the second stage, we use the limited branching method of [12] and the forcing lists constructed in Stage to obtain a module implementation selection that satisfies the net span and density constraints (provided such a selection is possible). To find an optimal solution, we use binary search to determine the smallest density constraint for which a feasible solution exists.  lates the net span constraint for net i, then each of the remaining three selection pairs also violates the net span constraint for this net. We have the following possibilities:  In this feasible Once we have constructed the forcing lists for the net span constraints, we proceed to augment these lists to account for the channel density constraint. Of course, this augmentation is to be done only when we haven't already determined that the given 2-PDMIS is infeasible. Our strategy to augment the forcing lists to account for the .density constraint begins by partitioning the routing channel into regions such that no module boundary falls inside of a region (see Fig. 4). In this method, we start with a module whose implementation is yet to be selected. For this module, we try out both implementations, in parallel, following the forcing lists L[t] and L[m+ t], respectively. This is equivalent to running Assign (L, C, i) and Assign (L, C,m + i) in parallel and terminating when either (a) both return with value False or (b) one (or both) return with value True. when (a) occurs, we have an infeasible solution.
When (b) occurs, the selections made by the branch that returns True are used. Note that the parallel execution of Assign (L, C,i) and Assign (L, C,m + i) is actually done via simulation by a single processor; this processor alternates between performing one step of Assign (L, C, i) and one of Assign (L, C, rn + i) and stops when one of the two conditions (a) or (b) occur. In case of (b), we proceed with the next module with unselected implementation.

Implementation Details
To implement Stage 2, we need two copies of the implementation selection array C; one copy for each parallel execution branch. Call these copies C1 and C2. Although both are identical at the start of Assign (L, C1, i) and Assign (L, C2, i), C1 and C2 may differ later. When the execution of these two branches terminates, we need to set the Ci corresponding to the unselected branch equal to that of the selected branch. This is done efficiently by maintaining two lists A1 and A 2 of changes made to C1 and C2 since the start of the two branches.
Then, if C1 is selected, we can use A to first convert C2 back to its initial state and then use to convert it from the initial state to C1. If C2 is selected, a similar process can be used to convert C1 to C2, The time need for this is IAI+IAzl rather than Cll IC21 rn (as would be the case if we simply copy C1 to C2 or C to Further, since the forcing lists are shared by two branches, these branches should not modify the forcing lists. Therefore the simulation of Assign omits the steps that remove forcing lists. Finally, to efficiently simulate two parallel executions of Assign, we need to convert the recursive version of Figure 3 into an iterative version. Our iterative code which simulates the parallel execution of two Assign branches employs two queues Q1 and Q2.
A high level description of the code is given in

Time Complexity
To construct the net span constraints' portion of the forcing lists, we must identify the up to four critical modules of each netand establish the forcing constraints for each of the up to four critical module pairs that determine the net span.  Algorithm Undo (C, Ax, C2, A2) /* make C = C2 by using delta lists */ for each x A x do Mark x undecided in C1; for each x A2 do Mark x selected in C; To construct the portion of L[t] that results from the channel density constraint, we partition the channel into regions by performing a left to right sweep of the modules and using the module end points as region boundaries. The number of channel regions is, therefore, O(m). In our implementation, we scan the channel four times to compute the maximum density of each region for each of the four possible implementations of the module pair that bounds the region. This takes (p) time.
Once we have the densities of each region we can, given the density constraint, construct the forcing lists L[1.
.2m] in O(m) time. Notice that on succeeding iterations of the binary search for an optimal solution, only the contribution to L from the density constraint may change. The new contribution to L can be determined without recomputing the densities of each region.
The limited branching method of Stage 2 uses two queues Q1 and Q2. The time needed to add (EnQueue) or delete (DeQueue) an element to/ from a queue is O(1) [13]. In each iteration of the for loop of Figure 5, the time spent following the successful branch equals that spent following the unsuccessful branch and the time needed to make C1 and C2 identical (i.e., the cost of the Undo operation) is, asymptotically, no more than the time spent following the successful branch.
The time spent following all successful branches is no more than the size of the forcing lists because no forcing list is examined twice. Therefore, the Stage 2 time is O(p).
The binary search for the minimum density solution iterates O(log n) times. Therefore, our algorithm finds an optimal solution to the 2-PDMIS problem in O( p log n) time.
Comparing our algorithm to that of [1], we note that our algorithm has the potential of identifying infeasible 2-PDMIS instances quite early; that is, during the construction of the forcing lists. Although infeasibility resulting from the critical modules of a single net being too far apart are detected immediately by both algorithms, our algorithm also can quickly detect infeasibility resulting from forced selections during Stage 1. The algorithm of [1] does not do this. Because of the calls to Assign made during Stage 1, the size of the forcing lists to be processed in Stage 2 is often significantly reduced. As a result, the limited branching operation is often applied to much smaller data sets than the 2-SAT graph on which the strongly connected component algorithm is applied in [1]. These factors contribute to the observed speedup provided by our algorithm relative to that of [1]. We can reduce this time to O(pn-1) as follows. When c 2, first determine the least y such that ((n/ 2), y) is a feasible channel density tuple. This is done using a binary search on d2 and takes O(log n) feasibility tests, each test taking O(p) time. We can ignore tuples (dl, d2) with dl < (n/2) and d2 < y because these tuples are infeasible, and we can ignore tuples (d,d) with da > (n/2) and dz>_y because these are inferior to ((n/2), y). Therefore, the search for a better tuple than ((n/2),y) may be limited to the regions dl < (n/2) and d2 >_ y, and d > (n/2) and d2 < y. These two regions (Fig. 8) may now be searched recursively. For example, to find the best tuple in the region da < (n/2) and d2 > y, 4

. MULTICHANNEL 2-PDMIS PROBLEM
In the multichannel 2-PDMIS problem, we have c + 1, c > rows of modules. Each module has pins on its upper and lower boundaries, each module has two possible implementations, there is a routing channel between every pair of adjacent rows, and net span bounds are provided for every channel [1]. Although Her, Wang and Wong [1] develop a heuristic for the general multichannel PDMIS problem, they do not consider polynomial time algorithms for the multichannel 2-PDMIS problem.
For any fixed channel density tuple (dl, d2,..., dc) for the c routing channels, we can develop the forcing lists in O(p) time, where p is the total d2 eliminate dl FIGURE 8 The two regions to be searched recursively after the binary search. find the least z such that ((n/4), z) is feasible. Now search the two regions dl < (n/4) and d2 > z, and dl > (n/4) and d2 < z. for a better tuple than ((n/ 4),z).
The worst-case number of feasibility tests for the above search strategy is given by the recurrence N(n)=2N -+logn, n>2 and N(1)= 1. The solution to this recurrence is N(n) O(n). Since each feasibility test takes O(p) time, the 2-channel 2-PDMIS problem can be solved in O(pn) time.
By doing an exhaustive search on the densities of c-2 channels and using the above technique for the remaining 2 channeIs (i.e., for each choice of densities for c-2 channels, find the overall best choice for the c channels as above), we can solve the c-channel 2-PDMIS problem in O(p. n c-2. n) O(pnc-1) time.

EXPERIMENTAL RESULTS
We implemented our algorithm as well as that of Her, Wang and Wong [1] in C and measured the run time performance of the two algorithms on a SUN SPARCstation 5. Our first data set consists of benchmark channels used in [1]. We partitioned the top row and bottom row of the channel into intervals and consider these intervals as "modules", and assume each module has two implementations. Table I gives the characteristics of these circuits as well as the time, in seconds, taken by the two algorithms. The optimal densities given in Table I differ from those reported in [1] because the partitioning of the top and bottom rows of pins used by us is different from that used in [1]. The speedup provided by our algorithm ranges from 1.67 to 2.20. Our second data set consists of circuits designed to minimize the size of the forcing lists constructed in Stage 1. The characteristics of these circuits as well as the performance of the two algorithms on these two circuits are given in Table  II. Our algorithm is 9 to 11 times as fast on these circuits.   [1]. The heuristic proposed in [1] for the k-PDMIS problem, k > 2, uses the algorithm for the 2-PDMIS problem. By using our 2-PDMIS algorithm, the k-PDMIS heuristic of[l] will also run faster.