A Fast Algorithm for Transistor Folding

Transistor folding reduces the area of row-based designs that employ transistors of different size. Kim and Kang [1] have developed an O(m2 log m) algorithm to optimally fold m transistor pairs. In this paper we develop an O(m2) algorithm for optimal transistor folding. Our experiments indicate that our algorithm runs 3 to 60 times as fast for m values in the range (100, 100,000).


INTRODUCTION
In high-performance circuit design, the transistor sizing problem was investigated widely in the past (for example, [7][8][9][10]). The objective of transistor sizing is to reduce the circuit delay by increasing the area of transistors. One by-product of transistor sizing is the generation of layouts of transistors of widely varying size. In row-based layout synthesis ( [3][4][5][6]), we group pMOS and nMOS transistors together and place them in rows. The layout area for these designs is wasted due to nonuniform cell heights. The layout area required can be reduced by folding large transistors so that their height is reduced. Transistor folding to optimize layout area has been considered in [1] and [2]. Her and Wong [2]  They also show that the area of row-based designs can be reduced by as much as 30% by performing transistor folding. In this paper, we consider the row-based-design transistor-folding problem considered in [1] and develop an O(m2) or O(s(m algorithm to minimize area. We also report on experiments conducted by us that show that our algorithm actually runs much faster than the This research was supported, in part, by the Army Research Office under grant DAA H04-95-1-0111. e-mail: yccheng@cise.ufl.edu *Corresponding author, e-mail: sahni@cise.ufl.edu algorithm of [1]. The test circuits used in our experiments have between 100 and 100,000 transistor pairs. So, our tests are similar to those conducted in [1] where the circuits had from 192 to 88,258 transistor pairs.

PROBLEM FORMULATION
We are given a CMOS circuit with a row of m transistor pairs. Each transistor pair consists of a pMOS transistor and its dual nMOS transistor. Let Pi and n;, respectively, be the heights of the pMOS and nMOS transistors in the ith pair, _< < m. Pi and ni are integers that give transistor height in multiples of the minimum resolution A. occupied by the folded transistor pair is shown by a shaded box in Figure 2. In practice, the height of the layout area is slightly larger than the sum of the pMOS and nMOS folding heights, and the layout width is slightly larger than the number of transistor columns because of overheads.
Let hp and hn be the folded heights of the pMOS and nMOS transistors, respectively. The height of the folded layout is hp+h,+cv and the width is -i%1 max((pi/hp), (ni/hn)]) / ch where cv and ch are, respectively, vertical and horizontal overheads. The area of the folded layout is [ [1].
The second algorithm given in [1] works in two phases. In the first phase, the algorithm constructs a subset S P of [PMIN, max(P)] and another subset S u of [NMIN, max(N)] with the property that the optimal hp is in S e and the optimal hn is in Su. The basic observation used to arrive at S e and S u is that if the heights hi and hi /k divide a transistor into the same number of columns then hi is preferred over hi +k (for example if Pl 14, then folding heights 7,8,9,10,11,12 and 13 all fold the transistor into two columns; 7 is preferred over the remaining choices). In the second phase the opti- Our algorithm is also a two phase algorithm. The first phase of our algorithm is identical to the first phase of Kim and Kang's algorithm [1]. We compute the subsets S e and S N using the code of Figure 3. The arrays SPL and SNL are initialized to zero in the first two for loops. Then we determine the members of S e and sN; we set SPL [i] if and only if E S e and SNL [i] if and only if SN. Finally, S e and S N are computed in compact form from SPL and SNL respectively. Note that we can compute S e and S N in either ascending or descending order easily by controlling the direction of traversal of the SNL and SPL arrays respectively, in the last two for loops. The algorithm presented in Figure 3 computes S e in ascending order and S N in decending order.

Phase II
Assume that the transistor pairs have been recorded so that (pi/ni)<_ ((Pi+l)/(ni+l)); also assume that (po/no)=0 and ((pm+,)/(nm+))= OC. The formula, Eq. (1) 4 Compute optimal hp and h,.  Figure 4. Since S P and S u can be computed in ascending and descending order respectively by Algorithm Phase I of Figure 3, no sorting is needed to evaluate the members of S P and S u in the specified order. We can sort the transistors into increasing (actually nondecreasing) order of (pi/ni) in Algorithm Refined Phase II (P, N, SP, SN, Ch, Cv) /* S P is in ascending order and S N is in descending order */  Figure 5. Although its worstcase complexity is the same as that of Figure 4, it is expected to run faster in practice.
processing is completed using the technique of Kim and Kang [1], it is possible that the cost of this preprocessing when used in conjunction with our refined Phase II algorithm of Figure 5 may exceed the benefits that accrue from the preprocessing. Therefore, we formulate another version of our algorithm which does not employ the Phase I preprocessing algorithm of Figure 3. The new version, Figure 6 is just the algorithm of Figure 5 with the while loops replaced by for loops. Figure 7 shows a variant of the algorithm of Figure 6 in which only one of the arrays Lp and L N is precomputed, the values of the other array are computed as needed. This variant reduces the space requirements of the algorithm, and, as we shall see, also reduces the time requirements.

ALGORITHMIC VARIANTS
Although the use of the preprocessing phase, Phase I, dramatically reduces the run time when The phase algorithm of Figure 3, the phase 2 algorithm of Figure 5, the algorithmic variants of Figures 6 and 7, and the two algorithms of [1]  methodologies were used to develop the codes for our algorithms and those of [1]. As a result, we expect that almost all of the performance difference exhibited in our experiments is due to algorithmic rather than programming differences. Since we were unable to obtain the test data used in [1], we generated random data. We ignore any possible correlation between pMOS and nMOS transistors. For our test data, the number of transistor pairs ranged from 100 to 100,000. This covers the range in transistor numbers (192 to 88,258) in the circuits of [1]. For our first test set, the sizes of the pMOS and nMOS transistors were generated using a uniform random number generator with range [30,90] for pMOS and [20, 60] for nMOS. These size ranges correspond to those for the circuit fract that was used in [1], the circuit fract has 598 transistors. Since all of the tested algorithms generate optimal solutions, run time is the only comparative factor. This time is provided in Table I. Other than in the column labeled "Phase 1", all times are the times needed for the entire area minimization process (i.e., phase plus phase 2).
The exhaustive search algorithm was not run for m > 10,000 as its run time becomes prohibitive.
In the case of the algorithm proposed in [1], the phase 2 time is significantly larger than the phase time. Our algorithm for phase 2 (5) has brought this time down to approximate the phase time. Equally interestingly, on the data sets used by us, the preprocessing of phase is no longer of any use. When this preprocessing is eliminated (as in Fig. 6), the run time reduces by an additional 30%. The variant of Figure 7 provides modest run time improvement (due mainly to not having to reference a two-dimensional array to get one of the L values), but provides a 50% reduction in space needs.
For small circuits (m <_ 10,000), our algorithm of Figure 7 provides a speedup of 3.8 to 7.1 over the algorithm of [1]. On larger circuits, the speedup is more dramatic. For instance, when m--= 100,000 our algorithm of Figure 7 is almost 27 times as fast as that of [1].
We experimented with two other data sets. Table II reports the run times for circuits in which the range of the uniform random number generator was set to [30,180] for pMOS transistor sizes and [20, 120] for nMOS sizes and Table III gives the run times when the transistor sizes are

CONCLUSION
We have developed a transistor folding algorithm that is both theoretically and practically faster than the algorithm proposed in [1]. Our algorithm is also simpler to code. Experiments suggest that our algorithm runs 3 to 60 times as fast as the algorithm of [1] on circuits with 100 to 100,000 transistor pairs. These circuit sizes are comparable to those used in [1].