Circuit Partitioning for FPGAs by the Optimal Circuit Reduction Method

Mathematically the most difficult partitioning problem-packaging-is being consid- ered. Its purpose is to minimize a number of partitions and to satisfy the constraints on the number of constituent elements and external nets. To solve the problem, the Optimal Circuit Reduction Method, suggested by R. Bazylevych is being used. The optimal reduction tree to reflect the hierarchical entrance of smaller clusters into bigger ones is being built for the first step. At the second step we select one or more tree vertices which better meet the given constraints and are the first partitions generated from. After creating every new partition we eliminate its elements from the circuit and repeat the procedure to complete all partitions. During the last stage optimization strategies to exchange some elements between the partitions are being used. Better or equivalent results among known tests confirm the effectiveness of this method.


INTRODUCTION
Packaging of complex circuits to a minimal number of FPGAs is one of the most important tasks in the field of contemporary partitioning problems. Due to a high price, one can easily see the benefit in reducing the number of FPGAs to satisfy the functionality of a given circuit. The challenge involves multiway partitioning with constraints, the most difficult being the number of constituent elements and the number of external IO terminals. Packaging belongs to intractable NP-complete combinatorial optimization problems.
FPGAs are widely used in rapid prototyping, and also in place of ASICs. These problems have been the focus of a number of authors [1][2][3], however, at present, the solutions produced by current techniques are of inadequate quality. In this study we are proposing a new approach, which utilizes the Optimal (Parallel) Circuit Reduction (OCR) Method [4]. Fundamentally the method solves the problem in three stages. In the first stage, we build an Optimal Reduction Tree (ORT), which represents the hierarchical entry of smaller circuit clusters into larger ones. In the second stage, the method splits the circuit into the individual FPGAs by using ORT. This type of approach essentially reduces the computational complexity, while simultaneously assuring a high quality solution. In the third stage, new optimizing procedures are used to reduce the number of partitions. The OCR Method provides the high quality solutions to a wide class of partitioning problems with various constraints.

PREVIOUS WORKS
We concentrate our discussion on studies dedicated to the circuit division to the strongly connected groups of elements, since the Optimal Circuit Reduction Method solves this problem at the first stage. Most likely the first investigation which foresees a division of such groups is in [5].
The authors identify all minimal groups, which have the basic distinction that for each of its subgroups the number of external nets is greater than for the group itself. This approach is interesting, however, it has not been applied to complex circuits with a high degree of complexity. In [6] individual blocks are partitioned at the logic level. This approach assumes previous determination of all initial nets for each group, and every nonprimary block must have exactly one output net. This restriction essentially reduces the number of problems for which the method could be applied, since it requires manual entry of additional information. In [7] an algorithm was presented for cluster recognition, which efficiently identifies the "core" of functional modules. The algorithm has O(nk-t) complexity where k is the number of edge-disjoint paths connecting two vertices such that each of these paths has at most length I. Such time complexity may be of prohibitive magnitude for large circuits. It is difficult to choose k and for a given netlist. The authors in [8] presented a circuit clustering method based on a random walk in the netlist graph. They proposed a method for extracting clusters from the random walk via the concept of a cycle. The bottom-up algorithm based on recursive collapsing of small cliques in graph was suggested in [9]. This approach requires considerable computational time. In [10] the algorithm constructs an ordering by adding a vertex and then splitting it into clusters. Authors in [3] generated an initial partitioning by using the gain principle of the Fiduccia-Mattheyses heuristic for selected vertices to be moved. With the initial partition they used the same heuristic for improvement of every pair of devices. Very good results of recursive combining clustering with iterative improvement method were presented in [11]. In [12] the authors proposed a new cluster metric called the cluster ratio. A two-phase algorithm that attempts to combine the merits of the bottom-up and topdown approaches was proposed in [1]. The authors devised a local ratio-cut clustering scheme to reduce the circuit complexity, and then a set covering partitioning is used to reduce the number of FPGAs. Some additional materials can be found in [13 17].

PROBLEM FORMULATION
The initial data to solve the problem are: (3) Every set P(ei) contain those elements of P that are incident to ei E.
It is easy to receive a system of sets of nets incident to every element from the previous system: E(P) {E(p),..., E(pn) }; (YE(pi) E E(P), 1,..., n) [E(pi) {eEE]e is incident to pi}]. (4) Every set E(pi) contain those elements of E that are incident to Pi P. Usually they include the element physical area, power dissipation etc., while for electrical nets they include a signal propagation delay and others. The most significant contributor to the propagation delay is a break caused by a signal passes from one FPGA to another. The propagation delay is large in this case. It is very important to take into account the constraints, which absolutely must be satisfied, otherwise the desired circuit will either not be feasible or will not function according to the desired specifications.
We will formulate the problem statement assuming the following: It allows to replicate elements that may result to improve some partitioning characteristics.
To solve the problem we utilize the OCR Method [4]. The main idea of the approach is to build an Optimal Reduction Tree by bottom-up strategy and then to partition the circuit by applying top-down strategies. The combinatorinal analysis of hierarchically built clusters is being done instead of analysis of original elements. It drastically reduces the computational complexity and improves the quality of solutions. Based on this method, the problem is being solved in three stages: 1. Optimal Reduction Tree Generation, 2. Initial Packaging, 3. Packaging Optimization.

The Optimal Reduction Tree
The Optimal Reduction Tree T R is a binary rooted tree which leaves (level 1) correspond to the elements of set P and a root (level H)to all grouped circuit (Fig. 1). The reasoning behind this approach is to create the tree, in which intermediate vertices correspond to circuit clusters (elements groups) with more internal nets than others. In addition, it is desirable to build such the tree, where the individual clusters are treated independently and in parallel, meaning to proceed in contrast to the well-known greedy approaches. In the last ones they are being treated in a serial manner, and therefore the first clusters are being created under more favorable conditions, and the last created clusters are much worse than earlier created ones. Our approach is to remove this disadvantage by developing the conditions for parallel formations of groups. It favors to the creation the natural circuit clusters. To remove completely the greedy approach is almost impossible, therefore we attempt to weaken it. We develop a bottom-up parallel-serial approach, where in the initial steps the smallest clusters are being created and they grow to the next steps by the cost of other elements or clusters. It is possible to create the new clusters from the initial elements or to enlarge those formed earlier in the later steps.
A physical analogy is the crystallization of a liquid, where there are only certain centers of crystallization, which later grow at the cost of free liquid or other crystals. Our approach tends to identify natural clusters in the circuit. This means that the tree T R could be considered as the Hierarchical Cluster Tree. It i-1 The Optimal Reduction Tree.
illustrates a natural hierarchy for the smaller clusters entrance into the greater ones. During the tree generation the most important aspect is the question of the element groupings into clusters, i.e., to find out the criteria for elements and clusters merging. A natural electrical circuit functional tree of is good for this purpose.
Obviously it is helpful to take advantage of this information; however, in this case we build the cluster tree based on certain formal criteria.
We have already mentioned the similarities between the cluster tree generation and the freezing process (transformation from the liquid to the solid state). It is need to note that there are significant differences between our processes and simulated annealing processes, while the laster is being widely used for combinatorial problems. The starting temperature is very important for simulated annealing since the result strongly depends on it. Our process has no such dependence. The temperature of the freezing process is constant. The process begins when all elements are free and not grouped with others (liquid state). This state corresponds to an unlimited high temperature with simulated annealing. Secondly, in simulated annealing the cooling schedule is also important.
Here the group formation process has some elements of analogy with the cooling schedule. However, the principal distinction between the two techniques is that proposed one could be considered as a constructive method, while the simulated annealing process is iterative. In our case, the final state is being constructed from the smallest original elements by the parallel-serial grouping of clusters. At every level pairs of clusters merge only when they match the best grouping criteria. All other clusters and free elements rise from one level to the next without modification. The passage from one level to the next means discharge of the energy and creation of some new crystals (clusters) and (or) growing old one, which have appeared at the previous levels. The process terminates when all elements are grouped (solid state).

Cluster Definition
Every cluster C at any level could be defined as a pair of a set of elements and a set of nets: A set of cluster nets consists of a set of internal and a set of external nets (Fig. 2): Every internal net has at least two nodes in a cluster. Every external net has at least one node )ut of the cluster. Some nets could be simultaneously internal and external. We mention these nets as combinational: We note also a set of pure internal E in* nets, all nodes of which are inside a cluster, and a set of pure external E ex* nets, with only one node inside a We have for the cluster C, three mutually independent sets E in* (C), E ex* (C), E cmb (Cs) and eCOm eccmb e ecmb e " ) d"" im'\"d  for the cluster Ct also three mutually independent sets E in* (Ct), E ex* (Ct), e cmb (Ct). Now we describe an algorithm to form all sets of the new cluster with minimal expenses.
We need to receive three mutually independent sets" E in* (Cst), E ex* (Cst), and E cmb (Cst) for the new cluster. First of all we have to determine some auxiliary sets of nets (Fig. 4)    There are several possibilities. In the best case we merge only the maximum number of independent pairs with the best value of chosen criterion. It could cause a large tree height and consequently takes a lot of CPU time. The one possible way to reduce the running time is to take all independent pairs with e given decreasing of the best criterion value. The second way is to merge the first A of all possible independent pairs, where A(0 < A < 1) is a reduction parameter. In the last case (A-1) we can take the maximal number of all possible independent pairs of the list L(/). Here a height of the ORT is at a minimum and therefore it takes the minimal CPU time but results could be worse. This case corresponds to the forced circuit reduction that might not generate good natural clusters.
Form the new (i+ 1)-th level of the tree T R by including a set of the new clusters, defined by merging at the previous step, and a set of rest remaining clusters from level i. Form sets and all clusters parameters of the (i + 1)-th level.

PACKAGING ALGORITHMS
There are a number of strategies to solve the problem. The technique depends on the solution quality, CPU time, and memory available. For the first technique, apportionment of every partition is achieved by generating a unique ORT. The process continues until the first vertex with a violation of the constraint for the maximum number of elements is reached. The violation for the maximum number of external nets could be reached earlier; however, in some cases it is possible to reduce this number by increasing the total number of elements. This could improve the solution. By these following steps we receive from this vertex the largest subcircuit. It has the maximum number of elements non violating the constraints.
These steps are being carried out by a top-down strategy. Here it is possible to utilize a multivariable combinatorial analysis on the reduced number of elements and to consider only clusters of several high levels instead of all initial elements. It drastically reduces the computational complexity. The quality of the solution depends on the breadth and depth of analysis. For example, the initial procedure could involve examining the previous two vertices in the tree and identifying the largest cluster that is assumed to be the basis of the partition. Next, by examination of the remaining vertices of second (smaller) cluster, we try to add the maximum number of them to the first one without violating the constraints.
After the first partition is formed, it is cut from the overall circuit and for the next partition the generation of the tree begins anew. Since the size of the circuit is smaller, the running time for the step is reduced. The process repeats until all partitions are generated.
Other strategies involve forming two or more partitions from one ORT. They reduce the processing time. The tree grows until two or more vertices with violated constraints are reached. Then for each of the subcircuits the algorithm above generates the partitions. It is desirable for every subcircuit to be minimally interconnected i.e., to have a minimal number of common nets.
For this reason the ORT generation is to be completed. Then we compare all pairs of initially arisen vertices with violated constraints with respect to the number of common nets and choose the pair with the smallest one.
To reduce the running time it is possible to use an algorithm where a maximum number of partitions are formed from one ORT. In this case the tree is being built until all vertices have violated constraints. If a vertex has reached this point, its further growth is terminated. The rest of the vertices continue to grow until the violation of constraint will appear. Partitions are formed from each such a vertex. Then these partitions are cut away from the overall circuit. To generate the next partitions from the remaining circuit, a new ORT is built and the process is repeated until the entire circuit is divided into the individual partitions.
The minimal running time requires an algorithm similar to the previous one with an exception that the ORT is generated only once. From this tree the partitions are formed utilizing combinatorial analysis from the generated clusters (vertices of tree). There are a number of possible strategies. It is expedient to form the first and subsequent partitions from the largest cluster by adding to them smaller ones as possible.
We implemented the first strategy with some features in our experiments. The ORT is being constructed completely. The cluster exceeding and minimally deviating from constraints is searched for the tree. The following task is to eliminate the largest subcircuit that satisfies both constraints from the selected cluster. For this purpose we estimate all vertices by the ratio of external nets to the number of elements. We consider the vertices with the best criteria value. If the constraints are not met after that, the procedure will be repeated. Otherwise we try to attach the cluster with the largest number of elements and the best net criterion to the removed elements.
The purpose of optimization is to reduce the number of partitions. Some small partitions are being united to receive the number of elements that do not exceed constraints. If the number of external net constraints for these partitions is exceeded, then we realize the optimization technique to exchange some elements between partitions. The process is then terminated and the final results are received if all constraints are eliminated. Otherwise subcircuits with exceeded constraints are being split to the smallest number of subcircuits without violation by the mentioned algorithm. 6. PARALLELIZATION APPROACH The described algorithms are of a parallel-serial type by the nature. It is possible to parallelize some procedures of the algorithms for this reason. Here we consider the main stages. The most computationally expensive step, and the one that must be efficiently parallelized, is the ORT generation. At every level a set of all pairs of connected clusters, for which it is necessary to calculate criterion, is divided to a number of subsets equivalently to processors number and the task is being solved in parallel. The calculation of criterion needs a lot of time and its parallelization could significantly speed up the process.
The criterion calculation of each pair could be additionally divided into four tasks according to Figure 4 and solved on separate processors. For N accessible processors, a set of all connected clusters is divided by N/4 parts. For every pair of clusters four processors are being assigned. These approaches for parallelization are being used in all cases, when it is necessity to build the ORT. The optimization of the initial solution could be carried out in parallel for separate pairs of partitions.

ALGORITHM COMPLEXITY
We express the total algorithm complexity to receive the partitioning system P as j=k j=! where Q(Pj)-the complexity for one j partition, which involves the complexity of the ORT generation and the complexity of apportioning it from one tree vertex: Q(pj) Q(Tj. R) --QA(Pj).
The apportioning complexity QA(Pj) weakly depends on the size of full circuit and mainly depends on the size of FPGA. This means we can consider it a constant. Some approximation for the algorithm complexity of the ORT generation is equal to the average complexity of the data determination for one level multiplied by the height of the tree. The complexity for one arbitrary level of the ORT could be expressed as: where nji-a number of clusters at level i; 2/)ji-an average number of adjacent clusters for one cluster; ]ji an average complexity for determination of the criterion.
We can remark that the number of clusters is being decreased from one level to the next higher one according to the above mentioned reduction parameter A, but the average number of adjacent clusters increases at every next level. Nevertheless the full complexity decreases from one level to another, since the total number of nets we have to consider at every next level is less than the previous because some nets become pure internal and are not taken more into account: Qi+l (TjR) < Qi(TjR) If a number of elements at the first level is equal to n(nl n), then a number of elements at the second level is n2---(1-A/2)n and consequently a number of elements at the last level is n/_/--(1-A/ 2)/-/in. Since at the last level a number is equal to 1, it justifies the expression 1-(1-A/2)/-/-n.
So H--log n/log(1-A/2) and for large height Hlog n/log (1 A/2). The full complexity of the ORT generation could be expressed by the formula: i=H-1 Q(T] R) Z Qi(TjR) i=I It is obvious that Q(TF)<Q,(TjF)H, i.e., Q(Tj R) < Q1 (Tj) log n/log(1 A/2)I. So Q(Tj ) O(nlogn) and therefore the same estimation we have for Q(P#). Our experiments show that the complexity of ORT generation is close to linear by the number of elements. For apportioning of every next partition the size of circuit we have to consider is less that for previous partition, so Q(P+ 1) < Q(Pg). For this reason we could conclude that Q() < kQ(P1). It takes place when for every next partition we generate the new ORT.
As we have mentioned above it is possible to apportion from one ORT more than one partitions. It reduces the overall complexity.

EXPERIMENTAL RESULTS
The circuits from [1,2] were taken for experiments. In our experiments we chosen criterion f]7--trtcom ex and at the every level of the ORT -mc generation we merged only 20% (A =0,2) of the better independent pairs of the list L(r). The test results (#FPGAs) from [1,2] and for our OCR methods (initial and optimized) are shown in the Tables I and II. We used 64 CLBs and 58 IOs constraints (FPGA Xilinx XC2064) for the tests with Table I and 320 CLBs and 144 IOs (FPGA  TABLE Packaging results for FPGAs with 64 CLB and 58 IO (Xilinx XC2064)