Sorting on Reconfigurable Meshes: An Irregular Decomposition Approach

Most algorithms for reconfigurable meshes (R-meshes) are based on the divide-andconquer (DAC) strategy. Although the strategyper se does not require the subproblems to be equal in size, existing DAC algorithms for R-meshes do divide the problem approximately evenly. This paper demonstrates that dividing a problem evenly is not necessarily a good way to decompose a problem. There are occasions on which an irregular decomposition scheme may be preferable. We take this approach and obtain a new sorting algorithm. Our sorting algorithm has several strengths: it is simple, scalable, and as broadcast-efficient as the best known result.


INTRODUCTION
A reconfigurable mesh (R-mesh) is a twoor threedimensional array of processing elements (PEs) connected by reconfigurable buses. Though the PEs are interconnected as a regular mesh, the internal connection between the I/O ports of each PE can be individually reconfigured during the execution of algorithms [16]. Computations on reconfigurable meshes have recently received a great deal of attention from researchers. The challenge here is to solve a problem on an R-mesh in a constant number of steps using as few PEs as possible. Efficient algorithms have been developed for sorting [8,9,14,18], image processing [5,11,12], arithmetic problems [6], computer vision [21] and geometric problems [7,13,19]. Almost all of these algorithms are based on the well-known divide-and-conquer (DAC) strategy: Decompose a given problem into several subproblems, solve them individually, and then combine the partial results into a solution to the original problem.
Although the divide-and-conquer method per se does not require subproblems to be equal or approximately equal in size, in practice almost all existing DAC algorithms follow this rule and divide a problem approximately evenly. A divideand-conquer algorithm is said to follow a regular decomposition scheme if it always divides a problem (approximately) evenly in size. Otherwise, it is based on an irregular decomposition scheme.
With regular decompositions, it is relatively easy to reduce problem size from n to o(n) in O(1) recursion steps, which is crucial in obtaining O(1) time algorithms. By comparison, there seems to be no easy control on subproblem size in irregular decompositions. Thus, intuitively, irregular decompositions seem unnatural and difficult to implement on R-meshes. In this paper, we demonstrate that contrary to intuition, irregular decomposition can be a useful design paradigm.
We apply the irregular decomposition scheme to sort a set of n elements on an n n R-mesh. The resulting algorithm is conceptually simpler and computationally more efficient than previous sorting algorithms (e.g. [8,9,14,18]). The irregular decomposition scheme can be tuned to use the same number of broadcasts as the best know result [22]. Our algorithm also has the desirable property of being scalable. The purpose of this paper is thus twofold: (1) to demonstrate the usefulness of irregular decomposition, and (2) to report efficient sorting algorithms for R-meshes.
The next section reviews various models for reconfigurable meshes. The scheme of irregular decomposition is discussed in Section 3. It is then applied to obtain a new sorting algorithm. We also compare the new sorting algorithm with previous ones and present its scalability. All of the above can be found in Section 4.

The Model
A two-dimensional reconfigurable mesh (R-mesh for short) is a two-dimensional array of processors connected by a reconfigurable bus system. Each processor has four I/O ports, denoted as N (north), S (south), E (east), and W (west), respectively. An R-mesh with n rows and m columns of processors is denoted as an n x m Rmesh. (A 6 x 6 R-mesh is shown in the left portion of Fig. 1.) The rows of an R-mesh are numbered from 0 to n-1, with 0 referring to the northernmost row. Similarly, columns are numbered 0 to m-1, from west to east. PE(i,j) denotes the processor located at the intersection of row and column j. To simplify the notations, we use PE(i, ) (resp. PE(,,j)) to denote all processors in row (resp. column j); and PE(,, ,), all processors in the R-mesh. Each processor has a constant number of registers of O(log n) bits. Data are represented in the binary form and stored in these registers. Buses are capable of carrying O(log n) bits data.
Internal connection between the four ports of each processor can be reconfigured during the execution of an algorithm. There are 15 different possible patterns of internal connection at each node (we use the notation {xy} to mean that port x and port y are connected to each other): There are various models for reconfigurable meshes [1,15,16,27,28,29]. They differ principally in the patterns of connection allowed at each processor: The processor array with a reconfigurable bus system (PARBUS) [27,28] allows arbitrary connection between the ports. The reconfigurable mesh with buses (RMESH) [15,16], or the content addressable array processor (CAAP) [29], allows all the connection patterns listed above except the two-pair connections.
The mesh restriction in reconfigurable network (MRN) [1] allows all but three-ports and fourports connections.
We will use R-mesh as a generic term to refer to any of the above models. When we describe an Rmesh algorithm, we will indicate which models of R-meshes we have in mind.
There are two models for bus arbitration in Rmeshes [16]: exclusive write and common write. The exclusive write model does not allow bus conflict, i.e., at most one processor can send data to a bus at any time. In this respect, it resembles the exclusive write PRAM (EW PRAM). The common write model allows multiple processors to simultaneously broadcast to the same bus as long as the same data value are broadcast. In this paper, we assume the exclusive write model.
In order to measure the time complexity of an R-mesh algorithm, two models have been proposed in [16]: the unit-time delay model and the logtime delay model. In the unit-time delay model, the computation operates synchronously, and the following operations are each counted as one step: evaluating an arithmetic or logic operation. reconfiguring internal connections between each port of a processor. broadcasting on a configured bus.
In the log-time delay model, broadcasting on a configured bus is assumed to take log s time, where s is the diameter of the bus on which the data are broadcast [16]. Any algorithm taking T(n) steps on an n x n R-mesh of the unit-time delay model will run in O(T(n) log n) steps on the log-time delay model. The unit-time delay model is more popular than the log-time delay model. In fact, almost all papers in the literature of R-mesh algorithms have assumed the unit-time delay model, which we will adopt in this paper, too. To satisfy these conditions, a problem must be carefully decomposed. A decomposition of a problem is said to be regular if the subproblems are approximately equal in size (i.e., IA01 [All ...lAm-ll). Otherwise, it is an irregular decomposition. A DAC algorithm is said to adopt a regular decomposition scheme if it requires every decomposition to be regular. It adopts an irregular decomposition scheme if regular decomposition is not a requirement. Regardless of its regular or irregular status, a decomposition is said to be natural if a solution to the original problem can be easily obtained from the solutions to the subproblems (i.e., the "combine" step of the algorithm is trivial). For instance, the well-known (sequential) quicksort adopts an irregular decomposition scheme, while the mergesort does a regular one. The quicksort always produces natural decompositions (since the "combine" step takes only O(1) sequential time); the mergesort doesn't (it takes O(n) time to merge two sorted subarrays).

Regular Decomposition
With a regular decomposition scheme on an Rmesh, a problem of size n is typically divided into a number rn of subproblems each of size approximately n/m. Correspondingly task. Most existing algorithms for R-meshes uses m 2 equal-sized submeshes in the "combine" step to achieve C(n) 0(1).

Irregular Decomposition
In an irregular decomposition scheme, the subproblems Ai's are not necessarily of the same size, nor are the submeshes Mi's. When both regular and irregular decomposition schemes are capable of producing natural decompositions, but the irregular scheme is computationally easier because of less restriction on subproblem sizes, irregular decompositions may be preferable. In this case, a problem of size n is divided into rn subproblems of sizes no,..., nm-1, with max{n0, nl, ,nm-} ne for some e, O<e< 1. If C(n)=D(n)=O(1) and T(n') 0(1) for some e', 0 < e' < 1, then T(n)= 1.
We shall demonstrate this case with sorting algorithms in the next section.

SORTING BY IRREGULAR DECOMPOSITION
Several optimal sorting algorithms for R-meshes are available in the literature, all based on the divide-and-conquer strategy with regular decompositions [9,14,18]. In this section, we apply the irregular decomposition scheme and obtain a new sorting algorithm that is simple and requires fewer bus broadcasts than previous algorithms. Our algorithm is an improvement on the one by Lin et al. [14], so we start with a brief review of the latter algorithm.

A Regular Decomposition Approach
Let S be a set of n integers to be sorted. For ease of presentation, we assume the n elements to be distinct so as to avoid tedious bookkeeping details. Also, we assume that a number oc is added to S so that S now contains n + numbers. The sorting algorithm presented in [14] divides the given set S into m subsets Sk, 0 < k < m-1, of the same size, such that all elements in Si are less than all elements in Si+ 1. This is a "natural" decomposition because once each Si is sorted, the whole set S is sorted. To sort the Si's, the given n x n mesh is divided into m submeshes, and each Si is sorted on a dedicated submesh. These ideas are more precisely stated in the following.
2. Partition S into rn subsets, Se, 0 < k < m-1, such that S {s S[Sk < S < sg + 1} 3. Directly sort each Se on an n x (n/m) submesh Mk. 4. Concatenate the sorted subsets into a single sorted list.
The algorithm employs a sophisticated multiselection algorithm to carry out Step 1. The multiselection algorithm, while clever and interesting in its own right, is undesirably complex and involved for the sorting problem, especially on an R-mesh.

An Irregular Decomposition Scheme
The above sorting algorithm requires equal size of the subproblems, and, therefore, needs to solve a relatively difficult multi-selection problem. We observe that dividing the problem evenly among the subproblems is not crucial for the above algorithm to work correctly. If we relax this requirement and allow the subproblems to be different in size, then we don't need to solve the hard multi-selection problem and thus have a simpler and more efficient algorithm. Of course we still need some restriction on subproblem sizes so as to not partition a problem into a very large subproblem and some very small ones.
Our algorithm is a recursive one, so we describe it as one that sorts n elements on an n n Rmesh. (Initially we are given n elements to sort on an nn R-mesh.) The algorithm divides a problem of size n' into m subproblems of sizes no,...,nm-1, respectively, such that max{n0,nl, Let T(n') be the time complexity of our algorithm in sorting n' elements on an n x n Rmesh. The algorithm is specified in the following. New Sorting Algorithm: Sort(S) Input: a set S of n' elements stored, one element per processor, in the northernmost row of an n x n' R-mesh.
Output: the set S, sorted in ascending order, again stored one element per processor in the northernmost row of the same R-mesh.
Define some variables for ease of reference: q /'], p n'/f(n'), and m p q.
1. If n' < v then directly sort the elements by the method of [28]. (It takes O(1) time.) 2. Select a set S' of rn + elements So, S1,...,Sm from S such that the number of elements x in S that satisfies se < x < se / is at most f(n'). Store To carry out Step 2 of the algorithm, we perform the following procedure. Select(S) 1. Divide the mesh evenly into p submeshes, M, 0 < <_ p-1, where the size of each Mz is n [f(n)] except for the last one, Mp_l, which may be narrower than others. This correspondingly divides S into p subsets, W, 0 _< < p 1, with W consisting of the elements of S that are stored in Mr. Note that the elements in Wz are not necessarily smaller than those in W+I. (This recursive step takes T(f(n)) time.) 3. For each sorted Wt, select wt,0, wz,i,..., Wt, q_ such that these elements evenly partition W into subsets of size f(n')/q. These selected elements are marked and stored in the northernmost row of M. (This step takes 0(1) time.) 4. Let S' {sk" 0 < k < m } be the set of all the elements selected in the preceding step. Include Sm = in S for convenience in presentation.
There exists an such that wt, o <... < wt, t, < Sk < Sk+l Wl, l'+l''" < Wl, q-1. In all cases, at most one set from W,0, Wt,1, Once S' is computed, S is partitioned by the elements in S into m subsets S, 0 < k < m-1 such that S {s Slsk < s < s/ }; each S is then moved to a dedicated n x ISg[ submesh. This can be done as follows.
Partition (S, S) 1. For each element in S , compute its rank in S.
(This can be done in parallel on an n n Rmesh in O(logn'/(logn'-log m)) time [14].) 2. If sk has rank 7r(k), then route sk to PE(0, 7r(k)). After the reconfiguration, let each PE(0,j) that holds a marked aj broadcast aj to its N-port. The value will be received by a PE in the westernmost column.
In parallel to Lemma 2, the following is true. Solving the recurrence equations yields Thus, T(n) 0(1). We summarize the above results as a theorem.

THEOREM
The New Sorting Algorithm sorts a set of n elements on an n x n R-mesh in 0(1) time.
The R-mesh algorithm described above works on the PARBUS model. It can also be applied to the more restrictive models, RMESH and MRN. In the following, we discuss issues about implementing this sorting algorithm in RMESH and MRN. First note that our algorithm employ an existing algorithm [28] to sort x/-g elements on an n Rmesh. The algorithm in [28] does not use any crossing connection {WE, NS}, 3-port connection, or 4-port connection. It does use non-crossing two-pair connections {WN, ES} and {WS, EN}.
These connections can be simulated using four PEs in the RMESH model [6]. Thus, a set of elements can be sorted in O(1) time on n v/-g Rmeshes (PARBUS, RMESH and MRN). Similarly, the same simulation can also be applied to compute Step 4 of the partition procedure.
Another implementation issue is the ranking algorithm of [14] (in Step of the partition procedure). That algorithm needs to use the crossing connection {WE, NS}. Fortunately, computing the sum of a 0/1 sequence of length n on an n 2n R-mesh (PARBUS, RMESH, or MRN) can be done in O(1) time for 0 < e [5]. Using this algorithm, we can compute the rank in s(IsI n') for every element in S'(IS'l--m) on an n n' Rmesh (PARBUS, RMESH, and MRN) in O(logn'/ (log n'log m)) time.

Comparison with Existing Sorting Algorithms
The model employed in our analysis of time complexity takes into account both computation time and communication time. Each primitive computation as well as each reconfiguration of the bus system and the subsequent broadcast of data, is assumed to be completed in a unit time.
However, the constant involved in a reconfiguration and broadcast is normally much larger than that involved in a computation step. In light of this, Nigam and Sahni [18] proposed the number of bus broadcasts as a measure of efficiency of O(1) time R-mesh algorithms. They showed that the number of broadcasts required by their sorting algorithms is much smaller than that of [14] and that of [9]. Table I

Scalability of the New Sorting Algorithm
Although the Jang-Prasanna [9] algorithm is scalable, it only works on a PARBUS. Ours works for all R-mesh models (PARBUS, RMESH, and MRN). Our algorithm is also easier to implement; we only need an algorithm for < K < gr, while [9] needs two algorithms, one for < K < x/-, and one for < K < x/. Our algorithm divides the problem of size n' into rn subproblems of sizes no, nl,...,nm-1, respectively, such that max{no, nl,...,nm_l}<_ f(n'), where n / x/f n / x/--< n' f (n') n/K otherwise To sort the original set of n elements, it divides n elements recursively until the size of every subproblem is less than n/K. The subproblems are then solved in O(1) time by Theorem 1.
We also use amortization [4] to prove the time complexity of this scalable algorithm. To the best of our knowledge, this is the first usage of amortization in the design and analysis of R-mesh Algorithms. The details of this algorithm and its time complexity analysis are described in Appendix C.
A scalable sorting algorithm can sort a set of n elements on an n/K x n/K R-mesh in O(K) steps for < K < x/-. This also leads to an AT optimal design in the word model of VLSI. The Jang-Prasanna [9] algorithm is scalable. The Nigam-Sahni [18], Lin-Olariu-Schwing-Zhang [14], and Olanu-Schwing [22] algorithms seem to be not scalable, or at least hard to scale. Our algorithm, while close to the Lin et al., algorithm, is easy to scale.  [26].
The R-mesh algorithm (using irregular decomposition) presented here, unlike previous R-mesh algorithms (using regular decomposition), looks a little complicated at first sight. In contrast to our intuition, it is very simple to implement on Rmeshes, and as broadcast-efficient as the best known results obtained by [22].
We have applied irregular decomposition to obtain simple and efficient sorting algorithms for R-meshes. The main idea behind the notion of irregular decomposition is to divide a problem along its "natural" boundaries between subproblems. Not all problems have such natural boundaries. This paper provides an evidence that when a problem has natural boundaries between its subproblems, irregular decomposition may be a more effective way for designing a divide-andconquer algorithm for R-meshes. For example, if P(ao) P(a2)= P(a3)=-P(a6) 0 and P(al)= e(a4)= P(as)= P(a7)= 1, then an R-mesh configuration is illustrated in Figure 3  iv. For each submesh Me, each PE(i, j) of Me has aj, se and se + 1. Extract those aj's in the northernmost row of Me such that se < a < se+ to the westernmost column of Me. Also mark those processors in the westernmost column that receive a new element. By Lemma 1, the number of a's such that se < aj < se+ is no more than f(n') in each Me. Also note that f(n') < for v/-g < n < /-n-. Thus, the computation requires one broadcast by Lemma 2.
v. Broadcast from the westernmost column of M by row buses such that PE(i, j) knows whether PE(i, 0) is marked or not. vi. Extract those new elements received at the marked processors in the westernmost column of M into the northernmost row of M. Note that the number of processors marked in the westernmost column of M is n. Thus, this is computed in one broadcast by Lemma 2. Suppose PE(i, 0) is connected to PE(0, II(/)) in the extraction computation. For any pair of marked processor PE(i,0) and PE(i,0), if i<i' then II(i)<II(i'). Thus, So, S1,... Then a list of n bits, b0, bl,..., bn'-I is computed, where bi--if the number in PE(i, 0) is less than C. Count the number of l's in the list of n bits on each Me. We apply the parallel prefix sum algorithm of [14,22]  We calculate the number of broadcasts in each step of Recursive_Sort(S) and conclude that Recursive_Sort(S) requires g(n') + (g( f (n')) 1) + h(n') + g(f(n')) broadcasts.
Thus, Recursive_Sort(S) requires g(n')= { -+-3-t-8+4= 16 ifx/ < n' < 1+15+10+16-42 if<n' Recently, Olariu and Schwing presented a deterministic sampling scheme that further reduces the number of broadcasts required for the sorting problem [22]. We tune the parameters of our algorithm and obtain the same broadcast-efficiency. The procedures and analysis are depicted as follows: Note that Lemma holds for the following parameters setup: if n' <_ f f(n')= otherwise and q-[n-g. Then the analysis follows: h(n') { 8 if < n' _< 17 if h-g < n and g(n') 9 (see [22]) We summarize the results of this section as a theorem.
THEOREM 3 There exists an R-mesh algorithm that takes at most 35 bus broadcasts to sort n numbers on an n x n PARBUS.

APPENDIX B: NUMBER OF BROADCASTS IN THE RMESH MODEL
To analyze the number of broadcasts of the new sorting algorithm in the RMESH model, we note some necessary changes to the PARBUS implementation of Section 5.
Broadcasting elements of S by column and row buses such that PE(,,j) has a and PE(i, ,) has ai requires only one broadcast in the PARBUS. This step is simulated by two broadcasts in the RMESH, one uses column buses, the other uses row buses, as illustrated in Figure 5.
The connections {WN, ES} and {WS, EN} in the PARBUS model can be simulated using 4 PEs in the RMESH model [6]. The broadcast initiated after these types of configuration can be simulated by two broadcasts, one for the processors indexed with odd numbers, the other with even numbers. Counting the number of l's in the list of n bits on an n' RMESH. Consider n'<_ x/-. The implementation is similar to that of Nonrecursive Sort. It takes 2 iterations (one for odd bits, and the other for even bits) with each iteration using 2 broadcasts (one for counting bits, and the other for broadcasting to the top left corner of the mesh). Consider n'> x/-. We use only 12  (a) Compute the row major order of each elements of Sk as follows.
We now analyze the time complexity of our new scalable sorting algorithm.
n/x/-ifn/--<n' f(n')-n/K otherwise m= rn'/Ff(n')l ifn/K<n' 0(1) otherwise Rl (ISkl)-g* lSkl/n. R(ISkl)-g* lSkl/n.   Ming-Jye Sheng received his Ph.D. degree in computer science from the Ohio State University in 1994. He has been involved in telecommunication projects with AT & T during 1994-1995. He also served as the director of R & D of CyberExpress Inc. and initiated several internet projects during 1995-1996. He is currently a member of technical staff of Lucent Technologies. His current research and development interests are new technology and performance analysis of wireless systems.