Fast Digit-Index Permutations

We introduce a tensor sum which is useful for the design and analysis of digit-index permutations (DIPs) algorithms. Using this operation we obtain a new high-performance algorithm for the family of DIPs. We discuss an implementation in the applicative language Sisal and show how different choices of parameters yield different DIPs. The efficiency of the special case of digit reversal is illustrated with performance results on a Cray C-90. © 1996 John Wiley & Sons, Inc.


INTRODUCTION
Kronecker's matrix product is a powerful tool for designing and adapting fast Fourier transform (FFT) algorithms to different multiprocessing en vironments [6,9,10].This is especially true for implementations in an applicative language.Starting with an FFT algorithm expressed in tensor notation, one can use the laws of tensor algebra to obtain tensor formulas that take into account a given architecture and are easily translated into the language.The Kronecker product also provides a unifying framework for seemingly different FFT algorithms.Cnder Kronecker's formulation, for instance, Pease's parallel algorithm and the Korn-Lambiote vector FFT are simple variants of the original Cooley-Tukey FFT.In [2] the authors have exploited these ideas to obtain a high-performance FFT written in the applicative language Sisal.
The Cooley-Tukey FFT, as well as the Pease and the Korn-Lambiote algorithms, consists of a "combine" phase and a separate "ordering" phase in which the components of a vector need to be written in bit-reversed order.An efficient way to implement bit reversal on a vector machine such as a Cray is by computing a vector index of bit-reversed indices and then using gather/ scatter to effect the reordering.Programmers have used various algorithms to compute the index vector; however, the development of these algorithms has been somewhat ad hoc and there is no general tool comparable to the Kronecker product, which is so useful for the combine phase, for designing and analyzing bit-reversal algorithms.
In this article we develop a tensor sum whose role in the ordering phase is similar to the role of the tensor product in the combine phase.Just as the tensor product serves as a unifying framework for the combine phase of various FFT algorithms, the tensor sum serves as a unifying framework for digit-reversal algorithms.We show how different known digit-reversal algorithms can be expressed in terms of the tensor sum.This observation allows us to obtain a new algorithm for the general class of digit-index permutations (DIPs) whose parallel-time complexity is O(log log n).We give a Sisal implementation for this algorithm and illustrate how different choices of parameters yield different members of the DIP family.
The results presented here provide a complete set of tools for designing, modifying.and implementing radix-r FFT permutations in a functional language such as Sisal.In other work, which will appear, the authors exploit the tensor product to develop a similar set of tools for the combine phases of radix-r FFTs.In this latter work we show that there are just four basic functions, in the sense that the combine phase of each of the radix-r FFTs can be expressed as the composition of some of the basic functions.The totality of these results allows us to develop high -performance Sisal implementations for the entire family of radix-r FFTs.

AN OVERVIEW OF SISAL
Sisal (Streams and Iteration in a Single-Assignment Language) is an applicative language developed primarily by Lawrence Livermore 1\;ational Laboratory and Colorado State Cniversity for large-scale scientific programming.It is a "singleassignment" language-a name can receive a value only once in each name scope.Sisal has been successively implemented on shared memory machines such as the Cray family and work is currently underway to implement it on distributed memory machines.
Sisal is a very convenient language for prototyping, developing, and evaluating parallel algorithms since it allows the programmer to write programs that are very dose to their original mathematical formulations.One can program and test individual functions and compose them with the guarantee of not introducing race conditions and side effects.Furthermore, there is growing evidence [:3] that Sisal can perform on a par with traditional languages such as Fortran.
The authors [2, 5] have successively used Sisal to develop FFTs that outperform their Fortran counterparts on Crays.We express the algorithms developed in this work in Sisal, not only because these algorithms complement the ideas in [2], but also because the use of Sisal greatly simplifies the expression of these algorithms as opposed to an imperative language such as Fortran.
In this section we present only sufficient detail of the language to understand the algorithms presented in this work.A much more complete and very readable account of Sisal is given in [8].
SiRal comments begin with a % character and continue to the end of the line.The most important data type in Sisal for the ideas considered here is the array which is a one-dimensional collection of homogeneous values indexed by simple integer expressions.Multiple dimensional arrays are arrays of arrays.The size of an array is dynamic.Array operations include array construction, replacement, selection, and operations to redefine the lower bound, array_setl(A,lo), to add a new element at the beginning of the array, array_ addi(A,v), and to add a new element at the end of the array, array_addh(A,v).
There are two types ofloop expressions in Sisal: for initial and for.The former is an iterative construct having four parts: initialization, body, test, and returns clause.The initialization clause is the loop's first iteration and defines loop constants and the initial values of all loop carried names.The body redefines the values of all loop-carried names.The values of the names on the previous iteration can be accessed as old name, Thus, the for initial expression supports iteration while preserving single-assignment semantics.The test clause may be either while Boolean expression or until Boolean expression.The test clause can appear either before or after the body.The returns clause determines the value of the for initial expression.The returns clause has various forms, including: returns value of expression that returns the expression's value on the last iteration; returns array of expression that returns an array of the expression's value on each iteration: and returns value of catenate expression that returns the catenation of the arrays defined by expression.
The syntax of a for expression has three parts: generatoc body, and returns clause.An instance of the body is executed for every element in the "generator."The generator can have the form for name in lo, hi that specifies that name takes on the values lo to hi, inclusive, and an instance of the body is executed for each value.Another useful form of the generator is for name in array_name.
Here, name takes on the values of the array's elements, and an instance of the body is executed for each element.
The semantics of the for expression are such that the loop bodies are data independent and may be executed in parallel.The reduction operations specified in the returns clause are implemented determinantly by the run-time system.The for expression is the major source of parallelism and vectorization in Sisal programs on most commercial computer systems.

A TENSOR ADDITION
We shall develop a tensor addition whose role in the ordering phase of an FFT is similar to the tensor product in the combine phase.Although occasional references to tensor products can be found in early work on FFTs, it is only in the last several years [6,9,10] that its importance to FFTs, especially in matching specific FFT algorithms to specific architectures, has been fully appreciated.
Here we remark only briefly on its role.
Let A = [ ak 1 1 and B be matrices of arbitrary sizes.The tensor product of A by B is defined to be the block matrix Different sparse-matrix factorizations of the Npoint discrete Fourier transform (OFT) matrix Fv, formed by different Kronecker product decompositions, yield different computational schemes for the OFT.The main use of these Kronecker factors in matching algorithms and architectures comes from the identity A ® B = (A ® I")(Im ®B) where I, is the s X s identity matrix, and A and B are m X m and n X n matrices, respectively.The factor Im ® B is then regarded as m parallel multiplications of B on n-point vectors, while the factor A ® In is regarded as A multiplying a set of m-point vectors in a pipeline of length n.The so-called commutation formula P(nm, n)(A ® B)P(nm, m) = B ®A, where P(nm, s) is the stride s permutation, allows the change from a parallel to a pipelined interpretation and vice versa.For example, the decomposition F 2 2, = (F 2 , ® I 2 ,)T (I 2 , ® F 2 ,)P(2 2 r, 2r), where Tis a diagonal matrix of the so-called "twiddle factors," is turned into a parallel algorithm by commuting the leftmost factor.The same formula can be turned into a vector algorithm by commuting the rightmost factor.Kronecker formulation thus provides a mathematical framework for FFT dataflow modification and identification of core operations.
We shall see that our tensor sum is a very natural companion to the tensor product.We define the tensor sum u EEl v of u and v as follows: For any scalars u and v, define u E9 v = u + v.
For any vector u = [a 0 , a 1 , . . ., ar_ 1 ] and any For our purposes here, we assume that all of the scalars in the definition are integers.
The subvectors u E9 /3; in the above definition can be computed in parallel.On a vector machine each of these operations is a vector operation.
for all vectors or scalars v, u, and z.
Distributivity: a(v E9 u) = au E9 au, for v and u vectors or scalars and a scalar Also, the tensor sum is: Noninvertible: unless both terms are scalars.
Noncommutative: unless one of the terms is a scalar.
For Uo' . . .'uk-1' vectors in cr, we use the notation As an example of this notation, note that and by induction on k we have

DIPs
We shall sometimes describe permutations in terms of cyclic permutations, which are permuta-tions of the form p(a 1 ) = a 2 , p(a 2 ) = a 3 , • • • p(am-tl = am, p(am) = a 1 .Such a permutation is denoted by the cycle (a 1 , a 2 , • • • , am).By a wellknown result from elementary group theory, every permutation can he written as a product of disjoint cycles (i.e., no two cycles alter the same symbol).

P:
(5) is said to be the DIP on Nrk induced by p. Thus, the image of an integer j E Nrk is obtained by expressing j in base r and using p to permute the base r digits of j.
The following result provides the basis for a generic algorithm for the DIPs: Theorem.Let P be a DIP on N/ induced by p: Proof.We show that P(j) is equal to thej-th element of EB7==-d ,-P(i) vr.Let us first define what we mean by the antilexicographical order on a Cartesian product of ordered sets.For any ordered set A, the antilexicographical order on A is the order on A. If A and B are ordered sets with respective cardinalities m and n, and if a; and b; are the i-th elements of A andB, respectively, then the antilexicographical order on A X B is defined as follows: The first element of A X B is (a 1 , b 1 ).If (an bs) is the i-th element, where 1 :S: i < mn, then the i + 1-th element is (ar+l, bs) if r < m and is (a 1 , bs+!) if r = m.The antilexicographical order on Xj'!, 1 A; where each A; is an ordered set is the antilexicographical order on (Xj'!,1 1 A;) X Am and where we identify each ((e 1 , . . ., em_ 1 ), em) with (e;, ... , em>• is thej-th element in the antilexicographical order of xf::d Nr.Thus, the It is shown in [ 1] that the set of all DIPs on Nrk is precisely the group of permutations generated by the generalized "shuffles," i.e., permutations GS;,r; induced by the cycles e; = (i, i + 1, . ., , k-1 ).Thus, a permutation on Nrk is a DIP if and only if it can be expressed as a product of GSrk./s.In actual practice, the most convenient description of a DIP onNrk is in terms of the permutation p on Nk that induces it.In what follows we examine some examples of DIPs of special interest and give their vector representations based on Equation 7.
Example 1. Digit Reversal These are perhaps the best known DIPs.Such a permutation R/ : N/ ~ Nrk, is induced by In terms of Equation 7, digit reversal on Vrk is expressed as For example, digit reversal for r = 3, k = 2 is induced by the permutation (0, 1) and its vector representation is = [0,3,6, 1,4, 7,2,5,8] ( An important special case arises when r = 2, i.e., "bit reversal," which we denote simply by B k.
Example 2. Stride Permutations A stride-r; on Nrk is induced by a left cyclic shift of length ion Nk.We consider some special cases in the examples that follow.

Example 3. Even-Odd Sort
This is the stride 2 permutation, Ek, on N 2 k.Thus, Ek groups all even indices and all odd indices and is induced by (0, 1, 2, • • • , k-1 ).By the theorem we have, For example, E 3 is induced by (0, 1, 2) and its vector representation is For example, if r = 2 and k = 3, then 5 8 is induced by (0, 2, 1) and Example 5. Matrix Transpositions Given an m X n matrix A which is stored in row major order (as in Sisal), the transpose of A can be regarded as a stride-n permutation on Vm,.For square matrices of order, say, n, transposition can be represented as the DIP on N,2 induced by the permutation (0, 1 ).Thus, then the one-dimensional array a occupied by A can be envisioned by The first row of A is indexed by V-t, the second by V 4 EB 4, and, in general, thej-th row is indexed by V 4 E9 4).So, V" is associated with the rows while 4V 4 is associated with the columns.Thus.although V 4 E9 4V"* is a one-dimensional indexing vector, it preserves, by means of the tensor sum decomposition, the two-dimensional features of the original array.The indexing set 4V-+ E9 V, of the transpose can also be interpreted as a two-dimensional indexing set by associating 4 v4 with rows and v4 with columns.The resulting array is then From its definition it follows that the 2D-DFT of size Jl!/ X L can be computed in terms of onedimensional transforms, as follows: Step 1. Compute the L-point transforms of each of theM rows Step 2. Compute the i\-1-point transforms of each of the L columns of the result of Equation 1. Assuming that xis stored in row major order, as in Sisal, a common approach to this problem is to perform a transpose after Step 1, then again compute transforms of rows, and then another transpose.
An efficient way to effect the transposes in the above method would be to combine them with the ordering phases of Steps 1 and 2. For example, suppose that the Gentleman-Sande algorithm (where digit reversal follows the combine phase) is used to compute the row transforms.Then the four separate permutations, two row-wise digit reversals and two transposes, could be replaced by two single DIPs, each consisting of the composition of row-wise digit reversal with transposition.

ALGORITHM DESIGN
Several algorithms for computing Bk( V 2 k) have been designed.Among them, the better known are the "Double-Add-One" (DAO) and Elster's algorithm (EA) [ 4].Both algorithms, as well as their generalizations to Rrk, can be interpreted as simple variants of Equation 7.For this purpose let us first observe that for any scalar a.The distributivity of E9 allows the "Horner-like" nesting of (7).This nested computation of Bk(V. 2 k) constitutes the DAO algorithm.
For example, since we have EA, in turn, is based on associating to the left in Equation 7.For example, since we have Let us compare these algorithms within the PRAM framework.We assume that such a model consists of p processors connected to a shared random access memory Af.The input for an algorithm is assumed to be in some designated memory cells and the output is to be placed in some other designated cells.All processors run the same program.Each time step has three phases: The read phase in which each processor may read from a memory cell, the computation phase in which each processor does a computation, and the write phase in which each processor may write to a memory cell.w•e assume the "CREW" model which allows concurrent reads but only exclusive writes.The parallel complexity or parallel time is the total number of steps executed, expressed, as a function of n, to a solve a problem of size n.
A PRAM can compute u E9 v in one time step and using rs processors, where r and s are the respective sizes of u and v, as follows: Let u = [ a 0 , a 1 , . . ., ar_ 1 ] and v = [/3 0 , /3 1 , . . ., f3s-tl• For each i = 1, . . ., s, each of r processors Pu reads an a 1 , j = 0, 1, . . ., r -1 and /3;, and each of the rs processors p iJ then adds its a 1 and /3; and writes its result in a distinct memorv cell.
It is easy to see that both DAO and EA require O(log n) parallel steps and O(n) processors.However, by performing successive pairwise sums in Equation 7, we can asymptotically reduce the complexity to ( O(log log n ), O(n)) much in the same way that the parallel time of computing an associative sum of n = 2k elements is reduced from O(n) to O(log n) in the familiar prefix sums algorithm.For example, where each node represents a (tensor addition) time step.This gives an algorithm of parallel complexity O(log k) = O(log log n), where n = rk.The maximum number of processors used, which occurs in the last step, is n.
One of the problems of implementing the O(log log n)-parallel time algorithm is the representation of the variable number of vectors in each of the iterative steps.One possibility is to represent each such set of vectors by their concatenation, but this vector is of variable length.Furthermore, keeping track of the indices of the components of the individual vectors within the large vectors is very difficult.This problem can be completely eliminated in Sisal by representing the set of vectors by a dynamically sized array of dynamically sized arrays.function digit__reversal(r,k: integer returns OneDim) For a perfect shuffle or stride rk-l we need to com-

SISAL IMPLEMENTATIONS
The following Sisal function T ("tree") computes the tensor sum (7) by repeated pairwise applications of EB.We assume the data type OneDim as previously defined.The function T provides a generic algorithm for computing the vector representation of any DIP P in parallel time 0(/og log n), n = rk.One only needs to correctly supply the values of the permutation p on Nk that induces P. For digit reversal we use

DIGIT REVERSAL
We thus see that the function T constitutes a O(log log n) parallel time algorithm for the whole family of DIPs.In particular, we obtain a digit-reversal algorithm, which we term "tree-digit-reversal" ("TDR") by computing T with p(i) = rk-', i = 1, 2, . . .'k.
In previous work [7] we developed another O(log log n) time algorithm, ''recursive divide and conquer" ("RDC") for computing digit reversal.We conclude with a comparison between these two algorithms.
It follows from Equation 7 that (27) where k 1 = k 2 = 2'f k 1s even and k 1 = -2 -and Algorithm RDC results by repeated application of this recurrence.For example, R 2 :, = B:; can be computed in the following three steps: In general, R,.k can be computed by applying recurrence (27) flog k l times.Thus, RDC has the same parallel complexity and uses the same number of processors as TDR.However, RDC uses slightly fewer scalar operations than TDR.This is because RDC makes use of the fact that in each of the sequential steps the vectors rk1R,.k 2 and Rh can be obtained from the previous step.This is true even in case k 1 of= k 2 since rk1Rh consists of the first rk2 components of rk2Rh .
Let us examine the relationship between the number NmR(n) of scalar operations used by TDR and the number NR 0 c(n) used by RDC in a little more detail.This is easiest to see when k is a power of 2, say k = 2m.The computation done by TDR can be visualized by a complete binary tree T of height m where each node represents a tensor sum.]"'~;umbering the levels 0, 1, • • • , m -1 from leaves to root, on level i TDR computes 2m-i tensor sums of vectors, each of length r 2 '- Actually, we could reduce the number of scalar operations used in both RDC and TDR.This is because each application of the tensor sum in each of the algorithms involves vectors whose first components are zero.For two such vectors of length r, we could save r operations alone by just noting that U (£! [0, V1. ' ' ' , Vr-il = [u, U tfJ V1, ' ' .

U (£! Vr-il•
Taking advantage of this economy requires a slightly more complicated version of the Sisal function oplus.However, such a change has little noticeable effect. First, the change has very little effect on the total number of scalar operations needed.Recalculating Equations 28 and 29 we obtain N~DR = f 2"'-;rz;-1-1) i~l N~DR is still slightly greater than N~m:, but the difference is even smaller than the difference between NTDR and NRDC:.For example, N!mR = 43053456 and N~Dc = 43053372.Second, and most importantly, experiments confirm that this change has virtually no effect on the running times of either RDC or TDR nor doei'i it have any noticeable effect on EA.
The following Sisal function implements RDC.We have eliminated recursion by using a function ks to precompute vectors that give the values of k 1 and k 2 , respectively.function rdc(r,k: integer returns OneDim)

PERFORMANCE
TDR and RDC are both excellent parallel algorithms.The last step (i = m in Equations 28 and 29) comprises almost all the work.In the experiments we ran.99'Yo of the execution tirne was spent on the last step in function oplus.The overhead of the tree reduction and rnanagernent of the data structures is inconsequential.
Table 1 gives the execution times in ,;econds for the Elster (EA).TDR. and RDC algorithms on a 16-processor Cray C-90 with 2GB of main memory.For r = 2. TDR and RDC are approximately twice as fast as EA . .and the speedup for all thn-~e algorithms is nearly linear.RDC is slightly faster than TDR . .but the difference is insignificant.Both algorithms execute approximately 285 million integer operations per second per processor.Since the Sisal version of TDR comprises a triply-nested for exprei'ision.
FAST DIGIT-I~DEX PERMUTATIONS 145 for each pair of row in Bj for each element in the second row of the pair for each element in the.first row of the pair and there is only one pair of rows on the last step, we compiled the code with the -n 2 flag to force the compiler to parallelize the first two levels of the nested expression.Note that the Cray Sisal compiler always vectorizes the innermost for expression or slices thereof.For r = 3, performance of the three algorithms is similar tor= 2: however, TDR performs poorly for ~3 1 '.The algorithm's poor speedup on two and four processors for this data size is an effect of static loop slicing.For k = 17. the algorithm builds five instances of the Bj array with 17, 9, 5. 3. and 2 rows.respectively.On the last step, the first row has 3 16 and the second row has three elements.Since we instructed the compiler to parallelize only the firi'it two levels of the triplynested for expression, concurrency is realized only across the three elements of the seeond row; thus.the code suffers from poor load balance on two processors and insufficient parallelism on four processors."~e ean improve the performance by in,;tructing the compiler to parallelize all three levels, but the correct general solution to this problem is for the Sisal compiler to support dvnamic loop slicing.

SUMMARY AND CONCLUSIONS
We have defined a vector operation, tensor sum, that plays a very important role in the design and analysis of DIPs and in particular in digit-reversal permutations.We derived a new O(log log n) parallel-time algorithm for the family of DIPs and developed a Sisal implementation for this algorithm.Different choices of the parameter p of the Sisal function T give different DIPs.For the special case of bit-reversal (TDR with r = 2) we showed that on a Cray C-90 our algorithm runs up to twice as fast than that of Elster's bit-reversal algorithm and differs only very slightly from the authors' RDC algorithm.
Work is currently underway to apply the methods presented here to a more general class of permutations which include mixed-radix digitreversal permutations and prime-factor FFT permutations.

. in p for i in 1 ,
k -1 returns array of i -1 end for T(r, k, array_addl (p, k -1)) end let end function To obtain "even-odd" sort we need p •, k-1, OJ: [1, 2, function even__odd(r,k: integer returns OneDim) let in p : = for i in 0, k -2 returns array of i + 1 end for T(r,k,array__addh(p,O)) end let end function hand, RDC computes the product of a scalar by a vector of length r~';=: and just one tensor sum of vectors of length r:... on each level of the tree and hencem 1\T " ' ( 2'-l •i) 1 VRDC = L..J r + r (29) i=lIt is straightforward to show that for fixed r, both NTDR and NRDC are O(n).The exact values are very nearly equal.For example, .NmR(3 16 ) = 2:"> • 3 2 + 2 2 • 3 22 + 2

Table 1 .
Performance Execution Times on C-90