Functional Implementations of the Jacobi Eigensolver

In this article we describe the systematic development of two implementations of the Jacobi eigensolver and give their initial performance results for the MIT/Motorola Monsoon dataflow machine. Our study is carried out using MINT, the MIT Monsoon simulator. The functional semantics with respect to array updates, which cause excessive array copying, have led us to an implementation of a parallel "group rotations" algorithm first described by Sameh. Our version of this algorithm requires O(n$^3$) operations, whereas Sameh's original version requires O(n$^4$) operations. The convergence of the group algorithm is briefly treated. We discuss the issues involved in rewriting the algorithm in Sisal, a strict, purely functional language without explicit l-structures.


INTRODUCTION
A fundamental strength of functional languages is their ability to express the implementation of parallel algorithms closely following their mathematical formulation in a machine-independent fashion.This combined with their ability to express parallelism at the function, loop, and instruction levels provides strong arguments for the use of functional languages and development of functional algorithms for parallel computing.As functional languages provide a gateway to novel multithreaded machine architectures [ 6], they are of interest to computational scientists and designers of numerical algorithms for these machines.Id [11] and Sisal [9] are languages with such potential.
An ultimate goal of parallel programming language design should be the efficient mapping of programs onto parallel hardware.Functional languages have yet to provide general evidence that they can achieve sufficient levels of performance.Some recent results of optimizing compilation for Sisal are beginning to demonstrate this capability [3].
In this article, we present the design and analysis of a numerical algorithm, the Jacobi eigensolver, initially written in the functional dataflow language Id [ 1,8,10 ].The programming constructs are functional, but we are using explicit !-structures in implementing array computations, exploiting the extra expressive power that !-structures bring to a deelarative language.!-structures are data structures with built-in element-level synchronization, implemented with tag bits.Each element of an !-structure can be written only once.Element reads before writes are deferred until the write occurs.We have not used Id features that allow nondeterminism in parallel programs.In particular, mutable arrays (called M-structures in Id) that allow parallel updating of array elements, causing time dependence and hence nondeterminism, are not used.We discuss the issues involved in rewriting the algorithm in Sisal, a strict functional language without explicit I -structures.
Algorithms for eigensolvers represent an important class of numerical software typically found in standard Fortran system libraries.The Jacobi algorithm exhibits an interesting matrix calculation where the ordered update of each matrix element is governed by a sequence of previously computed updates.From the description of this algorithm given below, the computational use and organization of the data are initially seen to be a challenging task for implementation in a functional language.
We show that a functional implementation taken directly from the specifications of the numerical algorithm is marred by an intolerable amount of work caused by useless data copying required to maintain functional semantics.A second implementation, which avoids useless copying by performing more real work in one step, is of the same order of total work complexity as the original sequential algorithm and provides a higher degree of parallelism.This turns out to be an improved version of an algorithm that was designed for the ILLAC-IV [13].

THE JACOBI EIGENSOLVER
Given a symmetric N X N matrix A, the eigenvalue problem is the determination of eigenvectors x and eigenvalues A defined by Ax= Ax. ( Any standard reference on numerical methods [12] will provide several methods for determining the solution to this problem.One such method, known as the Jacobi algorithm, uses two-dimensional rotations applied successively to each offdiagonal element of the matrix A. When the rotations are done systematically.A converges to a diagonal matrix, thereby producing both the eigenvectors and the corresponding eigenvalues.The "plane" or Jacobi rotation is described by an "orthogonal" transformation matrix Rpq, in which all diagonal elements are unity except for the two elements c located at RPP and Rqq, and all off-diagonal elements are zero except for s and -s located at Rpq and R'IP, respectively.The rotation is defined by the values e (eosine) and s (sine) with respect to a free angular parameter cf>.A rotation is performed by the matrix product (2) This rotation can be shown to preserve the eigenvalues of A and to allow for a simple recovery of the eigenvectors.A Jacobi rotation (p.q) changes only the p and q rows and columns of A. Solving Equation 2 we get: 2) ( where r # p, r # q.The Jacobi method defines the choice of the free angular parameter cp such that where t = tancf> and T (=tan~) is defined by Using the property that the matrix A is symmetric.the pattern of element updates as induced by the similarity transformation RT:l.'iAR. 1 _:; is depicted in Figure 1.These updated elements are denoted by a 1 with the (3.5) element zeroed by the appropriate choice of cf>.When elements are zeroed in a strict cyclic order, the convergence of this method is quadratic for nondegenerate eigenvalues (i.e., eigenvalues that are not identical).Because the matrix A is symmetric, one sweep of the Jacobi method is applied to n (n -1 )/2 distinct off-diagonal elements.Furthermore, each rotation requires O(n) operations, so that the total computational complexity is of order n 3 for each sweep.

31MPLEMENTATIONS 3.1 A Row Maior Order Implementation
ln the following implementations of the Jacobi algorithm, A stands for the input matrix, D for the diagonal elements that will be converted into eigenvalues by a number of rotations, and V stands for the matrix that will be converted from an identity matrix into the matrix of eigenvectors.
A sequential implementation of Jacobi's algorithm performs sweeps of rotations around points in the upper triangle in row major order, until the sum of the absolute values of the upper triangle of the matrix is sufficiently small.In the following sketch of the main program, some of the details concerned with not rotating around a point that is relativelv small are left out: The while and/or loops in the above code have the standard meaning.A nextified value, such as next A, receives a new value for A in every loop iteration.The finally construct yields the last value produced in a loop.The function Rotate does the actual work.Rotate computes the values for s = simp, t =simp/ cosf/>, and T = s/ (1 +c), as defined in the previous section, and with these values it creates next versions of A, V, and D. In the following sketch of function Rotate, only the creation of the next value of A using Equations 8 and 9 is considered (Equations 1 0 and 11 are used in the computation of next D), and again complications considering small values are left out: def Rotate A p q = { % compute s and tau def e8 p1 p2 = p1 S*(p2+tau*p1); The above ld code uses an array comprehension in which the dimensionality and bounds of the array are first defined, followed by a number of region definitions.A region definition of the form This first implementation closely follows the mathematics of the Jacobi transformation and allows for the natural exploitation of parallelism.The problem is that this algorithm is too inefficient.
To update O(n) elements in A, Rotate performs O(n 2 ) work, most of which is just copying.This makes a sweep [involving 0 (n 2 ) rotations] an 0 (rz-+) operation, which is one order of magnitude too high.A nonfunctional solution to this copying problem would be to use updatable (mutable) structures (M-structures in Id).This forces the programmer to leave the functional model of computation and rely on time-dependent and nondeterministic constructs.It is our view that these constructs should be used as much as possible at the compiler leveL and as little as possible at the source code level.Examples of this approach are the build-in-place and update-in-place optimizations performed by the Sisal compiler [3].

Sameh's Parallel Group Rotations
A more parallel and at the same time more space efficient implementation of the Jacobi algorithm allows several rotations to be performed concurrently.A group of rotations (p 1 , q1) . . .(pk, qd is valid if each point (p 1 , q 1 ) occupies its own row and column in the upper triangle of A. Clearly there cannot be more than Lrz/2j rotation points in a group.In a parallel rotation based on all rotation points of such a group, each element in the resulting matrix is affected by at most two rotation points.A set of groups partitions the upper triangle of a matrix if all points in the upper triangle are included in exactly one group.In [13] Sameh defines a minimal number of 2n -1 groups of maximal size Ln/2j.These groups are essentially antidiagonals which wrap around the matrix boundaries.Sameh's group definition can be directly translated into the following function ( (2*m-2*k) < q) and (q<=(2*m-k-1)) then ( (4*m-2*k) -q) else n; (i,j) =if p < q then (p,q) else (q, p); PQs[k,i] = (i,j); PQs[k,j] then n else i f ( (2*m-k+1) <= q) and (q<=(4*m-2*k-1)) then ( (4*m-2*k) -q) else ((6*m-2*k-1)-q); (i,j) =if p < q then (p,q) else (q, p); PQs[k,i] = (i,j); PQs[k,j] = (i,j) The following are Sameh's groups for n = 5 and n = 6: Observe that for odd n, in the example rz = 5, Sameh's group numbers start at L(n -1 )/2j and increment (modulo n) with L(rz -1)/2j in both row and column directions, and that for even rz a last column is added with groups n down to 1.We have a proof for this observation in general [2], which is beyond the scope of this article.Thus MakePQs can be simplified, especially if we separate the cases for odd and even n: (p, q); PQs [k, q] = (p, q)}; PQs[p,n+1-p) =(0,0) finally PQs} def MakeEvenPQs n = { m = div (n-1) 2; dn = n-1; }; PQs 2D_I_array ( (1, dn), (l,n)) in {for p <-1 to dn do {for q <-p+1 to dn do h (mod (m*p + m*q-2*ml dn}; k = if (h == 0) then dn else h;

PQs [k, p]
(p, q); PQs [k, q] (p, q) } ; We have left the definition of PQs in MakeOdd-PQs and MakeEvenPQs in the form of a side-effecting loop, as an array comprehension would force us to compute k twice, for index expression [k,p] and for [k, q].

Implementing the Group Rotations
Sameh uses the groups defined in the previous section to create an orthogonal transformation Qk for each group, consisting of sines and cosines of the various <f>s that occupy disjoint elements of the transformation matrix, and then performs the transformation using a matrix product given in Equation 2. As there are 2n -1 groups, this method requires O(n) matrix multiplications, which renders the complexity of one sweep to be O(n 4 ).
We now present a new, demand-driven, implementation of the parallel group rotations algorithm that requires only O(n 3 ) operations.Instead of forming a transformation matrix and performing a matrix product, we register with each element in the transformed matrix A' the two corresponding rotations that affect it and perform those rotations in row major order, guaranteeing that for the two elements modified by the same rotations, the rotation orders are the same.

JACOBI EIGENSOL VER 115
We will derive this algorithm in two transformation steps from the row major implementation.In the first transformation step we turn the row major implementation function Rotate into the demanddriven form RotDemand in which the effect of one rotation point (p, q) is computed for each element of the result army.

PRELIMINARY MONSOON PERFORMANCE p j
This section provides some preliminary performance results.A more complete study of the performance of the algorithms, written in Id and Sisal and running on parallel platforms, still needs to be done.The run-time behavior of Jacobi algorithms is highly data dependent.Also, the convergence rate is dependent on the order in which the rotations take place.Section 5 discusses convergence issues further.The order of the rotations differs in the two implementations, and thus the number of rotations and sweeps needed to converge may differ.This is exemplified by Table 1, which contains simulation results of both Jacobi implementations, run fora matrix A with 1.0 on the diagonal and A[i,j] = i + j off the diagonal."lnstr" stands for the number of instructions executed, ''Rots'' for the number of rotations performed, and "Sweeps" for the number of sweeps performed.The diagonal elements in this particular type of input matrix are smaller than the off-diagonal elements, which give rise to relatively slow conver- gence.Notice that the row major order algorithm performs many fewer rotations than the group rotation algorithm, but that the group rotation algorithm still executes fewer instructions most of the time (except in cases n = 8 and n = 1 0).Clearly, a more thorough study of the convergence characteristics of the two algorithms is needed; however, that is beyond the scope of this article.The next section provides an outline of the convergence proof for these algorithms.

NUMERICAL CONVERGENCE
The convergence of the Jacobi method is dependent on the ordering of the rotations applied to each of the off-diagonal elements of the matrix.
With (n(n-1 ))/2)!ways of choosing the updating order for the Jacobi method, it is important that a particular class of rotation orderings can be guaranteed to converge.One such ordering is the cyclic row major ordering, in which the first rotation in the sweep is (1,2).Forsythe and Henrici [41 have proved the convergence of the cyclic row major ordering algorithm via the following theorem: THEOREM: Let a sequence of Jacobi transformations be applied to a symmetric matrixA.Further, let the angle cf>k be restricted as follows: If the off-diagonal elements are annihilated using a cyclic row major ordering, then this Jacobi method converges.Building on this theorem, Shroff and Schreiver [ 14] introduce the notions of "cyclic wavefront" orderings, and "weak equivalent ordering" to prove that a modulus ordering of the form 1 2 3 4 3 4 5 5 1 2 converges.In the modulus ordering, rotations are sequenced according to their group number l(p,q ), where l(p,q) = ((p + q -3) mod n) + 1 for odd n, and /(p,q) = ((p + q-3) mod n-1) + 1 for even n.Inside a group, the rotations are sequenced according to row major ordering.Shroff and Schreiver show that, if an ordering X can be transformed into the modulus ordering by a permutation of the array indices, then X gives rise to a converging Jacobi algorithm.As an example, for N = 6, the permutation of the array indices (123456) ~ (135246) transforms Sameh's ordering, when taking symmetry into account, into modulus ordering:

A STRICT FUNCTIONAL IMPLEMENTATION OF GROUP ROTATIONS
In ld's array comprehensions, the tar~et of a certain array element expression can he explicitly specified in a re~ion definition: Moreover, ld's }-structures allow array-element assignments A [ i] = u to he scattered over the program.Thus, in Id there is no relation between the process defining an array element and the placement of that element.As ld data structures are nonstrict, it is not even necessary that all elements of an array be defined.It is in general not decidable whether an Id array is uniquely defined.i.e., all elements are defined at most once, or completely defined, i.e., all elements are defined exactly once.For array comprehensions with linear expressions defining the array dimensions, array-element targets.and bounds of the generators, uniqueness and completeness can he checked [ 5].
In contrast, the strict functional programming language Sisal allows array creation in a loop only monolithically and in an implicitly defined order: loop body (it, ... , in) of an n-deep nested loop defines element (it, ... , i") of an n-dimensional array.For example fori in ll,ul cross j in l2,u2 returns array of f(i,j) end for creates a two-dimensional array with bounds ((/1,u1), (/2,u2)), and/(i,j) at position (i,j).We call this the strict loop order property of Sisal, which statically guarantees that arrays are uniquely and completely defined.This property allows efficient array allocation and creation at the cost of some loss in expressiveness.As an example, translating the ld functions Grouprot, rot!, and rot2 into Sisal is straightforward, whereas translating the ld function MakePQs into Sisal is not.as it relies on a nonmonolithic computation of one I -structure (in this case scattered over three loops), and side-effecting assignments with computation of the target index.
There are four approaches in Sisal to create arrays such as PQs, where there is no direct relation between the order of creation and the place in the array.
1. Devise a new implementation of the algorithm that adheres to strict loop ordering.This often requires a more thorough understanding of the algorithm being implemented, and often helps the programmer to come to a more elegant solution of the problem.It turns out that it is possible to change MakePQs so that it adheres to the strict loop order.2. Create the array in a sequential loop by subsequent array updates of the form: PQs : = old PQs [k, i: tuple] [k, j: tuple] relying on the update in place optimization of the Sisal compiler to create only one array and perform destructive updates.In this form the creation of the array is completely sequential.
3. The second approach can sometimes be made more parallel when rows (or more generallv, suharravs) can he created in parallel and .concaten~ted together, exploiting the build-in-place optimization.This can be done for ;v/akePQs, as the elements are placed in the correct rows, but not in the strict loop order inside a row.4. Create an intermediate arrav of (value.index)tuples in any order.followed by a forall loop that places the values in the array by searching in the intermediate array.This very general approach exploits all available parallelism.but is costly in instruction count and space usage as a search is involved and an intermediate array is built.In our case the intermediate array would have the correct row order but not the correct element order per row.Also, it would not be necessarv to store the row index i with the element (p,q), as an element (p.q) of the intermediate array must he placed in both positions
II generators is equivalent to the loop for generators do array[target] = expression.A generator of the form indicates a nested loop.

1 .
(k,p and k.q) of the final array PQs.The design of the Sisal function MakePQs 1s based on the following observations.I.If n is odd.then The branches assigning p = n are never taken, as 2mk -1 = n -k and 4mn -k =2mk + 1.

2 .
All other expressions in different conditional branches assigning a value to p are equal modulo n.The different expressions in the 's groups.For this we define a table PQs where row PQsk defines the k 1 h group rotation, such that PQs [ k, i] and PQs [ k,j] contain the points affecting A'[i,j].A tuple (0,0) in PQs[ k,i] signifies that there is no rotation in row or column i in group k.The assignments to array elements of PQs in the functions MakePQs, and also in MakeOddPSs and MakeEvenPQs accomplish the creation of PQs, which is constant throughout the computation, and is created once.The PQs arrays for n = 5 and (00) (25) (34) (34) (25) (14) (23) (23) (14) (56) (56) (12) (12) (35) (46) (35) (46) def GroupRot A V D PQs k N = { Ts, Taus, Ss = MakeTsTausSs A D PQs k N in { matrix( (1,N), (1,N)) of