A Linear Algebra Framework for Static High Performance Fortran Code Distribution*

High Performance Fortran (HPF) was developed to support data parallel programming for single-instruction multiple-data (SIMD) and multiple-instruction multiple-data (MIMD) machines with distributed memory. The programmer is provided a familiar uniform logical address space and specifies the data distribution by directives. The compiler then exploits these directives to allocate arrays in the local memories, to assign computations to elementary processors, and to migrate data between processors when required. We show here that linear algebra is a powerful framework to encode HPF directives and to synthesize distributed code with space efficient array allocation, tight loop bounds, and vectorized communications for INDEPENDENT loops. The generated code includes traditional optimizations such as guard elimination, mes sage vectorization and aggregation, and overlap analysis. The systematic use of an affine framework makes it possible to prove the compilation scheme correct.


INTRODUCTION
Distributed memory multiprocessors can be used efficiently if each local memory contains the right pieces of data, if local computations usc local data, and if missing pieces of data are quickly moved at the right time between processors.Macro packages and libraries are available to ease the programmer's burden but the level of details still required transforms the simplest algorithm, e.g., a matrix multiply, into hundn•ds of lines of code.This fact decreases programmer productivity and jeopardizes portability as well as the economical survival of distributed memory parallel machines.
Manufacturers and research laboratories, led by Digital and Rice University, decided in 1991 to shift part of the burden to compilers by providing the programmer a uniform address space to allocate objects and a (mainly) implicit way to express parallelism.
Numerous research projects [ 1-3 J and a few commercial products had shown that this goal could bt> achieved and thP High Performance Fortran (HPF) Forum was set up to select the most useful functionalities and to standardize the syntax.ThP initial definition of the new language, HPF, was frozen in May 199~) and corrections were added in November 1994 [ 4 J.
non-independent A in the rhs may induce RW dependences ... FORALL (i=l:n, j=l:m, MASK(i,j)) A(i,j) = f(A, ... ) array IMP declared and mapped as A initial copy of A into IMP because of potential RW dependences INDEPENDENI(j,i) do j=l, m do i=l, n IMP(i,j) = A(i,j) enddo end do !possible synchro ... INDEPENDENI(j,i) do j=l, m do i=l, n if (MASK(i,j)) A(i,j) = f(IMP, ... ) enddo enddo FIGURE 1 Masked FORALL to INDEPENDENT loops.
This article deals with this HPF static subset and shows how changes of basis and affine constraints can be used to relate the global memory and computation spaces seen by the programmer to the local memory and computation spaces allocated to each elementary processor.These relations, which depend on HPF directives added by the programmer, are used to allocate local parts of global arrays and temporary copies which are necessary when nonlocal data is used by local computations.These constraints are also used in combination with the owner-computes rule to decide which computations arc local to a processor, and to derive loop bounds.Finally, they are used to generate send and receive statements required to access nonlocal data.
These three steps, local memory allocation, local iteration enumeration, and data communication, are put together as a general compilation scheme for parallel loops, known as INDEPENDENT in HPF, with affine bounds and subscript expressions.HPF's FOR-ALL statements or constructs, as well as a possible future ON extension to advise the compiler about the distribution of iterations onto the processors, can be translated into a set of independent loops by introducing a temporary array mapped as required to store the intermediate results.These translations are brieflv outlined in Figures 1 and 2. The resulting code is a pair of loops which can be compiled by our scheme, following the owner-computes rule, if the ON clause is put into the affine framework.The FORALL transla-tion requires a temporary array due to its single-instruction multiple-data (SIMD) -like semantics.However, if the assigned array is not referenced in the rhs, the FORALL loop is independent and should be tagged as such to fit directly our scheme.Such necessary temporary arrays are not expected to cost much, both from the compilation and execution point of view: The allocated memory is reusable (it may be allocated on the stack), and the copy assignment on local data should be quite fast.
This compilation scheme directly generates optimized code which includes techniques such as guard elimination [ 1], message vectorization, and aggregation [2,3].It is compatible with overlap analysis [1].
There are no restrictions t>ither on the kind of distribution (general cyclic distributions are handled) or on the rank of array references (the dimension of the referenced space: for instance rank of A ( i , i ) is 1 ) .The memory allocation part, whether based on overlap extensions or dealing with temporary arrays needed to store both remote and local elements, is independent of parallel loops and can always be used.The relations between the global programmer space and the local processor spaces can also be used to translate sequential loops with a run-time resolution mechanism or with some optimizations.The reader is assumed knowledgeable in HPF directives [ 4 J and optimization techniques [1,3].
The article is organized as follow.Section 2 shows how HPF directives can be expressed as affine constraints and normalized to simplify the compilation process and its description.Section 3 prest>nts an overview of the compilation scheme and introduces the basic sets Own, Compute, Send, and Receive that are used to allocate local parts of HPF arrays and temporaries, to enumerate local iterations, and to generate data exchanges between processors.Section  these sets to minimize the amount of memory space allocated, to reduce the number of loops whenever possible, and to improve the communication pattern.This is achieved by using different coordinates to enumerate the same sets.Examples are shown in Section 5 and the method is compared with previous work in Section 6.

HPF DIRECTIVES
The basic idea of this work was to show that the HPF compilation problem can be put into a linear form, including both equalities and inequalities, then to show how to use polyhedron manipulation and scanning techniques to compile an HPF program from this linear form.The linear representations of the array and HPF declarations, the data mapping, and the loop nest accesses are presented in this section.
HPF specifies data mappings in two steps.First, the array elements are aligned with a template, which is an abstract grid used as a convenient way to relate different arrays together.Each array element is assigned to at least one template cell thru the ALIGN directive.Second, the template is distributed onto the processors, which is an array of virtual processors.Each template cell is assigned to one and only one processor through the DISTRIBUTE directive.The template and processors are declared with the TEM-PLATE and PROCESSORS directives, respectively.
Elements of arrays aligned with the same template cell are allocated on the same elementary processor.Expressions using these elements can be evaluated locally, without interprocessor communications.Thus, the alignment step mainly depends on the algorithm.The template elements are packed in blocks to reduce communication and scheduling overheads without in-creasing load imbalance too much.The block sizes depend on the target machine, while the load imbalance stems from the algorithm.Templates can be bypassed by aligning an array on another array and by distributing arrays directly on processors.This does not increase the expressiveness of the language but implies additional check on HPF declarations.Templates arc systematically used in this article to simplify algorithm descriptions.Our framework deals with both stages and could easily tackle direct alignment and direct distribution.The next sections show that any HPF directive can be expressed as a set of affine constraints.

1 Notations
Throughout this article, a lower case letter as v denotes a vector of integers, which may bP variables and constants.v 1 , (i 2: 0) is the ith element or variable of vector v. Subscript 0, as in v 0 , denotes a constant integer vector.As a convention, a denotes the variables which describe the elements of an array, tis used for templates, and p for processors.An upper case letter as A denotes a constant integer matrix.Constants are implicitly expanded to the required number of dim ensions.For instance 1 may denote a vector of 1. IAI denotes the determinant of matrix A.

Declarations
The data arrays, templates, and processors are declared as Cartesian grids in HPF.If a is the vector of variables describing the dimensions of array A ( 3 : 8, -2 : 5, 7) , then the following linear constraints are induced on a: These mav he translated into the matrix form D 8 a ::::: dwhrrf': ;.) 2

-1
Any valid array e!Pment must verify the linear constraints.i.e .. A(a 1 • a~, a 1 ) is a valid array element if Equation 1 is verifif'd by vector a.In the remaining article it is assumed without loss of generality that the dimension lower bounds are equal to 0. This assumption simplifies the formula by deleting a constant offset at the origin.Thus. the declaration constraints for A ( 0: 5, 0: 7, 0: 6) can be writtf'n intuitiwly as O:::::a<D.l where Dis a diagonal matrix composed of thr sizes of A. The rationale for storing the sizes in a diagonal matrix instead of a vector is that this form is needed for the distribution formula (see Section 2.-t).Likewisr.the templatf' declaration constraints arf' 0 ::::: I < T.l and the procrssors 0 ::::: p < P.1.

Alignment
The ALIGN directive is usf'd to spPcify the alignment relation between one object and one or many other objects through an affine expression.Tllf' alignmrnt of an array A on a template T is an affine mapping from a to t. Alignments are specified dimension-wise with integer affine expressions as template subscript expressions.Each array index can he used at most once in a tPmplate subscript expression in any given alignment.and each subscript expression cannot contain more than one index [ 4 J.Let us consider the following HPF alignmmt directive, where the first dimension of the array is collapsed and the most general alignment subscripts are used for the other dimensions: align A ( * , i , j ) with T ( 2 * j -1 , -i + 7 ) it induces the following constraints between the dimensions of A and T represente<L respectively.by vectors a and t: Thus. the relation between A and T is given by t Aa + s 0 where t= (' 1),A = (() A (for alignnwnt) is a matrix with at most one nonzero integer element per column (dummy variables must appPar at most once in the alignment subscripts) and per row (no more than one dummy variable in an alignment subscript) and s 0 is a constant vector that denotes the constant shift associated to the alignment.:\'ote that since the first dimension of tlw arrav is collapsed.the first column of A is nulL and that~ the remaining columns constitute an antidiagonal matrix ~incP i and j are used in reverse order in the subscript expressions.llowcver, this representation cannot express replication of elements on a dimension of tllf' template, as allowed by llPF.The former relation is generalized by adding a projPction matrix R (for replication) which selects the dimensions of the template which are actually aligned.Thus the following example: template T(0:99), T'(0:99) processors P(0:4) distribute T(block(20)), T'(cyclic(1)) onto P
The same formalism can also deal with a chain of alignment relations (A is aligned to B which is aligned to C, etc.) by composing the alignments.To summarize.a"*'' in an array reference induces a zero column in A and a "* ,, in a template reference a zero column in H.When no replication is spt'cified, His the identity matrix.The number of rows of R and A corresponds to the number of template dimensions that is actually aligned.thus leading to equations linking template and array dimensions.Template dimensions with no matching array dimensions are removed through R.

Distribution
Once optional alignments are defined in HPF, objects or aligned objects can be distributed on Cartesian processor arrangements, declared like arrays and templates, and represented by the inequality 0 :S p < P.1.
Each dimension can be distributed by block, or in a cyclic way.on the processor set or even collapsed on the proct'ssors.Let us first consider Examples 1 and 2 in Figures ;) and 4:

Example l
In Example 1, the distributions of the templates create a link between the template elements l and the processors p. so that each processor owns some of thf' template clements.These links can be translated into linear formulae with the addition of variables.For thf' block distribution of template T, assuming the local offset l within the size 20 block.0 :S l < 20, then the formula simply is: For a fixed processor p and the allowed range of offsets template T(0:99,0:99,0:99) processors P(0:9,0:9) distribute T(*,cyclic(4),block(13)) onto P within a block I, the formula gives the corresponding template elements that arc mapped on that processor.There is no constant in the equality due to the assumption that template and processor dimensions start frorn 0. For the eye lie distribution of template T', a cycle variable c counts the number of cycles over the processors for a given template element.Thus. the formula is: The cycle number c generates an initial offset on the template for each cycle over the processors.Then the p translation associates the processor's owned template clement for that cycle.The general cyclic(n) multidimensional case necessitates both cycle c and local offset l variable vectors, and can be formulated with a matrix t'xpression to deal with all dinwnsions at once.

General Case
HPF allows a different block size for each dimension.
The extents (n in BLOCK(n) or CYCLIC(n)) are stored in a diagonal matrix.C. Pis a square matrix with the size of the processor dimensions on the diagonal.Such a distribution is not linear according to its definition [ 4], but may be written as a linear relation between the processor coordinate p. the template coordinate L and two additional variables.l and c: Vector l represents the offset within one block in one processor and vector c represents the number of wraparounds (the sizes of which arc C1J) that must be performed to allocate blocks cyclically on processors as shown on Figure 5 for the example in Figure 7.
The projection matrix ll is needed when some dimensions are collapsed on processors, which means that all the elements on the dimension are on the same processor.These dimensions are thus orthogonal to the processor parallelism and can be eliminated.The usual modulo and integer division operators dealing with the block size are replaced by predefined constraints on p and additional constraints on l.Vector c is implicitly constrained by array declarations and/ or (depending on the replication) by the TEMPLATE declaration.
Specifying BLOCK(n) in a distribution instead of CYCLIC(n) is equivalent but tells the compiler that the distribution will not wrap around and that Equation:~ can be simplified by zeroing the r coordinate for this dimension.This additional equation reduces the number of free variables and, hence, removes a loop in the generated code.CYCLIC is equivalent to CYCLIC (

Example 2
Let us consider the second, more general example, in In this example, the distribution is of type BLOCK in the last dimension, thus the added c 2 = 0.If a dimension of a template is collapsed on the processors, the corresponding coordinate is useless because even if it is used to distribute the dimension of an array, this dimension is collapsed on the processors.Thus, it can be discarded from the equations.It means that it can be assumed that II = I in Equation :3 if these useless dimensions arc removed by a nommlization process.

Iteration Domains and Subscript Functions
Although iteration domains and subscript functions do not concern directly the placement and the distribution of the data in HPF, they must be considered for the code generation.Since loops are assumed INDEPENDENT with affine bounds.they can be represented by a set of inequalities on the iteration vector i: The original enumeration order is not kept by this representation.but the independence of the loop means that the result of the computation does not depend on this very order.The array reference links the iteration vector to the arrav dimensions.
Let us considt>r for instance the example in Figure 6.Tlw iteration domain can be translated into a parametric (n 1 and n 2 may not bt> known at compilt>-  tirnf') form.where the constant vector is a function over the parameters.Thus. the set of constraints for the loop nest iterations.with the additional change of variablej = ;)i~ + 1 to normalize the loop [18], is: Moreover.the array reference can also lw translated into a set of linear equalities: Tn the general case.a set of constraints is derived for the iteration domain of tlw loop nest (Equation 5, wht>re 11 stands for a vector of parameters for the loop nest) and for the array references (Equation 6).
Tlw array mapping of A is represented by the ••0 .. or ..•. , on Figure 8 and thf' written elemt>nts (if As can be sef'n 011 this two-dimensional (2D) rqn-esentation of a 3D space.the mapping and the access are done according to a regular pattern.namely a lattice.This is exploited to translate the compilation problem in a linear algebra framework.

OVERVIEW OF THE COMPILATION SCHEME
The compilation scheme is basrd on the HPF declarations.the owner computes rule.and the singlrprogram multiple data (SP~1D) paradigm.and deals with INDEPENDENT loops.Loop-bound and array subscript exprt>ssions arc assumed affine.Such a parallel loop indt>pendently assigns elements of smnf' ldthand side X with some right-hand side expressions f on variables Y. Z. etc. (Fig. 9).The loop is normalized to usc unit steps as explained in the previous section.
Since tlw loop is indqwndent.8x must be injective on the iteration domain for an indt>pendent parallel loop to be deterministic.The n models external run-time parameters that may be involved in somf' subexprrssions.namely the loop bounds and the reference shifts.
The I IPF standard does not specify the relationship between the IIPF processors and the actual physical processors.Thus.t>fficiently compiling loop nests involving arrays distributed onto different processor sets without further specification is beyond the scope of this article.But if the processor virtualization process can bt> rt>presf'ntcd by affine functions.which would probably be thf' case, wf' can Pncomrmss thPsf' different processor sets in our frarnP,nlrk by remaining on the physical processors.ln the following.Wf' assume for clarity that the intnfering distributed arrays in a loop nest art' distributed on the same procPssor sf't.

Own Set
The subset of X that is allocated on processor p according to HPF directives must be mapped onto an • 0 0 Example of an array mapping and writing.(From ChatterjPe et al. [ 17]. with permission.) array of the output program.Let Ownx(P) be this subset of X and X' the local array mapping this subset.
Each element of Ownx(P) must be mapped to one element of X' but some elements of X' may not be used.Finding a good mapping involves a tradeoff between the memory space usage and the access function complexity.This is studied in Section 4.3.Subset Ownx(P) is derived from HPF declarations, expressed in an affine formalism (see Section 2), as: A 0 :::; t < 1~.1} Subset Oumx (p) can be mapped onto a Fortran array by projection on c and l, or on any equivalent set of variables, i.e., up to an injective mapping.Although Own is not a dense polyhedron as defined here, it can be seen as such in the higher dimensional (a, c, l) space.Thus.this view is used in the following, although our interest is just to represent the array clement a.Note that for a given distributed dimension ai there is one and only one corresponding (p 1 , c 1 , L 1 ) triplet.

Compute Set
Using the owner computes rule.the set of iterations local top, Compute(p ), is directly derived from the previous set, Ownx (P).The iterations to be computed by a given processor are those of the loop nest for which the lhs arc owned by the processor: Note that Compute(p) and Ownx(P) are equivalent sets if the reference is direct (Sx =I, axo = 0) and if the iteration and array spaces conform.This is used in Section 4.3 to study the allocation of local arrays as a special case of local iterations.According to this definition of Compute, when X is replicated, its new values are computed on all processors having a copy.Depending on a tradeoff between communication and computation speed, the optimal choice may be to broadcast the data computed once in parallel rather than to compute each value locally.

View Set
The subset of Y (or Z, etc.) used by the loop that computes X is induced by the set of local iterations: Note that unlike Sx, matrix Sy is not constrained and cannot always be inverted.
If the intersection of this set with Ownv(p) is nonempty.some elements needed by processor p for the computation are fortunately on the same processor p.
Then. in order to simplify the access to Viewv(P) without having to care about dealing differently with remote and local elements in the computation part, the local copy Y' may be enlarged so as to inelude its neighborhood, including Viewy(p).The neighborlwod is usually considered not too large when it is bounded hy a numerical small constant, which is typically the case if X and Y are identically aligned and accessed at least on one dimension up to a translation, such as encountered in numerical finite difft>rence schemes.This optimization is known as overlap analysis [1].Once remote values are copied into the overlapping Y', all elements of Viewy(p) can be accessed uniformly in Y with no overhead.However, such an allocation extension may be very rough and expensive in some cases (e.g., a matrix multiplimtion) because of the induced memory consumption.A reusable temporary array might he preferred, and locally available array elements must be copied.Such tradeoffs to be considered in the decision process are not discussed in this article.but we present the techniques for implementing both solutions.
When Y (or X, or Z. etc.) is referenced many times in the input loop or in the input program, these references must be clustered according to their connection in the dependence graph [19].Input dependencies are taken into account as well as usual ones (flow, anti, and output dependencies).If two references are independent, they access two distant areas in the array and two different local copies should be allocated to reduce the total amount of memory allocated: a unique copy would be as large as the convex hull of these distant regions.If the two references are dependent, only one local copy should be allocated to avoid any consistency problem between mpies.If Y is a distributed array, its local elements must be taken into account as a special reference and be accessed with (p, c, l) instead of i.
The definition of View is thus altered to take into account array regions.These regions are the result of a precise program analysis which is presented in [20][21][22][23][24][25][26][27].An array region is a sf'! of array elements described by equalities and inequalities defining a convex polyhedron.This polyhedron may be parameterized by program variables.Each array dimension is described by a variable.The equations due to subscript expression sy arc replaced by the array region, a parametric polyhedral subset of Y which can be automatically computed.For instance, the set of references to Y performed in the loop body of: enddo enddo is automatically summarized by the parametric region on Oy = CY1, .Y2) for the Y reference in the loop body of the example above as: Subset Viewy is still polyhedral and array bounds can be derived by projection, up to a change of basis.If regions are not precise enough because convex hulls are used to summarize multiple references, it is possible to use additional parameters to exactly express a set of references with regular holes [28].This might be useful for red-black SOR.
As mentioned before, if Y is a distributed array and its region indudes local elements, it might be desired to simply extend the local allocation so as to simplify the addressing and to allow a uniform way of accessing these array elements for a given processor Pv and cycle Cy.The required extension is computed simply by mixing the distribution and alignment equations to the Viewy description.For the example above, assuming that X and Y are aligned and thus distributed the same way, the overlap is expressed on the block offset LY with: Vectors Cy and Pv are constrained as usual by the loop, the processor declaration . .and the template declaration of X, but vector lY is no more constrained in the block size (C).It is indirectly constrained by av and the Y n•gion.This leads to a -1 ::s LY lower bound instead of the initial 0 ::S lv, expressing the size one overlap on the left.

Send and Receive Sets
The set of elements that are owned by p and used by other processors and the set of elements that are used by p but owned by other processors are both intersections of previously defined sets: These two sets cannot always be used to derive the data exchange code.First, a processor p docs not need to exchange data with itself.This cannot be expressed directly as an affine constraint and must be added as a guard in the output code.*Second, replicated arrays * This could !w avoided by exchanging data first with processors p' such that p' < p and then with processors such that p' > p.
using the lexicographic order.But this additional optimization is nut likely to decrease the execution time much.because the loop ovPr p' is an outenuost loop.
would lead to useless communication: Each owner would trv to send data to each viewer.An additional constrair~t should be added to restrict the sets of communication to needed ones.Different techniques can he used to address this issue: (l) Replication allows broadcasts and/ or load balance.which is simply translated into linear constraints as described in [29].
(2) The assignment of owners to viewers can also he optimized in order to reduce the distance herw<:>en conmnmien ting processors.For instanef',, the cost function could bt '

Output SPMD Code
The generic ontpnt SPMD code, parametric on the local processor p, is shown in Figure 10

FIGURE 10
The output SPYlD codP. the

REFINEMENT
The pseudocode shown in Figure 10 is still far from Fortran.Additimwl changPs of coordinates are needed to general!' propfr Fortran deelarations and loops.

1 Enumeration of Iterations
A general method to enumerate local iterations is described below.Jt is based on ( 1) solving HPF and loop equations, both equalities (Equations 2, 3, 6) and inequalities (Equations 4, 5), on (2) searching a good lattice basis to scan the local iterations in an appropriate order using p as a parameter since each processor knows its own identity, and (3) on using linear transformations to switch from the user-visible frame to the local frame.
In this section, a change of frame is computed based on the available equalities to find a dense polyhedron that can be scanned efficiently with standard techniques.The change-of-frame computation uses two Hermite forms to preserve the order of the (L c) variables which are related to the allocation scheme, for benefiting from cache effects.

Simplifying Formalism
The subscript function S and loop bounds are affine.
All object declarations are normalized with 0 as the lower bound.The HPF array mapping is also normalized: TI = I.Cnder these assumptions and according to Section 2. HPF dedarations and Fortran references are represented by the following sets of equations: t= CPc + Cp + l (7) affine reference ( 6) a= Si + a 0 (n) where pis the processor vf'ctor, l the local offset vector, c the cvcle vector.a the accessed element.i the iteration, and n the parameters.

Problem Description
Let x be the following set of variables: They are needed to describe array references and iteratiom.The order of the l and c variables chosen here should be reflected by the local allocation in order to benefit from cache effects.This order best suits cvdic distributions.as discussed in [33], because there should be more cycles (along c) than block clements (along /) in such cases.Putting the block offset after the cycle number for a given dimension would lead to larger inner loops for more block distributions.J\ote that the derivations discussed in this section are independent of this order.Equation 7 can be re\\Titten as with For instance, in the Chatterjee eta!. [17] example in Figure 7, Equation 10 is: The set of interest is the lattice The goal of this section is to find a parametric solution for each component of .r.This allows each processor p to scan its part of .J'fwith minimal control overhead.
It is important to note that F is of full row-rank.
The rows contain distinct variables that ensure their independence one from the other.First, the alignment equations are composed of indep!'ndent equalities (they differ from a variables); this is also the case for the distribution and reference equations because of the land t variables.Second, the equalities composing the distribution equations are the only to contain l variables, thus they are independent from the alignment and reference.Third, the alignment and reference equations are separated by the template variables through R. Since all equalities are independent, F is a full row-rank.:\lote that HPF restrictions about alignment are not Pxploited.
If additional equalities arP taken into account, such as those arising from simple block ( c, = 0) or cyclic (!, = 0) distributions.they can be used to remove the variables from the equation and do not actually change this property.Other degenerated cases rnay arise (for instance.there is only one processor on a given dimension), but they are not actually discussed here.Our general scheme can directly take advantage of such extra information to optimize thP generated code by including it as additional equalities and inequalities.

Parametric Solution ol Equations
Lattin~ .Jf is implicitly dd:ined but a parametric definition is needed to enumerate its elements for a given processor p. Then• are two kinds of parameters that must be set apart.First, the constant unknown parameters n and local processor id p. the values of which are known at run-time on each processor.Second, the parameters of interest to us.i.e., those that have to be enumerated or instantiated on each processor to scan the integer solutions to the HPF equations, are the variables in vector :r.
A Hermite form [34 J of integer matrix F is used to find the parameters.This form associates to F (an n X m matrix with m 2: n) and three matrices H, P, and Q. such that H = PFQ.P is a permutation (a square n X n matrix).Han n X m lower triangular integer matrix, and Q an m X m unimodular change of basis.Since F is of full row-rank, no permutation is nePded: P = I and H = FQ (a).By definition, His a lower triangular matrix, and thus can be decomposed asH= (H~_O), where fh is ann X n integer triangular square matrix.We know that jH~_j E {-L 1}.Indeed, I It is of full rank (as F) and the column combinations performed by the Hermite form computation put unit coefficients on the H 1 .diagonal.This is ensured since independent unit coefficients appear in each row of F.
Thus H~_ is an integer triangular unimodular matrix, and has an integral inverse.
J\ow we can use Q as a change of basis between new variables L' and x, with I' = o-lx (b).Vector L' can also be decomposed like I I in two components: v = (1• 0 , r'), where lvol is the rank of H. Using (a) and (b), Equation 9 can be rewritten as: where l'o is a parametric solution at the origin which depends on the run-time value of then and p parameters.Thus we have l'o(n,p) = HzlJo(n,p ).By construction, I I does not constrain r'. and Q can also be decomposed like I' as Q = ( Q 0 F').Lattice .:Jfcanbe expressed in a parametric linear way: and with x 0 (n,p) = Qot'o(n,p) = QoHilJo(n,p): ?7= {:r:j31'' s.t.x = x 11 (n,p) + F'u'} (11) We have switched from an implicit (Equation 9) description of a lattice on :r to an explicit (Equation 11) one through F' and v'.;-..rote that F' is of full column-rank.

Cache-Friendly Order
As mentioned earlier.the allocation is based somehow on the ( l. c) variables.The order used for the enumeration should reflect as much as possible the one used for the allocation, so as to benefit from memory locality.Equation 11 is a parametric definition of x. llowever.. the new variables 1•' are not nccessarilv ordered as the variables of interest to us.The aim of this paragraph is to reorder the variables as desired.by computing a new transformation based on the Ilennite form of F'.Let H' = P'F'Q' be this form.Let Q' define a new basis: ( 12) (13) If P' = f. the new generating system of .4'isbased on a triangular transformation between :r• and u (Equation 13).Since H' is a lmvcr triangular integer matrix, the variables of u and of x simply correspond one to the other.Knowing this correspondence allows us to order the variable u components to preserve locality of accesses.If P' of= I, the variables are shuffled and some cache effect may be lost.However.we have never encountered such an example.
Algorithms presented in [28] or others [30.[35][36][37][38][39][40][41][42][43] can be used to generate the loop nest f'Immerating the local iterations.When S is of rank iai, optimal code is generated because no projections are required.Otherwise, tlw quality of the control overhead depends on the accuracy of integer projections [ 44 J but the correctness does not.

Correctness
The correctness of this enumeration scheme sterns from ( 1) the exact solution of integer equations using the Hermite form to gt>nerate a parametric enumeration, from (2) the unimodularity of the transformation used to obtain a triangular enumeration, and from (:1) the independent parallelism of the loop which allows any enumt>ration order.

Symbolic Resolution
The previous method can be applied in a symbolic \vay, if the dimensions are not coupled.and thus can be dealt with independently, as array sections in [ 17.
where rr is the number of processors (a diagonal coefficient of P) and y the block size (i.e., a diagonal coefficient in C).In order to simplify the symbolic resolution.variables a and tare diminated.The matrix form of the system is then _{1 1 = ( aa 11 + t 0 -yp), .r= (l, c, i), and F = (1 rry -aa).
The Hermite form isH= (1 0 0) = PFQ, with Let g, f.L, and w be such that g is gcd(rry, aa) and g = 7T'Yf.L -aaw is the Bezout identity.The Hermite form H' of the two rightmost columns of Q noted

-C"" + '"-YP)
.Xo-0 , 0 (X(T) This links the two unconstrained variables u to the elements .r of the original lattice :Y Variables a and t can be retrieved using Equation 14.
The translation of constraints on .r• to u gives a way to generate a loop nest to scan the polyhedron.Under the assumption a > 0 and a > 0, assuming that I JNEAR ALGEBRA FRAMEWORK 15 Vp E [0 ... 1r -1] and can be translated as constraints on u: The resulting generic SPMD code for an array section is shown in Figure 11.As expected, the loop nest is parameterized by p, the processor identity.Integer divisions with positive remainders are used in the loop bounds.

HPF Array Allocation
The previous two sections can also be used to allocate local parts of HPF distributPd arrays.A loop nest referencing a whole array through an identity subscript function (8 =I, a 0 = 0) serves as a basis for the allocation.The dense polyhedron obtained by the changes of bases for the enumeration purpose can be used to store the required elements, since local Ut ;luz iterations and local clements are strictly equivalent.Thus, the constraints on local iterations can be reused as constraints on local elements.However, Fortran array declarations are based on Cartesian sets and are not as general as Fortran loop bounds.GenPral methods have been proposed to allocate polyhedra [ 4 7].For our particular case, it is possible to add another change of frame to better fit Fortran arrav declaration constraints and to reduce the amount of allocated memory at the expense of the access function complexity.
The local array elements are packed onto local memories dimension by dimension.The geometrical intuition of the packing scheme for the [ 17] example is shown in Figure 12.The basic idea is to remove the regular holes due to the alignment stride by allocating the dense u space on each processor.The "0" and "e" are used to support the geometrical intuition of the change of frame.The first grid is the part of the template local to the processor, in the cycle and offspt coordinate system.The second grid shows the same grid through transformation from .1• to u.The third one is the final packed form.which could be used if space must not be wasted, but at the expense of a complex access function.
An array dimension can be collapsed or distributed.
If the dimension is collapsed, no packing is needed.so the initial declaration is preserved.lf the dimension is aligned.there are three corresponding coordinates (p . .l. c).For every procPssor p. local (l.c) pairs have to be packed onto a smaller area.This is done by first packing up the elements along the columns.then removing the empty ones.Of course, a dual method is to pack first along the rows . .thPn remove the empty ones.This last method is less efficient for the example shown in FigurP 12 since it would requirP 16 (8 X 2) elements instead of 12 ( 3 X -l).The two packing schemes can lw chosen according to this criterion.Other issues of interest arc the induced pffpcts for the cache behavior and tlw enumeration costs.Formulae are derived below to perform these packing schemes.

Packing of the Symbolic Resolution
Let us consider the result of the above symbolic resolution, when the subscript expression is the identity ( u = 1, a 0 = 0).The equation between u and the free variables of .T is obtained by selecting the triangular part of H', i.e., its first rows.If H" = (10)1!' is the selected submatrix, we have x' -x( 1 = H"u, i.e., Variables O" and a 0 were substituted by their values in the initial definition of 1!'. Variables ( u 1 , u 2 ) could be used to define an efficient packing, since holes are removed by the first change of basis (Fig. 12).In order to match simply and closely the l, c space, the sign of u 1 (linked to l) can be changed, and the vector should be shifted so that the first local array clement is mapped near (0, 0).Some space may be wasted at the beginning and end of the allocated space.For a contrived example (with very few cycles), the wasted space can represent an arbitrary amount on each dimension.Let us assume that two thirds of the space is wasted for a given dimension.Thus, the memory actually used for an array with three of these dimensions is 1/27.and 26/27 of the allocated memory is wasted.lf such a case occurs, the allocation may be skewed to match a rectangle as closely as possible.This may be done if space is at a premium and if more complex, nonaffine access functions are acceptable.The improved and more costly scheme is described in the next section.

Allocation Basis
Let M b~ the positive diagonal integer matrix composed of the absolute val uP of the diagonal coefficients of II".
M provides the right parameters to perform the proposed packing scheme.To every 11 a vector u' is associated through Equation 15.This formula introduces an integer division.Lt>t's show why u' is correct and induces a better mapping of array elements on local mPmoriPs than u.Since H" is lower triangular, Equation 1:1 can he rPwritten (16) array A' (0: r max(c)-;in(c)+11 ,0: r; l) FIGURE 13 Local nPw dPciaration.
Function alloc(u) is bijectiw.Alloc(u) is injective: Tf u" and u 1 ' are different vectors, and i is the first dimension for ,vhich they differ.Equation 16shows that u!" and u! 1 ' will abo differ.The function is abo surjective, since the property allows construction of a vector that matches anv u 1 fl\ induction on i.

Array Declaration
Two of thP three components of .Thf' packing scheme indm•ed by 11 1 is bettf'r than the onf' inducf'd by 11 because there are no nondiagonal coefficients between u 1 and .1' 1 that would introduce a waste of space . .and u 1 is as packed as 11 because contiguity is preservPd by Equation H>.The row packing scheme would lwvf' lwen obtained by choosing ( c, I) instead of (/.c) for .r 1 • Different choices can be mad!' for each dimension.The access function requires an integer division for thf' rf'duced memory allocation.Techniques havf' been suggested to handle divisions by invariant integers efficiently [ 48] that could hf'lp reduce this cost.Also.because contiguity is pres!~rved.only one division per column is requirf'd to compute the base location.These packing schenws define two parameters (uj.u2) to map one element of one dimension of the HPF array to the processor's local part.
Th~ local arrav dt>daration can be linearized with the ~~~ dimf'nsion first, if the Fortran limit of seven dinwnsions is exceeded.

Properties
The proposed iteration enumeration and packing scheme has sPveral intf'resting properties.It is compatible with efficient cachf' exploitation and overlap analysis ..\!loreovcr.some improvenwnts can statically enhance the generated code.
According to the access function, the itPration enumeration order and the packing scheme in Figure 11 can be revf'rsed via loop u~ direction in order that accesses to the local array are contiguous.Thus. the local cache and/or prefetch mechanisms.if any, are f'fficientlv used.The packing scheme is also compatible with owrlap analysis techniques [1 J. Local array declarations are extended to provide space for bordf'r elements that are owned by neighbor processors.and to simplify accesses to nonloeal elements.The overlap is induced by relaxing constraints on l . .which is transformed through thf' scheme as relaxed constraints on u2.This allows overlaps to be simply considered by the scheme.
.\!loreover, a translated access in the original loop leading to an overlap is transformed into a translatPd access in the local SP-'10-generated code.For example.using the HPF array mappmg of The generic codf' proposed in Figurf' 11 can be greatly improved in many cases.Integer division may be simplified.rwrformed efficiently with shifts.or even removed by strength reduction.Node splitting and loop invariant code motion should be used to reduce the cm1trol overhead.Constraints may also be simplified.for instancf'.if the concemed elements just match a cycle.JVloreover. it is possible to gt'nerate the loop nf'st directly on 11 1 •• whenu is not used in the loop body.
For the main example in [ 17], such transformations produce the code shown in Figure 15.
In the general resolution (Section 4.1 ), the cycle variables c were put after the local offsets l.The induced inner loop nest is then on c.It may be interesting to exchange land c in x' when the block size is larger than the number of cycles: The loop with the larger range would then be the inner one.This may be useful when elementary processors have some vector capability.

Allocation of Temporaries
Temporary space must be allocated to hold nonlocal array elements accessed by local iterations.For each loop L and array X, this set is inferred from Com-puteL(P) and from subscript function Sx.For instance, references to array Yin Figure 16 require local copies.
Consider the generic loop in Figure 9 and assume that local elements of Y cannot efficientlv be stored using overlap analysis techniques.First, if all or most of local clements of the lhs reference are defined by the loop nest, and if Sy has the same rank as Sx, temporaries can be stored as X.If, furthermore, the access function Sx uses one and only one index in each dimension" the resulting access pattern is still HPF like, so the result of Section 4.:3 can be used to allocate the temporary array.Otherwise, another multistage change of coordinates is required to allocate a minimal area.
The general idea of the temporary allocation scheme is to reduce the number of necessary dimensions for the temporary array via a I Iermite transformation, then to use this new basis for declaration, or the compressed form to reduce further the allocated space.
The set of local iterations, Compule(p ), is now defined by a new basis and new constraints, such that xx 11 = P'-1 H'u (Equation 13) andK'u :S k'.Some rows of P'-1 ll' define iteration vector i, which is part of x (Equation 8).Let Hi be this submatrix: i =Hi u.
Let HY = PYSYQY be Sy's Hermite form.Let I' = Q:; 1 i be the new parameters, then SYi = P:; 1 Hyv.V ector l' can be deconlposed as (1' 1 " v") where 1' 1 = no contributes to the computation of i and v" belongs to Ifv's kernt>l.If l!Yt. is a selection of the nonzero columns of HY = (HYt.0)" then we have: and by substituting i: It is sometimes possible to use x' instead of v'.For instance, if the alignments and the subscript expressions arc translations, i is an affine function of x' and u' simply depends on x'.The allocation algorithm is based on u but can also be applied when .x' is used instead.
Since the local allocation does not depend on the processor identity, the first IPI components of u should not appear in the access function.This is achieved by decomposing u as (u', u") with lu'l = IPI and IIQ; 1 Hf in (f[,.Fv) such that: and the local part v" is introduced as: Then the first solution is to compute the amount of memory to allocate by projecting constraints .)("onto1' 11 • This is always correct but may lead to a waste of space because periodic patterns of accesses are not recognized and holes are allocated.In order to reduce the amount of allocated memorv.a heuristic.based on .' ' large coefficients in FY and constraints J{," is suggested. Each dimension, i.e., component of 1' 11  , is treated independently.Let F~ be a line of FY andy the corresponding component of v":y = .fi,".A change of basis G (not related to the previous allocation operation) is applied to c" to reduce f to a simpler equivalent linear form/' where each coefficient appears only once and where coefficients are sorted by increasing order.For instance: f' = (:3 4 16)   and r" is replaced by w = Gv".Constraints ./{'onu are rewritten as constraints IV on w by removing the constraints due to processors and by using projections and G.
LINEAR ALGEBRA FRAMEWORK 19 input: {y = f.w,Ww :S 0} Linear form f is then processed by the heuristic s shown in Figure 17.It reduces the extent by -'-at each g; step, if the constraints W show that there is a hole.A hole exists when g; is larger than the extent of the partial linear form being built, under constraints W.
Linearity of access to temporary clements is preserved.This scheme is correct if and only if a unique location y' is associated to each y Further insight on this problem, the minimal covering of a set by interval congruences, can be found in [ 49,50].

Data Movements
The relationships between the bases and frames defined in the previous sections are shown in Figure 18.
Three areas are distinguished.The top area contains user-level bases for iterations.i, and array elements, t----i~ ',: IIPF~IIPF Solid arrows denote affine functions between different spaces and dashed arrows denote possible nonaffinp functions built with integer divide and modulo operators.A quick look at these arrows show that u. which was defined in SPction 4.1, is the central basis of the scheme.hPcause every other coordinate can be dt>rived from u along one or more paths.
Two kinds of data exchanges must be generatPd: updates of overlap areas and initializations of temporary copies.Overlap areas arP easier to handle because tlw local parts arc mapped in the same way on each proePssoL using the same alloc function.Consider array X in Figure 18.The send and receive statements are enumerated from the same polyhedron, up to a permntation of p and p '. with the same basis u.To each u corresponds only one element in the user space, ax, by construction* of u.To each ax corresponds only one a~ on each processor.tAs a result, data exchanges controlled by loop on u enumerate the same element at the same iteration.
Be(~ause the alloeation scheme is the same on each processor, the inner loop may be transformed into a vector message if the corresponding dimension is contiguous and/ or the send/receive library supports constant but nonunit strides.Block copies of larger arPas also are possible when alloc is affine.which is not the general case from a theoretical point of view but should very often happen in real applications.
Temporary arrays like Y" and Y"' are more difficult to initialize because there is no such identity between their allocation function as local part of an HPF array, allocy.and their allocation functions as temporaries.Lallocy., and Lallocv"'• HoweveL basis 11 can still be used.
When the temporary is allocated exactly like the master array, e.g., Y"' and X'.any enumeration of elements 11 in U/( enumerates every element a~' once and only once because the input loop is assumed independent.On the receiver side, a~' is directly derived from u.On the sender side, ay is computed as SY F 1 u and a nonaffine function is used to obtain :x< and *As explained in Section ~.:l., allocatiou is haudled as a ,;pccial a~.Vector messages can be used when every function is affine, because a constant stride is transformed into another constant stride, and when such transfers are supported by the target machine.Ottwrwise, calls to packing and unpacking routines can be gcnerated.
When the temporary is allocated with its mvn allocation function i'vl, it is more difficult to find a set of 11 enumerating exactly once its elements.This is the case for copy Y" in Figure 18.Elements of Y", a~.
must be used to compute one related 11 among many possible ones.This can be achieved by using a pseudoinverse of the access matrix ;tf: (17 Vector 11 is the rational solution of equation

SEXAMPLES
The different algorithms presented in the previous section were used to distribute the contrived piece of code of Figure 16, using functions of a linPar algebra library developed at Ecole des mines dc Paris.
This is an extension of the examplP in [17] showing that allocation of llPF arrays may be nontrivial.The reference to X in the first loop requires an allocation of X' and thP computation of new loop bounds.lt is not a simple cast> because the subscript function is not the identity and because a cyclic ( 4) and block distribution is specified.The two references to Y in the first loop are similar hut they imply some data exchange between proeessors and the allocation of an overlap area in Y'.Furthermore.the values of I and J are used in the computation, the iteration domain is nonrectangular.and is parameterized with m.
The second loop shows that the Compute(p) set may have fewer dimensions than the array referenced and that fewer loops have to he gPnerated.The output code is too long to be printed here.Interesting excerpts are shown in Figures 19 and 20   The other coordinates, u(J to 11!,, are the local array coordinates.Note that since we have a block distribution in the second dimension, u~ = 0.The Fortran declaration of X 1 is shown in Figure 19 as X_hp f. 1 0 X 7 elements are allocated while 7 X 7 would he enough.The efficiency factor may be better in the (p, l, c) basis.It would also be better with less contrived alignments and subscript expressions and/ or larger blocks.

Sending Y 1
Before the first loop nest is executed, each processor (ph p 2 ) must send some elements of Y to its right neighbor (p 1 + 1, p 2 ), with a wraparound for the rightmost processors, (3, p 2 ).The pairs of communicating processors are not accurately represented by convex constraints and the dest_pl and dest_p2 loop is not as tight as it could be.This suggests that central and edge processors should he handled differently when overlap is observed.A linearized view of the processors could also help to fix this issue.
The bounds for the inner loops are complicated but they could mostly he evaluated iteratively if node splitting and loop invariant code motion are used.Also, the test on u 7 can be simplified as a modulo operation.Finally, w5 and w6 can also be evaluated iteratively.

Local Iterations for the Second Loop
As for the first loop, the second loop cannot be represented as an array section assignment, which makes it harder to handle than the first one.The set of local iteration is defined by: the other ones.The guard may as well be hidden in the inner loop bounds.Experiments are needed to find out the best approach.
~ote that an extra loop is generated.TheY diagonal can be enumerated with only two loops and three are generated.This is due to the use of an imprecise projection algorithm, hut does not endanger correctness [28].Further work is needed in this an•a.

Integer Divide
One implementation of the integer divide is finally shown.The divider is assumed strictly positive, as is the case in all call sites.It is necessary because the Fortran remainder is not positive for negative numbers.It was added to ensure the semantics of the output code.~evertheless, if we can prove the dividend 1s positive, we can usc the Fortran division.

RELATED WORK
Techniques to generate distributed code from sequential or parallel code using a uniform memory spact' instruction is guarded by a condition which is only true for processors that must execute it.Each memory address is checked before it is referenced to decidP whether the address is local and the reference is executed.whether it is remote and a receive is executed, or whether it is rPmotclv accessed and a send is exe-(~uted.This re\\riting scheme is easy to implement [7].but is very inefficient at run-time because guards.tests.sends, and receives are purP overhead.Moreover, every processor has to executP the whole control flow of the program, and even for a parallel loop, communications may scqucntialize the program at run-time [;) 1 J.
Many optimization techniques have been introduced to handle specific cases.Gerndt introduced overlap analysis in [1 J for block distributions.When local array parts are allocated with the necessary overlap and when parallel loops are translated.the instruction guards can very often be moved in the loop hounds and the send/ receive statements are globally executed before the local iterations, i.e .. the loop nest with runtime resolution is distributed into 1\vo loop nests, one for cornrnunication~ and one for computations.The communication loops can be rearranged to generate vector rncssages.
Tseng [:3] presents lots of additional techniques (message aggregation and coalescing, message and vector message pipelining, computation replication, and collective cmmnunication).He assumes affine loop bounds and array subscripts to perform most optimizations.He only handles block and cyclic(l) distributions and the alignment coefficient must be 1.
Recent publications tackle any alignment and distribution but often restrict references to array sections.Each dimension is indPpcndent of the others as was assumed in SPction 4. Chatterjee et al. [17] developed a finite state ma-(~hine (FSM) approach to enumerate local elements.;'\o memory space is wasted and local array PlPmcnts are ordered hy Fortran lexicographic order exactly like user array clements.They are se(ruentially accessed by while loops executing the FS\h.which may hP a problem if vector units are available.MoreoveL ac-CP~SPS to an auxiliary data strueturP, the FSM transition map, add to the overhead.1\otc that the code generated in Figure 11 may be used to compute the FS\1.In fact.the lower iteration of the innermost loop is computed by the algorithm that builds tllP FSM.Kennedy eta!. [69][70][71][72] and others [17] have suggestPd improvements to this t cchniq ue, essentially to comput P faster at run-time the automaton transition map.Also multidimensional cases need many transition maps to be handled.[ 17] and the initial order is prPserved but no formulae arc given.Stichnoth uses the dual method for array allocation as in [7].i.e., blocks are first compressed, and the cyde number is used as a se(~ond argunu:nt.ln [19,77] polyhedron-based techniques are presented to generate transfer code for machines with a distributed memory.ln [37,78] advanced analyses are used as an input to a code generation phase for distributed memory machines.Polyhedron scanning techniques are used for generating the code.Two families of techniques have been suggested for that purpose.First, Fourier elimination-based techniques [28,36,37,[40][41][42][43]79], and second, parametric integer programming-based methods [30,35,38,39).In [80], a twofold Hermite transformation is also used to remove modulo indexing from a loop nest.First, variables are added to explicit the modulo computation, then the Hermite computations are used to regenerate simply new loop bounds.While the aim is different, the transformations are very similar to those presented here.

CONCLUSION
The generation of efficient SPMD code from an HPF program is not a simple task and, up to now, many attempts have provided partial solutions and many techniques.A translation of HPF directives in affine constraints, avoiding integer division and modulo, was presented in Section 2 to provide a unique and powerful framework for HPF optimizations.A normalization step is applied to reduce the number of objects used in the compilation scheme overviewed in Section 3. New algorithms are presented to ( 1) enumerate local iterations, (2) allocate local parts of distributed arrays, (3) generate send and receive blocks, and (4) allocate temporaries implied by rhs references.They are based on changes of basis and enumeration schemes.Section 4 shows that problem (2) can be casted as a special case of problem ( 1) by using an identity subscript function.Problem ( 3) is an extension of problem ( 1): A unique reference to an array is replaced by a set of references and the equation used to express the reference is replaced by a set of inequalities.Problem ( 4) is an extension of problem ( 2).The set of elements to allocate is no longer the image of a rectangle but the image of an iteration set which can have any polyhedral shape.This shows that all these problems are closely linked.
Although the usual affine assumptions are made for loop bounds and subscript expressions, our compilation scheme simultaneously lifts several restrictions: Array references are not restricted to array sections.General HPF alignments and distributions are supported, and the same algorithms also generate efficient codes for classical block distributions, similar to the ones produced by classical techniques.Memory allocation is almost 100% efficient on large blocks and performs quite well on small ones when strange alignments are used.We believe that this slight memory waste is more than compensated by the stride-1 vector load, store, send and receive which can be performed on the copies and which are necessary for machines including vector units.These contiguous accesses also perform well with a cache.The overlap analysis and allocation are integrated to the basic allocation scheme.Finally, most computations are performed at compile-time and no auxiliary data structures are used.
Our scheme can also be extended to cope with processor virtualization if the virtualization scheme is expressed with affine constraints.Such a scheme could reuse HPF distribution to map HPF processors on physical processors.
Many partial optimization techniques are integrated in our direct synthesis approach: message vectorization, aggregation [2], and overlap analysis [1).A new storage management scheme is also proposed.Moreover, other optimization techniques may be applied to the generated code such as vectorization [18], loop invariant code motion [81], and software pipelining [82,83).
This technique uses algorithms, directly or indirectly, that may be costly, such as Fourier elimination or the simplex algorithm, which have exponential worst-case behaviors.They are used for array region analysis, in the set manipulations, and in the code generation for polyhedron scanning.However, our experience with such algorithms is that they remain practical for our purpose: Polyhedron-based techniques are widely implemented in the PIPS framework [84) where HPFC, our prototype HPF compiler, is developed.First, for a given loop nest, the number of equalities and inequalities is quite low, typically a dozen or less.Moreover, these systems tend to be composed of independent subsystems on a dimension per dimension basis, resulting in a more efficient practical behavior.Second, efficient and highly tuned versions of such algorithms are available, for instance, in the Omega library.Third, potentially less precise but faster program analysis [85][86][87] can also be used in place of the region analysis.
Polyhedron-based techniques are already implemented in HPFC, our prototype HPF compiler [88), to deal with 1/0 communications in a host/nodes model [89) and also to deal with dynamic remappings [29) (realign and redistribute directives).For instance, the code generation times for arbitrary remappings are in the 0.1-2 s range.Future work includes the implementation of our scheme in HPFC, experiments, extensions to optimize sequential loops, to overlap communication and computation, and to handle indirections.

Figure 4 .
These directives can be translated into the following matrix form:

r 1 ,
namely p and l, arf' explicitly bounded in ./{'Implicitbounds for the third component.c are obtainPd by projecting ./{'onc.These three pairs of bounds . .divided by /11, are usf'd to declare tlw local part of the HPF array.Figurf' 1:-l shows the resulting declaration for thf' local part of the array.in the general symbolic case, for one dimension of the array.Bounds min(c) and rnax(c) are computed by Fourif'r projection.
FIGURE 16 Code example.
and commPnted on below.
Stichnoth et al. [ -+6. 7 4 J on thP one hand and Gupta et al. [ 45, 75] and Kaushik et al. [76 J on the other hand present two similar methods based on dosed forms for this problem.They use array sections but compute some of the coefficients at run-time.C upta et al. [ 43. 73 J and Kaushik et al. [76 J solve the block distribution case and use processor virtualization to handle cyclic distributions.Arrays arc densely allocated as in Sendy and Receioey can be parametrically t'numeratPrl in basis (p, u ), wl~ere u is a ba;;is for Y', the local part of Y, the allocatwn and adrlressirw of which arc discussed in SN:tion 4.:t Send and Recei~e are polyhedral sets and algorithms in [28} can he used.H the last component of u is allocated contiguously.vector messages nm be generatf'd.
. U represents any local array generated according to the dependence graph.Communications between procpssors can _h" exeeuted in parallel when asynchronous sendlreeerve calls and/ or parallel hardware is available as in transputer-based machines or in the Paragon., Thf' guards Smdu, etc. ean he omitted if unusnl HPl• pro-~:t'ssors m•e not able to free physical processors for other ust>ful tasks.The loop-bound correctness ensures that no spurious iterations or communications occur.The parallel loop on U can be exchanged witl~ tl~e parallel inner loop on p' indudt'd in the fora ll.Hns other ordering mak<:>s message aggregation pos;;ible.This code presentt'd in Figure l 0 is quite difff'rf'n~ *The \lauhattan not•m of a vector is I he smn of lh~ absolnh• \HlU(•s of its eoordinatPs.i.t> ... tfw /1 norm.•;•Thatis.tlwdoses! on the first rlimensiou.andtht>n the clo,t•sl on the "'cond dinH•nsiou, and so ort.dintt'nsion per dinH.•nsion.tillOlll' pro("t'Ssor is df'tt'rminf'd.:j:For n wetor t'.let lrl denotes its di11wnsion.
Relationships between frames.ax ..av, and so OIL The middle area contains the bases and franws used by the compiler to enumerate local iterations, u, and to allocate local parts of HPF arrays.a~ . .a~. and so on as well as the universal bases .. ?:< and :r~, used to define the lattices of interest.:JT and .JP.The bottom area shows new bases used to allocate temporary arrays, a~ and a~', as defined in Section 4 .S.
.M';,• . ..I FIGURE 18 a~ = il1u which has a minimum norm.ltmaywell not belong to 1/ the polyhedron of meaningful u which is linked to a user iteration.However, ;l1 was built to have tl1e same kernel as Sv F,.Thus . . the same element av is accessed as it \vould be by a regular 11.Since only linear transformations are applit>d, these steps can he performed \Vith integer computations by multiplying Equation17with the determinant of MM' and by dividing the results by the samc quantity.Tl1e local address.a~, is then derived as ahovP.This proves that sends and receives can be enumerated by scanning elements a~.