An Algebraic Machinery for Optimizing Data Motion for HPF

This paper describes a general compiler optimization technique that reduces communication over-head for FORTRAN-90 (and High Performance FORTRAN) implementations on massively parallel machines. The main sources of communication, or data motion, for the parallel implementation of a FORTRAN-90 program are array assignments, array operators (e.g., CSHIFT, TRANSPOSE, etc.), and array parameters passing to and from subroutines. Coupled with the variety of ways arrays can be distributed, a FORTRAN-90 implementor faces a rich space of posibilities by which data motion can be organized. We propose a unified framework for optimizing intraand inter-procedural data motion. The central idea of this framework is algebraic analysis of data motion. We give an algebraic representation for each HPF's array intrinsics and data distribution specifications. An array reference extracted from the source FORTRAN-90 program, given a particular data distribution specification, is represented as a communication expression, which in turn can be simplified according to a communication algebra. Fast communication is uncovered by pattern matching with a set of communication idioms. Experimental results on the Connection Machine CM-5 demonstrating the effectiveness of this approach are reported.


INTRODUCTION
Massively parallel machines offer the potential teraflop computing power.Such power cannot be fully utilized until such machines are made easy to program.A major difficulty of this class of machines is the need to distribute data and manage interprocessor communication explicitly.In recent years, much research effort has been devoted to providing suitable programming tools.One subject on which research focusses is the provision of appropriate high-level, data-parallel programming languages to easy parallel programming.In 1992, a coalition of researchers from academies, industry and governmental labs formed the High Performance Fortran Forum to develop a standard set of language extensions to Fortran 90.The forum has produced a proposal for a language, called High Performance Fortran (HPF) [24 ], which extends the Fortran 90 standard with data distribution directives for high performance target machines such as massively parallel machines and workstation clusters.
One of the main factors in achieving high performance for parallel programs on massively parallel machines is the reduction of communication overhead.While the development of optimizing compilers for super-scalar architectures is becoming commonplace in the industry, work on optimization for data movement and node code performance is mostly done in the context of specialized, hand-crafted code written in assembly code, if not mi-crocode, for specific target machines (e.g TMC's Convolution Compiler for stencil computation [7]).Automatic transformations to reduce data movement have become an important issue for HPC architectures, where the time spent communicating can easily outweigh the time spent performing actual arithmetic.
In this paper, we describe a general compiler optimization technique that reduces communication overhead for Fortran-90/HPF implementations on massively parallel machines.Movement of distributed data in HPF can occur in two ways: (1) passing of distributed arrays in procedure calls.
Note that the actual and dummy arguments may have different data distribution; (2) assignments on distributed arrays within a procedure body.Again, the LHS (left-hand side) and the RHS (right-hand side) of an assignment statement may have different data distribution.
So there are short-range (between LHS and RHS), medium-range (between a LHS or RHS and the alignment and distribution that are in force in the current scoping block), and long-range (actual to dummy across the procedure calling sequence) effects of data layout to consider.This makes compilation of HPF to efficient target code a complex task.
We propose an algebraic transformation and runtime support technique for reducing intra-and inter-procedural data movement in HPF programs.The theoretical framework within which we designed program transformations is algebraic analysis of data movement.We give each of HPF's array operations and data distribution directives an algebraic representation.We then formalize data distribution, intra-procedural and inter-procedural data movement using communication expressions.We have developed a communication algebra and its associated heuristics to simplify communication expressions, and a handcoded, optimized runtime communication library to carry out aggregate communication.Fast communication is uncovered by pattern-matching with a set of communication idioms.Calls to fast communication is generated if pattern matching is successful; otherwise, a general communication routine using direct send/receive is used.
The algebraic transformations are done abstractly in the logical, global space defined in the program.It does not require the notion of processor IDs and local memory offsets.As a result, although being demonstrated in the context of an HPF compiler for the Connection Machine CM-5, the optimization technique is applicable to other data-parallel languages and other massively parallel-machines.
We have conducted experiments on the Connection Machine CM-5 to evaluate the effectiveness of this optimization.Our experimental results demonstrate significant performance improvement over benchmark codes.
The rest of the paper is organized as follows.Section 2 reviews the data mapping model of HPF and the associated data movement problem.Section 3 gives an overview of the algebraic transformation framework.Section 4 presents in more detail the communication algebra.Section 5 reports our experimental results on the Connection Machine CM-5.Finally, Section 6 reviews related work and Section 7 gives the conclusions.

Data Mapping
There is a two-level mapping of array elements to logical processors.An HPF user can align array elements to a template, which is then partitioned and distributed onto an array of logical processors.The mapping of logical processors to physical processors is implementation dependent and may be specified by optional physicalmapping directives.We will use the term "data layout" as a generic term for the composition of the alignment and distribution (and physical mapping, if given explicitly).
Figure 1 shows an HPF program segment and a corresponding graphical representation of the effect of data mapping.Two arrays A and B are related by the FORTRAN-90 array intrinsic CSHIFT which shifts array A toward the negative direction by one element in wraparound fashion.Assuming the number of processors is two and the logical processors are mapped in an obvious way to the physical processors (that is, the first block is assigned to processor Pl and the second block to processor P2), the alignment of the two arrays is intended for the eventual reduction of the interprocessor communication (only one element out of four references in the statement B = CSHIFT(A,dim=l,shift=l) requires communication).

Procedure Interfaces
HPF provides a rich set of procedure interface specifications for distributed array arguments.A dummy argument has a data layout that is either explicitly specified via user directives, or implicitly inherited from the caller's actual argument.Similarly, an actual argument may have an explicitly prescriptive data layout, or inherits itself a data layout from yet another calling context.Consider the possibility of extended calling chains ('A' calls 'B' calls 'C', and fo forth, each of which may add additional directives to the data layout specification and inherits the rest).This makes compilation of efficient data movement a complex task.Figure 2 shows three levels of procedure calls (ALPHA calls BETA calls FOO calls BAR).For the procedure call statement in ALPHA, both the actual (array A) and the dummy (array B) have explicit alignments.Array A is aligned to the template T by offseting one element.The template T, by default, is partitioned into contiguous blocks.In procedure BETA, array B (the dummy argument of BETA) is aligned toT by offseting two data elements.Therefore, data realignment is required for passing array A between ALPHA and BETA.In procedure FOO, the dummy argument (array C) inherits the alignment directive of the actual argument (array B).By aligning to the dummy argument C, the local array D defined in procedure FOO also inherits the actual argument's alignment specification, resulting in totally offseting five data elements with respective to template T. This layout effect will be propagated to the procedure interface when calling procedure BAR.In the procedure body of BAR, the dummy argument E inherits its actual argument's (array D) alignment, which differs from the local array L's alignment specification.When the calling chain is long, a systematic approach is desirable to automate the process of optimizing data movement for passing array arguments across procedure boundaries, as well as moving data elements in array assignment statements within a procedure body.

Optimization for Data Movement
Consider the case where both the actual and dummy arguments of a single-level procedure call.have explicit data layouts (Figure 3).Array A is cyclically partitioned in procedure ALPHA , and should be redistributed using block partition within procedure BETA.Layout conversion for array B is a complex one due to change of both alignment (changing between offseting two elements and offseting one element) and distribution (changing between cyclic partition and block partition).In the procedure body of BETA, the assignment statement shifts array C toward the negative direction by one element and assigns the result to array D (i.e., C ( i + 1) is assigned to D ( i) ).We call this logical data movement defined by the program.Due to the effect of the alignment directives, actual data movement for executing the assignment statement may be different from the logical data movement as it appears.In order to satisfy the directives as given by the user, the compiler must combine all the layout requests that are in force.
FIGURE 3 An example for data movement optimization where both the actual and dummy arguments have explicit data layouts.
Assuming "owner-compute" rule for the compilation of data movement, the compiler has two roles.First, the compiler should minimize time spent in moving A and B to C and D's preferred layouts, and moving them back when returning from the call.For instance, the compiler should optimize communication for BLOCK and CYCLIC data redistribution.It should also determine the order in moving data for complex data movement.For instance, layout conversion for passing array B to procedure BETA involves an offseting alignment change and a conversion from CYCLIC distribution to BLOCK distribution.There are two alternatives in arranging data movement: offsetting realignment under CYCLIC distribution followed by CYCLIC-to-BLOCK redistribution, or CYCLIC-to-BLOCK redistribution followed by offsetting realignment under BLOCK distribution.The latter is more efficient because offsetting realignment requires much less communication under BLOCK distribution.Secondly, the compiler should minimize time spent in moving data array D for the execution of the assignment statement.For instance, with compiler optimization, the logical data movement specified by the EOSHIFT operation can be turned into local memory accesses as aresult of the alignment directives for arrays C and D.
A simple but naive approach is to use general communication or array copying through temporary storage whenever non-canonical* data layouts is in force.This approach not only causes excessive data copying but also ignores many opportunities for fast communication.
A better approach is to analyze data movement systematically.This can be achieved if we capture and manipulate data movement algebraically.
In this paper, we propose an algebraic analysis framework for optimizing data movement.Data layouts and data movement caused by passing distributed arrays between procedure boundaries, and array intrinsic fraffic *By "canonical" we mean an array is only aligned to itself identically, and is distributed to the processors using the default distribution strategy.For instance, if the array size is 8 x 8 and the number of processors is four, CM-Fortran compiler partitions the array contiguously into four equal-sized (4 x 4) subarrays, and assign one subarray to one processor distinctly.
within procedure bodies can all be captured algebraically.Data movement can be reduced by simplifying corresponding algebraic expressions according to a set of rules.The manipulation of algebraic expressions can be carried out in the global, logical space defined in the program.Detailed, machine-dependent aspects of data movement can be postponed to runtime through a set of communication library routines.

ALGEBRAIC ANALYSIS FRAMEWORK
Figure 4 gives an overview of the framework.Data layouts and data movement in HPF codes are extracted and formalized as communication expressions.Reducing the terms in such an expression one by one from right to left does not eliminate unnecessary communication.Under the "owner compute" rule, the amount of communication is determined by the number of operators in a communication expression; i.e., a communication expression is simplified if its operator count is decreased.Our goal is to minimize the operator count.We design an algebraic engine which contains a set of rules for simplifying communication expressions and the associated heuristics which guide the engine in applying the rules.A communication expression is reduced to an expression that no further rules will be applied under the particular set of heuristics.
In the following, we describe the algebraic framework in more detail.We first describe the algebraic representations of data layout and array intrinsics.We then show what a communication expression might look like.Next, we give an overview of the communication algebra and outline the set of communication idioms we have collected.

Algebraic Representations
General integer matrix notations are commonly used for affine alignment functions, which lead to a general transformation technique called affine transformation for optimizing data-parallel programs [3,46].General matrix notations and affine transformations are insufficient for transforming non-linear alignment operators such as CSHIFT (cyclic shift), CSKEW (cyclic skew), and replication.A solution is to separate boundary array elements from interior elements using explicit loop structures followed by index function transformation within the loop bodies.However, this approach may increase program size rapidly when large number of non-linear operations occur in the source program.
Our approach is to associate each HPF alignment/distribution operation with a special algebraic representation.Each representation has a name and a matrix/vector/ integer parameter as appropriate.The name characterizes the value of the parameter, and facilitates the design of the algebraic rules as well as efficient pattern matching for communication idioms.Formally speaking, these algebraic representations are functions with dependent types, which will become clear later.
We divided these algebraic representations into several classes according to the dimensionalities of their arguments and results.ALIGN I operators capture alignments within a single dimension, including offsetting, strided, and reflection alignments.ALIGN2 and ALIGN3 operators capture alignments across multiple dimensions.The argument and result of an ALIGN2 operator have identical dimensionality.Typical examples are transposing an array (i.e.dimension permutation) and skewing an array dimension with respect to others.An ALIGN3 operator has different shapes for its argument and result.Array reshaping, replication, and embedding into higherdimensional space all belong to ALIGN3. Figure 5 gives graphical representations of these operators.Table 1 outlines these classes, the HPF operators, and the corresponding algebraic representations and their definitions.In the following, we explain some of the definitions.
In The CSHIFT operator performs offseting operations in a wrap-around fashion.In ALIGN2, both the TRANSPOSE array intrinsics (e.g., TRANSPOSE (A, ( 1 , 3 , 2 ) ) , which exchanges the second and the third dimensions of array A) and the "transpose" alignment directives (e.g., ALIGN A(i,j,k) WITH T(i,k,j)) are denoted by TRANS(dl(M), which, when given ddimensional index domain D, permutes dimensions of D according to M (which is the matrix representation for the permutation vector and the alignment index expression) and results in another index domain M(D).The SKEW(dl(M) operator, when given d-dimensional index domain D, skews D (i.e., performs affine mapping on domain D) according to the coefficient matrix M. For instance, the operator SKEWC 2 l ( ~ ~) skews the second dimension of a two-dimensional domain (i.e. it maps index (i, j) to (i, i + j)).The CSKEW(d)(M) operator performs skewing operations in a wrap-around fashion.
In ALIGN3, both the array intrinsic SPREAD(A,dirn=2,ncopies=n) and the alignment directive ALIGN A(i) WITH T(i,*)   Tis the template that array A is aligned to.V denotes an integer vector oflength rank (T) and V (k) denotes the kth element of V. g<d) (M) denotes an operator which takes ad-dimensional integer matrix M as argument, and M * l denotes the multiplication of matrix M with the index vector !.
DJ and Dz are the index domains representations for shape specifications SJ and sz, and OJ and oz are permutation matrices representing the storage ordering in reshaping.
where (assuming the size of template Tat the second dimension is n)t duplicate n copies of A at the second dimension.Both are denoted by REPLICATEC 2 l ( (~),Interval(!, n)).
where the nonzero element (the second element) in the vector argument indicates that the replication takes place at the second dimension.The directive ALIGN A ( i) with T ( i, 1) aligns A to the first column ofT, denoted by reshapes a 2 by 6 domain into a 3 by 4 domain, where the argument is enumerated in a column-major fashion (denoted by the first matrix) and the result is enumerated in a row-major fashion (denoted by the second matrix).
The standard distribution directives in HPF are BLOCK, CYCLIC and generic BLOCK_ CYCLIC partitioning strategies.We collect them in the class DIST.MULTID operations capture multi-dimensional array intrinsics and data layouts that can be formulated as "product" of onedimensional operators.For instance, a two-dimensional shift operation is denoted by the product of the two ALIGN!operators CSHIFT(CJ) and CSHIFT(c2).tu template T is partitioned at the second dimension, the effect of the directive ALIGN A ( i) WITH T ( i, *) is to duplicate the same data of A onto all processors.
The five classes of algebraic representations capture most of HPF's array intrinsics and alignment/distribution directives which can be formalized as index domain morphisms (functions that map an index domain to another).General array references using index expressions are transformed to corresponding algebraic representations (or composition of algebraic representations) whenever possible, according to a set of simple transformation rules.

Communication Expressions
We can formalize data layout and data movement as communication expressions using "product" and "composition" operators.The product of f : D1 -+ E 1 and The composition of two functions g : D2 -+ D3 and f: D1 -+ D2 is defined as In order to formalize to relationship between a data element and a concrete store within a processor, it is necessary to formalize the three stages of data mapping (alignment to template, partitioning template to logical processors, and the mapping logical processors to physical processors).Let a denote the alignment operator which aligns array D to template E, f3 denote the partition operator which partitions template E into a pair of logical processors L and local index domain M, and y denote the operator which maps logical processor-memory pairs (L x M) to physical processor-memory pairs (P x M).

Data Layout
Data layout is simply the composition of the three stages of data mapping (a, {3, and y).The layout of an array D can be defined as a communication expression g = yo f3 oa, as shown in the commuting diagram of Figure 6a.
The formalization of data layout is used to derive communication expression for data movement.Intraprocedural data movement refers to array references within a procedure body.Inter-procedural data movement, also called layout conversion, refers to array copying due to change of data layouts between the actual and dummy arguments in procedure calls, which is a special case of intra-procedural data movement with array reference being identity.

Intra-Procedural Data Movement
Consider an assignment statement where the left-handside array D2 and the right-hand-side array D1 are related by an array reference a.Let the data layouts of D 1 and D2 be g1 and g2, respectively.The data movement induced by the reference a can be formalized by the com-

Inter-Procedural Data Movement
Let the data layout of array D in procedure B1 and procedure B2 be g1 and g2, respectively.The data movement required to move array D from B 1 to B 2 is given by the communication expression e = g 2 o g 1 1 , as shown in Figure 6c.Most of the HPF alignment operators are reshape morphisms [13], which essentially are bijective functions defined over index domains.For an operator f : 1 denote the inverse of f.Let g be the composition of n bijective functions fi, i = 1, n (defined The inverse of g can be denoted by g-1 , and -1 OPTIMIZING DATA MOTION FOR HPF 305 One exception is replication operations, which are relations.For instance, the operator maps (i) E D1 to ((i, j), j = lb(D2), ub(D2)) E D1 x D2.
We abuse the notation for its inverse, which is a relation that maps (i, j) E D1 x D2 to (i) E DJ.The composition of two relations A and If both A and B are functions, this definition is the same as function composition.

Examples
Figure 7 shows a communication expression for interprocedural layout conversion.In order to formalize the data movement between procedures ALPHA and BETA, is is necessary to construct the layouts of the actual and dummy arguments and to use these constructions in the application of the Inter-Procedural Rule (Figure 6c).We explicitly indicate the domain D and codomain E of each operator g (written as g D-+ E) because now the operators are bound to the index domain of array A. The actual argument A is aligned with the template Tl, which is partitioned into columns of blocks (denoted by by offsetting one element at the first dimension and two elements at the second dimension (denoted by (refer to Figure 6a).The dummy argument B is aligned with template T2 by a transposition followed by an offsetting alignment with distance two at the first dimension and with distance 1 at the second dimension (denoted by ) )

O DxDz-+Dz xD1
The communication expression for changing from A's layout to B's can be constructed by composing B's layout and the inverse of A's layout.A communication expression for intra-procedural data movement is shown in Figure 11 of Appendix A.
Communication expression for changing from actual's layout to dummy's: A communication expression for inter-procedurallayout conversion.In order to formalize the data movement between procedures AlPHA and BETA, it is necessary to construct the layouts of the actual and dummy arguments and to use these constructions in the application of the inter-procedural rule.

Algebraic Simplification
A generic method for simplifying a communication expression is using functional transformation [6,47].Since we only deal with the set of standard HPF data distribution directives (instead of general functions) in the context of optimizing data movement, we look for a simpler and more efficient solution.We have designed a communication algebra to serve this purpose.The communica- tion algebra manipulates communication expressions at the algebraic representation level only.
Recal that we divide HPF's array operators and layout operators into several classes according to the dimensionality of their domains and ranges.A communication expression may be the composition of operators from any or all of the classes.Based on the classification, we design sub-algebras for each class to manipulate the interaction of operators within that class, as well as bridging sub-algebras for manipulating the interaction of operators from different classes.
Each sub-algebra con taints three kinds of rules: (1) the inverse rules that computes the inverse of an operator; (2) the reduction rules that reduce two adjacent operators to one or zero new operator; and (3) the exchange rules that make two operators adjacent to each other by exchanging with other operators between them, so that the reduction rules may be applied later to simplify them.
The communication algebra will be presented in more detail in Section 4.

Example
Consider the communication expression e in Figure 7.
By exchanging the TRANS operator with one of the twodimensional EOSHIFT operators (using one of exchange rules in the bridging sub-algebra for ALIGN2 and MDL-TID), the two EOSHIFT operators become adjacent and can be canceled with each other (using one of the reduction rules in ALIGN1 sub-algebra).This results in a TRANSPOSE operation on a two-dimensional index domain which is partitioned at the second dimension as show below:

Communication Idiom Matching
A simplified communication expression contains the actual data movement that needs be performed.A naive approach is to use general communication for all cases.This approach ignores any opportunity for fast communication.A better approach is to uncover frequently occurring data movement and use specialized, fast communication whenever possible.For instance, the simplified communication expression e shown in the example in Section 3.3 is a transposition of a two-dimensional matrix which is partitioned one-dimensionally, resulting in so-called all-to-all personalized communication [21], in which every processor exchanges distinct data with every other processor.Due to the uniform communication patterns, communication overhead may be reduced be carefully scheduling messages to avoid contention in the network.
Since the advent of massively parallel machines, many researchers have developed specialized communication routines to facilitate direct programming of distributedmemory machines (e.g.[5, 21-23, 26, 36, 37, 39, 42]).In building compilers, we might take advantage of these hand-crafted, highly optimized routines which become part of the runtime system for the language.In the Crystal compiler developed at Yale University [30][31][32], this approach is used to generate intra-procedural communication.We extend that work further to include those communication routines for converting data layouts between subprograms.
We have collected a set of frequently occurring communication patterns, and extracted the contents of their communication expressions into communication idioms.They include most of the array intrinsics and frequently occurring layout conversions such as conversion between BLOCK and CYCLIC partitioning and conversion between column partition (*, BLOCK) and row partition (BLOCK, *).These idioms may or may not have specialized, fast communication, perhaps microcoded or otherwise hand-optimized, depending on the target machine.A list of communication idioms is shown in Table 2.The optimization procedure simply goes through this list of idioms and pattern matches with the communication expression.If a simplified communication expression contains more than one alignment operators, the compiler will either expand the expression to match multiple idioms or unfold and collapse the expression into a general communication, depending on the characteristics of the target machine.

COMMUNICATION ALGEBRA
In this section we present in more detail the communication algebra.Recall that the alignment/distribution oper- ators are functions with matrix/vector/integer parameters.Although the simplification structure is not closed under composition, we abuse the name "algebra" to reflect that the algebraic properties are preserved in combining the matrix/vector/integer parameters between composition of operators.Based on our classification of array operators and layout operators, we design subalgebras for each class to manipulate the interaction of operators within that class, as well as bridging subalgebras for manipulating the interaction of operators from different classes.Figure 8 shows the organization of these subalgebras.Each subalgebra contains three kinds of rules: (1) the inverse rules that compute the inverse of an operator; (2) the reduction rules that reduce two adjacent operators to one or zero new operator; and (3) the exchange rules that make two operators adjacent to each other by exchanging positions with the operators in between.
Given a communication expression, the algebraic engine applies these rules according to a simple heuristic until no rules can be applied. • In the following, we present the subalgebras, the bridging subalgebras and the simplification procedure for communication expressions.Each subalgebra is denoted by a triple (S, Q, R), where Sis the sorts, Q is the operators and R is the set of rules.The sorts of each subalgebra define the appropriate set of index domains in which the operators in the particular classes are defined.The set R outlines the rules that either directly or indirectly reduce the length for a communication expression.Proofs of the rules are included in Appendix B.

Subalgebra for ALIGN1
The sorts of subalgebra for ALIGNl contains the integer set N and the set of interval domains Dom.The operators include the four ALIGNl operators and one composition operation ( o).The composition operator is a higher-order function that takes two ALIGNl operators (D 1 --+ D2, D2 --+ D3) as arguments.For convience, we ) The result is to skew index domain D at both dimensions in wrap-around fashion according to the coeficient matrix {i D-

Subalgebra for ALIGN3
The sorts of subalgebra for ALIGN3 contains the integer set N, the set of interval domains Dom, the set of m x n integer matrices Ma tm,n, the set of.length-d inte- where V(k) =J 0 (kth element of V is nonzero), to be a length-d vector).The operators include the three ALIGN3 operators and one composition operation (o).The composition operator is a higher-order function that takes two ALIGN3 operators as arguments.
Inverses of ALIGN3 operators can be devised without appeal to standard matrix/vector/integer algebra (for discussion on inverses of ALIGN3 operators, please refer to Section 3.2).Therefore, no algebraic rules are needed for finding their inverses.We abuse the notation a -1 to denote the inverse of an ALIGN3 operator.The inverse- where D1, ... , Dd E Dom, E1, ... , where D1, D2, D3 E Domd} cancellation rule (Rule 8) cancels out an ALIGN3 operator with its inverse.Rule 9 reduces two adjacent ALIGN3 operators into one simply by standard matrix-vector algebra.The purpose of this rule is to simplify a chain of compositions of ALIGN3 operators, which may be a result of propagating inherited alignments across multiple levels of procedure calls.

Example
A replication along the second dimension followed by a replication of the new index domain along the third dimension can be combined into a multi-dimensional replication of the original domain.

Subalgebra for MULTID
The purpose of this subalgebra is to distribute composition operations through product operations so that composition of MULTID operators can be simplified.Rule 11 Exchange of ALIGN2 and MULTID Operators where ai, i = 1, n are ALIGN I operators, where

Bridging Subalgebra for ALIGN2 and MULTID
This subalgebra exchanges the positions of adjacent ALIGN! and MULTID operators, so that either one of them may be reduced with other operators in the communication expression.
Rule 13 Reduction of ALIGN2 and ALIGN3 Operators

Bridging Subalgebra for ALIGN2 and ALIGN3
The reduction rules in this subalgebra collapse adjacent ALIGN2 and ALIGN3 operators into an ALIGN3 operator.

Subalgebras for Distribution
Since change of data distribution and physical mapping strategies is a task that is handed over to the runtime system, the compiler's job is simply to detect whether data movement is required and to identify the appropriate communication routine for it.Applying the Productcomposition-exchange rule (in the subalgebra for MDL-TID) followed by inverse-cancellation (g canceled out with g- 1 ) for each product component is sufficient for this purpose.

Simplification Procedure
The compiler simplifies a communication expression by applying a sequence of the algebraic rules.Although the length of a communication expression (and.therefore the complexity of the algebraic simplification) only depends on the number of levels in nested procedure calls, which usually is a small constant (less than four or five in most application programs).For efficiency of the compiler, it is desirable to minimize the execution time of the simplification procedure.We use a simple heuristic to solve this problem.Before describing the heuristic, we first define some relevant terminology.
Definition.The composition of two operators e = g1 o g2 is immediately reducible if e can be reduced to a basic operator by the reduction rules.A basic operator is either an identify function or a single alignment operator.
For example, the composition is immediately reducible (it can be reduced to a basic operator CSHIFT(c2 + CJ)).
Definition.A redcom is a subexpression e = g, o [• • •lg2 in a communication expression such that either e is immediately reducible or the two operators g1 and g2 can be reduced to a basic operator by applying a sequence of exchange rules and reduction rules.
For example, the subexpression is a redcom, and the subexpression contains two redcoms: one is and the other is Definition.If a communication expression contains no redcoms then the communication expression is said to be in its final form.
The compiler simplifies a communication expression by repeatedly reducing the redcoms in the expression until the expression reaches its final form.A communication expression may contain multiple redcoms, therefore there exist multiple choices in simplification order.
Compiler execution time may vary depending on the order of simplification.We use a simple heuristic to minimize simplification time.Under "owner-compute" model, the amount of communication is determined by the number of operators in a communication expression; i.e., a communication expression is simplified if its operator count is decreased.Our current heuristic employs a greedy algorithm which reduces immediately reducible operators as early as possible because that always reduces operator count.The algorithm also avoids infinite looping by adjusting the starting pointer after application of each exchange rule.Note that MULTID operators are denoted by the product of one-dimensional operators (e.g.product of ALIGN!, DIST, etc.).The simplification of the composition of MULTID operators proceeds by simplifying each product terms independently.
Recall that a communication expression is in the form where ai are alignment operators, f3i are distribution operators, and Yi physical mapping operators (please refer to Figure 6).Following elaboration, we simplify a communication expression according to the order: alignment a1 o a o a;-1 first (which is machine independent), then distribution f31 o f32 1 (which is somewhat machine dependent), and then physical mapping Yt o Y2 (which is completely machine dependent), as shown in Procedure simplify-communication (Figure 9).Since in most cases, the length of a communication expression is dominated by the number of alignment operators and array intrinsics in the expression, our heuristic is mainly applied to the simplification of alignment subexpressions, as shown in Procedure simplify-alignment (Figure 9).The procedure for simplifying distribution and physical mapping is just to cancel out g with g-1 in each dimension.

Discussion on Algebraic Simplification
In the communication algebra, each reduction rule results in a basic operator, decreasing the length of the communication expression.The inverse rules and the exchange rules do not increase the number of operators in the expression.Consequently, the algebraic simplification procedure converges.
The simplification procedure can be carried out efficiently.The length of a communication expression only depends on the number of levels in nested procedure calls, which usually is a small constant (less than 4 in most application programs).Let the length of a communication expression be l and the number of array references in the program be N.It requires i l (l -1) N time steps to simplify the communication expressions for the entire program.

Optimization of Composition Order
The simplification procedure reduces a communication expression to its final form.The compiler then pattern matches the final form with the set of communication idioms.If the final form contains more than one alignment operators (i.e., which cannot be further simplified by current set of rules), it will either be pattern matched with multiple communication idioms or be unfolded and collapsed into a general communication function, depending on the characteristics of the target machine.
Note that the ordering of the operators in the composition in the final form does not affect the correctness of the target program.However, it does affect the cost of communication.For example, suppose a communication expression is reduced to a final form which contains a shift operation and a replication operation.There are two alternative order of composition:

Procedure simplify-communication
Input: a communication expression Y1 ofh oa1 oa:2 1 ofJ:2 1 oy 2 -1 , where fJ1, fJ2, Y1, Y2 are d-dimensional operators denoted by the product of d one-dimensional operators.We use g; to refer to the ith product component of multi-dimensional operator g (i.e.g; is the one-dimensional operator in dimension i).Output: a simplified communication expression I.Call Procedure simplify-alignment to simplify the alignment subexpression a 1 o a;-1 .

2.
If the alignment subexpression is reduced to an identity operator, then simplify the distribution subexpression fJ 1 o fJ2 1 by reducing the composition of fJ 1i and fJ~ 1 in each dimension i.Otherwise, return the simplified expression (It will then be pattern matched with the communication idioms).3.If the distribution subexpression is reduced to an identity operator, then simplify the physical mapping subexpression Yl o y 2 -1 by reducing the composition of Y1i, and y 2 ~ 1 in each dimension i.Otherwise, return the simplified expression (It will then be pattern matched with the communication idioms for layout conversion).4. If the physical mapping subexpression is reduced to an identity operator, then no data movement is required.
Otherwise, return the simplified expression for pattern matching with the communication idioms for changing physical mapping.

Procedure simplify -alignment
Input: an alignment subexpression Output: a simplified alignment subexpression Repeat step 1 to step 4 until no rules can be applied.
1. Reduce immediately reducible operators, except adjacent MULTID operators, in the expression by the inverse rules and reduction rules.2. Apply the appropriate exchange rule to the first pair of operators in the expression that cannot be reduced immediately.Then move the starting pointer to right by one operator.3. Perform Product-composition-exchange on adjacent MULTID operators in the expression, so tat composition of MULTID operators becomes a product form whose components are composition of ALIGN I operators.4. Simplify each of the product components produced from step 3 using the rules in ALIGN I subalgebra.Expression (1), which contains replication in dimension two followed by end-of-shift in dimension one, costs more than expression (2), which contains shift in dimension one followed by replication in dimension two, since the communication volume for replication is the same in both expression, but expression (1) requires communicating sizeof(DJ) • sizeof(Dz) elements in the EOSHIFT operation, while expression (2) only requires communicating sizeof(D1) data elements.
To further optimize communication volume, we have communication idioms in the final form appear in the following order from right to left: message reducing operators (e.g.reduction), then message-preserving operators (e.g.shift, transpose, affine transform), and then finally message-broadcasting operaotors (e.g.spread).

Example
Figure 10 shows the transformation result for the interprocedural layout conversion given in Figure 7.The simplified communication expression matches with the idiom for matrix transposition where the matrix is partitioned one-dimensionally.In the transformed program, the size of array A is expanded to match template size according to the alignment directives.Calls to a communication routine matrix-transpose-ld-parti tion are inserted to move array A to the proper layout before calling BETA and restore array A's layout after returning from BETA.
Figure 12 of Appendix A illustrates the transformation for intra-procedural data movement.

EXPERIMENTAL STUDIES
Experiments were conducted to evaluate the effectiveness of the optimizations.Three synthetic codes (ALIGN!, ALIGN2, ALIGN3) were used to evaluate the benefit of algebraic simplification, and two benchmark codes (ADI: a PDE solver using Alternate Direction Method, and FFf: Fast Fourier Transform using hybrid block and cyclic dis-procedure ALPHA in Figure 7 source program real A(2,4) template T1 (3,6) transformed program real A 1 (2 : 3, 3 : 6) TEMP(3 : 6, 2 : 3) align A(i,j) with Tl(i+l,j+2) call BETA(A) call matrix-transpose-ld-partition(TEMP,A') call BETA(TEMP) call matrix-transpose-ld-partition(A' ,TEMP) By Rule 11: exchange of MULTID and ALIGN2 Match Idiom: matrix transposition FIGURE 10 Algebraic simplification for inter-procedurallayout conversion.tribution) were used to demonstrate the impact of idiom matching for fast data layout conversion.The benchmark codes were listed in Appendix C.
We report our experimental results on the Connection Machines CM-5 located at AHPCRC of Minnesota University.The CM-5 has totally 896 processing nodes (PN), configured as various-sized partitions.Each processing node is a SPARC with four optional vector units that totally can deliver peak rate of 128Mftops [14].
In our experiments, we wrote the five benchmark codes in CM-Fortran syntax.Two versions of codes were generated for each benchmark code: a CM-Fortran version without any of the algebraic optimizations (called unoptimized version), and a CM-Fortran version which was the transformed result from algebraic .optimization.
We then compiled the CM-Fortran codes with vector-unit option on.

Algebraic Simplification
Table 3 shows the performance of the three synthetic codes on 64 processors.In ALIGN I, without optimization, array a was copied to a canonical heap temporary using shift communication, and then the two-dimensional EOSHIFT was carried out on the temporary.With algebraic simplification, the EOSHIFT intrinsic traffic in the assignment statement was cancelled out with the data alignment and therefore all data movement became local memory accesses.This optimization improves   1.3 to 1.9.The speedup factors in FFT are consistent with the results of ADI.

Data Layout Conversion
Table 5 shows the scaled speedups (by fixing perprocessor problem size and changing number of processors) of ADI and FFT.When per-processor size is fixed, speedup factors for ADI increase with number of processors (by a factor of 2.15 on 32 processors to a factor of 2.86 on 515 processors), because number of messages increases linearly with machine size and message contention becomes a more serious problem on larger machines.For larger machine sizes, speedup factors for FFT also increase with machine sizes.

Summary
The results from the three synthetic codes all show positive impact of algebraic simplification on program performance, because it is always beneficial to reduce away redundant layout conversions between procedure calls and unnecessary local copying through canonical temporary storage.
The results from the two benchmark codes ADI and FFT show that optimized layout conversion (compiletime idiom matching + specialized runtime communication routine) can reduce communication time significantly.The reason is that even on a regular communication architecture like CM-5, message contention may cause inefficiency.Both of the two benchmark codes involve all-to-all communication, which produces heavy message traffic in the network.By carefully scheduling these messages, contention can be reduced greatly.Research has shown that message contention problem is present on many massively parallel machines [4,5,36,37,39].Consequently, those machines will also profit from this optimization (perhaps with different implementation of the runtime communication routines).

RELATED WORK
A number of prototype compilers for Fortran 90/HPF have been developed in the past few years.We first briefly review the communication optimization techniques used in some of these compilers.The Fortran D compiler [20,44] performs various optimizations (message vectorization, message pipelining) to reduce communication overhead.However, the Fortran D compiler currently only handle a small subset of HPF's data layouts: canonical alignment and one-dimensional data partitioning, while our framework is applicable to more general cases.The Fortran 90D compiler [6) optimizes data movement for subscripted array references in parallel loops using linear index-function transformation and pattern matching for collective communication.By formulating data movement using linear transformations, optimization for nonlinear alignment, such as CSHIFT and replication, and data redistribution are not possible.Vienna Fortran and Vienna Fortran-90, based upon the parallelizing system SUPERB, extends Fortran/Fortran-90 by providing alignment and distribution specifications.The Vienna Fortran compiler [8,48] currently only supports arbitrary rectilinear block distributions.The ADAPT system [35], developed at University of Southampton, compiles Fortran 90 for execution on MIMD distributed-memory machines.The ADAPT system also simplifies data movement problem by restricting itself to a universal communication model.The SUIF compilation system [3,46] developed at Stanford University uses integer matrix notations and affine transformation for optimizing data movement in data-parallel programs.This approach is insufficient for handling HPF's non-linear alignment operations (e.g.CSHIFT (cyclic shift), CSKEW (cyclic skew), and SPREAD (replication)) and data redistribution.
Of the part of industry, the CM Fortran compiler [ 15] uses simple but naive copy-in, copy-out strategy for interprocedural data movement, and copying via canonical temporary for intra-procedural data movement.To our knowledge, many commercial compilers either only support a subset of HPF standard alignment and distribution specifications or, although they support full HPF data distributions, do not tackle complex data movement optimization issues like we do (e.g.APR's Forge90 compiler and the IBM x!HPF compiler).
Our algebraic transformative framework also relates to other more specific research efforts.The technique for generating collective communication, pioneered by the Crystal compiler [32,33], has great influence on our work.The major differences are: (I) the Crystal compiler finds optimal (or near optimal) data alignment automatically, while our framework optimizes data movement in the presence of user-provided data layout specifications; and (2) the Crystal compiler does not optimize inter-procedural data movement as we do.
The array synthesis scheme proposed by Hwang [25] employs index function transformation for HPF's alignment directives and array intrinsics.Array operations are translated to nested loops with explicit index subscript expressions.Non-linear array operations are handled by duplicating multiple copies of the loop nest, each corresponding to a particular boundary condition.Our concern with this approach is that the program size may increase rapidly with the number of non-linear operations.
Another important research area is generating communication sets for array section movement Several approaches have addressed the efficient execution of array statements involving block-cyclically distributed array sections (e.g.[1] address generation framework).We would like to point out that the algebraic transformative framework presented in this paper does not compete with that work.The main focus of this paper is a framework for highlevel optimization of data movement (i.e. the optimization, including algebraic simplification and idiom matching for collective communication, is performed purely in the global, logical space defined in the program, working on the algebraic representation level only).A lower level framework, which is not presented in this paper, handles all the details of generating send/receive pairs.Internally, the lower level framework employs similar techniques borrowed from existing literatures listed ab0ve.
There are also work on global optimization for data movement.Gilbert and Schreiber [ 10,17] designed a dynamic programming algorithm for optimizing temporary storage use for Fortran 90 array statements.Chatterjee et al. [11,12] extended that work to allow loop nests.Ju et al. [27] (and later Hwang et al. [25]) proposed a synthesis scheme for combining consecutive data reference patterns to reduce communication.Another line of work on global optimization of communication is based on dataflow analysis, e.g.Amarasinghe and Lam's [2], Gong et al.'s [18], and Gupta et al.'s [19] work.The current implementation of our optimization framework assumes owner-computer rule for compilation of data movement.We plan to incorporate global optimization techniques in the near future.

CONCLUSIONS
In this paper, we have described the theoretical aspect and experimental results of the algebraic transformation framework.We expect the effectiveness of this optimization technique to be even more significant for larger application programs which usually contain many program modules and may involve abundant use of array operations.
Two major optimization primitives in our framework are algebraic simplification of communication expressions and idiom matching for fast communication.Most of the communication idioms we have collected are not architecture-specific.They may or may not have specialized, fast communication, depending on the target machine.Specialized implementation of communication routines may or may not have significant impact on more regular communication architectures.As a result, idiom matching may not be crucial to achieving high performance on this kind of machines.On the other hand, algebraic simplification is a high-level, abstract transformation technique that carries out data movement reduction within the purely logical, global space defined in the program.Any redundant layout conversions between procedure calls and any unnecessary local copying through canonical temporary storage will be reduced away abstractly by algebraic simplification.Consequently, even on a very balanced, regular communication architecture, communication overhead can still be reduced by high-level pattern matching and algebraically simplifying them.
The algebraic transformation framework (including algebraic representation of data movement, alignment and distribution, an algebraic engine and associated heuristics, and runtime layout conversion and communication services) allows a compiler to reduce data movement at the abstract level and leave machine-dependent details to the runtime system.Although presented in the context of an optimizing compiler, the algebraic transformation framework can also be used as part of a runtime system for O!Jtimizing data movement that is dynamic or dependent on runtime computed values.By modeling different stages of data mapping (alignment, distribution, physical mapping) and data movement using communication ex- pressions and providing algebraic rules to simplify each stage of data movement, the algebraic framework is conceptually clean and portable to different target architectures.
Recently, there has been some progress on the part of industry toward applying some simple kinds of layout optimizations in commercial HPF compilers.For example, TMC's CM-Fortran compiler version 2.2 has included some similar optimization techniques for a very restricted subset of layout directives (shift/shift combination).We hope that eventually most commercial compiler groups will adopt our algebraic trans formative strategy, or something similar to it, to make sure that users can write HPF code without worrying about compiler blind spots (like "copy in-copy out" calling sequences, or redundant sequences of copies through heap temporaries whenever non-trivial alignments are in force).

!FIGURE 1
FIGURE1 An example of data mapping and data movement in HPF.The alignment of the two arrays is intended for the eventual reduction of the communication in the CSHIFT array intrinsic.

FIGURE 4
FIGURE 4Overview of the algebraic analysis framework.
ALIGN I, both the array intrinsic EOSHIFT(A,dirn,shift=-c) and the alignment directive ALIGN A(i) with T(i+c) offsets array A at the positive direction by distance c, therefore both are denoted by EOSHIFT(c), which, when given an index domain D with lower bound lb(D) and upper bound ub(D), maps D to anew index domain with lower bound lb(D) + c and upper bound ub(D) + c.

FIGURE 5
FIGURE 5  Graphical illustration of some alignment operators and MULTID operators.
~). Domain(T)).which, when given index domain D, maps D to Domain(T) according to the coefficient matrix (b).The operator maps index domain D1 to D2, according to the mapping orderings given in permutation matrices OJ and 02, respectively.For instance, RESHAPE (Interval(!, 2) x Interval(!, 6),
mum catron expressiOn e = g 2 o a o g 1 , as shown in the commuting diagram of Figure 6b.
The expression YI o y 2 -I in the first row of Table2indicates change of physical mapping strategy (e.g.change from Gray code encoding to Binary code encoding on hypercube architectures) because alignment and partition operators have all been reduced away.The expression x d (y; I o {3; I o f3i2 I o Y;2.I) in the second row of the table indicates change of data partition.For instance, the idiom CYCLIC(1, p) o BLOCK(b)-I indicates conversion from BLOCK partition to CYCLIC partition.The idiom indicates conversion from column partition to row partition.

FIGURE 8
FIGURE 8 Organization of the communication algebra.
use the notation a * D + c for an index domain with lower bound a * lb(D) + c and upper bound a * ub(D) +c.Rule 1 computers the inverse of an ALIGN] operator, Rule 2 reduces two adjacent ALIGNl operators into one simply by integer addition and multiplication on their arguments.Rule 3 exchanges the positions of two adjacent ALIGNl operators that cannot be directly reduced.Nonadjacent operators are reduced by applying a sequence of Rule 3 and Rule 2.

FIGURE 12
FIGURE 12Algebraic simplification for intra-procedural data movement.

Figure 12
Figure 12 illustrates the algebraic transformations for these expressions.Data movement in the first statement is reduced to an one-dimensional CSHIFT operation, while the second statement requires an additional reflection operation (REFLECT) due to the effect of the data alignment directives.

Table 3 .
Execution Time in Seconds on 64 Processors

Table 4 .
Execution Time in Seconds on 64 Processors (ADI: 10 iterations, double precision floating points, FFT: double precision complex, with block-cyclic layout conversion)