Compiler Technology for Parallel Scientific Computation

There is a need for compiler technology that, given the source program, will generate efficient parallel codes for different architectures with minimal user involvement. Parallel computation is becoming indispensable in solving large-scale problems in science and engineering. Yet, the use of parallel computation is limited by the high costs of developing the needed software. To overcome this difficulty we advocate a comprehensive approach to the development of scalable architecture-independent software for scientific computation based on our experience with equational programming language (EPL). Our approach is based on a program decomposition, parallel code synthesis, and run-time support for parallel scientific computation. The program decomposition is guided by the source program annotations provided by the user. The synthesis of parallel code is based on configurations that describe the overall computation as a set of interacting components. Run-time support is provided by the compiler-generated code that redistributes computation and data during object program execution. The generated parallel code is optimized using techniques of data alignment, operator placement, wavefront determination, and memory optimization. In this article we discuss annotations, configurations, parallel code generation, and run-time support suitable for parallel programs written in the functional parallel programming language EPL and in Fortran.


INTRODUCTION
With a constant need to solve scientific and engineering problems of ever-growing complexity, there is an increasing need for software tools that provide solutions with minimal user involvement.
Parallel computation is becoming indispensable in the solution of the large-scale problems that arise in science and engineering.Although the use of parallel computation has been increasing, its widespread application has been hampered by the level of effort required to develop and implement the needed software.Parallel software often must be tuned to a particular parallel architecture to execute efficiently; thus, it often requires costly redesign when ported to new machines.Parallel program correctness requires the results to be independent of the number and speed of the processors.This requirement can be satisfied only if the parallel tasks are independent of each other or properly synchronized when a dependence exists.Designing proper synchronization is a major source of difficultv in ensuring parallel program correctness.Different categories of parallel architectures have led to a proliferation of dialects of standard computer languages.Varying parallel programming primitives for different parallel language dialects greatly limit parallel software portability.Poor portability of parallel programs has resulted in a duplication of efforts and has limited the use of developed systems.
Parallel computation can be viewed as an interwoven description of operations that are to be applied to data distributed over the processors, and of data mapping and synchronization that dictate the data movements and the computation order.The traditional programming languages, such as Fortran, C, or C++, cope well with the task of prescribing operations to be performed.However, the description of data mapping and synchronization in such languages is often introduced by ad hoc architecture-dependent extensions.Examples are various synchronization constructs like busy-wait, locks, or barriers that are used in programs for shared-memory machines, send andreceive with different semantics employed by programs for message-passing architectures.and dimension projection and data broadcast popular in programs for single instruction multiple data (SIMD) computers.To avoid such architecturedependent language definitions, we propose to separate the description of operations to be performed on the data values from the definition of data mapping and synchronization needed to supply these data values to the proper processor at the proper instance of the program execution.
With this goal in mind, we developed tools [ 1. 2] that (i) decompose, at least partially, the parallel program into the two (nearly) orthogonal parts described above, (ii) translate the necessary data movements into optimal form customized for the target architecture, and (iii) synthesize an overall parallel computation.using these tools the user can describe high-level features of a program and synthesize parallel computation from numerical algorithms, program fragments . .and data structures that are implemented separately.The tools support (i) parallel task generation and their allocation to the processors, (ii) distribution of data to the processors.(iii) run-time optimization, and (iv) rapid prototyping of different parallel implementations.
Through the application of transformation techniques, different versions of the same program can be generated from decomposed components.The synthesized computation uses load as-signment, data distribution, and synchronization appropriate to the size and type of target parallel architecture.The computation synthesis is guided by conditional dependence graphs that represent externally accessible information in each of the synthesized fragments.Csage of conditional information in data flow analysis and parallelization significantly increase efficiency of the generated parallel code.
The summary view of our approach is given in Figure 1.Program components are created by annotating source programs in Fortran or in the functional parallel equational programming language (EPL) [ 3 J.Fortran programs are transformed into an equational form before decomposition.The configuration definition guides the synthesis of the components into a parallel computation.The synthesized computation together with the architecture description is used by the code generator to produce an object code customized for the target architecture.In Figure 1, continuous lines describe system paths that have been implemented, broken lines represent paths currently under development, and dotted lines correspond to paths at an early stage of investigation.
A brief description of the EPL language, its annotations, and configurations is given in Section 2. The relationship of EPL constructs and tools to different levels of parallelism is discussed in Section 3. The EPL compiler is discussed in Section 4: in particular Section '±.4 includes an overview of our approach to scalable parallel code generation.A dynamic load management strategy for adaptive scientific computation on SL\1D architecture is the topic of Section 5. Finally.conclusions and comparison to other approaches are given in Section 6.

OVERVIEW OF THE EPL LANGUAGE
EPL is a simple nonstrict functional language with a type inference designed for scientific computation.Although computationally vast, scientific computations are typically quite regular both in terms of control flow patterns and employed data structures.The data structures used are usuallv some variations of multidimensional arrays (sparse matrices, grids, jagged-edge arrays.and even some hierarchical structures can be viewed as such).Correspondingly, the EPL language is defined in terms of just a few constructs: generalized arrays and subscripts for data structures, recurrent equations for program specification, ports for process communication, and virtual processors to facilitate mapping of computation onto processors and load balancing.
A computation is viewed in EPL as a collection of cooperating processes.A process is described by an EPL program that consists of only data declarations and annotated conditional equations.The canonical data structure is a tree with nodes that can repeat and with leaves of primitive types.In its simplest form such a tree can be viewed as a multidimensional array, with each level of a tree corresponding to a new dimension of the corresponding array.Structured files are provided for communication with an external environment (in records) and with other processes (through ports).EPL enforces a single-assignment rule, i.e., each data element should be defined exactly once (the EPL compilec however, is free to produce multiple-assignment object codes).Thus equations, though syntactically reminiscent of assignment statements, are best viewed as assertions of equality.
The EPL programmer also defines the process interconnection network (the graph obtained by representing processes as nodes and port interconnections as edges) in the configuration file.Processes along with the configuration files are provided by the user to facilitate the compiler in extracting the coarse grain parallelism in the computation by generating processes and interprocess communication constructs.Configurations also allow the programmer to reuse the same process in different computations.

1 Iterations
An iteration is a staple of scientific computing.In EPL, iterations are programmed using subscripts.A subscript assumes a range of integers as its value.Subscripts give EPL a dual flavor.In the definitional view, they may be treated as universal quantifiers and equations can be viewed as logical predicates.In the operational view, they can be seen as loop control variables and each equation can be seen as a statement nested in loops implied by its subscripts.
There is a special class of indirect indexes, called sublinear subscripts, that is used in scientific applications so often that a special construct devoted to it has been introduced in EPL.Formally, an indirect index s defined over the subscript i is sublinear to this subscript if it satisfies the following property: It immediately follows from this definition that the sub linear subscript s [ i] starts with the value of either 1 or 0 and then, with each increase of i, it is either incremented by 1 or kept unchanged.Typically, there is a condition associated with each sublinear subscript.The condition dictates when the subscript increases.This is the way a sublinear subscript is defined in EPL.For example, a sparse matrix S that is a row-major representation of a matrix D can be defined in EPL using a sublinear subscript col[j] as follows: Sublinear subscripts have an implicit range determined by the number of times the defining condition yields true.
The sublinear subscripts are convenient in expressing such operations as creating a list of selected elements, operating on sparse matrices, or defining a subset of the given set.Even more important is the fact that in the implementation of a process no new iteration has to be created for computation associated with the sublinear subscripts.Instead, all necessary computation can be nested in the iterations created for subscripts in terms of which the considered sublinear subscript has been defined.Sublinear subscripts are also useful in defining dynamic distribution of data to processors at run-time.An example of such a definition is given in Section 5.2.

Reduction
A computation that frequently occurs in scientific applications is to apply a binary operation over an entire vector and store the result in the last element of the vector.For example, in scientific computation there is often a need to apply an associa-tive operator (such as +, *, max, min, etc.) selectively on the elements of an array.Scan and reduce are language constructs in EPL and other parallel languages that allow such operations to be succinctly written.Reduce applied to a vector of values produces a scalar result whereas scan results in a vector of partial results.For example, consider a matrix A multiplied by a vector X with the result placed in a vector r.This operation can be written in EPL as: or, even shorter as Such operations result in references of the form V[ . . .range.i, . . .], where range.iindicates the range of the reduced/ scanned dimension of a multidimensional array V. (In general, the EPL range variable prefix denotes the size of its suffix.) The presence of such references in the program is explored by memory optimization and scheduling, which are discussed later.A more detailed description of the language is given bv Szymanski [3].

Configurations
In our approach a parallel computation is viewed as a collection of cooperating processes, which are defined as functional programs.Process cooperation is described by a simple macro data flow specification, called a configuration.Configurations support programming-in-the-large.The user can experiment with various configurations to find the one that results in the most efficient code.
The configurator uses the dependence graph created during configuration analysis to generate an architecture-independent parallel description that is fed to the code generator.Configurations define processes (and their aggregates) and ports.Statements of the configuration represent relations between ports in different processes.They are supplied by the user to direct integration of the processes into a parallel computation.Processes created dynamically can communicate with ports located at parent, child, and sibling processes; each of those processes is just a copy of the same program, except the parent process that can be arbitrary.
Consider as an example an iterative solver of linear equations Ax = b that uses the following recursion: The first part of the recursion is a matrix-vector multiplication that may form a separate process, as shown in Figure 2.
Note that there are no explicit input/ output statements and the order of equations is irrelevant because all variables are singly valued.If we assume that the separate process, let's call it XC, calculates the new approximation of the vector x and monitors convergence and the third process, Configuration file: Input: P: MAIN-> inf -> P:MVM Output: P: XC-> xf -> P: MAIN P:XC -> inf -> P: MVM -> ouf -> P:XC

(a)
MAIN, provides final input/output, then the corresponding configuration is shown in Figure 3.The textual definition lists data flow paths that cover a configuration graph.The graphical definition is built from process boxes and file edges.It is augmented with file structure information provided by the EPL system (see Fig. 3b).

Program Decomposition Through Annotations
Annotations provide an efficient way of introducing the user's directives that assist the compiler in program parallelization.Annotations have been proposed in many systems by various researchers [11][12][13][14][15] and are used mainly as compiler directives.In our approach annotations limit the feasible mappings of computation onto the processors.Hence, they are used only during the decomposi- tion of a process into smaller fragments.This kind of annotation is similar to the ON clause as used in the Kali compiler [11], Fortran D [12], or Vienna Fortran [13].
Annotation does not have anv effect on the result computed by a program.Consequently, sequential programs that have manifested their correctness over many years of usage are good candidates for parallelization through annotations.Being orthogonal to the program description, annotations support rapid prototyping of different parallel solutions for the same problem, which can be helpful in performance turning.
In EPL, each equation can be annotated with the name of an array of virtual processors on which it is to be mapped.Virtual processors can be indexed by the equation's subscripts to identify instances of equations assigned to individual virtual processors.Such instances constitute the smallest granule of parallel computation.For example, for the process .\1VM the following annotation: will cause the compiler to consider only the tasks that define a sequence of vectors r[ * , i].Each task will locally store one row of array A but the vectors x [ k, *] must be broadcast to all of those tasks.
The above partitioning allocates a slice of the equation defined by a single subscript value.The resultant granularity may be too fine for a target architecture.However, when an annotation is indexed by a sublinear subscript, then the corresponding sublinear expression dictates how the annotated equations are clustered onto the virtual processors.For example, let p be a sublinear subscript of i, and range.pbe the number of physical or virtual processors.(This number may be a system constant not even known explicitly to the user; it may depend on the architecture, system load, or it may be defined by the user or compiler directive.)Considering again the previous example of a matrix-vector multiplication, we can use an annotation: It will distribute (or partition) one dimension of r and A over range.pprocessors in a block fashion (each processor will hold lf, J or If, l columns of r and rows of A).In Section .S. 2 there is an example in which a different distribution is achieved using a sublinear subscript in an annotation.This distribution balances the load on the processors.
There are similarities as well as differences between the EPL annotations and the Fortran Ianguage extensions that have been introduced in many systems, e.g., Vienna Fortran [13,16,17], Fortran D [12,18,19], and SCPERB [20] Fortran D directives are similar to Vienna Fortran, however distribution is separated from alignment.In Fortran D, first the DECOMPOSITION statement is used to declare a problem domain for each computation.The ALIGN statement is then used to describe problem mapping that defines the alignment of arrays with respect to each other.
Finally, the DISTRIBUTE statement is used to map the problem and its associated arrays to the physical machine.
In EPL, by subscripting the annotated virtual process names and defining the appropriate ranges for the subscripts, the user can distribute the arrays in blocks, columns, or rows.The arrays can also be transposed by permuting the subscripts of annotated virtual processors.Cnlike Vienna Fortran and Fortran D. EPL does not provide the user with directives to do manual alignment of data.Instead, data alignment algorithms have been developed to facilitate this task automatically (see "Data Alignment'' in Section 4.4).Hence alignment directives embedded in a source program are not necessary.

PARALLELISM EXTRACTION IN EPL
In EPL.compile time parallelism is sought on three levels: Figure 4 shows the tools that have been developed and their correspondence to various models of parallel computations.The control-parallel model assumes that there are separate, relatively independent processes or functions that can be executed simultaneously.This model requires the user to handle the error-prone and difficult task of synchronizing these independent processes.The configurator eases the burden of programming for control parallelism by automating the definition of interprocess coordination.

Data parallelism, popular in massively parallel
Medium Grain Fine Grain systems, assumes that there are large data structures to be processed and that each element of every structure can be assigned to a single processor (either virtual or real).The same sequence of instructions is applied simultaneously to all elements of the processed structures.It is also necessary to decide which elements of the different structures should be placed on the same processor to minimize the cost of fetching arguments for operations involving those elements.Data alignment tools described in this article can find suboptimal solutions to this problem without user involvement.
Annotations, relevant mainly to loop parallelism, provide the user with the means of rapidprototyping altemative parallelizations of the program.For example, supplying proper annotations, the user can experiment with various combinations of column and rowwise parallelizations of the matrix operations in a program.
A load-balancing problem surfaces at all three levels of parallelism.In Section 5 we describe how the partitioning tools developed for the presented compiler can be used to do either static or dynamic load balancing on linear or rectangular arrays of processors.The partitioning tool is applicable to irregular computations that result from using adaptive solvers of partial differential equations on either homogeneous or heterogeneous processors.
In EPL, the programmer can assist the compiler in extracting coarse-and medium-level parallelism.As described earlier, coarse grain parallelism is ~btained by creating tasks from the processes and their interconnection network as specified in the configuration files.The programmer can help in determining the medium grain parallelism by annotating the equations in the source program.After determining the coarse and medium grain parallelism, the parallel program is synthesized with the help of the configurator.

EPL COMPILER
The basic techniques used in EPL compilation are data-dependence and data-attribute propagation.In a single program, the dependencies are represented in the compact form by the conditional array graph.A similar dependence graph is also created for a configuration.It shows the data dependencies among processes of the computation and is used for scheduling processes and mapping them on to the processors.Figure 5 de-picts the structure of the EPL compiler by showing part of Figure 1 in more detail.In particular, all components of annotation processing, precompiler, and scalable code generator are explicitly shown.The major stages of the EPL compilation are: 1. Array graph construction, which transforms the source code into its intermediate form.
The main components of this form are the array graph and the symbol table.The array graph nodes represent the variables and the equations.Each array graph edge represents the dependence between the nodes and is labeled by its attributes such as the associated subscript expressions, dependence type, and conditions under which the dependence holds.2. Dimension propagation, which checks correctness and assigns dimensionality to each EPL variable.3. Type checking, which verifies that all variables and expressions have or can be assigned consistent types.4. Completeness verification, which performs various semantic checks and verifies that each variable is defined over its entire domain.5. Range propagation, which finds equivalences between ranges of variables and equations.The EPL compiler uses the concept of a range set as an object to which all equivalent ranges are linked.Range propagation links all dimensions that share a common limit into a range set.6. Condition analysis, which establishes equivalence or exclusiveness of predicates used in conditional equations.The found relations of predicates are used in scheduling and verification.7. Scheduler, which finds an array graph evaluation order that is minimal among all orders preserving the program semantics.Scheduler also defines the scopes and nesting of the loops in the object program.The output generated by the scheduler is used by the schedule optimizer and the code generator.8. Schedule optimization, which is an architecture-dependent step that customizes the generated schedule to the target architecture (see McKenney and Szymanski [10] for SIMD specific optimizations).9. Annotation processing, configuration pro-  ------+---------------------------------------

Single Assignment Fortran
Through extensions and annotations, imperative languages, particularly Fortran, have maintained their dominance in scientific computation over such nontraditional languages as data flow, logic, or functional.Nevertheless, languages based on the single assignment rule have proven to be a convenient basis for developing sophisticated program optimizations.EPL research has centered its program optimization techniques on the array graph representation of recurrence equations.We believe that by-transforming the Fortran programs to array graph representation, a wider spectrum of program optimization and parallel code generation methods can be applied to the transformed programs than to their Fortran source.An important step towards an efficient parallelization of Fortran programs with the help of the EPL compiler involves an equational transformation during which the equational equivalent of the program is generated [2].The transformed programs obey the single assignment rule and do not contain any control statements [22].The transformation is done in the following two steps: 1. Program expansion, during which the variables are expanded to enforce the single assignment rule.In particular, the reassignments elimination involves replacing the reassigned variables by vector (additional dimension)-inside loops and by variantsin "if" branches and basic blocks.

Program optimization, which consists of:
Condition analysis: Conditions in the transformed program are analyzed using a Sup-Inf inequality prover [ 4 J and the Kaufl variable elimination method [23] to find pairwise equivalent or exclusive conditions.
Variable's variants elimination: Variants created in equivalent and exclusive conditions are merged into a single variable.

Additional dimension elimination:
During scheduling and code generation for indi-vidual processes, memory optimization is performed to replace entire dimensions by windows of a few elements for multidimensional variables [7].This step restores the memory efficiency of the original program.
The transformed Fortran program is then compatible with the programs produced by annotating EPL programs.

Annotation Processing
Each virtual processor produces data typically used by other virtual processors, and in turn consumes data produced by others.By performing data-dependence analysis in a style of PTRAK [24 J, the annotation processor can find the dependencies local to each virtual processor as well as data structures produced and consumed by this processor.All data produced by the processor become local to it and are placed in its local memory.The created parallel tasks are supplied with communication statements needed to move nonlocal data.Parallel tasks associated with virtual processors at the bottom of the block hierarchy are the smallest components used in the program synthesis.Hence, annotation processing includes: Let G(V, E) be a task communication graph with a set of nodes V representing tasks and a set of edges E ~ V X V representing intertask communication.Each edge e 1 .J E E has the associated cost, c(e 1 -Jl, that represents the volume of data being sent from task ito taskj.In a spanning tree T, the distance JT(e,) defines the minimum length path from task ito taskj.Csing these definitions, the cost of the spanning tree T can be defined as: To minimize the total communication cost, proper cut-tree must be found.It can be done in O(IVI-1) steps [25] by solving lVI maximal flow problems.
To embed the tree, we developed an heuristic that selects the embedding using the following criteria: 1. Dimension nesting: If two tasks with different dimensionalities are connected in the task communication graph, the task with more dimensions should be located lower in the spanning tree.2. Range nesting: Whenever possible, tasks sharing the same range should be clustered together in the spanning tree.Variables that share ranges usually appear in the same equations.Thus, clustering such variables together decreases the number of interprocess references to distributed variables.3. Data flow: The total communication cost of the selected spanning tree should be the smallest among all spanning trees satisfying the above two criteria.
Trees created from an annotation of the matrix vector multiplication program are shown in Figure 6.The double outcoming arrows indicate scattering the data from a task to a group of tasks.The double incoming arrows represent an inverse operation of gathering the data.For example, process IWAIN scatters the vector x [ 0] among processors P[i].On the other hand, process XC gathers the vector r[ k] by collecting individual elements r[k, i] from processors P[i].

Program Synthesis with the Configurator
The goal of configuration processing is to establish scheduling constraints for the overall computa- ,./ MVM X XC FIGURE 6 Communication tree for matrix-vector multiplication.
tion.In the parallel computation, individual process correctness is a necessarv but not sufficient condition for the correctness o.f the entire computation.If a task has input and output ports that belong to a cycle in the configuration graph, then this task's input messages are dependent on the output messages.Such dependencies (in addition to dependencies imposed by the statements of a task) have to be taken into account in generating the object program for individual tasks; otherwise, loss of messages, process blocking, or even a deadlock can arise.
Tasks that belong to a cycle in the task communication graph can execute concurrently onlv if they are all enclosed in the.same loop in-cluding the respective send and receive statements.Such tasks are called atomic because they cannot be broken into parts without splitting the loop.For example, if a send statement is executed in a separate loop from the matching receive statement, then all messages will be sent before any one can be received, and the successors of such ~onatomic tasks cannot start until its predecessors in the task communication graph finish sending messages.
The algorithm for finding external data dependencies has been presented by Spier and Szymanski [ 6].The analysis starts by inspecting all atomic processes and then propagates transitive dependencies along the paths of the task communication graph restricted to atomic processes.As a result, a configuration-dependence file is created and later used by the synthesizer and the code generator.This fiie conta.ins a list of the additional externally imposed data dependencies (edges and their dimension types) that need to be added to the task array graph.One task may have several such files, each associated with a different configuration in which this task participates.
Each edge in the configuration-dependence file may have the following effects on the program generated from the array graph: 1.An additional constraint is imposed by an edge if there is no equal or stronger internal dependency between the considered nodes.2.An error is discovered when there are internal dependencies incompatible with the edge.
Hence, as shown in Figure 7, the dependence analysis for the synthesized computation has to be done in two stages.

in].
Program execution can be seen as an evaluation of the arravs at various index points (elements of the index domain).The order of execution is restricted only by data dependencies that rarely impose the total order.Figure 8 shows the conceptual stages of mapping the index domain of a variable to the Cartesian product of the processor domain, their local memory domains, and the time domain.The goal is to find a mapping that results in the minimum execution time.In Figure 8, A represents a virtual architecture.It is defined by the computer interconnection network.For example, in a k-dimensional mesh-connected architecture of size N, processors can be thought of as arranged in a k-dimensional array, with The processor p [ / 1 , /2, . . ., lk] is connected with processorsp[/ 1 , . . .,l;± 1, . . .,lk], 1 :=;j:=;k provided that processor p [ / 1 , . . ., ~ ± 1, . . ., lk] exists (~ ± 1 mod n;, in the case of torus-connected architecture).To facilitate data alignment and time scheduling, we assume that a virtual architecture A is compatible with the domain /.Local memory domain L can be viewed as a multidimensional cube with the volume equal to the actual local memory available on each processor.Virtual memory domain M is of the same structure as the domain L, except that it has unlimited memory size.The execution time steps are represented by time domain T = [ 1, lmax], where lmax is the total number of time steps needed to complete the computation.
In such a view, there are three major problems that need to be solved for generating optimized code for massively parallel architectures: data alignment, time scheduling, and memory optimization.
Data alignment is discussed in some detail in the next section.Time scheduling of iterative computations is usually done either through data-driven scheduling or wavefront determination.Both methods explore the fact that iterative computations often allow the simultaneous evaluation of many array elements.Data-driven scheduling starts the execution of an index point as soon as all data that this point is dependent on become available.However, data dependencies often hold under conditions that involve input data and therefore can be resolved only in run-time.Consequently, data-driven scheduling typically relies on run-time distributed synchronization.In the case of functional programs with single assignment and recurrent relations, the compile-time data-driven scheduling is decidable [26].Such a scheduler has been implemented in the compiler for EPL language [7] and is not discussed here.Wavefront determination is presented below.
Programs written in EPL or transformed from Fortran obey the single assignment rule.A variable that is reassigned in a procedural language is seen as a vector of values with a different subscript value for each assignment.This extra temporal dimension allows the program to be specified without any reassignments but, unless optimized, may require an exorbitant amount of memory.The EPL compiler can often reduce the memory requirement of a program by replacing the entire dimension of an array by a few elements [7].However, Sinharoy and Szymanski [27] have proven that the problem of finding the optimum replacement is equivalent to the well-known NP-hard problem of determining the maximum weight clique problem.Consequently, the EPL compiler uses heuristics to determine a good loop arrangement for memory optimization.

Delta Alignment
In a distributed memory parallel computer, a significant speedup can be achieved by distributing (or mapping) data structures in a program on to the processors.One processor is allocated (at least conceptually) to each array element or composite data structure.Operations on elements of two data structures can be performed entirely locally if the elements are allocated to the same processor; otherwise, processor communication has to be involved.The cost of communication depends on the relative position of the two processors involved and the architecture under consideration.One of the major challenges in programming distributed memory parallel computers is to distribute data structures among the processors so that the communication cost is minimized.
The problem is particularly acute when the communication is synchronous, such as in the case of SIMD machines.In addition, different alignments of multidimensional arrays on a gridconnected SIMD architecture result in different communication patterns during parallel program execution.The usual approach to this problem [28,29] is to select the best alignment for each array in the program independently of other arrays.Hence, such an approach does not succeed when the independently found alignments conflict with each other.Similarly, the algorithm presented by Gilbert and Schreiber [30] finds the minimum communication cost of evaluating an expression over a distributed processor array but only for a single expression.Given the initial allocation of data, the algorithm determines the processors at which the temporary variables should reside and a subexpression evaluation should take place to minimize the communication cost.
Szymanski and Sinharoy [31] have shown that the data alignment problem for an entire program is NP-hard for all communication cost metrics.They [8] proposed an heuristic that starts with an integer approximation of the rational minimum of the cost function when the distance is defined by the second (Euclidean) norm.The initial solution is then iteratively improved by following the steepest decline direction of the cost function.Results of using this algorithm on random graphs are encouraging [ 8] .
Here, we focus on the definition of the problem and its impact on the code generation.Let's consider an equation eu, ... ,h} defined over k subscripts / 1 , ... , h (such an equation corresponds to a statement nested in k iterations): , Sk] where each simple indexing expression s 1 on the left side of the equation is an affine function of the corresponding subscript f;, and each indexing expression jj on the right side is a function over possibly many subscripts.A large class of parallel scientific computations can be expressed as regular iterative algorithms (RIA) [32] in which all indexing expressions are of the form "I + c", where I is a subscript and c is an integer constant.
To generate efficient code for SL\1D machines, one or two dimensions of a data array should be projected along the processor array [10].For the i-th projected dimension of each array (each equation), we define an alignment function a, that PARALLEL SCIENTIFIC COMPUTATION 213 maps the index of that dimension into the position of the virtual processor that stores (executes) its value.We consider the simplest but also the most useful form of the alignment function defined as a constant shift, e.g., for variable v1, a1(l;) = I; + ali Hence, the equation e with alignment shifts can be written as: eu,, .. ,hl: This equation incurs the communication cost: where d is a distance metric, y denotes the time needed for sending a unit message between two directly connected processors, and n is the dimensionality of the communication network.The distance metric is defined by the interconnection of the processors in the considered parallel architecture.Thus, the problem is to find alignment functions a's for each of the variables and equations such that the communication cost C for the given set of assignments is minimal.Figure 9 shows the communication among the processors executing the i-th instance of Equation 1 along a single dimension.Contrary to the well known Owner Computes rule, to minimize communication costs, the processor executing the i-th instance of the equation may be different from the processor that stores the i-th element of the array defined by this equation.

Array Operator Placement
Proper assignment of array operators to processors in large scientific computations executed on a distributed memory machine can reduce total computation time significantly.For example, consider the following computationt evaluated over the rectangular stencil.Let n1, n2 stand for the lengths of the sides of the stencil and let p 1, P2 be the offsets (measured from the lower left comer of the stencil) of the desired position of the result.
t This example is based on the computation arising in modeling ecosystem on the MasPar [33].
for i=l to ... Let s;.J be a data structure distributed over the two-dimensional processor array and (m, q) be the coordinates of the processor that should receive the result.The computation is defined as: m+nt-pt q+nz-pl result= L L f(s"''l' s;.;) i=m-pt j=q-p'l.
The above computation is evaluated repeatedly for each rectangular stencil in the processor array.Hence, it is likely to dominate the total execution time.The above computation is an example of a reduction evaluated simultaneously over many overlapping continuous sections of an array.Other examples of usage of such operations are likely to be found in cluster recognition, fractal dimension computation in biological modeling [34], or in modeling physical phenomena (e.g., solvers of partial differential equations characterizing fluid flow).Simultaneous reduction is evaluated over a one-dimensional consecutive section of an arrav.called here an array interval; each array element is used as an operand to many reductions evaluated simultaneously over different overlapping intervals.This is distinct from what is usuallv referred to as parallel reduction, which involves the parallel evaluation of a single reduction [35] or its variants.An algorithm for standard parallel reduction that uses a balanced binary tree implementation for mesh-connected architectures has been presented [ 36 J. Another standard parallel reduction algorithm has been introduced [37] for tree topologies of arbitrary but bounded fan-in and arbitrary tree depth.The segmented prefix problem is a variant of parallel reduction that subdivides a single dimension of processors into nonoverlapping contiguous regions of varying size.A multiple prefix algorithm that reduces noncontiguous regions simultaneously for this variant has been solved by Sanz and Cypher [38].]\one of the published algorithms cope with the overlapping of the regions being reduced.
Efficiencv of the simultaneous reduction has been discussed elsewhere [39].It can be expressed as a function of (i) operation count: i.e .. the number of required reduction operation steps.
(ii) communication cost: i.e., a function of the number of messages sent (message count).the distances traveled by messages (hop count).and the length of the messages (message size), and (iii) memory count: i.e .. the number of memory locations used to store intermediate results at each processor.The lower bounds for the above counts are: [log 2 n] for the operation, message, and memory counts, n -1 for the hop count, and 1 for the message size.For the interval of size n = 2k and an arbitrary offset p, a modification of the wellknown parallel prefix algorithm [35] achieves the above bounds.The modification defines the direction of the message transfer in each step by the corresponding bit of the binary representation of the offset p.
For an arbitrary interval size n and an arbitrary offset p we have designed an algorithm called intersect, which achieves the lower bound of communication and memory counts and is within a factor of 1.5 of the lower bound of operation count.For an arbitrary interval size nand an arbitrary offset p, we have designed an algorithm called split, which produces the result with the memory, hop, and message counts equal to their lower bounds.The operation count and themessage size are at most twice the value of the corresponding lower bound.Depending on the relative cost of the increased message size and operation count versus the smaller hop count, this algorithm may or may not outperform intersect for the given interval and offset.
For an arbitrary interval size we have designed two algorithms that require asymptotically small operation and message counts: both counts are log 2 n + 2 if the reduction's binary operator has an inverse and log 2 n + 2(log 2 n)' + o((log 2 n)'), where c = log 12 6 = 0.721057 . . ., otherwise.

Wavefront Determination
One of the most common forms of parallelism available in a scientific computation is data parallelism, in which the same operation is performed on manv elements in an n-dimensional data array.In computation over such an array, a wavefront of computation can be defined as an (n -1 )dimensional subarray whose elements are all evaluated simultaneously.Different wavefronts result in different performance, so the question arises how to determine the wavefront that results in the minimum computation time.Wavefront determination should also define which wavefront elements are to be computed by each processor at every execution step.This type of scheduling is appropriate for single program multiple data (SPMD) [40,41] implementation on distributed memory architecture or for data parallelism on SIMD architectures.SP~D implementation.in general, requires larger parallel granules than SIMD implementation; therefore, it is more efficient provided that the computations at each index point are fairly complex (i.e., involve computationally intensive function evaluation).
Figure 10 illustrates how the choice of a particular wavefront can affect the performance of an algorithm.A two-dimensional array E is to be evaluated on a one-dimensional (logically) processor array.The elements are defined by the following equation (elements that are beyond the array boundary are considered to be zero): A data-dependence vector of an equation is any vector that connects two index points.The end point of this vector is an index point at which the equation is executed and the starting point of the vector is an index point at which some value used in the definition is evaluated.For RIA [32] expressed in EPL, the dependence vectors are defined by the difference between the corresponding subscript expressions used in the left and right side of the equation.In the above computation, there are just two dependence vectors: OA ( [4,2]) and OB([2, -2]).
In general, let D = {d1, d2, . . ., dk} be the set of dependence vectors in a program (i.e., a set dependence vector for all equations in the EPL program).Variables can be evaluated simultaneously at all index points on a wavefront h, if and only if h • J, > 0 for all dependence vectors J,.
Intuitively, this condition requires __that all index points reachable from a wavefront h are known at the time of execution of this wavefront or, in other words, all array elements in an appropriate side of the wavefront have already been evaluated.In Figure 10 all dependence vectors are on one side of the lines EH, E' H', and E"H", so all of them are wavefronts.Evidently, any line between OB and OA (traversed clockwise) in Figure 10 may be a wavefront because for these and only these lines are the dependence vectors on one side of the line.However, such a wavefront does not always exist.For example, when data dependences are different at different regions of the index domain, there may be no single wavefront with the required property in the entire index domain.
Two parallel wavefronts form a strip of computation that can be divided among a number of processors for execution.The separation between the wavefronts can be made such that once all packets (containing array elements evaluated by other processors) reach their destination, no more communication is needed to complete the evaluation of all the array elements between the two wavefronts.In Figure 10, EFGH, E'F'G'H', and  E"F"G"H" are three such strips.Because EFGH covers a bigger area than E"F''G"H", computation along this wavefront results in less frequent communication and synchronization.Wavefront EH can be preferred to E"fl" for another reason; namely, the smaller distance that data must travel (compare projection of OA on E"H" with the projection of OA on EH).Wavefront EH can be parti-

FIGURE 10
Different wavefronts to evaluate array E.
tioned into more sections than E"H" with the similar computation-to-communication ratio, leading to a higher degree of parallelism.
Even if there are no restrictions on the number of available processors, it is not straightforward to determine how the wavefronts should be optimally partitioned and mapped to the processors.A small partition increases communication time because most of the input array elements needed to evaluate a particular index point may reside outside the evaluating processor's local memory.For certain dependence vectors and the sizes of the partitions, input array elements may be quite a few processors away.On the other hand, the processors may be underutilized if a large partition of the wavefront is assigned to a single processor.
The wavefront approach to finding the set of index points at which evaluation can proceed simultaneously was originally proposed by Lamport 0 0 0 0 0 0 0 0 0 [ 42].However, to find the wavefront minimizing the total execution time, an NP-hard integer programming problem has to be solved.This original result has been extended by many researchers over the years [43][44][45][46]; however, the proposed solutions still are NP-hard because they can be reduced to an instance of the integer-programming problem.Assuming that the space-time representation of an algorithm is a con~nuous domain, we can determine the wavefront h with the minimum execution time with polynomial complexity.This result holds for two-dimensional arrays processed on a linear, arbitrary large array of processors.It is valid for two different models of communication: (i) individual element transfer and (ii) packet transfer.In the first case, we have proven, under the above simplifying assumptions, that the only wavefronts that can be optimal are those that are either perpendicular to one of the dependence vectors or parallel to the y-axis.This property leads to a simple but efficient procedure for finding an optimal wavefront by just inspecting all potentially optimal wavefronts (complexity of this procedure is linear in the size of the input).
For the example in Figure 10, there are only three angles of a wavefront to consider: Y1 = n/2, Y2 =arctan( -2), Y3 = n/4.The wavefronts with Y1 and y 2 are shown in Figure 11.In a single execution step with the wavefront defined by Y1, each processor evaluates four index points and needs to receive eight values from the neighboring processors.However, for Y2 wavefront, the number of evaluated points and received messages is at most three.The number of steps needed is also different for these two wavefronts because they move in different directions.If we assume that the compu-

FIGURE 11
Optimal wavefronts for array E.
tation is defined over a rectangle with corners at the points (0, 0), (0, Y), (X, 0), (X, Y); X= 100, Y = 10, then the number of steps made by the first wavefront is 50 and by the second one is 105.The corresponding total computation times for all three discussed wavefronts will be T1 = 315e + 315c, T 2 = 200e + 400c, T 3 = 630e + 630c, where e is the cost of execution at each index point and c is the cost of communicating one datum between neighboring processors.Depending on th8 value of c/ e, the first or the second angle should be selected (see Fig. 11).
Usually, array elements are not passed individually, but several of them are grouped together and sent in a single packet.This method is commonly used in the communication model known as block SIMD.In this model, off-processor values required to compute a designated block of parallel code are obtained immediately before the beginning of the block, and all off-processor values generated within the block are communicated immediately after the end of the block [ 4 7].Typically, packets of values are formed for communication and transferred between nonneighboring processors by means of hopping.
The wavefront strip is partitioned among the processors and the width of each partition impacts the total computation time.With too small a width, processors spend less time computing and more time communicating because less relevant information is available in the local memory.On the other hand, a large width enables processors to spend more time computing between data transfers, resulting in a smaller communication cost.Beyond a certain width, the communication cost does not decrease any further with an increase in the partition width.If the partitions are too large, the available parallelism may not be exploited fully.
As in the previous case, we have proved that the optimal wavefront can only be at certain angles to the dependence vectors (the number of possible angles is limited by the square of the number of dependence vectors).Once again the proof leads to an efficient enumeration procedure.
In our analysis we have assumed a continuum of data element in an array.In reality, arrays are discrete so the analysis is approximate.For example, in mapping a computation on to a linear array of processors, the algorithm provides a good wavefront when the longest projections (on each side) of the data-dependence vectors on the selected wavefront are much larger than the length of packets sent along the wavefront.

217
The methods described here can be applied to any set of uncoupled recurrence equations.To decrease the communication cost, a good alignment of all arrays in the program should be determined first [8,48].Many methods described in the literature [ [43][44][45][46] determine the actual mapping of the computation on to the processors, once the wavefront is determined by solving an integer programming optimization problem.These algorithms can be used for the wavefronts obtained by our method.
There are many open problems in this area.One major issue concerns finding an efficient algorithm to determine a good wavefront when a set of recurrence equations involving m-dimensional arrays are to be computed on an n-dimensional array of processors (m ;::: n).Another important question is how to generate the packets of convenient sizes and shapes efficiently, once their size and orientation are known.

RUN-TIME SUPPORT
As discussed earlier the main problem of efficient parallelization is to properly map addresses of values being computed on to the computer processors.Pure compiler techniques have been successful in cases when the data addresses are input independent and can be established at compile time.However, many important applications involve sparse matrix computations, adaptive numerical algorithms, or computations over irregular meshes and therefore do not belong to this category.
Traditionally supported compiler optimizations for parallel computation involve subscript analysis or directives for regular problem decompositions and distribution.Language and software tools for dealing with irregularity in parallel computation rely either on user-provided partitioning algorithms, e.g., dynamic distributions in Vienna Fortran [17], or the tracing of sample executions, e.g., Kali compiler [11,49] and the PARTI library [50,51]).Communication patterns of many advanced parallel computations are rarely known at compile time.However, transferring individual data is expensive because of the usually large latency of multiple instruction multiple data (MIMD) architecture communication.Fortunately, often communication patterns change with each input data but remain constant inside the loop at runtime.Therefore, both the Kali compiler and the PARTI library attempt to group messages.Entire blocks of data that must be sent to the single processor are assembled into a single message in loop preprocessing done at run-time [ 49,50].
In adaptive computation, the run-time support is needed because the workload distribution among the subregions of the computational domain changes during run-time.Therefore, there is a need for run-time task reallocation of adaptive computation executed on massively parallel distributed memory machines.Such task reallocation requires different methods than the large grain, few-processor approaches discussed in the literature [52].We have proposed a new type of so-called density workload problems appropriate for such environments [5].

Run-Time Task Distribution
One of the most challenging problems encountered while implementing adaptive scientific computations on distributed memory machines is run-time mapping of a dynamically changing computational load on to the parallel processors.In Nicol [53], the following rectilinear partitioning problem (RPP) has been proposed and solved: Partition the given n X m workload matrix into (N + 1) X (M + 1) rectangles with N + M rectilinear cuts in such a wav that the maximum workload among rectangles is minimized Such optimization is appropriate for adaptive finite element computations on architectures with local communication that is faster than the global one.Because balanced partitions tend to increase the volume of local versus global communication, the overall communication cost can be decreased by using the optimum rectilinear partition.
Ozturan et al. [ 5] investigated the balancing of an adaptive scientific computation on SIMD machines: This is the problem with similar motivation and applications as the RPP problem.In RPP, the sum of the weights is taken as the cost of a rectangle, whereas in our problem the cost is expressed as the workload density, i.e., the ratio of the workload to the area with which this workload is associated.The area is proportional to the number of processors active in it.Such cost definition is motivated by the mesh refinement techniques used in adaptive numerical methods.Each entry in the workload matrix represents the solution error obtained by an error estimation procedure [54].The high-error regions need recomputing and the needed work is proportional to the In adaptive solutions of partial differential equations, parallel tasks perform basically the same computation over different spatial subdomains (intervals for one-dimensional problems) and with a different discretization parameter ax.
Let K denote the number of such tasks.It is important to keep this number small for the following reasons.The subdomain interactions are proportional to the number of existing subdomains and in higher dimensions such interactions require expensive global communications.In each time step of the subdomain computation, a fraction of executed code is subdomain specific (e.g .. in hyperbolic equations the time step has to be set differently in each subdomain).For purely SI:\1D machines, execution of this code fraction has to be done in K consecutive stages.In each stage, processors in one subdomain are executing while processors belonging to the remaining K -1 subdomains remain idle.:j:Therefore, each subdo-:j: For more general MIMD architectures that support coordinated parallelism (i.e., CM -.'> ).all K sub domains can execute this fraction of code in parallel.main associated with a parallel task should represent a localized structure in the solution domain.Figure 13a shows an example of the more difficult two-dimensional case in which a coarse mesh is trivially mapped to the processor mesh.In regions A and B, the mesh must be refined due to the presence of high errors.Hence, we have to spread subdomains A and B over bigger rectangular subsets of processors to improve load balancing as in Figure 13b,c.
If mesh-movement or static rezone techniques are used, the mesh elements are moved into higherror regions.A global solution strategy will refine the high -error regions and repeat the entire step of the iteration.Consequently, a reassignment of processors is needed.A local solution strategy, on the other hand, repeats the iteration only where it is needed.Hence, local refinement results in less direct computation and enables more processors to be assigned to regions A and B. However, local refinement requires more interactions between the local and global solutions.Such interactions involve global communication that can outweigh the benefits of an adaptive procedure.Global solutions and mesh-movement techniques require less interactions of this kind.Careful buffering of the high-error regions can increase the number of iterations executed before regridding or mesh movement is needed.This will, in turn, decrease the frequency of the needed load balancing.these global mesh-refinement and mesh-movement techniques executed on a mesh-connected architecture that motivated us to develop densitytype partitioning.
It should be noted that applying RPP partitioning to the example shown in Figure 13d results in assigning unnecessary processors to regions C and D. To avoid such a waste, partitioning methodology cannot be restricted to rectilinear cuts extending across the whole domain in both dimensions.Hence, in our problem definition and solution [5], we require that the selected rectangles cover the whole domain.The heuristics for the two-dimensional case projects the weights to one dimension and results in rectilinear cuts extending across the whole dimension in one direction.Figure 13b shows an example of this kind of partition.
The problem of load balancing for adaptive PDE solvers on machines where the number of processors exceeds the number of tasks can be obtained by putting EB = min, Q9 = max, and /(k) = (xk-Xk-1 + 1), i.e., when the sum of the workloads in each partition is divided by the interval length (i.e., the number of processors).If there are K heterogeneous processors, each with a different speed sk, k = 1, . . ., K, thenf(k), EEl, and ~ can be instantiated according to the fourth row of Table 1.The RPP algorithm can handle only the case with the monotonically increasing cost function ~~,;x•-1 W;.In contrast, our algorithm can solve more complicated problems with an arbitrary cost function ~~~xk-1 w;lf(k) in O(Kn 3 ) steps.
There is a similarity between the weighted independent set for interval graphs and our problem [55].The interval graph for our problem can be created as follows.Each possible subinterval (xk-1, xk) is represented by a node of the interval graph.The weight of the node representing (xk-1, Xk) is set to ~~~x._ 1 w;lf(k).In such a graph, the independent set of size K, which covers the whole interval, 1, . . ., n, gives the solution to the original problem.The interval graph can be converted to a directed acyclic graph (DAG).The shortest path algorithm applied to this DAG will find the minimum weight dominating set [56 J.This approach results in the optimal algorithm for the one-dimensional case and also leads to an heuristic algorithm that can be easily generalized to two dimensions (by projecting the workloads to one dimension).

Run-Time Array Distribution in EPL
To illustrate the run-time support provided in the EPL compiler, consider the sparse matrix-vector multiplication.This operation lies at the heart of many numerical algorithms, such as the conjugate gradient algorithm for the solution of linear systems of equations.The corresponding computation is:

r =Ax
Let S be the row-major representation of the sparse n X n matrix A, colend[i] be the number of nonzero entries in each row (i ::5 n), and colmap[i, j] be the column number for each nonzero entry.The variable-sized rows of S must be mapped on toP processors where P < n.The total execution time of such a computation is defined by the execution time on the processor with the largest number of nonzero elements (because processors synchronize after each multiplication step in an iterative solver).Hence, it is important that rows of S are distributed in such a way that processors are load balanced, i.e., each has about the same number of nonzero elements to evaluate.The corresponding EPL program is shown in Figure 14.
The load-balancing scheme can be implemented solely on the basis of the ranges of rows in S. The scheduler implemented in the EPL compiler [ 3 J detects that the ranges of the rows in S must be available before the matrix-vector multiplication loop starts.Hence, the last two statements in the above EPL program that explicitly implement a simple load-balancing algorithm will always be scheduled before the loop body.The rows of S are then distributed accordingly to the The program for distributing arrays was run on several benchmarks including meshes originally used by Hammond [57] and test cases from the Harwell-Boeing sparse matrix collection [58].The characteristics of the tests are given in Table 2.The first test case is an unstructured triangular mesh around a three-component airfoil whereas the second test is a portion of a larger mesh representing an unstructured tetrahedral mesh about a Lockheed S-3A Viking aircraft.The third test case arises from a mixed kinetics diffusion problem (specifically, the study of ionization in the stratosphere with 38 chemical species).The fourth mesh is derived from a model of a gas cooled nuclear reactor core and the fifth test was generated using a package for reservoir modeling.The most straightforward implementation of the sparse matrix-vector multiplication used in the ITPACK library [59] is shown in Figure 15a.It multiplies each nonzero element by the corresponding vector element that is fetched through communication, if necessary.The results of several runs of the sparse vector multiplication are given in Table 2.The rows labeled "block" and "load balance" give times for runs of the multiplication with a standard block distribution and with the block distribution adjusted by the load-balancing step, respectively.Results from executions presented in Table 2 showed up to a 21% cost reduction for the MP-1.

Architecture Independence
The same source code is used bv the EPL compiler to produce different parallel executables for different architectures.Currentlv.the EPL compiler includes code generators for ,WPL and C* languages for SIMD architectures (MasPar and CM-200), Dynix C for the shared memory Sequent Balance, and message-passing C for the Stardent computer.There is ongoing work on C code generators for the C:.\1-5 and SPl architectures.Nevertheless, the user may still prefer to use different annotations or even different EPL programs for different architectures to achieve the optimal performance.

Parallelism Specification
A high-level language should shield the user from having to specify each and every detail of parallel execution.Below we discuss the level of user involvement in defining parallel execution of EPL programs.
1. Specifying data and program decomposition: Only partial specification is expected from the user.An EPL computation consists of cooperating functional processes that define an initial decomposition of the program.Parallel tasks are created by the EPL system through merging and splitting EPL processes based on the communication-to-computation ratio on the target architecture.The programmer can use explicit anotations to define the part of the EPL process that is to be assigned to a single virtual processor.The annotations define the lower limit on the granularity of decomposed tasks to improve the efficiency of generating program decomposition.If, during the process decomposition, a task is created that includes all computation designated to some virtual process, this task will not be further divided bv the EPL svstem. . . 2. Specifying mapping: :\lapping of the parallel task (created from processes by the EPL system) to the physical processors is done entirely by the EPL system.However, the quality of the mapping is decided by the quality of the decomposition which, in tum (see point above), is partially defined by the user who defines the EPL processes.3. Defining communication: At each process description there is no difference between communication and regular input/output; both are seen as externally providing input to the process.The necessary communication code is generated by the EPL compiler.4. Defining synchronization: Again, the user is shielded from this aspect of parallel programming.The synchronization generated by the EPL compiler is derived from the data dependency imposed by the EPL processes.

Software Development Methodology
EPL relies on functional decomposition of the computation into processes.Processes are described in an equatorial language and their cooperation is described as a configuration.Programs describing processes are compiled by the EPL compiler and a configuration is processed by the configurator, i.e., the compiler for the configuration language.Hence, there is a separation of programming-in-the-large from programming-inthe-small.The process written as a functional program may be refined by user-supplied annotations.The parallel code is generated through a series of transformations.First, the flow of control is established and minimum synchronization necessary for preserving correctness is found (in EPL terms, a schedule of a process is created).which is still architecture independent.Then, the decomposition and mapping take place (creating another, equivalent form, of the source program).Finally, input/output and communication statements specific to the target architecture are generated and the final parallel code is produced.
1. Structure of the development process: In EPL, the equational program for a process is written very independently from the programs of other processes.Only clearly defined interfaces (data structures exchange with the environment) are of concern for the process program writer.2. Exposition of the decision points: Preparing a configuration for the overall computation forces the user to decide on the method of writing the program at the global level without considering low-level details. 3. Record of constructs: Thanks to their conciseness and lack of implementation details (i.e., input/output, communication.flow of control), computation configuration and equational programs for its processes form a good basis for program documentation. 4. Preservation of correctness: The parallel code is produced in three major transformations that were designed to be correctness preserving.
5. Limit of proofs to derivation system: Proof of the correctness-preserving properties of the EPL transformation has not been made formally yet, however these properties strongly influence their design and implementation.

Cost Measures
There is a part of the system, called the timer, that provides the user with the execution time estimates for equational programs.As shown previously [ 61], the timer relies on a set of architecture measurements that can be established by running initiation programs of the timer on the given architecture.However, we do not have a mechanism for determining the overall computation execution cost (i.e., execution cost at the level of a configuration) at this time.For SPMD models.the timer is sufficient: however, in a more general setting there's a need for a better tool.Timer drives transformations of equational programs into schedules and the stage of decomposition and mapping.

No Preferred Scale of Granularity
There is no upper or lower limit on the grain size in EPL with the exception of the statement instance: i.e., EPL does not explore parallelism on the level of expressions and below.

Efficiently lmplementable
Our experience with the current EPL implementation indicates that the EPL-generated code is no more than 20-50% slower than the equivalent hand-written code.However, we have not yet measured the efficiency of larger applications (or even a large number of smaller ones).
Program decomposition through annotations and computation synthesis through configurations can support efficient parallel code generation for domain-specific computation.Annotations support rapid prototyping and performance tuning of a parallel program.Adaptivity, with its associated error estimates and the shrewd use of computation resources only in regions where accuracy requirements are not satisfied, can provide the needed numerical reliabilitv and efficiencv to parallel computation.In the EPL system, adaptivity is supported through run-time task distribution.
There are several premises underpinning our approach, among the most important ones are: PARALLEL SCIEJ'\TIFIC cmiPlTA TIO!\ 223 1. Annotations provide an easy and efficient way to parallelize existing codes.2. Large parallel programs consist of interconnected processes that represent logical partitions of the program.3. Absence of control statements simplifies program analysis and increases compiler's ability to produce an efficient parallel code.4. Most parallel code optimization problems are 1\"P-hard; hence, development of proper heuristics is important.5.A hierarchical view of parallel computation is helpful in extracting functional parallelism.
Our research on scalable program synthesis has left many interesting issues unexplored.Future work on program synthesis that we intend to undertake includes parallelization of dynamic task distribution and run-time support for irregular computation.Efficiency of our methods will be measured for large applications, such as finite difference and finite element formulations for various scientific computations.

FIGURE 1
FIGURE 1 Software tools and their uses.

FIGURE 2
FIGURE 2 Matrix-vector multiplication in EPL.

FIGURE 3
FIGURE 3 Configuration for an iterative solver in (a) textual and (b) graphical form.

FIGURE 4
FIGURE 4  Developed tools and their relationships to issues in parallel scientific computation.

FIGURE 5
FIGURE5 The structure of the EPL compiler.
Data structures used in scientific computation can be viewed as a function o from an index domain I to a value domain V.An index domain, in general a set of tuples of integers (i 1 , i2, . . ., i,), is often a subset of the Cartesian product of integer intervals for regular n-dimensional arrays.For exam-ple, I = /1 X /2 X • • • X In, where f; = [L lma.r.Jl•Often an inverse function o-1 does not exist.Following the standard higher-level programming language notation, we denote the value of the function o at point (i~, . . ., i,) as v[i1, . . . .

eFIGURE 9
FIGURE 9 Communication cost of exPcuting equation e.

FIGURE 12
FIGURE 12  Example of one-dimensional partitioning.

Table 2 .
Mesh Characteristics and Execution Times for Test Runs on the MasPar for 1,000 Iterations