Pattern-Driven Automatic Parallelization

This article describes a knowledge-based system for automatic parallelization of a wide class of sequential numerical codes operating on vectors and dense matrices, and for execution on distributed memory message-passing multiprocessors. Its main feature is a fast and powerful pattern recognition tool that locally identifies frequently occurring computations and programming concepts in the source code. This tool also works for dusty deck codes that have been "encrypted" by former machine-specific code transformations. Successful pattern recognition guides sophisticated code transformations including local algorithm replacement such that the parallelized code need not emerge from the sequential program structure by just parallelizing the loops. It allows access to an expert's knowledge on useful parallel algorithms, available machine-specific library routines, and powerful program transformations. The partially restored program semantics also supports local array alignment, distribution, and redistribution, and allows for faster and more exact prediction of the performance of the parallelized target code than is usually possible.


INTRODUCTION
Parallel computers with distributed memory are known to be difficult to program. Even more problematic is the automatic parallelization for such machines. The most challenging problems that a parallelizing compiler is faced with are the following: 1. Parallel code must contain explicit messagepassing statements. But explicit programming of message passing is complex, tedious, and error prone. 2. The efficiency of the target program depends heavily on choosing a suitable distribution (and sometimes, even redistribution) of the arrays occurring in the source program. For larger applications, this is a difficult global optimization problem. 3. A single-program multiple data (SP:MD) program generated semiautomatically from a sequential source program by adapting it to given array distributions must in general be transformed to be efficient-just applying the owner-computes-rule will usually not suffice. There is no guidance on which optimizing transformations to choose, and in which order to apply them. :Moreover, there is no possibility to exploit explicitly parallel algorithms that have been developed over the last decades for various problems on various target architectures. 4. Run-time prediction for nontrivial codes on real machines is a very complex issue, due to network contention, message protocols, buffering, undocumented hardware features, and other problems. But reliable runtime prediction is essential to estimate the quality of array distribution schemes or of program transformations.
:vlessage-passing statements can be generated automatically today by semiautomatic parallelization [ 10. 55]. The user has to provide array distributions and optimizing transformations manually. either in the form of interactive commands, as in SCPERB [55], or in the form of language constructs or compiler directives in an explicitly parallel programming language such as Fortran D [29], Vienna Fortran [ 11], High-Performance Fortran (HPF) [28], and others. "\evertheless, there remains the hard problems involved in automatic data distribution and redistribution, automatic guidance on optimizing transformations, and in suitably accurate performance prediction.
The problems involved in generating good parallel code for distributed memory multiprocessors (or other complex supercomputer architectures) arise from the fact that there is often not sufficient knowledge available of the source program and on the target machine characteristics. Thus. an automatic parallelizer for such target architectures must be able to acquire and access as much of this knowledge as possible. Thi,; does not work for all programs.
:Vlany numerical programs are, however, particularly suitable for this purpose. As a result of considering numerical algorithms in books and cour,;es. and studying a large number of typical application codes that are reasonable candidates to be ported to distributed memory systems, we have observed [33] that there is onlv a rather limited number of typical operations. called patterns. that often occur in these programs, in particular in the time-consuming inner loops. These patterns are mostly data parallel operations like elementwise operations on vectors and matrices, various kinds of reductions and linear recurrences. difference stars. grid relaxation sweeps. convolutions. and others. A pattern is considered to be a primitive with respect to mathematical properties, data structures of operands, nwmory access structLLre, array alignment preferences, and run-time behavior. We have collected about 150 patterns in a basic pattern library. We have also recorded typical implementation prototypes (syntactic variations) of these patterns that are used in sequential source codes [33].
Based on this observation, we constructed an automatic parallelization system called PARA-MAT (PARallelize Automatically by pattern J;JATching) with the following key ideas: 1. The first step of parallelization must contain a pattern recognition tool that works fast and reliably. Code pieces in the source program that are recognized as an occurrence of one of our patterns are replaced by an instance of that pattern, looking similar to a call to an externally defined function. The input language is structured C without pointers. 2. Once the system knows what the source program locally does. it can infer additional knowledge using mathematical properties and efficient implementations of the patterns on the target machine and access offline-generated information on favorable data distributions and run-time behavior of the pattern implementations on the target machine. The parallelization system can then easily use this knowledge to guide a sophisticated parallelization process with high-level program transformations including local algorithm replacement.
The remainder of this article is organized as follows: Section 2 describes pattern recognition in numerical codes and summarizes our list of patterns. Section 3 presents the main ideas and definitions of our pattern recognition method and gives several examples. Section 4 summarizes the PAR-AYIAT pattern recognition tool. gives results, and discusses some extensions. Section 5 shows how the information supplied by pattern recognition is used to guide automatic parallelization. Section 6 lists some related approaches to pattern recognition and pattern-driven automatic parallelization.

PATTERNS IN SCIENTIFIC PROGRAMS
To promote the pattern-recognition approach, we examined many sequential numerical algorithms that are typical and well-suited candidates to be run on distributed memory multiprocessors. e.g .. some "Numerical Recipes" [48] or algorithms considered in numerical textbooks like [ 3] or in a numerical math course: Basic linear algebra subroutines (see also [15,40] ). direct solvers for linear equation systems (such as Gaussian Elimination. LC. QR. or Cholesky decomposition), Simplex. iterative linear equation solvers ( such as Jacobi. Gauss-SeideL JOR. SOR. and Conjugate-Gradient solver), fixpoint iterations (e.g .. square-rooting a matrix), grid relaxations (used for numerical solution of partial differential equations), interpolation problems. numerical integration and differentia- tion. and multigrid algorithms. These algorithms are the basic building blocks of many numerical applications.
Considering these numerical algorithms in numerics books and courses, and studying a large number of typical application codes as the Purdue Set benchmark ( [ 46]. Table 1 ), the Livermore Loops ( [44]. see Table 4), and others [see 33].
which are reasonable candidates to be ported to distributed memory systems, we have observed that there is only a rather limited number of typicaL mostly data parallel operations, called patterns, that often occur in these programs, in particular in the time-consuming inner loops. A pattern is considered to be a primitive with respect to mathematical properties, data structures of operands, memory access structure, array alignment preferences. and run-time behavior. We have collected around 150 patterns in a basic pattern library [see 33]. Chapter 5 for the complete specification; Table 2 gives an overview]. We have also recorded typical implementation prototypes (syntactic variations) of these patterns that are used in the sequential source codes considered; they are specified in Appendix B of [33].
Our observations are backed up by other empirical investigations on large FORTRAN codes [521 and by the typical sets of numerical routines contained in numerical linear algebra packages, which are either supplied by hardware vendors, or offered by numerical software companies, or distributed as public domain software. So far, we have focused on algorithms operating on rectangular dense rPal matrices because these are the most reasonable candidates to be ported to distributed memory parallel supercomputers: nevertheless, our approach may easily be extended to other matrix types (e.g., banded. block-banded; complex). We are currently investigating operations on sparse matrices [33].

Overview
PARAMAT's pattern recognizer works on the intermediate representation of the source program as an abstract svntax tree. A well-structured and statically analyzable source language is assumed. The goal is to annotate as many nodes as possible with a so-called pattern instance, a summary structure that describes which function is computed in the subtree rooted at that node, together with the parameter objects of that function. Speed and robustness of this method mainly result from exploiting the natural semantic hierarchy of the patterns in the library.
The algorithm traverses the abstract syntax tree from left to right in postorder. For a leaf node (a variable or a constant), determining its pattern is  2 ) and related patterns Forward and backward substitution (FSUBST 2 . BSUBST' 2 ') 2D reductions: total sum of matrix elements (MSUM 2 '). total product (MPROD 2 '). concurrent row/ col-vector sum (VVSUM' 2 ) or product (VVPROD' 2 ) 2D reductions: matrix maximizations/minimizations (total or row/col-wise) value ( [18]. This procedure can be extended for pattern matching along "horizontal" data flow edges, such that several (matched) instructions in the same block that belong to the same pattern may be contracted to a single pattern instance. Several instructions may belong to the same computation only if their operands are involved in at least one of several types of data flow relations. We denote important data flow relations by data flow edges (cross edges). Computation of these edges (i.e., computing exact array data flow) is generally hard, but in our case, we can profit from the simple array access structures that are characteristic for dense matrix computations and that are present in all our patterns. We will consider this problem in Subsection 3. 5.

Preparing Code Transformations
Before starting pattern recognition, we apply several important normalizing transformations to make the program as explicit as possible by

Patterns, Templates, and the Pattern Hierarchy Graph
Each nontrivial pattern m is a pair (fm, lm) consisting of a specificationfm of a (mathematical) operation, and a list /m of specifications of the types and the data structures of the parameters occurring in fm. For instan~e, the MVI 2 ) pattern represents the ... operation y = Ab + x, with the parameters y, A, b, and x being real (sub )arrays (x may also be a constant). For each nontrivial pattern, m, we usually know several implementation prototypes (for sequential C code). Because of the wide variety of semantics preserving code transformations, the number of such prototypes can be large for more complex patterns (such as matrix-matrix-multiplication), expanding the size of an automatically generated tree automaton dramatically. For this reason, we formulate the prototypes as far as possible by using instances of (other) patterns. An implementation of matrix-vector-multiplication (MVI 21 ) can be written as a single loop based on a dot product computation With such domain information it becomes straightforward to formulate templates, which are the rules to determine a node's pattern m (and pattern instance I) given the node's operator and all its children's pattern instances.
Recognizing leaf nodes in the syntax tree as variables or constants is trivial. Now consider a subtree Tw rooted at a node w with several children v 1 , ... , vk. The operator op of w is either a for loop header, an if header, an assignment, or a unary or binary expression operator. The children of w, respectively, correspond to the loop body, the then or else branch, the left-hand side variable or the right-hand side expression of the assignment, or the operand expressions.

Definition
Let h be the function computed by Tw, as defined by the semantics of the programming language used. Let the children v 1 , v 2 , . . . , vk of node w already being annotated by pattern instances / 1 , / 2 , . . . , Ik of (potentially, trivial) patterns m 1 , m 2 , ... , mk from the library. Let g denote a function.
Let i E {1, ... , k}. We call the k + 2-tuple S = (g, m 1 , . . . , mk, i) a template of m, if g(fm 1 , • • • , fm) = fm = h. We call m 1 a trigger pattern; i is, depending on op, determined according to Table  3. Moreover, we call m 1 , . . . , mk (potential) subpatterns of m. For each pattern, we realize only the most important templates (typically, we have one to three realized templates per pattern), see [33]. . We denote by SP(m;) the set of all superpatterns of m;. Usually, a pattern has onlv a small number of superpatterns (see [33]). Let w be as above, then the set of possible candidate patterns that may match w is  The PHG of matrix-matrix-multiplication. Solid edges mean realized templates for vertical pattern recognition; dashed edges for horizontal pattern recognition along cross edges. Solid cycles mean templates for unblocking or eliminating semantically invariant conditionals: dashed cycles represent templates for loop rerolling or integration of initializers. and the set of templates of these patterns that are to be tried out at w is determined analogously. Thus, pattern recognition becomes a path finding problem in the PHG. Different paths toward a pattern m correspond to different implementations of the functionality of m. This means that a linearsized PHG (and thus, pattern recognizer) represents exponentially many implementation variations of the same pattern.

Definition
The PHG has a second important advantage: It serves as a hash table that can be inspected by the pattern recognition algorithm, because it yields all the possible superpatterns that could be matched from a given trigger pattern. Often, the trigger pattern together with the operator of the node to be matched suffices to select a single possible template to match that node. If there are several templates admissible, these are tested concurrently; the result is deterministic. Failing templates abort as soon as possible.

Matrix-Matrix Multiplication
We demonstrate the pattern recognition algorithm using a simple example. Ylatrix-matrix-multiplication is well suited because its functionality and subpatterns are widely known. Its PHG is given in Figure 1.
Suppose the programmer has coded matrixmatrix-multiplication as follows: The pattern recognition algorithm travArses the abstract syntax tree from left to right in postorder. Then, the algorithm considers the assignment S2 and annotates it by AADDMUL ( c ) (accumulative addition of a product). Following the suitable PHG edge, this yields a dot product for the k ) . The accesses to the arravs a and b have become vectors. As the accumulating scalar c [ i] [ j ] has not been initialized so far, it has to be entered into the initialization slot of the ssp(1: instance to keep data access information consistent. In the next step, the do j loop around the ssp(ll instance is recognized as an instance of matrix-vector-multiplication. Also in this Abstract syntax tree of the Matrix-matrix-multiplication example. } At this stage, we can continue pattern recognition only if we take care of data flow. Exact array data flow analysis, although generally a ver·y hard problem [ 17,43], is dramatically simplified by the exact data access information supplied with the pattern instances. In this example, we find that the vector c[i] [l:m] is written in the VINITI 11 instance, and read and overwritten by the MV 12 l instance, symbolized by a so-called cross edge of type FLOW. Thus, we have exact information that data flow between these two instances in an expected way. This situation can be tested by a realization of another template for pattern recognition along cross edges. As the template matches, we can merge these two instances into a single MV

Elimination of Semantically Redundant Conditionals
The following fragment is taken from the MATMUL routine of the DYFESM program from the Perfect Club Benchmark Suite [ 4] : The programmer has added the condition IF (B (J, K). NE. 0. 0) to avoid unnecessary multiplications and additions by 0.0. Since ;he program's semantics is not changed by this optimization. we realized a new template for the vector triad VAADDSV!l! (and for several similar patterns) that follows a self-cycle in the PHG to remove the condition, just by copying the V AADDSV 11 ; instance at the I loop header to its parent node, the IF header. Pattern recognition then proceeds as above.

Unblocking Loops
Blocked loops are very common in dusty deck programs that have been optimized for other target architectures with caches or vector registers. In the following example, the i loop has been blocked bv a factor of k: for (i=1; 1<=n; i+=k) for (j=i; j<=min(n,i+k-1); j++) The inner loop is recognized as a VAADDSV ( 1  Similar unblocking templates exist for many otherelementwise vector and matrix operations and for many reductions. The normalizing transformation "loop unblocking" has thus been integrated into the pattern recognizer as a realization that is shared by all these templates. This integration is possible because the syntax tree structure is not modified. However, this does not hold for loop distribution (a loop transformation important for pattern recognition) which has to be called separately before each recognition step.

Exploiting the Cross Edges
Cross edges in the syntax tree represent particular, loop-independent data flow relations among the operands of pattern instances within the same block. Pattern instances interconnected by a cross edge may, even if textually separated, belong to the same thread of computation, and thus, to the same superpattern. Therefore, cross edges are well suited to guide horizontal pattern recognition.
In [34], we have devised a compact array access descriptor that supports fast realizations of the important query operations equality, inclusion, disjointness, and (direct) neighborhood of array access shapes. A descriptor is computed for each operand of a pattern instance just after generating it. Thus, only one loop level has to be considered at a time. Furthermore, each operand has one of four possible access modes: I (ignore), R (read), W (write), RW (read and overwrite). For nonrecognized code fragments, worst case assumptions have to be made. From this information, we easily compute five different types of cross edges that are important for pattern recognition. A cross edge connects an instance / 1 to an instance / 2 located textually behind / 1 within the same block, and has type 1. FLOW if / 1 writes an object that is read by 1 2 , and this data flow is not killed by another instance Is located between 1 1 and / 2 that writes to this object; this corresponds to a loop-independent data flow dependence from / 1 to / 2 . 2. ANTI if / 1 reads an object that is written by / 2 , and this data flow is not killed by another instance Is located between / 1 and / 2 that writes to this object; this corresponds to a loop-independent data antidependence from / 1 to / 2 .
3. INPuT ifboth/ 1 and lz read the same object that is not written to by another instance / 3 located between / 1 and l 2 . 4. NEIGHBOUT if / 1 and 1'2 write neighbored sections of the same object that are not read or written by another instance I 3 located between / 1 and / 2 • 5. 1\"EICHBIN if / 1 and / 2 read neighbored sections of the same object that are not written to by another instance 1, 1 located between 1 1 and lz.
In general, the eross edges of a block form a directed acyclic graph. Only pattern instances connected by cross edges are considered for a potential merge in pattern recognition. Selection of suitable templates is guided by the type of the cross edge and by the (trigger) pattern name of the last pattern instance (/ 2 ). If several templates should be admissible, then thev can be tried out concurrentlv: at most one of the~ may really match, thus, de~erminism is preserved.
Cross edges of type Ai\"TI are used at recognition of VSWAP(l) from three single vcopy\1l (vector copy) instances.

KESSLER
The interleaving of the instances does not prohibit the recognition process because it is guided by the cross edges. We obtain The following special cases of pattern matching along cross edges are particularly important for us:

Loop rerolling:
Loop unrolling is a common program optimization. It occurs (1) as replication on the expression level (within the same expression) and (2)  function stmtdescend (node) if node is alreadv visited then return fi Immediately after recogmuon of scopy'O and computation of cross edges, the FLOW cross edge from the SSP' 1 ' to the SCQpy'O instance selects a ssp'l; template that replaces the temporary temp by x [ i 1 and removes the (now useless) SCOPY 0 · instance.
Due to the one-pass nature of the pattern recognition algorithm, we do not know at this point whether the last value of temp (i.e., the nth component of vector x) may be used later on. Thus. to maintain consistencv. we insert a correcting scopy:O) instance. After loop distribution and one further pattern recognition step, we have The scopy(o: instance may later be removed as useless code if temp is not used anymore.

The PaHern Recognition Algorithm
The function stmtdescend() traverses the syntax tree in postorder; exprdescend() does the same for expression trees (where, however, no cross edges can occur).
if node is not an assignment statement then forall children s of node do stmtdescend(s) od fi forall expressions e occurring in node do exprdescend(e) od /*Now all subtrees of node are visited and (perhaps) recognized* I if node is an IF header then tryJF_distribution(node) fi if node is a for loop header then try _/oop_distribution (node) fi forall admissible vertical superpatterns m for node in the PHG ( cf. formula 1) do test by the vertical template match(m,node), if there is an instance I of m matching node od if not. return fi /*FAILED* I annotate node with I; compute access descriptors and cross edges to I repeat forall direct cross predecessors x of node (in the same block) do /* x has alreadv been visited earlier* I test by admissible cross templates if the sequence x: node is an incarnation of a superpattern m' if yes, merge x and node, call the result node and annotate node with an instance I' of m'; break; od until there are no mergeable cross predecessors of node left. end stmtdescend() The routine try_f}'_distribution tries to distribute a masked block of statements; try_/oop_distribution tries, for a loop over a block of statements, to perform scalar and vector expansion and thereafter distribute the loop as far as possible [ cf. 56]. IF and loop distribution modify the structure of the syntax tree: node gets several "younger" brothers (copies of node) and moves some of the statements from its body to theirs. After node, pattern recognition visits the new brother nodes as if these would exist already from the beginning; but revisiting their children (that were children of node before and thus are already matched) is not required.
Because of the deterministic nature of the method, each node is visited only once. Because selection of admissible templates is vety fast due to PHG inspection, run-time cost is dominated by the linear tree traversal time. Data flow is computed by need, i.e., only for the current loop level. Loop distribution uses Tatian's algorithm for strongly connected components; its pseudocode can be found in [56].*

Recognition of Data Structure Concepts
Beyond annotating nodes with pattern instances, pattern recognition offers the possibility to keep track of static relations of single program objects. An illustrative example is the identification of statically known grid hierarchies in multigrid programs. Detection of such grid hierarchies is especially important when data are stored in a one-dimensional workspace array. Then, the additional information allows reconstruction of the different two-dimensional grids, supporting array partitioning and load balancing.

Transformations after Pattern Recognition
After pattern recognition, we must eliminate useless code that may emanate from conservative * Computation of the data dependency graph for a block of k statements takes, depending: on the dependence tests used, in the worst case at least time O(k 2 ): the data dependency graph itself may require space O(k 2 ) which is then the input size for Tarjan's linear-time algorithm. This works fast for blocks of moderate size. buc of course, ruins the otherwise linear runtime of our algorithm. \\'e tolerate this because blocks tend to be small compared with the size of the entire source program, and because loop distribution is crucial for the robustness of our method. cross matching and certain transformations. Useless code computes variables that are not consumed or output before being recomputed.
Instances of so-called unstable patterns are decomposed into their basic patterns' instances, e.g., theinstanceSVSUM(i,c, a, b[l:n], 0.0) is split into the sequence VSUM ( i, temp, b[l:n], O.O);MUL(c, a, temp). This extraction of a loop-invariant multiplication is a target machine-independent optimization. Furthermore, the number of patterns that is visible for the eode generation phase is additionally reduced.

Implementation
A prototype of the pattern recognition tool (see Fig.  2) has been implemented and tested. The current implementation consists of around 12,000 lines of C code and reliably recognizes 91 nontrivial patterns with about 150 nontrivial templates. Each template is implemented as a C routine of around 20 to 50 lines that tests syntactic and semantic conditions and, if successful, generates the pattern instance and fills in the slot entries. Because many useful syntactic and semantic predicates have been predefined, writing code for templates is handy and straightforward. More patterns can easily be added. The high degree of robustness against loop interchange, loop distribution, loop unrolling, and statement reordering has been exemplified in practice.

Results
A phase [ cf. 6] is a minimal set of loops around some assignment statements such that all indexing variables occurring in these statements are bound by loop variables. Ideally, all phases of a program have been recognized completely as incamations of our patterns. The pattern recognition tool recognizes nearly all phases in 16 of the 24 Livermore Loops ( Table 4). The recognition times are pretty fast although measured on a low-end Sun SLC, including the time for parsing the source and printing the result. Further encouraging results have been obtained for many other source programs: most of them are listed in the appendix of [33].

Discussion
A possible alternative to our syntax tree-based approach may be pattern recognition on the control Table 4. Livermore Loops [ 44] flow graph (CFG). We state: 1. The syntax tree representation is supplied by the front end. Because we only admit C statements that produce well-structured control flow, the syntax tree contains all required control dependency information. 2. The CFG is much less structured than the abstract syntax tree. By converting the syntax tree in a CFG, we would lose information about the loop structure (loop variables). Pattern recognition would be harder, less clear, and slower. 3. The CFG may be more useful if the source program contains many jumps ("spaghetti code"). For our pattems, however, jumps are rarely required, and can always be replaced by structuring constructs like IF-THEN-ELSE or WHILE.
Future extensions to the pattern recognizer could address interprocedural matching which would handle recursive functions (that are encountered in many FFT programs) and indirect array references and pointers (that are required for recognition of operations on sparse matrices). Pattern instances could also be written directly by the programmer in the source text (very similar to Fortran 90's array operations and intrinsic array manipulation functions), thus locally bypassing pattem recognition. Note: Sixteen of the 24 kernels are (mostly completely) recognized. The fourth column indicates how many loops (counted after applying loop distribution) were matched. The fifth column gives the number of nodes of the abstract syntax tree: the last column the overall times for parsing. recognition and output, measured on a low-end Sun SLC. that are quite encouraging.

PATTERN-DRIVEN PARALLEL CODE GENERATION
The matched intermediate representation is rnachine independent and opens access to very sophisticated program transformations. Instances of r~cognized patterns can now be replaced by their best known parallel implementation. These implementations are machine dependent and are parameterized by problem sizes and data distributions of the operand arrays occurring in the instance. Thev should be written in C with inline assembler for ~ptimal usage of local processor features. Because we want to optimize each pattern implementation only once, off-line at compiler generation time, we assume that the following machine parameters are known at compiler generation time: 1. The number of processors. In principle, there are now two possibilities to generate parallel code for a recognized subtree of the abstract syntax tree: The first alternative is the generatioU: of a standard parallelization according to well-known techniques that we shortly revisit in Subsection 5.1 and modify for our purpose in Subsection 5.2. The second option, considered in Subsection 5.3, addresses the selection of an alternative parallel implementation that computes the same function as the standard parallelization but applies a different parallel algorithm.

Generation of Standard Parallel Implementations
For given array distributions, a standard parallel implementation is generated according to the following, well-known techniques [cf. 55].

Splitting
If the target machine has a host that handles all 110 operations, then a host program is generated that performs all 110 operations, starts the node programs on each processor, sends portions of read operands to the node processors that need them, and collects the result values from the node processors that generate them.

Adaptation
The node program maintains, in principle, the program structure of the sequential version. For a given partitioning of the arrays, each assignment statement will be masked by a condition depending on the node processor's lD number that ensures that a node processor only execute~ this statement if it ownst the variable on the left-hand side of the assignment. Furthermore. interprocessor communication (EXCH-statements. cf. [55]) must be generated to ensure that nonlocal operands are available when the statement is executed. There is no explicit synchronization needed if blocking receive statements are used.

Optimization
The masks can often be integrated into the bounds of a surrounding loop. thus avoiding much of the overhead due to the condition evaluation. Interprocessor communication is moved to the topmost loop level (loop distribution) that is still possible without violating data dependencies. Communication is vectorized as far as possible. . ] ) instead of EXCH as described above. In contrast to an explicitly parallel algorithm, the standard parallelization preserves the structure of the sequential program.  Thus, if the problem size of I is known at compile time and if it is small, P ARAMAT will decide to prohibit parallelization if sequential execution will be faster, thus avoiding slow-down of the target program. If the problem size is not known at compile time, a suitable run-time test is inserted into the generated code. Similar run -time tests can be inserted if PARA-MAT is not really sure about the value of certain important program values. An example is the following situation that is often encountered in multigrid applications: The programmer uses a large linear workspace array to store all (e.g., two-dimensional) grids and indexes each single grid by using an offset pointer which is, in general, an array reference itself. Such indirect array accesses cannot be handled by compile-time data dependence analysis, and, even worse, a standard decomposition scheme for this linear work array will result in bad load balancing and unnecessary communication. However, from the indexing schemes in recognized patterns of interpolation or restriction operations from one grid to the next one, PARthen generate 'l'[m] for L-L' ( node processor i860 can only be used if the program is written in machine language-the C compiler does not vectorize. Fried [23] shows how impressive performance improvements can be reached by exploiting hardware features like arithmetic pipelines, dual operation mode. or dual instruction mode that are just ignored by the standard compilers.

Selection of Parallel Implementations
II This may also be handled by a compiler option included in the program text. but as we focus on fully automatic parallelization. this is not a viable alternative for us.
AMAT is able to detect the (potentially) different grid parts by treating the offset array accesses as symbolic parameters. To make sure that the offset values implement the workspace concept, a suitable run-time test on the offset values must be generated that, if successful, treats each single grid as a unique (two-dimensional) array that can be aligned and partitioned individually, thus avoiding the performance decrease mentioned above. As the number of different grids (and thus, the number of offsets) is usually small, this run -time test does not involve much run-time overhead. As the potential benefit from a positive test result is greaL this optimization is sensible. If the assumption of a workspace grid hierarchy has been confirmed at run-time, the workspace array is decomposed into the single grids, and program control branches to an alternative implementation with separate array distributions for each grid.

Examples for Nonstandard Parallel Implementations
This section gives some examples for parallel implementations that may differ completely from the original sequential program structure, or that introduce useful transformations of the corresponding standard implementation. The latter can be regarded as automatic program transformation which is hidden from the user. There is no need for a cyclic approximation scheme of successively applying some program optimizations, observing the results, and choosing better ones [301. The disadvantage is that for each pattern a separate implementation driver is required. We claim that this can be taken into account, given that there would be a large intellectual effort devoted to the development of numerical software libraries for any real machine. In any case, we have finally the chance to get rid of the owner-computes rule.
The implementations are code skeletons where the slot entries are entered in an appropriate way. They already contain message-passing statements and register allocation. In the sequel, we sketch some of them. For a more complete survey of parallel algorithms for matrix computations, see [21] or [24].

Reduction Operations
For instances of specific common reduction operations (cf. Table 2) like global sum, global product, global OR, global maximum etc., we can make optimal use of optimized routines that are, in general, already supplied with the run-time environment of the target machine. Here the nonstandard parallel implementation mainly consists of a runtime svstem call.

Grid Relaxations
A single grid relaxation step represents one update of all elements of a two-dimensional grid. A sequence of such steps, e.g., a step-counting loop around them, offers additional potential for optimizations.
Algorithm replacement must always be conservative with respect to numerical stability and convergency properties. As the recognized pattern's names are available, we can access mathematical background information on convergency properties. This information allows-if not explicitly forbidden by the user-the replacement of, for instance, a Gauss-Seidel Wavefront relaxation by its red-black variant or by two steps of Jacobi relaxation which are much better suited for parallel execution (depending on the target machine). The basic motivation for this "aggressive" local replacement of implementations is that the average user just wants to get the actually fastest parallel implementation on this target machineindependent of, for instance, a particular relaxation coding.

Linear Recurrences
Simple linear recurrences are a classical example for algorithm replacement. Csually it appears as a sequential loop like which is serialized due to a loop carried data dependence as long as standard parallelization is used. For recognized linear recurrences (here FOLR' 1 l) we can apply a suitable number of recursive doubling steps [37] to gain some parallelism while taking care of growing communication overhead. The optimal number of recursive doubling steps (up to min(log p, log n) are possible for p processors) depends on the problem size n and the time required for interprocessor communication on the target machine. For smaller problem sizes, the sequential variant will be faster.

Matrix-Vector and Matrix-Matrix Multiplication
For matrix-vector multiplication (MV( 2 l), the standard method can be implemented as the ij variant (the inner loop is a dot product) or as the j i variant (the inner loop is a 1 1 daxpy 1 ' vector update). The latter variant seems to be preferable on vector node architectures. Alternatively, we might use a systolic algorithm; this seems at most appropriate for transputer arrays with comparably low communication overhead and node performance. For rnatrix-matrix multiplication (MM( 3 l), the standard method expands to one of six possible variants (ijk. ikj, etc.) since all three loops are interchangeable. An alternative would be a systolic implementation [see [22]. Similar systolic methods are also applicable to LU decomposition (LUD; 3 l Discussion Algorithm replacement must be conservative with respect to numerical stability and convergency properties of the recognized patterns. For each pattern rn, the nonstandard implementation II[m] must guarantee that its numerical stability is not worse than that of'¥[ m]. Where this is not possible, the user receives a warning, and thus can force PARAMAT to choose the standard implementation by setting NoREPLACE [p ]. Algorithm replacement is the most complex and strongest program transformation of all. Safe algorithm exchange is enabled only by the availability of pattern instances. It includes all other machinespecific optimizing transformations. The implementation library can be optimized off-line by expert parallel programmers, until optimum performance is reached. Some optimizations may even be reintroduced which have been removed at the pattern recognition phase (e.g., loop blocking, semantically redundant IFs, etc.). The suitable communication routines, either simple SE~D and RE-CEIVE instructions or higher-order communication primitives like COMBll"E, REDUCE, BROADCAST, GATHER, and SCATTER that are typically supplied with the parallel environment, are a basic component of the parallel pattern implementations and need not be further optimized afterward. Such optimizations would usually be required for semiautomatically parallelized code, e.g., by vectorization of messages [25] or by the general message-passing optimization technique proposed in [ 42].
Algorithm replacement enables local deviation from the owner-computes rule; it forms a framework to include all useful parallel algorithms that are known so far for the corresponding class of target machines (topology, granularity, communi-cation properties). All experts' knowledge becomes available for the average user, although they do not need to be concerned about these algorithms or machine parameters.

PaHern-Driven Data Distribution
To simplify the system design a given hardware environment is regarded as fixed: in particular, hardware resources like the number p and the speed of the processors, the network topoloe,ry, the cache size and caching strategy, and the memory size are regarded as constant. This corresponds to a "dedicated" target machine. In the following, we need not consider these hardware parameters further. Nevertheless, scalability of parallel pattern implementations (in a more general sense) is still an important issue since local problem granularity still depends on the problem size.
Each parallel pattern implementation accesses data in an individual manner. Thus, for each pattern implementation, there is (at least) one favorite alignment (to minimize communication) and one favorite distribution (to maximize parallelism) of all the arrays for this pattern. The programmer knows these favorite alignment and distribution strategies for each pattern implementation. This information is stored in a table and can be accessed bv the data distribution driver for each instance. S~me examples of anay alignment and distribution recommendations for standard parallel implementations are given in Table 5.
A second requirement for on-line optimization of array distributions is that the parallel implementations are specified in a data-distribution-independent way. This may be technically arranged either by conditionals depending on the distribution parameters of one or several arrays, or by replication of parallel implementations, one for each possible distribution configuration. In each of these cases, it would be advisable to limit the possible distribution alternatives, instead of admitting arbitrary block-cyclic distributions of any For an m X n matrix, we admit the following distributions: 1. Contiguous row distribution [block size is mn/p, block shape is (mlp) X n] 2. Contiguous column distribution ·block size is mn/ p, block shape is m X (nip)] 3. Cyclic row distribution (block shape ism X 1) 4. Cyclic column distribution (block shape is 1 x n) 5. Contiguous quadratic blocks [block size is mnlp, block shape is (miYP) X (n/YP))] 6. Total replication (no distribution, block shape ism X n) This limitation of array distribution alternatives is supported by the fact that for all our patterns [33], a locally optimal distribution for each array operand is contained in this list. We are aware of the fact that a globally optimal data distribution configuration may be made up of only locally suboptimal array distributions, although we believe that this scenario hardly appears in practice. Quadratic contiguous block distributions are optimal for grid relaxation sweeps, because they minimize the surface-to-volume ratio of the arrav partitions and thus the amount of data to be ex~ changed. In our framework, they are the only distribution scheme that distributes processors along more than one array axis. For quadratic distributions, however, we must add in this case the following constraint: The array (grid) A accessed by a matrix m must be two-dimensional. Otherwise, imagine the following situation: Let A be three dimensional, with axes A 1 , A", and A : 1 , being distributed into quadratic blocks along, say, axes A 2 and A:~· Let m be a matrix access along the first and second axis of A. The number of processors along axis A 2 is Vp, the number of processors along axis A 1 is 1 (not distributed). Thus, m has only YP partitions, which limits parallelism unnecessarily, and, worse still, the overall number of working processors is no longer constant for each call to the corresponding relaxation routine. Because we do not want to do everything nvice, with one extra routine version forp and one for only YP processors, we generally admit quadratic block distributions only for, arrays of dimensionality equal to 2.
The alignment and distribution recommendations for different pattern instances in a given program will usually conflict with each other. The problem of resolving this conflict by determining globally optimal data alignment and distribudon is well known to be NP-complete [ 41], thus automatic partitioning may take exponential time in the worst ease. Dierstein et al. [12] propose a branchand-bound algorithm for automatic partitioning. To help with the combinatorial complexity, we make use of our knowledge on favorite local partitionings as starting configurations when performing a global search for the optimal data distribution.
Dierstein et al. [12] also cover static anay redistribution which is a NP-complete problem itself [39]. The main problem in static redistribution is that a globally optimal distribution scheme involving redistribution may even be made up of suboptimal data distributions for all phases of the program. However, [6] shows that for application programs of moderate size (800 lines) represented as a sequence of phases, an optimal data distribution scheme can be found within a few CPU seconds using a fast 0-1 integer programming tool. This method matches our approach well, because the pattern instances supply the required phase representation, and the run-time tables (see the next section) deliver suitably accurate cost estimates.

Pattern-Driven Run-Time Prediction
Many performance prediction approaches [ 12,16,26] work analytically by estimating the program's run-time bottom-up through the abstract syntax tree, starting at the leaves of the expression trees, with an idealized model of the target machine in mind. Specific hardware features like caches or network traffic yield actual run-times that significantly differ from the prediction. For this reason, we follow a synthetic performance prediction approach that has been proposed in [2] and [20].
For each pattern implementation, PARAMAT provides a mn-time prediction driver that inspects a table of previously measured nm-times of that implementation with varying problem sizes and varying anay distribution schemes on the target machine. The table entries for each pattern are indexed in different data distribution configurations, in the problem sizes (logarithmic scale), and in the NoREPLACE flag. They also depend on the NoPARALLEL predicate. The restriction of data distribution altematives given above keeps thetable sizes moderate. In addition, we require some table entries for the communication routines that may be generated due to array redistribution, see [6].
As a consequence, run-time prediction considerably gains accuracy because now actual runtimes of high-level implementations on the target architecture are available which reflect hardware properties (traffic on the network, message buffer sizes. message protocols. undocumented communication behavior, overlapping of computation and communication etc.) better than theoretical, idealized estimation functions.
This synthetic run-time prediction has another important advantage over the analytical approaches: It is faster because table lookup suffices where otherwise complex intermediate representations have to be traversed and analyzed. For instance, the ADDAP [12] system's automatic data distribution engine suffers mainly from slow analytical performance estimation.
Problems with performance prediction generally arise if the target machine has a cache. Then, runtime also depends on whether operands (arrays or parts thereof) already reside in the cache due to a previous operation, or whether they must be reloaded first. This scenario may be influenced by previous operations. With a synthetic approach, however, the larger the problem sizes are, the less this effect changes the actual run-times compared with the table entries. For small problem sizes, the run-time prediction drivers may be augmented by some correction term addressing the cache effect. This issue is left to future research.
Problem sizes (corresponding to vector lengths or matrix extents) need to be considered only in a specific interval [Nmin ... Nmaxl of interest, e.g., from 8 to 16384. The parameter extent of that problem size axis thus contains D = log N maxlog N min + 1 entries. With these guidelines and with the limitation of array distribution altematives given in the previous section, the parameterization space (and thus, the run-time table size) for a pattern implementation with x vector operands, y rnatrix operands, and z problem sizes contains 3x · 6-'" · Dz entries. For the MV matrix vector product, we obtain an (uncompressed) table size of 54D 2 .
Of course, this does not mean that we have to implement a matrix vector product once for each of these configurations. Generally, several entries can be handled as a whole block by taking array alignment relations [35,36,41] into account, or ranges of problem sizes with similar run-time be-havior. The run-time tables can also be compressed according to this hierarchical parametrization structure of the parallel implementation. For run-time prediction, we consider a parallel implementation ('l'[m] and II[m]) of a pattem mas a black box. We are not concerned with the issue of how their run-times should behave in theorv. but how thev actuallv behave on the concrete hardware . . configuration, which can substantially differ from the former.
The synthetic performance prediction treats greater code portions as units where analytical methods estimate the program's run-time bottom up, starting at the expression level. Synthetic prediction models (at least partially) the cache behavior due to the localitv relations that are inherent to the parallel implementation, the overlapping of computation and communication, and the characteristic network traffic induced bv the access structures inherent to the parallel implementation.
For true parallel algorithms (II [ m]) the analytical methods (like [12,16,26]) often fail because they rely on standard parallelizations within a specific compilation environment. Synthetic performance prediction works also for all nonstandard parallelizations. As a byproduct, the nm-time tables will provide an extensive performance spectrum of the target machine. Furthermore, they will show which parallel algorithms are feasible in practice, and in which range of problem sizes and for which data distributions they are superior to others or to standard implementations.

System Overview
There remains the technical problem of how to code a parallel implementation in a data distribution independent way while maintaining explicit formulae for iteration and communication sets and avoiding the overhead involved in evaluating complicated parametrization formulae at the target program's run-time. We do this in two steps. First, P ARAMAT specifies the parallel implementation in a target machine-specific language like C plus in-line assembler. This specification, however, allows complicated parameterization formulae or, if unavoidable, excessive replication of implementation code. Once the data distribution engine has determined a global distribution configuration for all array operands, we can derive the proper parallel implementation subroutines (comparable to those in the previous section) from that data distribution-independent specification by partial evaluation [32] and dead code elimination. We obtain small and efficient message-passing C sources that are data distribution dependent, and we need to extract only those routines from the specification library that are called by the matched user program. These arc then compiled and linked with the matched user program that has been produced by a suitable code driver (cf. Fig. 3).
These routines extracted from the specification are also used to produce the run-time tables. As this is a tedious procedure, we plan to automatize table construction. 1'\ote that the time-consuming generation of the run-time tables can be performed off-line (at compiler generation time). We intend to develop an automatic benchmarking tool that does this tedious job.
For nonrecognized code portions, P ARAMA T generates standard parallelizations. The difference from standard parallelizations of recognized code portions is only that there are no corresponding entries in the data distribution/ alignment recommendation and run-time tables available; thus, these code portions do not (yet) influence the global determination of array distributions.

RELATED WORK
Several automatic program comprehension techniques have been developed over the years. They vary considerably in their application domain. method, and status of implementation.

Earlier Work Targeted Toward Automatic Code Optimization, Vectorization, or Parallelization
Snyder [53] addresses idiom recognition in APL codes. His algorithm is an extended depth-first traversal of the abstract svntax tree with linear expected run-time. He applies dynamic programming techniques to select the most profitable idiom in the presence of overlapping idioms, which appears to be common in APL programs.
Brandes and Sommer [9] suggest (non constructively) to apply pattern matching techniques for the detection of reductions and recurrences within the framework of a formal system for automatic shared memory parallelization. EAVE [7,8] is an expert system for interactive vectorization of Fortran programs. It contains a simple pattern matching tool that can discover order 1 patterns (vector operations, reductions).
The pattern matcher of [ 4] works on a modified program dependence graph (PDG, see [19]) that has been extended in a special way to match certain loop structures with the goal of replacing them by parallel algorithms. The cost of recognition is higher because the rewrite rules form a graph grammar. Normalization of the PDG has to be provided interactively by the user.
By abstract interpretation of the sequential source, [1] computes a sequential memory access map (abstract store) that assigns to each array element referenced in a loop the corresponding symbolic representation of its content. Thereafter, loops are, where possible, replaced by their explicit representation (closed form), comparable to our pattern instances. They recognize some patterns of order $1, namely equivalents of POWER, VSUM(l J, VPROD( 1 l, PREVSUM( 1 l, SSP( 1 1. Based on the closed forms, they implemented recognition of induction variables. The method fails at unroHed or blocked loops.
Red on and F eautrier [ 49] propose a special approach for recurrence detection. While this method offers, at considerable computational effort, the recognition of rather general and multidimensional recurrences, a number of assumptions are made that are hardly met by real applications. As complicated recurrences are rare in real programs, the computational effort of this approach seems unjustified.
CMAX [51] is the only commercial application of pattern matching with regard to parallelization. It translates Fortran 77 programs to CM-Fortran, a parallel vendor-specific Fortran dialect :;irnilar to Fortran 90. It recognizes syntactically several common loop constructs (vector operations, reductions, matrix-matrix multiply), but does not distinguish between patterns and templates. The recognition power is slightly weaker than PARA-MAT's, but the main advantage of CMAX is its ability to recognize Fortran-specific storage conventions and to transform them in order to make the program machine independent and more suitable to distribution of data at that point.
Program comprehension for algorithm replacement should not be confused with pattern matching that optimizes communication statements, e.g., in [31] and [42]. These approaches do not try to understand program semantics but apply pattern matching to (implicit) message-passing code to exploit higher-order communication routines like global combine, reduction, or broadcast, which are supplied by most parallel run-time systems. Note that such optimizations are contained in PARA.c\1AT's algorithm replacement strategy.

Other Current Research Proiects
Bhansali et al. [5] conclude, from a case analysis, that current tools for automatic parallelization are not powerful enough and recommend pattern recognition as the solution. Some general ideas are sketched, but there is no implementation.
DiMartino and lanello ( 13] build from the PDG a database of PROLOG facts, formulates templates as PROLOG clauses, and uses PROLOG's inference engine for pattern matching. This approach, although slightly more general than ours, forbids intermediate restructuring, relies on backtracking, and takes exponential run-time in the worst case. The information derived is used in an interactive system for automatic array alignment and distribution [30]; algorithm replacement is not straightforward as in PARAMAT. A detailed comparison of this approach with PARA:\1AT's pattern recognizer is given in (14]. A program comprehension system for Fortran programs sketched in [ 45 J is currently being implemented for a list of over 500 idioms of common loop nests, which corresponds roughly to an uncompressed version of our PHG. The method works on the PDG; it is a top-down approach that partly uses the algorithm from [53].

Other Problem Domains
Some systems for program comprehension in a nonnumerical domain are targeted toward automatic documentation and support of software maintenance. Transfmmation or replacement of code is not considered. Plan Calculus [50] represents code and patterns (called "cliches") with graph structures whose nodes correspond to subconcept instances and whose arcs capture control and data flow relationships among them. Cliches recognition becomes thus a graph parsing process using a set of graph grammar rules. It produces a parse tree representjng a hierarchical description of plausible concepts of the program.
The PAT approach [27] and following work [38] uses an abstract, object-oriented representation for syntactic and semantic concepts composing a (COBOL) source program. Each concept is an instance of a concept class, and the classes are hierarchically structured. Our templates are roughly comparable to their '"plans": a plan's representation consists of a description of the syntactical components and a description of the constraints to be satisfied by components. An inferential pattern-directed engine derives new higher-level concepts from the existing ones, utilizing plans as inference rules.

CONCLUSION
The PARAMAT approach to automatic parallelization consists of three basic ideas: First, we observe that we can cover large parts of many numerical codes by a small set of typical programming patterns. Second, we devise a recognition algorithm similarto bottom-up pattern matching which tries to locally recover the semantics of the program, while being robust against many common code modifications such as loop distribution, loop interchange, loop blocking, or loop unrolling. Third, we use the restored program semantics infornlation to guide sophisticated optimizing code transformations including local algorithm replacement.
In this article, we have presented a powerful framework for the detection of the patterns in scientific programs. We applied our knowledge on the semantical correlations between the patterns for speed and space economy. We used data access description and data flow information to compute cross edges which guide recognition of delocalized code portions. Our prototype implementation shows (1) that pattern recognition is robust against many common code transformations, (2) that writing code for template realizations is rather easy, and (3) that pattern recognition is very fast.
We have presented a framework for patterndriven generation of parallel code. For each pattern we can-as an alternative to standard parallelization of some loops according to given array distributions-also select a conceptually different parallel algorithm, for instance, highly optimized system routines supplied with the hardware environment. Safe algorithm replacement, though, is only guaranteed by the availability of pattern instances. It provides a universal framework to integrate all known parallel algorithms, library routines. and program transformations. Treating larger code parts as atomic building blocks of a parallel program also supports faster and more accurate performance prediction. Thus, PARA-MAT makes the experience of parallel programming and optimization experts accessible to all scientific programmers and thus avoids reinventing the wheel for each program parallelization project.
PARAMAT is not interactive. This is not necessary either because the user does not have to recognize his/her code during and after parallelization for selecting transforn1ations or further tuning by hand. On the other hand, this "non-WYSIWYG" system offers many more possibilities for aggressive optimizations and hides the parallelization details from the user.
The PARAMAT system is open for extensions. The pattern library can be extended by adding more pattern modules according to individual application areas. The computation of the run-time approximation functions can be automatized by a universal benchmarking tool. Changing the hardware platform only requires the loading of another base of parallel implementations, their default distributions, and their run-time functions. Thus, the PARAMA T system can always be up to date with the latest available hardware environments. The P ARAMAT system could also be modified to output HPF source programs instead of target machine code. As HPF programs (especially distribution and mapping directives and explicitly transformed code) are target machine (and compiler) specific, generating HPF output for each pattern by the implementation drivers and distribution recommendations by the distribution drivers is, in principle, possible. This, however, would only work if the same HPF target compiler is used to generate the machine code, since this compiler must then also be used to generate the run-time tables for the pattern implementations written in HPF. On the one hand, this would supply a Fortran 77 (Fortran 90, C) to an HPF converter for a specific target machine; on the other hand, it is likely that this indirect approach of generating HPF code and later compiling it again will result in a performance degradation of the final target program, compared with direct machine code generation bv PARAMAT.