Scientific Programming with High Performance Fortran : A Case Study Using the xHPF Compiler

Recently, the first commercial High Performance Fortran (HPF) subset compilers have appeared. This article reports on our experiences with the xHPF compiler of Applied Parallel Research, version 1.2, for the Intel Paragon. At this stage, we do not expect very High Performance from our HPF programs, even though performance will eventually be of paramount importance for the acceptance of HPF. Instead, our primary objective is to study how to convert large Fortran 77 (F77) programs to HPF such that the compiler generates reasonably efficient parallel code. We report on a case study that identifies several problems when parallelizing code with HPF; most of these problems affect current HPF compiler technology in general, although some are specific for the xHPF compiler. We discuss our solutions from the perspective of the scientific programmer, and present timing results on the Intel Paragon. The case study comprises three programs of different complexity with respect to parallelization. We use the dense matrixmatrix product to show that the distribution of arrays and the order of nested loops significantly influence the performance of the parallel program. We use Gaussian elimination with partial pivoting to study the parallelization strategy of the compiler. There are various ways to structure this algorithm for a particular data distribution. This example shows how much effort may be demanded from the programmer to support the compiler in generating an efficient parallel implementation. Finally, we use a small application to show that the more complicated structure of a larger program may introduce problems for the parallelization, even though all subroutines of the application are easy to parallelize by themselves. The application consists of a finite volume discretization on a structured grid and a nested iterative solver. Our case study shows that it is possible to obtain reasonably efficient parallel programs with xHPF, although the compiler needs substantial support from the programmer. © 1997 John Wiley & Sons, Inc.


INTRODUCTION
Our ohjectiv~ is tlw conversion of Fortran 77 (F77) program~ to High Performance Fortran (HPF) pro-and solutions that we have found are Yaluable to other programmers and also to compiler deYelopers_ HPF_ which i~ a11 extension to Fortran 90 (F90).offers a uniform programming interface for parallel computers that abstracts from the architecturP of the rnaehine.ltis a data-parallellanguage where parallelization is addressed by means of explicit array distrilmtion using directives and/or the forall statemenc implicit loop parallelization.implicitly through arraysyntax . .as well as intrinsic parallel functions and standan!library routines.The programmer specifies the distribution of arrays and the alignnwnt between arrays (the distribution of one array rdative to the distribution of another array) by annotating the program with directives.An important aspect of HPF directives is that they arc safe: they do not changp the semantics of the program, but serve as a hint to the compiler.The programmer specifies (potential) parallelization by using parallel loop statements, F90 array svntax.and several new intrinsic functions and standan!library functions.These routines provide both parallel computational functions as well as inquiry functions, which allow explicit adaptation of the program to the run-time configuration.With these parallel constructs.the compiler is expected to generate reasonably efficient parallel code automatically.This especially involves the distribution of arrays.the efficient implenwntation of communication and synchronization (typically by inserting communication operations in the code), the division of work among the processors . .and the use of temporary data structures.One of the most importa11t objectives of IIPF is so-called performance portability; an HPF program should display high performance over as large a range of computers and compilers as possible.The HPF language is defined in [2], and a more detailed discussion is given in [:3].A comprehensive introduction is given in [ 4].I IPF contains manv features for which it is still hard to gPnerate efficient programs automatically, beeanse this would require significantly increased quality of interprocedural analysis and optimization.Therefore, to support early implementations of parts of the language and maintain portability, an official subset of the language has been defined as the minimal implementation.We will refer to this subset as the HPF Subset [2].
The xiiPF compiler offers the HPF Subset with some exceptions and several additional APR-specific directives.The most important use of the APR directives is to force the compiler to make certain performance optimizations, like forced loop parallelization, or to omit communication or synchronization that would be generated otherwise.In contrast to HPF directives, these directives are dangerous: they rnav change the semantics of the program.which makes their usc umlesirable.We refer to thPse directives collectively as forced optimizations.A more detailed description of the xHPF compiler is given in Section 2.2.
The expectations of the automatic parallelization capabilities ofHPF compilers are often high.However, we needed an extensive learning phase until we were able to obtain reasonable performance with still high programming effort.In order to identify the problems that occur when converting an F77 program to an HPF program we did a case study.The case study is based on three programs that reveal several problems that will confront the novice HPF programmer.First, we use a dense matrix-matrix product to show that the distribution of arrays and the order of nested loops significantly influence the performance of the parallel program.This result is important, because loop parallelization and array distribution are the corner stones of HPF.Second.we use Gaussian elimination with partial pivoting to study the parallelization strategy of the compiler.There are various ways to structure this algorithm for a particular data distribution.This example provides some insight into the analysis and optimization capabilities of the compiler.and shows how much effort rnay be required from tlw programmer to support the compiler in generating an efficient parallel implementation.Finally, we use a small application to show that the more complicated structure of a larger program may introduce problems for the parallelization.even though all subroutines of the application are easy to parallelize by themselves.The application consists of a finite volume discretization on a structured grid and a nested iterative solver.
Our case study reveals three main results: ( 1) It is possible to obtain reasonably efficient parallel programs 'vith xHPF.(2) A detailed analysis of the F77 program and sometimes of the HPF compiler-generated program is required, and partial rewriting/restructuring of F77 programs seems necessary.( :3) The learning effort is substantiaL because the IlPF compiler needs significant support from the programmer.

MULTIPROCESSOR ARCHITECTURE AND HPF ENVIRONMENT
All experiments presented in this article have been carried out on an Intel Paragon 'vith the APR HPF Subset compiler xHPF [1].

1 Intel Paragon
The target architecture for our HPF programs is an Intel Paragon XP /S5+.The Paragon is a distributed m~mory multiple-instruction multipl~-data (:\'1I\'1D) multipnJc~ssoL which scales with respect to the number of nodes.Each node consists of two i860XP RISC processor~.running at a :->0-Mllz dock frequency.Each processor accommodat~s two floating-point pipelines that deliver a rwak p~rformance of ?5 \Hlop/ s for double-precision arithmetic.On~ of tlw proc~s sors functions as a compute processor.and th~ other as a conununication processor.The cornmunication processors ar~ conne<~ted via a network with a twodimensional mesh topology.The main task of the communication processors is tlw implem~ntation of cut -through routing, which reduces the overhead of routing such that nwssag~-transfer times are almost independent of the number of hops.
Tlw Intel Paragon at ETH Zurich (5] consists of 112 nod~s, 96 of which are available for running user programs.Each node has a .32-.Vlbytc memory.The data cache size of the i860 is 16 kbvt~.The network has a tlu~or~tical bidirectional bandwidth of 200 Mbvte/s.Each node runs a Mach :3.0 micro kernel and the OSF /1 operating system.Intel's :'\X messagepassing library manages communication among user processes.ThP optimizing i f7 7 compiler, version 4 .5.2, was used to compilP all the programs presented in this article, including the xHPF-gen~rated, singleprogram multiple-data (SPMD)-style programs for parallel execution.

xHPF
Th~ xHPF compiler offers the HPF Subset with some excPptions and several additional fcaturPs.The xi IPF compiler is a source-to-source compiler, which translates HPF programs (possibly with additional APRspecific directives) into F?? programs with subroutine calls to the APR run-time library.The compiler itself docs not offpr any support for debugging at the HPF level: debugging is typically done at the F?? level (which is cumbersome).APR does, however, offer additional tools that facilitate debugging.
ThP two most notable deficiencies of the xi IPF compiler with respect to the HPF Subset are that ( 1) the implementation of the DISTRIBUTE and ALIGN directives for dummy arguments does not conform to thP IIPF (Subset) language spPcification and is much less powerful and (2) the PROCESSORS directive and all directives that refer to a PROCESSORS declaration are not implemented.If the xHPF compiler encounters a DISTRIBUTE or ALIGN directive fora dummy argument, it will either make the required distribution for all actual arguments in the routines where they are dedared or it will ignore the directive.This leads to problems such as directive-based implicit alignments, which are discussed later.. and it obstructs the implementation of the required distribution for arguments that arc not an entire data ohjPcL P.g .. array segments.
Apart from the HPF Subset, the APR xHPF compiler offers several APR-specific directin~s.The most important use of these APR directives is to force the compiler to make certaiu optimizatious, like forced loop parallelization, or to omit otherwise included communication or synchronization.Although these di-rcctivPs are dangerous-thPy may change the semantics of the program-thP use of these directiv~s can lead to substantial performance improvements.The xHPF compiler offers tlw following additional dire<> tives: 1. CAPR$ DO PAR {ON array element reFrence} selects a loop for parallelization.The ON clause allows the programmer to let a distributed array (reference) determine the parallelizatiou of the loop iteration ov~r the processors.Csing this directive, the programmer can relate the work distribution to the distribution of a data seL 2. CAPR$ DO NOPAR indicates that the following loop should not lw parallelized: this is important when using the automatic parallelization features of the compiler.;).CAPR$ IGNORE ALL INHIBITORS forces the compiler to parallelize thP following loop even if its analysis indicatPs that this is not possible.This is important since the analysis is often insufficient.and the compiler will not parallelize the loop under the default assumption that it is not safe.With these directives the programmer can distribute an array at any place in the program according to a given scheme.They also allow the distribution of a segment of an array.If a "partitioning" is done inside a subroutine, the new distribution will also persist outside the subroutine; so returning to the original distribution must be done explicitly.The NOMOVE variant indicates that no communication of data is necessary (e.g., for scratch arrays).
The compiler offers automatic parallelization in essentially two ways.One way is to make a profile with st~qn<'ntial or preliminary parallel code.Tlw compiler can use this information to detcnnirw a parallelization l'tratc~y.The other way is to provide some initial array distrihntions or loop parallelizations (using the APR din'ctin's).The cornpilrr will then iteratively try to imprO\e the parallelizat.ionby trying to distribnte any nondistrilmted array that is referenced in a parallel loop.and bv trying to parallelize any nonparallel loop that references distributed arrays.The programmer can set a maximum nHmber of steps for this iterative improvement stratt>gy.Thr two strategies can abo be mixed.

CASE STUDY
\Vr analyze the IIPF Subset as implemented by APR nsing three programs.( 1) A dense matrix-matrix product, which shows that the distribution of arrays and the onkr of nested loops in which the distributed arrays HIT referrnced detcnnirw the perfor!llance of the parallel program.Specifically.the loop or•der affects the place of subroutinr calls for cmmmmication and rnn-time checks generated by the HPF compiler, which determines the cost of communication.Furthermore.these subroutine calls influence the opport unitit'S of tfw native compiler for optimization.(2) Cau~sian elimination with partial pivoting and backward substitution.\Yr examine the parallelization stratr~ies of the compiler and describe data -parallel impletllentations that.support the compiler in generating an rfficient parallel implementation.(3) A small applif,ation."•hich shows that the more complicated strnct ure of a largrr program may impede rfJicient parallelization.ewn though all subroutines of tlw application are easy to parallelize individually.
Thi,; st'ctioo is structured as follmY,;: Section 3.1 introducrs the programs and their implementation HSing F??. Section 3.2 presents the HPF versions of the programs.tht> haekground for the conversion from F 7 7 programs.and the dewlopment of the HPF programs.An experimental analysis of the three programs is pn~sented in Section ;1.3.Section 3.4 sttJIIllWrizes the results and ohserYation,; obtained from rach specific pnrallelization.

Dense Matrix-Matrix Multiplication
Densr matrix-matrix multiplication is one of the basic building blocks of linear algrbra.F90 offers an intrinsic matrix-matrix multiplication function (MAT-:\lCL).This function is also in the IIPF SubseL but

Gaussian Elimination
Our second example is the solution of a dense sy;;tem of linear equations by Caussian elimination with partial pivoting and backward sub~titution.We start our invebtigation with a k!f-forward-looking Gaussian elimination and a ji-version of tlw backward suhstit11tion [7].The sequential code is shown in the Appemlix, instnmwntcd with HPF directi,•es.ln order to include the impact of partial pivoting 011 tlw performance . .we use an artificiaL square system of order 1.000 that yields the worst case of 999 mw exchanges.
The partial piYoting complicates the t'limination code so that a complete empirical analysis of all possible parallt'l implementations is not sensible.We. therefort', concentrate 011 the implementation deeisions of tlw parallelizing t•ompilPr and the effort tlw pmgram-mPr lw~ to inYt'~t in order to support thP generation of eflicicnt parallel codt>.For the rnatrix-rnatrix prod-U('1.the irnplcmentation follows innrwdiately fmm tlw data distribution and the loop order.. even if this does not rwcessarily prodtrcc the desired efficiency.In this rPSJWCt the parallelization of the rknsc matrix-matrix product is strai~odrtforwanl compared to thP parallclization of the Caussian t>limim1tion w•itlr partial pivoting.where snTral choict>s nt>t>d to bt> made Lwyond tlw SJWcificat ion oft he data distribution.Thest> choices affect thP commm1ication cost of the program significantly.This !'.\ample illustratt>s tlw proct>ss of lwrfor-llHillCf' tuning that n-entttal!yleads to a good choice.

Flow-Simulation
The third example im olves finite volunw dis(Tt>tization of a simple flow problem on a regular grid and a fixt>d number of steps of the nestt>d iterativt> (lint>ar) ~oiYt>r dt>scrilwd in [8].
Tlw finite Yolmnt> discretization comptttes the set of liucar equations that nt>t>ds to be solved.This involve~ intt>grat ion over a volunw in the grid or its boundary using data that is defined over this volume and rH•ighboriug volumes.Tlw algorithm is embarra~singly parallel.Since \\T usc a n•gular grid.load balmH'ing is straightforwanl.
The itt>rative solvt'r inYolves matrix-vector products.inuer products.and Yt>ctor updates.For simplicity \\T did not use any preconditioning at this stage.Of course.for realistic simulations.JH't>ctmditioning is of great importance.However.for rt>:bonable prt>l'lmditimwrs (with respect to both mathematical proper-tie~ and parallel properties).\\T need an inquiry function for thc• actual distribution of tlw matrix.Sudr furwt ions \\ere not \Tt availahh• in tlw vt>rsion of xi IPF that \\'f' ust>d: the current version has a limited.uonconforming.distribution inquiry function.The matrix is stored in diagoual format.\Yhich allows straightforward parallelization.The vector updates.using the daxpy routine" are in principle embarrassingly paral-leL but some unexpected problems arc described in Section :~.2.subst>ction ••Flow-Simulation ... The inner prodttcts are also simple to parallelize: however.the global commtmication nt>l'f'S~ary for tlw reduction add may havt> a negative influence on the scalability of the solver: c•srwcially lwcause this particular solver relics heavily on orthogonalization [9].In principle.this problem is inherent to the iterative solver and not to HPF. llowever.the tt>dmiques that can be exploited in message-passing prograrmning to reduce the loss of t>fficiency.like latency hiding.cannot be used explicitly in HPF" nor does it seem that such techniqnes are curTPntly used automatically in HPF ~~ornpilers.

Parallel Programming with xHPF
We 110\\ discu~s thP paralldization issttcs of our progranrs.Due to the var~ing degree of cor11plexit~.the numlwr of topics discusst•d w•ith t•ach program varit>s.

Gaussian Elimination
We first consider the paralklization of the Gaussian elimination and backward substitution algorithm~ without any algorithmic changes to a~sist an HPF cmn-pilt>r (st>e tlw Appt>ndix).,,-e duJost> a column cyclic distribution for tlw 111atrix.and rt>plicatt> tlw righthand side on all proct>ssors: the I IPF distribution rlirectin:s at lines 18 and 19 n•qnest this.Running the ~olver \Yith the~e HPF directin•s only (without the capr$ dirt>ctivt>s) produce~ tlw f'Xf't'ntion times p;iven in Tahk 1 (distribution only).which arc clearly unsatisfactory.Tlw execution time inneast>s proportionally to the nmnlwr of pnH't>~sors.This poor parallelization results from the fact that tht> compiler inserts communication and rnn-time checks inside the loops.thereby generating tremendous m-erhcad.
We first try to improve the performance using APRspecific forced optimizations (ignore directiws) to n1ovt> com1mmication out of the loop (lint> :w). to distribute a loop such that its body is executed on the processor that owns the required clements (lines 29.34, 70. and 7 3).and to prevent supt>rfluous coHllllUJiication and nm-tinw checks (lines 7 1 and 76).\\'e regard this approach as the APR-specific \\ay to ITsoln• parallelization problemb.The execution times of this vt>rsion.giwn in Tablt> 1 (fon•ed optimization).are encouraging.However.this ver~ion still contains unnecessary communication.The reason is that the scope of automatic parallelization and optimization is limited to single loops; optimizations over seyeral consecutive loops arc not considered.Also, the reust' of data that haYe bel'n fetched preYiously is ignored.
In ordt'r to indicatt' better implementations and optimizations to the mmpiler, we have to provide it with programs that fit an HPF-like data-parallel programming style (st'e e.g., [10], [11]).
We will now discuss the general design decisions for a data-parallel implementation.Since we distribute the columns of the matrix cyclically, the row t'xchange for piYoting can bt> performed in parallel without communication.Furthermore, after the elimination factors have been computed and are availablt' on all processors,.the distributed update of the submatrix rows ;mel the replicated right-hand side with the pivot row is also local.Therefore, communication is necessary only to provide the indl'x of the pivot row and the elimination factors to all processors.Hence, the pivot column (thl' pivot clement and the clements below) and/ or the elimination factors must be broadcast to the other processors from the processor that owns the pivot col- 3. (This scheme has not been implemented).The processor that owns the pivot column does thl' pivot search locally and broadcasts the result.J\ext, the row exchange is carried out in parallel.Then, the pivot column i.s broadcast, and the computation of the elimination factors is replicated on all processors.Afterward.the update of the snbmatrix is carril'd out in paralleL and the update of the right-hand side is replicated on all processors.
The three variants ean be implemented with approximately the same efficiency., hecmtst' they f>xhibit similar communication.The second variant should be the most effieienL because it needs slightly less communication than the others and it requires the least synchronization.We consider two possible changes to the souree program that establish a data-parallel programming style: l. Local pivot search: \V e replace the scalars pp i v and abspi v by arrays of a size that equals the number of columns, and align these arrays with the columns of the matrix.This way, for each column col of the matrix, we use the loeal veetor elements ppiv(col) and abspiv(col) in the pivot seareh instead of (replicated) scalars that incur communication inside the loop.*Now, only the final result must be communicated to the other processors.This implements the first scheme.2. All local: We declare a replicated (dummy or automatic) array (pco 1) of the size of a matrix column, and copy the pivot column into that array; see Appendix, loop 400.This copying leads to the broadcast of the pivot column.Furthermore, we replace all references to the pivot column by corresponding references to the replicated array (pcol).This leads to strictly local work for the pivot search, the row exchange, the computation of the elimination factors, and the update of the rows and the right-hand side.This implements the second scheme.
The experimental results obtained with these dataparallel implementations are presented in Section 3.3, subsection "Gaussian Elimination."

Flow-Simulation
In a large application, a single array may be passed to and updated by many different subroutines with different access patterns.Conversely, a single subroutine may be called at many places in the program with array arguments that differ in size, shape, and/ or distribution and alignment.This leads to a complex set of requirements on the implementation of subroutines (especially the loops over distributed array-arguments) and the distribution and alignment of their dummy arguments if an efficient code is to he produced.None of the specific parts of the program plays a role by itself; indeed, each of the routines used in this program parallelizes well by itself.It is the global parallelization that causes the problems.
First, we deserihe some concepts that underlie tl1e parallelization of loops and the distribution of arrays.For simplieity, we will not indude replication in this description; for extensions see [12].Different compilers will address these concepts in different ways.However, since the distribution of arrays and parallelization of loops lie at the heart of data parallelism, the concepts themselves must he addressed in some way.Following a general introduction, we describe the problems we found for the APR compiler.Two of these problems seem to be xHPF specific, the directive-based implicit alignment and the interprocedural propagation.
* According to the data-paralkl prograunningparadigm, scalars arc replicated on all processors and therefore are broadcast if they are updatted on some processor.

HPF PROGRAMMING 133
With each array we can associate an index set that consists of the valid index references to that array.With each loop we ean associate an index set that consists of the values that are assigned to the loop index during the execution of the loop.The distribution of an array a is described by a function from the array index set into a processor index set P, M 1 ,.,: I" _..,.P. t Likewise, we can describe the distribution of a loop 1 by a function from the loop index set to a processor index set P, !IJ 11 .1':[ 1 --;.P. We assume that all statements within one iteration of the loop are carried out on the same processor.This is the case for the APR xHPF compiler and for the Oxygen compiler [13]; however, it is not defined in the HPF language specification [2].Through this use of index sets, the distribution of data and the parallelization of loops are described in the same way.
The programmer can indicate to the compiler that the distribution of the elements of one array should depend on the distribution of the elements of another array through the alignment directive [2].This directive describes a linear function from the index set of one array, the alignee, into the index set of another array, the align target.Given arrays a and b with the index sets Ia and h, the alignment a: Ia _..,.h, a processor set P, and the "distribution" M 1 ".1' of the array b, we can describe the interpretation of the alignment indicated hy a.For the processor set P, this alignment leads to the distribution jf,fr p: I"--;.P such thatM 1 ,(i) = Mr p(a(i) ).In order to minimize data movement for the execution of a distributed loop, an optimizing compiler will try to assign the execution of each loop iteration to the processor that owns the array elements referenced in that loop iteration.This implies functions from the loop index set to the index sets of the arrays that are referenced inside the loop, which leads to certain desired "alignments" between the loop and the array index sets.In fact, for the APR xHPF compiler, the alignment of the loop index set with one array index set can he indicated explicitly through an APR-specific directive.
In this article we will refer to alignments that are indicated explicitly by directives as explicit alignments.We will refer to all other alignments that an optimizing compiler may make to reduce communication as implicit alignments.Such implicit alignments t \'I:'P use the tPrm processor for simplicity: one may use instead proc~ss, thread.and so on.TIH'S<' have to be llHl[>p<'d to physical processors in turn.come.for example, from references to different distributed arrays in a single distributed loop.Several loop parallelization strategies an~ ust>d in data-parallel compilers that lead to difff'rent implicit alignments, t>.g .. owrwr computes and almost owrtcrcomputes [ 14].
However.thf'v all suffer from serious limitatious with respect to optimization when applied rigorously.so in practict> relaxed wrsions are usf'd.APH uses a so-callt>d OU'IU'r sets strategy [ 1].  or Obviously implicit alignments can have a much more compli<:ated form than explicit aligmnf'nts.and tht>y may be only a rdation on part of the indt>x set of the alignf'e.Implicit alignmellts may also arise indi-rt>ctly by aligning Sf'veral arrays explicitly with thf' same array, or through an explicit alignment in combination with a distributed loop referencing more than one distributed array.Altl10ugh implicit aligmnents indicate potf'ntial compiler optimizations.they can crt>ate problems as wdl.as Wf' discuss below.
We now describe four problem classes that occurred in tl1e parallelization of our application.For other compilers this would at most constitute a pott>ntial optimization problem.\>lost compilers will just distribute thf' associated dummy argument on entry to the subroutine according to tlw given directivf' and rf'distrilmtf' upon exit from dw subroutine.Possible incffi-cit>ncies art' tht>n f'ntirdy the progranmwr's responsibility.and do not form a problem for the com pi lf'r.
Consider a scenano where subroutine sub ( x, y) aligns array x with array y on subroutinf' entry by means of an align dirt>ctive.Tlwn . .two calls to this subroutirw of tlw form call sub(a,b) and call sub(c,b) lead to an implicit aligmrwnt of thf' arrays a and c.Tlw compilf'r may want to propagate tlw two aligmnf'nts to the calling routirw to prevent !"!:'distribution inside thf' subroutine (as the APR xHPF compiler must do).This will only bt> possi-bit> if the potential distributions (as indicatt>d by the prograrnnH~L for f'xamplt') of the index sets of a and c satisfy the requirements from this imp licit alignment.
We call tlw implicit aligmnf'nt of arrays that m•isf's in this way directive-based implicit alignment.because it arises out of alignments that are indicated by a dirt><:tive.size or shape, and so on.In the small application under consideration we had to resolve such problems several times.
A problem that will arise frequently in converting existing F77 programs to HPF, and which c1.m easily he avoided, is the implicit alignment of unrelated arrays due to the use of workspace arrays passed to a subroutine.Consider the program fragment in Figure :3.The programmer wants array a to be block distributed (line 10) and array b to be cydically distributed (line 11).If the parallelizing compiler obeys these directives, any distribution for array h will lead to large communication overhead.Typically, the parallel program will redistribute h upon entry to and exit from d1e subprogram.
The alignment directive does not solve the problem; it only indicates to the compiler that it is probably better to align (redistribute) the array dummy_ vector at the start of the subroutine than to do the communication in the subroutine element-wise.The solution is to use the dynamic allocation features of HPF (derived from F90), i.e., to use automatic arrays as workspace.
Loop-Based Implicit Alignment.Second, we look at a problem associated with loop-based implieit alignment.In the daxpy routine given in Figure 2 we replace the prescriptive (desired) alignment [2] at line 9 by a descriptive (asserted) alignment [2].The descriptive alignment (chpf$ align dx with k dy) asserts to the eompiler that the aetual arguments are (already) aligned on entry to the subroutine.Now, consider the use of the daxpy in the program fragment of Figure 4.
For each call to daxpy, the compiler knows that the actual arguments are aligned and that no eommunieation is necessary.However, the daxpy loop from the first call (line 7) should be distributed due to the block-distribution directive in line 5, whereas the daxpy loop from the second call (line 8) should be replicated due to the replicated-distribution directive in line 6.(see below), but with current compilers (at least the three mentioned above) one has to compromise to obtain reasonable performance.
It is interesting to note that if the daxpy program were not in a subroutine but inlined by the programmer.there would be no problem at all.The compiler could have implemented the best choice for each loop.For the implementation of a simple routine like the daxpy, this may be a good idea; the entire (simplified) daxpy can be implemented in a single forall-or arraysyntax statement.The same problem also appears for more complicated kernel routines that are not suitable for inlining at the source code level.
It is not defined in the HPF standard that a subroutine must have only one object module for all invocations.However, given that the three mentioned compilers do this, it seems to be the '"standard'.implementation method.Having different subroutine sources for different argument distributions, alignments, or shapes is undesirable because it means that the programmer has to keep track of the different uses, and has to determine the different uses for an existing program.Tl1is is difficult or even impossible since the distributions of arrays may not be known before the program is compiled (the compiler may choose distributions other than indicated in the program) or even before run-time.
Aliasing.r\ow.we consider parallelization and optimization problems that arise from aliases.In F?? . .programmers often address parts of arrays as independent items.For example, we may consider a twodimensional arrav as a matrix and access its ''columns."or we may consider a three-dimensional array as a three-dimensional grid and access planes or lines (e.g .• in a three-dimensional FFT).If we pass two (or more) different parts of a distributed array to a subroutine and modify at least one part in a parallelized loop, potential aliases occur and the compiler has to determine whether they are real or not.If the compiler cannot establish that no alias exists, it will typically execute the loop sequentially.ltmay be possible to solve this problem by passing segments of arrays to the subroutim', so that the compiler can check that no actual alias occurs.
In our application, alias problems arise in the generation of orthogonal bases in the iterative solver.A twodimensional array is used to represent a sequence of vectors; these vectors are generated by multiplying the last generated vector with a matrix and then orthogonalizing the result on the previously generated vectors (see Fig. 5).In the subroutines for the matrix-vector product and the vector update, two (disjunct) parts of the same array are used and one is updated.If the compiler cannot check that these parts are disjunct.it cannot parallelize the loops in these subroutines.
Using dynamic allocation, we constructed an implementation of the program that enables the compiler to check that the referenced parts of the array are always disjunct; however, the compiler still refused to parallelize the routines.W c could solve the problem only with forced optimizations.An alternative implementation in (full) IIPF may be to represent the sequence of vectors by an array of structures, each of which <~ontains a pointer to a vector.However, pointers arc not part of the HPF Subset and the construction is rather cumbersome.Moreover . .this approach does not provide a solution if it is necessary to traverse a higher dimensional array in more than one way.Consider, for example, a three-dimensional FFT over a three-dimensional array.We would like to consider such an arrav first as a collection of x-vectors, then y-vectors, and finally z-vectors.This cannot be done with pointers in HPF.

Dense Matrix-Matrix Multiplication
All timings listed in this section give the run-time of the multiplication kernel.. i.e., loop ..J:O in Figure l or one of its permutations.Thrse permutations are defined by changing the order of lines 7-9.Tables 2 and   :3 show measurement~ with all six pennutations of the loop ordf'r.Tahlt> 2 with ont> ra•oc('~sor and Tahle :3 with :tz processors.\\ e usP three square matrices of dimension .)00.
Comparing tlw seq uentia I run-times with lit tie (-0) and full compiler optimization (-full) in Table 2, \\'!' make two ohst>rvat ions.First.for the original sequential code (seq) of tlw F7? program . . the loop order has a large influeJH'I" on the JWrfonnance.mdess the natiw compiler ( i f7 7) is allow1•d to restnwture the loop ordPr.The compiler only re:>tmctures the loop order "•lwn fullrmnpiler optimization is enabled.This improves the c01nputational spt>t>d by more than a fa1•tor of -t-\\'ith the ikjand k{j-variants ttp to a faetor of 26 and rt'IIIOYf'S the performance dependt•tH't' 011 the loop onh•r.Second.for the pantllrl (~Odf' cxeeutt>d on one proct•ssor (par -1), \\T sPP 1 hat even \Yith full optimization.the nat in~ compilPr is 110 longer able to restructure the loop order.lwcaus1• of the code (subroutine calls) inserted by the xHPF compiler (runtinw checks, calls to comnmnicate nmtiUt'S . .etc.).Tht•refore.tlw loop onlf'r affect,; the n11Hinw efficiency of tlw parallel program.Comparing the sequential and tlw OIH'-proct,ssor rm•-tilll("s of the xliPFgenerated parallel code.both fully optimized, wt• call interpret tlw difference as paralklizmio11 ovPrht•ad.This owrhead is significant.Tablt> :1 shows f'x£>cution times of th£> parallel code ex£>cut£>d on :32 prmTssors.The C and R extensions of thf' loop ordt>r triplets reprt>sent type and location of interproct>ssor communication insnt£>d by xHPF.C denot£>s prcloop communication and R denotes a reduction add operation.This information originates from the xHPF -generated F77 program.which is instnnnented with calls to the APR run-time library.Tlw run-time,; with tlw two different optimization levels hardly differ, indicating that communication overhead is the rf'ason for the variation of execution time.Surprisingly, the hest loop order for the parallf'l code on one processor (jki) in Tablt> 2 gives the worst runtimf' on 32 procf'ssors in Table :3, whereas the best code on :32 processors (ijk) performs poorly on onf' pnwessor.To investigate this efft>ct, we measured a series of speedup curves for all permutations of the loop nesting.Tlw results with fully optimized native compilation arc pres£>nt£>d in Figure 6.These results indicate that in general no single program will be best or even good for all possible numbers of processors.
The snpt>rlinear speedups are caused by cache effects.not by swapping.Sincf' thf' siz£> of all three matrices is only SOO X :SOO, each processor could hold the three matrices simultaneously: Pach matrix reqnirt>s 2 Ylbyte.whert>as t>ach Paragon proc£>ssor has :32 Yfbyte of main memory [:S J.The jki-variant also behaves irregularly in another way: The run-time of the twoprocessor execution ( 154 s) is about five times larger than that of the one-processor t>xecution of the parallel code (d.Table 2).We do not know why this happens only with the jki-variant.The lack of knowledg£> of fnnctions in the parallelized code that are supplied with the APR run-time library makes the interpretation of sm:h effects difficult or even impossibl£>.
An additional concem is that for the fjk and i~"j variants.all communication is performed in one operation outside the outermost loop (d.Table :3).This indicates that at least one of the matrices is replicated on all processors, and the compiler potentially reintroduces the scalability problem with respect to mt>mory requirements that we tried to avoid by distributing all matrict>s.It is not clear whetlwr this code is scalable witl1 respect to the matrix dimensions.

Gaussian Elimination
To demonstrate the influencP of the loop parallelization match and the optimization gap.we measure the execution times of two sequential variants: a k!j-ordered.forward-looking Gaussian elimination (gelim) combined with an !/-backward substitution (backsb).and a k/i-ordered.forward-looking Gaussian elimination with ji-backward substitution [7].The run-times are listed in Table "±.Tht>y show that the Aji-elimination and ji-backward substitution lead to substantially better processor utilization.ami that they have a higher potential for compiler optimizations.However.we will not elaborate these aspects further.Instead.we focus on the analvsis of the two data-parallel programs developed in Section :~.2.subsection -~Gaussian Elimination."•\Ve continue with the NotP: The large variation in the run-tinws shows the sensitivit) of the pPrformarHT with rPspect to r:ompiiPr optimizations and loop onkring.kji-ordered.forward-looking Gaussian elimination because of its better processor utilization, and with the ij-backward substitution because of its lower communication complexity for the chosen distribution (see the end of this section).
The implementation of the different strategies was hindered by several xHPF -specific problems.l!nfortunatcly, these problems obscure the usefulness and importance of the data parallel implementation, which is our main point in discussing Gaussian elimination.
The local pivot sParch-version does not improve the performance.The xHPF compiler still broadcasts all intermediate results of the distributed array elements ppiv(piv) and abspiv(piv) in the pivot search.Because these elements are local to the processor that executes the loop, we did not expect that the compiler would insert this communication inside the loop, since the resulting values after the loop execution are important for the rest of the algorithm and the other processors.We can avoid the communication inside a loop with the xHPF parallclizer by using a forced optimization that inhibits this communication.However.this leads to a segmentation violation in one of the APR run-time library routines if the program is exrcuted on more than one processor.With sPveral other '•forced optimizations" and algorithmic changes we could get the program to run, but we never obtained rPasonable timings for the local pit,ol search-version.\Ve would like to rrnphasize that such an imp!C'meutation, with many forced optimizations. is highly undesirable.
The all local-version docs not lead to the desired efficirncy immediately with the xiiPF compiler.If array pcol is declared as an automatic array inside the subroutinr ge 1 im, the copying of the pivot vector (Loop 400 in tlw Appendix, "Data-Parallel Linear Equation Solver, All Local Variant") is implrmentrd by broadcasting each vector clement separately instead of broadcasting the whole vector once after the copy.Furthermore . .a subroutine call for computing the local indices for the replicated array pcol is in-serted inside the loop.Hence, the xHPF compiler not only creates unnecessary communication, but also introduces unnecessary work.This obviously leads to very poor performance, which is complrtely unnecessary (the PGI compiler does not havC' these problems).The execution times are shown in Table 5 in the '•Automatic Array" column.
To improve the performance we used the following tactic.If array pcol is allocatrd in the calling routine and passed as a dummy array to gelim, the pivot column is broadcast as a whole, and the index caleulation is moved out of the loop.This also shows that there is no good reason for the poor implementation by the xHPF compiler when an automatic array is used.The implementation with the dummy-replicated array is given in the Appendix (see '•Data-Parallel Linear Equation Solver, All Local Variant'' and the (much better) timings are given in Table 5 in the '•Dummy Array" column.This type of compiler anomalies is obviously highly undesirable.
After Gaussian elimination we compute the solution by backward substitution.For the given distribution of the matrix, there are two straightforward implementations: A row-oriented (~/-version) and a columnoriented (ji-version) backward substitution with the solution vector x aligned with the columns of the matrix.The column-oriented version genPrates more communication than the row-oriented version.The column-oriented version requires a broadcast of the array element x (col) and a one-to-all scatter of a ( 1: row-1, col) in each step, whereas the roworiented version needs onlv a reduction add over the local produets a(row,col)*x(col), col= row+l, n. which costs about the same

Flow-Simulation
The execution times for the discretization and a fixed number of steps of the iterative solver in the final HPF program arc given in Table 6 for three problem sizes and various numbers of processors.For one processor, Table 6 gives the run-time for the original sequential code (seq) and for the I IPF code (hpf) executed on one processor.For the first two problem sizes the program fits in the memory of a single processor, for the largt>st problt>m size the program does not fit in the memory of a single processor (so some paging is done) but fits in memory for two processors or more.So, considering tilt' size of the machine, 96 processors . .this is still a small problem.
It is clear from a comparison between the sequential execution tinws of the original program and the HPF program that the latter induces a substantial amount of overhead.Since no special measures are taken in the solver to improve the communication effects of a large number of inner products, which need global cormnunication, the speedup deteriorates rapidly for smaller problems [9], [16].However.for the largest problem the speedup on 32 processors is approximately eight.Moreover.if we look at the three problem sizes for a srwcific number of processors (say 64), we spe that the Pxecution time increases by 25% to :3S%J.
whereas the problem size increases by a factor of 2.5.So for very large problems the overhead induct>d by the compiler ht>comes less noticcahlL and the (relative) performance improves.

Dense Matrix-Matrix Multiplication
The performance study in Section :L3, subsection "Densf' \latrix-Matrix Multiplication,.leads to the identification of two cfficicncv-related issues of thf' parallel program generated by HPF: 1. Loop parallelization match: Consistency of the data distribution with the loop order and loop parallelization.Depending 011 the loop nesting order and the given data distributimL an HPF source-to-source compiler inserts run-time checks, index transformations, and communication primitives.The loeation of this code determines the numlwr of messages generated, and hence affects the run-time efficiency.Given the data distribution, the appropriate order of the three nested loops in the sequential program is essential to ensure minimal communication overhead . .and vice versa.We <~all this aspect the loop parallelization match to characterize the match of the data distribution with the loop order and loop parallelization.2. Optimization gap: Loss of optimization pott>ntial of the parallel program generated hy an HPF sourct>-to-sourcc compiler.Efficiency aspects related to the processor architf'c.ture, in particular, register allocation.pipelining, and caching, arc eouplt>d with the loop order.Complex RISC processors such as thf' i860 of the Paragon rely heavily 011 machine-specific optimizing compilers for generating cfficit>nt sequential code [17].
In our case, tllf' quality of the sequential code for the individual processors depends on the ability of the native compiler to optimize the program generated by the HPF compiler.We call the interface problems between the two compilers the optimization gap.
Although the optimization gap is not HPF specific.there are two reasons to mention this aspect explicitly.First.there is a large difference in performance between the I IPF programs and the sequential program on one processor.Second, there is also a difference in performance between the I IPF programs with different loop orders on multiple processors.Furthermore, the loop parallelization match and optimization gap are closely coupled, which complicates the analysis.The optimization gap is illustrated in Table 2, the loop parallelization match is illustrated in Table :t and an overvie\v of the joint effects can be obtained from Figure 6.
An important problem with the analysis of the timing results is the lack of knowledge .ofvendor-or compiler-specific subroutine and function calls to runtime libraries inserted by HPF compilers.Often the only feasible way to analyze the peculiarities will be performance monitoring or empirical methods.For the simple matrix-matrix multiplication code, it is possible, although time consuming, to obtain an empirical performance analysis by gathering data of the complete set of loop permutations.These experiments show that the loop parallelization match and the optimization gap strongly influence the parallel efficiency.Moreover.Figure 6 indicates that the optimal loop order depends on the number of processors.Hence, there is no variant that is optimal for all numbers of proePssors.Experimenting and profiling may be the only way to find the best variant for a particular number of proCf~ssors, since it seems difficult to predict.

Gaussian Elimination
The programming experience with Gaussian elimination illustrates the kind of considerations that are necessary i11 general to obtain an efficient parallel implementation.One should not confuse these considerations with the xHPF -specific implementation problems.
The programmer has to present a data-parallel program to the HPF compikr.h1 general.this reqnirPs the restrw~turing of the original F77 program . .adding or modifying data structures and making (some) algorithmic changes.This should and will. in gf'HeraL guarantf'e good performance irrespective of thf' compiler.lt is highly undesirable that certain constructs if'ad to good performance on one compiler and poor performance 011 another (performance portability).Also it should be dear which constructs work and whv and the difference between good and bad performan~; should not depend 011 trivial changes to the code, which is sonwlirnes the ease \Vith the xiiPF compiler (as discussed previously).The xHPF compiler, however, even with a dataparallel programming style, may not generate an efficient parallel program.In such a ease, the programmer has to study carefully the parallelization report and the xHPF -generated program, and either provide APRspecific directives that force the compiler to insert or omit communication code or make (small) changes to the (HPF) source code that prevent erroneous or poor implementations.Using forced optimizations should be viewed as supporting the xHPF compiler to ovPrcome the limitations in its optimization capabilities, especially at this (early) stage of parallelizing compiler technology.However, the use of forced optimizations is dangerous; it may change program semantics.and it is not portable to other HPF compilers.

Flow-Simulation
Before we discuss our conclusions concerning the four problem types discussed in Section :i.2, subsection "Flow-Simulation,'• we consider the type of changes in declarations that arc important when converting F77 to HPF, including some issues that do not arise in a F77 program at all.HPF incorporates F90 [18], and hence it supports several dynamic allocatio11 features.This removes the need for many practices in F77 that make automatic parallelization difficult, apart from the problems that they pose for programming in general.
It will no longer be necessary to declare arrays larger than needed in the actual invocatio11 of a program or subroutine.If the required size is not known in advance . . the array can be made allocatable and allocated once the required size is known.If the required size is passed through the argument list of a subroutine, the array can be allocated inside the subroutine with this size (automatic array).Therefore, dummy \Vorkspace arrays are no longer necessary.Furthermore, dummy arguments can have undefined shape, which means that they will take their shape from the actual argument (assumed shape array).Finally. it is possible to define (declare) the distributio11/ alignment attributes of an array before the array has actually bee11 allocated.
Also common blocks . .a major source of problems in liPF (but often in F77 also).arc no longer necessary.Global variables can be created and maBag<'d (for access) more cleanly using MODlLEs and tlw USE statement [18].Having a single subroutine perform the same (or similar) function on a range of different objt~cts.e.g., matrices in different storage formats.can he acl1ieved either by using optional keywords or by using generic interfaces (overloading) [18).Strieter definitions/ declarations of data and specifying exact sizes will make compiler optimizations easier, communications cheaper, and it will remove load-balancing problems.
An important point . .t>specially with the F90 ( dynamic) allocation features, is where arrays should be declared.From a software engineering point of view, data should he declared at the (highest) level where it is meaningful and where it makes sense to makt> this data visible and allow manipulation.Also from a parallelization point of view this is generally the best.However, for the xiiPF compiler, sometimes large arrays that are often referenced together should preferably be declared together.because this makes their mutual alignment/ distribution easier.This may mean that an array is dt>.dared at a higher level than necessary (meaningful), because this will typically reduce directive-based implicit alignment problems.
Directive-based implicit alignment problt>ms (which only occur with the xHPF compiler) can often be rPmoved by writing out small (often used) subroutines explicitly.This is not necessarily as bad as it sounds, because HPF (F90) is a much more powerful language than F77, and offers more flexibility and expressiveness in addressing parts of arrays.For example, most level 1 BLAS can be written in a single statement, and with the triplet notation in array syntax all the complicated increment (stride) issut>s can be avoided.
Another important issue is that the potential optimization to move (re)distributions of arrays on entry to and t>xit from a subroutine out of the subroutine is not always useful.The redistribution may be useful only inside that subroutine, amL if the (re )distribution takes plact> on entry and the original distribution is restored on exit, no implicit alignments outside tht> subroutine can arise.We note that the APR xllPF compiler has no choice but to move distributions out of subroutine~ (i.e., force the appropriate distribution to bt> performed where the array is declared), because it does not support restoring the original distribution 011 exit from a subroutine.This is a severe drawback and the main rPason for the directive-based implicit alignment problems.Moreover.this implementation does not conform to the official IIPF Subset definition.
The main way to prevent loop-based implicit alignment problems is cloning where a subroutine may use distributed arrays as actual arguments with a different shape.~izt~, or distribution in a loop.This nwans that it may lw lwttcr to have two routines that perform the same or similar actions, but place different requirements on their arguments.Of course.it would be much better if the compiler handled such problems transparently.Furthermore~ using arrays with different distributions in a single loop should be avoided if possible.Writing out smalL often usecL subroutines may also relieve problems.Both explicit cloning and inlining should be considered as temporary work-arounds to cope with the problems of currently available compilers.Eventually, compilt>rs should take care of these problems transparently.Another important issue is to reduce the complt>xity of loop structures and index expressions.Finally.one should he careful with nested loop structures.
The aliasing problems can often be avoided using the derived types and POINTER features of F90/HPF (not in the IIPF Subset).Careful declaration will avoid several problems.Finally.using array syntax (segments and triplet notation) may avoid aliasing problems.Bettt>r eompilPr analysis.on the other hand, will also improve the situation.
Where the analysis capabilities of the compiler arc insufficient~ interprocedural analysis should be abandoned or limited, lwcause it cau~es more problems than it solves.Once more. it is important to note that in several cases, the xHPF compiler has to rely on interprocedural analysis, because its handling of distributed data over subroutine boundaries is nonstandard and can create problems if distribution of actual arguments is not moved to the subroutine where the arrays are declared.This approach has the additional drawback that if distribution of the actual argunwnts is not possible then the desired distribution must be ignored.In cases where the interprocedural analysis fails~ it is much more u~eful to ust> inquiry routines to dt>termine the run-timt> configuration and data layout and to usf'.this information to implement the desired strategy explic.itly.(xHPF doe~ not provide the IIPF inquiry routirws; in tlw new release only one APRspecific and limited distribution inquiry routine is provided.)

DISCUSSION
ln this section we discuss some general problems related to the performance portability of the HPF programs a])(l summarize possible solutions.Our re(~om mendations are hast>d on tlw work with APR" s xHPF compiler.although we try to generalize them by considering their background.Furtlwrmore . .we encountered several other problems typical of the APR compiler.which "~e will not discuss.

1 Performance Portability
Currently, only an HPF language specification has been defined, which (of course) does not include a specification of the optimization capabilities of an HPF compiler.This allows a wide range of possible IIPF compiler implementations, which poses a threat to the goal of performance portabilitv.Preliminary testirw .
. e of our programs using the PGI* compiler confirmed our suspicion that for the same program different compilers generate codes delivering very different performance: i.e ... the PCI compiler sometimes creates poor code when' the xHPF compiler creates efficie11t code and vice versa.The HPF language specification allows the programmer to make few assumptions about the gpnerated code.At the main program level, all directives (particularly distributions and alignments) are so-called prescriptive directives, i.e .. they indicate a desirPd fpature but they may be igno;ed.Stricth• speaking, the programmer has no control over th~ actual distribution, although in practice the situation will not be so bad.Nevertheless., anv distributions that the compiler will usc or "grant'' th.e programmer and any optimizations it will make depend on the analvsis capabilities of the compiler. .As an example of the range of possible compiln implementations, consider tlw following casP.ln a subroutine the programmer asserts the alignment of two dummy arguments (a descriptive alignment).It is therefore the programmer's responsibility that this alignment holds upon each subroutine cntr~•.However. in the main program, the programmer can;1ot insist m~ any particular distribution, because only prescriptive directives can be used, which may be ignored by the compiler.Hecognizing this problem, the authors of the IIPF Language Specification [2] provide the following statement: All thist is under the assumption that the language processor has observed all other directivPs.While a confonning I IPF language processor is not required to obey mapping dire<tives. it should handle descriptive dirPctives with the undcrstandinf!: that their implied assertions are relative to this assumption.
We can distinguish two entirely different wavs of handling this problem at the compiler lPvel.First.. when the compiler does not observe evPn a single directive or encounters even a single analysis problPm. it sirnph• ignores all descriptive directives.Second, the compile.r*Trademark of Portland Croup hw.
"l " It is the prngrammcJ:s rcspnnsihilit\ that tlw a"ertf'd state-nH•nt holds., performs global interprocedural analysis and propagates all requirements on dummy arguments to the actual arguments in the calling routine(s).In this wav the compiler can check whether decisions it has mad~ elsewhere in the program have an influence on this particular descriptive directive.The directive mav even influence distribution decisions taken at highe.r levels in the program.Obviously, there arc many possibilities between the two extremes.These possibilities are determined by the quality of the analvsis and the type of analysis being performed, and by the optimizations that the compiler supports.

Hints for Programming with HPF
In order to support novice users of HPF.we provide a summary of those points of the parallelization process that appear most important to us.One should always start with a sound analvsis of the program with respect to parallelization, the aigorithms involved . .and the puq)()se of the program.:j:From this analysis, a parallelization strategy should be developed that involves the distribution of both data and work.The distribution of data is done through declarations and directives (distribute and align).The distribution of work is done implicitly by using a dataparallel programming style, by using statements and constructs that indicate independence of loop iterations (forall and independent, and array syntax).and by using routines from the IIPF _LIB.HA.HY module and HPF intrinsics (not in the HPF Subset).ln order to assist the compiler in generating an efficient parallel program.one should reduce as much as possible the complexity of nested loop structures and indPx expressions.Furthermore, declaring arrays only where they are nPeded and using the dynamic allocation features generally improve the ge~erated parallel code.
However, for the xHPF compiler, one should be aware that sometimes it is better to declare large.distributed arrays at a higher level in the program to facilitate the alignment with otlwr arravs.
Taking care of the loop parailelization match means programming loop nestings and arrav distributions such that communication-can lw mo~ed out of the loops as much as possible.Decreasing the number of subroutine calls to the vendor-supplied run-time library for communication and run-time checks inside the loops will also increase the optimization opportuni-:j: \\"e IIH"Htiou this obvious fact . .because""" fed that n•n•utlv HPF has been oversold to he ven easv to use. and to involn IHJthing HHH'(' than inserting a ft•w dirf'ctivPs a! the ~tart of a progrmn or ~nhroutine.This.however.ls Hot our cxpericiHT.tics for the native compiler, which may reduce the optimization gap.The optimization gap may be further reduced by adjusting the loop order in the (original) F77 or HPF program such as to improve locality, vector length, or other features in the program generated by the HPF compiler.However, one should be aware that such changes may improve the performance on one parallel machine but may reduce it on another parallel machine.In general, such optimizations require a thorough understanding of the I IPF compiler itselL or at least extensive experience with its black-box behavior.In many cases, loop parallelization match and optimization gap requirements will conflict.One possibility to resolve this is to change the algorithm and/ or the data structures.In numerical simulations, different algorithms often can be used to obtain the same result, and the choice for the current algorithm may have been based on assumptions that do not hold in an HPF setting.
Another important consideration is that on a machine with a small number of powerful processors, optimizations by the native compiler might be more effective than minimizing the cost of communication, whereas on a machine with many relatively slow processors the opposite might be true.Changing the algorithms and data structures or replacing one algorithm by another is useful when simple "tricks"' do not work.
With the xiiPF compiler, it is important to prevent problems with directive-based implicit alignments.Directives for dummy (array) arguments in subroutines should lw used carefully, and careful handling of the actual arguments in the subroutim~ calls is recommended to avoid unnecessary redistribution.Often alignments can be made at a higher level to prevent some of the implicit alignment problems.
To rPsolve problems with loop-based implicit alignments mw could consider the inlining of small and frequently callPd subroutines.Another way to support the compiler is to provide different versions of certain subroutinPs for array arguments with difff'rcnt distributions.These suggestions should be considered as temporary solutions to <~ope with current compiler inadequacies.Eventually compilers should resolw (potential) problems of this type transparently.In gpneral, one should prevent loops referencing arrays that cannot be aligrwd.
Alias problems can somPtimes be prevented by using automatic arrays or array segments:.one should not pass workspace arrays.Another possibility is to use pointers, although this only helps in simple cases.With the xHPF compiler, one' can use forced optimizations if changing the algorithms and data structures does not help.
To avoid intC'rprocedural propagation problems With the xHPF compiler one could, at the end of the parallel program development, consider the use of forced optimizations to improve the efficiency of the parallel code by removing unnecessary communication and run-time cheeks.

CONCLUSIONS
Our ease studv revealed several problems with the conversion of F77 programs to HPF programs: the loop parallPlization match, the optimization gap, program changes to support the compiler with program sections whose parallelization is not dear from the data distribution . .directive-based implicit alignment ( xiiPF specific), loop-based implicit alignment.alias problems, and propagation problems introduced by the interprocedural analysis (xHPF specific).From our Pxperienee with the PCI compiler and from discussions with several other people involved in HPF [T.
Brandt's . .Personal Communication; L. Meadows, Personal Communication; R. Riihl, Personal Communication L it appears that, except for the directive-based implicit alignment and the propagation problems introduced by the interprocedural analysis, the problems that we indieatPd are of general importance.Here . .''general importance,.does not indicate that these problems are inherent to the IIPF language, but indicates that they arise in several currently available HPF compilers.and are not easy to resolve within the compiler.For most of these problems, wf' have indicated possible solutions at the program level (although they do not always work with the xHPF compiler).The solution of the identifif'd problems at the compiler level would enhancP the I)('rformance of HPF programs and substantially reduce the effort of the programmer.We obtained reasonable performance for the programs in our case study, but only with several changes in the programs and algorithms. Concerning the APR xHPF compiler we observed the following.In general.an IIPF Subset program must lw (re)writtcn in a data-parallel programming style.Furtherrnore.a rather detailed understanding of the compiler is necessary to obtain good results.This requires patience and tinlC' in order to gather sufficient experience.ThP compiler is sometimes overconservative (which probably indicates insufficient analysis capabilitiPs).so that eonsifkrable time must lw devoted to removing unnecessar-y emnrnunication calls and nm-time checks when parallelizinp: a particular program.This is the sequential implementation with HPF directives for data distribution and with APR-specific forced optimizations (see Table 1 for timing ITstllts).

c 2 c
Gaussian elimination with partial pivoting FIGUHE 1 Prowam frag-ment of matrix-matrix multiplication.
umn.*This leads to the following three parallelization * It is al;;o po.'SihiP to distribute the pivot column and the computation of the elimination factors.\Ve willuot consider this possibility because it will be wry difficult for the compiler to analyze whether this improves efficiency or not.variants for one iteration of the outermost loop (loop 200 in the Appendix):1.(See program in the Appendix).The processor that owns the pivot column does the pivot search (lines 27-36) locally and broadcasts the result.l\ext, the row exchange (lines 88-66) is carried out in parallel.Then, the processor that owns the pivot column computes the elimination factors (lines 72-7 4) and broadcasts the column which contains these.Afterward, the update of the submatrix (Jines 77 -82) is carried out in parallel, and the update of the right-hand side (line 81) is replicatt'd on all processors.2. (Sec program in the Appendix).First, the pivot column is broadcast (lines :t-3-85) and tht' pivot search (lines 39-46) is replicated.Then the row exchange (lines 58-79) is carried ont in parallel.The computation of the elimination factors (lines 8:3-SS) is also replicated on all processors, and the update of the submatrix (lines 86-93) is carried out in parallel.The update of the righthand side (line 92) is replicated on all proceosors.
\V f' illustratt' the effects of loop parallf'lization by means of an namplt>.Consider a loop in which two arrays art' reft>renet>d once.The loop imlf'x set is givt>n by I 1 • thf' indf'x st>ts of the mTays are I, and I 1 ,. and thf' alignments of the loop with the array indt'x st'ts are givf'n by the functions A,: I 1 ~ !, and Ah: I 1 ~ Ih.
h1 onlf'r to kf'ep all array referenct>s local in this loop.the distributions i~Ir, p: I, ~ P and JV/r,,P: h ~ P must fulfil.for all possible P . .This definf's an implicit alignment of I, and Ih in the following sense.

FIGURE 4
FIGURE 4The daxpy routine and its use lead to mulesired loop-based implicit alignment.

FIGURE 5
FIGURE 5 Parallelizatiou prohkms due to aliases created hy passing parts of vv as different vectors.

1
Prowam frag-ment of matrix-matrix multiplication.it is not included in the xllPF version discussed here.Howeyer., our interest is not in the matrix-matrix multiplication by itself, hut in the effects of loop ordering, All se(}ttelltial rm1-times reported in Section :~.:3.. subsection '•Dense \latrix-\fntrix Multiplication.••havebeenmeasured with the ~traightforwanl thredold loop wr~ion.The ij'k-version is shown in Figure1. emmnnni(~ation, and pn)n•ssor utilization.Although the sequential algorithm comprisrs a straightforward thrrefold loop, optimizations for better sequential perfonnance as well as for parallelization are far from trivial.Structuring optimizrd code manually for HISC machines with virtual memory hierarchies does not only complicate the code, but rnay also prevent tlw compilrr from recognizing potential optimizations apparent in the simple code.\Ve experienced this aJJomaly when we compared the simplr matrix-matrix multiplication with thr dgemm level :3 BL\S nmtine, a well-known optimized code for dense matrix-matrix multiplication [()].The optimizing i f7 7 compiler generates code that is approximately three time:> faster with the straightforward threefold loop than with the manually optimized routine.

2. Loop-based in1plieit alitpnnent. As explain~d
the loop indrx set, howeveL lrads to the implicit alignment of all tllf' actual argumrnts in separatr calls to the subroutine.The effrct is the samr if all the actual arguments werr referenced in this loop at the sarnr time.Tlw daxpy routirw as defined in tbe Bl ~AS has different iwTPments for the two \PCtors dx and dy: this makes alil(lllliPnt irnpnssi-hiP in the general case. in tlw case of a distributed loop containing rderenct's to a distributed array.an optimizing compiler will try to align the loop indt~X set with the index Sf't of thf' distributed array so as to minimize communication.\V e refer to this type of implicit alignnwnt as loop-based implicit alignment.lfsuch a loop occurs inside a t~llically genf'rates only one object module for thf' subroutine.This holds.for examplt>, for the APR xl IPF compiler, for the PCI compiler [L.Meadows.Pt'rsonal Communication] ( Wt' tested this ourselves).and for the ADAPTOR -tool [ 1 S].A fixed distribution mapping of *

Table 4 .
Sequential Execution Times (Seconds) of Gaussian Elimination with Partial Pivoting and Backward Substitution Optimization Gelim k{i Baeksb ij Gelim kji Backsb ji

Table 6 .
Execution Times (Seconds) for the Finite-Volume Discretization and 25 Iterations in Total of the Nested Iterative Solver for Three Problem Sizes: The Sequential Program and the HPF Program for Several with the xHPF compiler, one should try to remove the problem that is the source of the propagation.To remove remaining problems one should again consider changes to algorithms and/or data structures.Finally, one could use forced optimizations.