PGHPF-An Optimizing High Performance Fortran Compiler for Distributed Memory Machines

High Performance Fortran (HPF) is the first widely supported, efficient, and portable parallel programming language for shared and distributed memory systems. HPF is realized through a set of directive-based extensions to Fortran 90. It enables application developers and Fortran end-users to write compact, portable, and efficient software that will compile and execute on workstations, shared memory servers, clusters, traditional supercomputers, or massively parallel processors. This article describes a production-quality HPF compiler for a set of parallel machines. Compilation techniques such as data and computation distribution, communication generation, run-time support, and optimization issues are elaborated as the basis for an HPF compiler implementation on distributed memory machines. The performance of this compiler on benchmark programs demonstrates that high efficiency can be achieved executing HPF code on parallel architectures. © 1997 John Wiley & Sons, Inc.


INTRODUCTION
Recently there have been major efforts to develop programming language and compiler support for distributed memory machines.Based on the initial work on projects like Fortran D [1] and Vienna Fortran [2], the High Performance Fortran Forum produced HPF version 1.0 [3].HPF incorporates a data-mapping model and associated directives which allow the programmer to specify how data are logically distributed in an application.An HPF compiler interprets these ev~L naive compilation of a global-name-space HPF program for di~tributed memory machines und~r these distributions may cause unacceptable addr~ss calculation overh~ad.A uniquP approach is d~scribed here which eliminates address calculations for block distributions and dramatically simplifies address calculations for 9•clic distributions.
A set of communication primitives has been developed and is used by thP HPF compiler to communicate nonlocal data in an efficient . .convenient. and machine-indt>pendent mamwr.These primitives tak~ advantage of distribution information provided in HPF directives.For ~xample.primitives move data very effi-(~iPntly among arrays aligned to the same template.These primitives have b~~n design~d to minimize communication overh~ad.They ar~ trPated as tru~ primitiws within th~ compiler.allowing optimizations.such as code motion.to tak~ place which minimize costly interprocessor cornmunication.
ThP r~st of this article is organized as follows.Section 2 presents tlw HPF language featurPs.Section :3 describes the architcctun~ of the HPF compiler.Section "! discusses data and computation distribution.s~ction ;) introduces t}w communication primitiVPS and descrilws compiler algorithms for generating calls to the primitives and (~r~ating temporary arrays used in communication.Algorithms arc given for some communication primitives.s~ction 6 discusses the run-time support system for intrinsics and subroutin~ intnfaces.SPction ?pr~sPnts optimization techniques. Section 8 presents experimental performance results using the pghpf compil~r on several benchmark programs.Section 9 summarizes relat~d work and s~ction 1 () provides conclusiom.

HPF FEATURES
The IIPF approach is based on two fundamental observations.First, the overall pcrformanc~ of a program can be incr~asPd if operations are Iwrforrned concurrently by multipl~ processors.SPcond . .th~ efficiency of a single processor is highPst if the proc~ssor performs computations on data Pl~mcnts stor~d locally.BasPd on these obsPrvations.the HPF extPnsions to Fortran 90 providP a means for the explicit expression of parallelism and data mapping.An HPF programmer can express parallelism ~xplicitly.and using this information the compiler may be abl~ to tune data distribution accordingly to control load balancing and minimize communication.Alternatively, given an explicit data distribution.an HPF compiler may he able to identify operations that can be executed concurrently.Typically.the programmer will write IIPF code that in- corporates both data -mapping dir~ctives and explicit parallel constructs in order to maximize the information available to the compiler during code g~neration.
An HPF program has ess~ntially the same structure as a Fortran 90 program, but is enhanced with data distribution and alignm~nt directives.Whm writing a program in HPF, th~ programmer spt>cifies computations in a global data space.Array objects ar~ aligned to abstract arrays with no data, called templatPs.Templates are distributed according to distribution dir~c tives in block and c_yclic fashion.All arrays aligned to such t~mplates are implicitly distributed.
Parallelism can be explicitly cxpress~d in HPF using the following languag~ features: Fortran 90 array assigmnents . .masked array assigmnents, WHERE stat~ mt>nts and constructs, HPF FORALL statements and constructs.HPF 11\DEPE:'-JDE'IT directives.intrinsic functions.the HPF library, and EXTRIKSTC functions [ 4 J.The eornpil~r uses thesP parallel constructs ami distribution information to produce cod~ suitable for parallel ~xccution on ~ach node of a parallel computer.

ARCHITECTURE OF THE COMPILER
The major phases of the HPF eompiler are shown i11 Figure 1.The first phase semantically analyzes all HPF directives and stores th~ resulting information on t~mplates, distribution.alignment.and processor arrangements in the array descriptor portion of th~ symbol tahlP.This information is used throughout compilation.For variable~ that arc not Pxplicitly mapped (and com pi ler-cr~ated temporaries).th~ compiler chooses a default distribution and ali!-!;11rnent.Figure 2 shows a typical sequriH"e of llPF din•c-tiv<>s used to align and distribnt<> arravs.The compilrr identifies align;n<>nt chains and det~rmines that no PROCESSORS directive is presrnt.In the absence of a PROCESSORS directive . .thr compiler generates code to dvnmnically determine the number of available processo~•s and uses a default processor arrangrment.
The compiler stores the distribution information as show11 in Figurr 2. lt creates a default trmplate T for array C and collapses all alignments to C i11to aligmne~lts to T. The template T is distributed by (kfault onto P. the default processor arrangement. The next phase of the compiler transforms all parallel ctmstnwts: DO loops.array assignments . .WHERE statements/constructs.and FOHALL statements/constructs into one internal representation which is very sitnilar to a FORALL statement [5].After transformation.subsequent phases of the compiler (such as optimization) process parallel constructs using only the internal representation.
The communication phase of the cmnpiler selects the communication primitives.h inserts code for allocation of buffers as well as calls to communication primitives.This phase also partitions computations by modifying the bounds of parallel loops and inserting conditional statements which restrict execution of statements to the appropriate processors.:'\umrrous communication optimizations are also performed during this pass.This phase is discussed further in Section 4.
The last phase of the compiler builds drscriptors for each processor arrangement.template.and array so that all information available at compile-time is also available at run-time.Processor descriptors include information on the shape and mappin!-!; of the processor arrangement.A template data structure describes the shape and distribution of the template.An array descriptor contains all the information necessary to determine the shape of the array.the template to which it is aligned.and how thP alignment is specified.
Finally. the code generator produces loosely synchronous SP.\fD code [ 6].The generated code is stm<> tured as alternating phases of local computation and calls to communication primitives.Communication primitives an• synchronization points.:\lost oft he time the compiler docs not k!to\\ until run-tinw which groups of pro<T~~ors "•ill comtnunicate for a !-!;in•n parallel construct.. so it guarantees that communication primitiws will he called by all processors but only sender and reccin•r processors will perform thr cm1nrnmicat ion.Equation 2 is based on the observation that array element A ( i) will be owned by processor p if and only if processor p owns T(i'

PARTITIONING
The nice thing about tlw presented calculation is that local elements of array A on processor p are repre-Sf'nted with a lower bound and upper bound pair.This can easily be implf'mentt>d with tlw Fortran 90 allocate statement.As shown in thf' equations.all calculations use tllf' array's global index space.Thf' allocate statement will allocate the aligned arrays with respect to their global index space without allocating the f'ntire array in a processor.This approach eliminates thf' overhead of address calculation between global to local and local to global index space sincf' the local indices arf' identical to the global indices.

Computation Partitioning
The source of paralldism using data parallelism is thf' partitioning of the computation among the procf'ssors.Onf' of the most common methods of computation partitioning is to use the owner-computf's rule [5 . .8. 9].This implies that assignment statf'ments involving distributf'd arrays are executed f'xclusively by those processors which arf' owners of tllf' lefthand side (lhs) variable.The set of elements of an assignment which have to be computed on a particular procf'ssor p is referred to as an execution set, denoted by e.rec (p).
Tht> execution set cakulation is shown in Figure 5 for bLock(m) distribution.For an assignment to a rf'gular array st>ction A (L: U: S) on a particular procf'ssor p, the execution set can Of' df'termined ll\• the intf'rsf'ction of (L: { ': S) and the local set of A., local 1 (p ), on this processor.Again.the calculation of thf' f'xecution set is perfornlf'd according to global index space.This eliminatf's global to local and local to global eon version overhead.This calculation can be applied to each dimension of the lhs indqwndently of all other dimen-siOns.

COMMUNICATION
ln HPF.data distribution can he specified so that corresponding elements of arrays aligned to tlw same template are allocated on the same processor.ln this way.the programmer can usf' tf'mplates to reduce connnunication and scheduling overheads without increasing the load imbalance significantly.Pghpf usf's a set of communicatimt primitivf's designed to take advantage of cases wherf' arrays are aligned to the same tf'mplate.Thf'n' are several conmnmication possibilities.s is a ~calar.. then the compilt:>r marks that dimension of the sonrce array as IULCOfllfflllllication.All rt>pli-catt>d and collapsed dimensim1s are also marked as IIILCOII/IIIllflicatioll.If all dimensions arc marked as no_communication.thccompiler will not gt>Jif'ratc any commm1ication for the sourTe array.This is the optimal cast>.no conm11mication and no temporary arntYs arc net:>dt>d.

overlap_shift
The compiler takPs advantage of the overlap concept [11] wlH'JWVt>r possihiP.it crt>ates an overlap area around the local segnH'nt of a sourTe array to store non local segments of the source array received during conmlunications.The primitivt> that cmTCSjXHHls to the owrlap corH'ept is called orerlap_sh(fi.To detect orcrlap~~!t(ft.the compiler applit's an affine alignment function loa source/ dt>stinalion pair as defined aboVt~.
If the result is (i.i + c).wlwre c is a compile-tirrw ('fmstanC then tlw overlap concept can potentially he applied.ln theory.c could be any scalar variable . .but a lim\ ing arbitrary scalar~ could lead to significant nwmory usage problems.So c is restricted to a small set of vahws.If all dirnensions of the source array are rnarked 1/0_Co/ltlllllflica/ io11 or ot'erlup_sft(ft, the compiler gPnerates an Ot'erlap~~h!ft call.This primitive i~ wry efficiPnt ~ince its overhead is low and tlwre is uo unlleccssary local copying.Ill this cast'.. the only tPrnporary storage used is that required for the ovt~r lap an•as.

collective_communication
This primitiYe handles thost> cases where tht> result of the affine alignment function is (i.i + s).(i.d). or   copies data from the source arrays into each temporary before computation.If then• is only one source array on the righthand side of the parallel loop, the source is copied directly to the destination without storing into a temporary.Additional communication primitives are available to handle transpose.diagonaL and indirect aceesses [ 12,13].However.these primitives arc not applicable for cases in which multiple loop indices are associated with a particular array subscript or when a subscript is a higher-order function of a parallel index variable.These cases are currently handled by scalarization of the parallel loop.

RUN-TIME SUPPORT SYSTEM
The HPF language allows the programmer to write an HPF program with imprecise compile-time information.This situation arises when PROCESSORS.ALIGN.\JENT.or DISTRIBUTIO~ directives, orAL-LOCATE statements.generated to support compilercreated temporaries, depend on run-time variables.
An example would be the directive !hpf distribute templ(cyclic(k)) where k is a run-time parameter.
Therefore.the HPF compiler postpones distribution and address calculation until run-time.During the compilation process, the compiler analyzes the global namr space.This simplifies the logic in the compiler and shifts the complexity to tl1e run-time support software.
A set of compact data structures was designed to store information on array sections, alignment, templates.and processors.These data structures are called distributed array descriptors (DADs) [ 14].DADs pass compile-time information to the run-time system and information bt>tween run-time primitives.The runtime primitives query alignment and distribution information from a DAD and act on that information.
Many basic data-parallel operations in Fortran 90 are supported through intrinsic functions which rely on the run-time software.The intrinsics not only provide a concise means of expressing operations on arrays, but they also identify parallel computation patterns that may be difficult to detect automatically.Fortran 90 provides intrinsic functions for operations such as shift, reduction, transpose, reshape, and matrix multiplication.Pghpf parallelizes these intrinsics.Arrays may be redistributed across subroutine boundaries.A dummy argument which is distributed differently from its actual argument in the calling routine is automatically redistributed upon entry to the subroutine by the compiler, and is automatically redistributed to its original distribution at subroutine exit.These operations are performed by the cop_y_in and copJ._outprimitives.These primitives also copy noncontiguous memory spaces specified in array section actual arguments into a contiguous dummy argument.The compiler takes the advantages of intent (in) and intent( out) specifiers by not calling copy_out and copy_in, n•spectivcly, where appropriate.
The data-parallel programming model naturally results in the creation of many arrays with identical distributions and alignments.The IIPF run-time support system is designed to take advantage of dataparallel problems.Run-time primitives share run-time data structures globally.The compiler provides information to the run-time through intcrprocedural analysis (IPA).By sharing the run-time data structures, the overhead of creating many identical data structures is eliminated.Analysis extends even to the communication primitives.The overhead of scheduling communications is reduced by reusing scheduling data structures for identical communications.Certain portions of the run-time.in particular those that perform the actual message passing and synchronize the processors, are optimized for the underlying hardware architecture.

OPTIMIZATION TECHNIQUES
Several types of communication and computation optimizations can be performed to generate more efficient code.In terms of computation optimization, the scalar node compiler performs a number of classic scalar optimizations within basic blocks.These optimizations include common subexpression elimination, copy propagation (of constants, variables, and expressions), constant folding, useless assignment elimination, and a number of algebraic identities and strength-reduction transformations.However. to use parallelism within the single node (e.g.. using an attached vector unit), the compiler propagates information to the node compiler using node directives.Since there is no data dependency between different loop iterations in the original data-parallel constructs such as forall statement, vectorization can easily be performed by the node compiler.Pghpf perforn1s several optimizations to reduce the total cost of communication [10,15,16].This section lists these optimizations.

Message Aggregation
One of the important considerations for message passing on distributed memory machines is the setup time required for sending a message.Typically, this cost is Messages with an identical destination processor can then be collected into a single communication operation as shown in Figure 7.The gain from message aggregation is similar to communication vectorization in that multiple communication operations can be eliminated at the cost of increasing message length.

Evaluating Expression
A drawback to blindly applying the owner_computes rule is that the amount of data motion could be much greater than necessary.For example, consider the following program fragment: In a compiler that blindly applies the ownercomputes rule, A, B, C, and D will be redistributed into temporaries that match the distribution of E then the computation will be performed locally.This may cause four different communications with four different temporaries.
A more intelligent approach might perform the sum of A and B locally into a cyclically distributed temporary, then perform the sum of C and D locally into another cyclically distributed temporary.Next the result is the multiply of these two temporaries locally into a cyclically distributed third temporary.Finally, the result is redistributed (comunicated) into E.This approach may cause one communication with three temporaries (shown in Fig. 8b).
To apply the optimization, the compiler has to evaluate the expression according the partial order induced by the expression tree (shown in Fig. 8a).However.Li and Chen [17] show that the problem of determining an optimal static alignment between the dimension of distinct arrays is NP-complete.Chatterjee et al. [18] and Bouchittc et al. [19] propose some heuristic algorithms to evaluate the expression tree with minimal cost.Thus, this is a worthwhile optimization to implement.

Communication Parallelization
HPF allows arrays to be replicated in one or more dimensions.When a block of source data is replicated, any or nll of the proct>ssors owning tlw block can takt> part in the communication.Alternatively.om• of the so111Te processors is chost"n to send to all processors owning the destination block.Ideally. the sends should iw spread o11t over ns many som-ce processors as possible to utilizt> the availahlt> communication bandwidth (Fig. 9).Tlw hnsic idea of this optimization is to divide tlw set of dt"stination processors among the set of somT<' pnwessors.Each source docs a nmlticast to its assigned subset of destinations.The source and d1•stination sets are computablt> from information in the template and processor datn structm•t's.

Communications Union
Tn many case,;.comm1micat ion rt>quin•1 I for two difft'rt'llt operands nm he replaced by their union [ 10].

Reuse of Scheduling Information
The comrnunication routines perform send and rP-cein~ set calculations according to a sclwduling data structure.The schedule.once cmnputed.can also carry out idt>ntical pat terns of data exchangcs on several difff'rent.b11t identically distributed arrays on array sections [ 12. 20 . .:21 J.The same schedule can he reused repeatedly to carr~ out a particular pattern of data exchange.Tn this cast".the cost of gnwrating the schedules can he amortized by only t'Xf'<'llting it 0111'<'.This analy~is can lw perfornwd at com pi le_timl'.lienee.if tlw compilt•r recognizes that a schedule can be reust>rL it docs not rwt•d to generate code for sdlf'duling hut ratlwr passes a pointer to tlw alrt>ady existing sehedult>.Furthermore.the scheduling comp11tatio11 can be JJJOn~d up as

EXPERIMENTAL RESULTS
Benchmark results from five programs are presented to illustrate performance obtaint>d using the HPF compiler.All of these lwndnnarks were nu1 on a  The hydflo benchmark is a hydrodynamics program with 2,000 lines.Figure 14 shows the rwrformance of hvdroflo.The data are distrilmted block fashion in one dimension (*.*.block).Good scalability is Pxhihited.The communication mostly consists of copysection and collPctive-comrnunication.
A significant advantage of coding in I IPF is the ability to specify different distribution directives and measure performance differences without Pxtensive recording.Some expcrirnPntation along thesP lines was performed on the Gauss benchmark (gauss), which is a short program designed to measure the performance of a Gaussian Plimination algorithm.
Figure 10 gives the main factorization loop of gauss which converts matrix a to upper triangular form.This Gaussian Plimination algorithm is suboptimal due to a mask in tlw inner loop which prevents vectorization.
Figure 1la shO\vs the updated values of matrix u in the shadPd region aftPr the factorization loop.Since the compilPr uses the owner_computes rule to assign computations.only owners of data in the shaded region will participate in the computation.The rPmaining processors arp masked out of the computation.Figure 11b.cshow tlw computation distribution on four processors in block and cyclic fashions.rPspectively.lnthis parti<~ular lwnchmark.cyclic distribution results in better load balancing than block distribution.Figure 15 prPsents the rwrfonnance using cyclic as well as block distributions.As expected.the cyclic distribution exhibits better performance because of load balancing.The communication requirements for thesP distributions are identical.Both use a multicast communication primitive.
As shown by the data, benchmark programs written in HPF can achieve reasonable efficiency given a prob-lPm of reasonable size.The figures show good scalability when increased numbers of processors are used.

SUMMARY OF RELATED WORK
The prog:ramminp; language Fortran D [ 1] proposPs a Fortran language extension in which the progrannnf'r specifies the di.~trihution of data by aligning Pach array to a virtual array.known as a dt>t•omposition.and tlwn SJWcifying a distribution of dccomposition to a virtnal machinc.Fortran 770 [ J;)] and Fortran 90D/11PF  The PARTI primitiws.a set of run-tinw library routines to handlf' irregular computations.have lwf'n dt>velopcd by Saltz and coworkf'rs [27].ThPse prilllitiws have been integratf'd into an ARF t•ompilt•r for irregular rmnpntations [28] as well as into otlwr systems to handlf' a wide range of irregular problf'ms in scientific computing.

FIGURE 1
FIGURE 1 TIH~ architecture of the HPF compiler.
Arrays aligrlf'd to the same template may be locatt>d in tllf' same processor df'pending on their alignment and subscript relations [10].The eornpilf'r pairs the dimensions of a source array and a destiuation array if they are aligned to the same dimension of the template.Let the pair be (lhs.rhs).The compiler applies an affine alignment function to this pair.If the result is (i.i) or (s.s).where i is a parallel loop index and execA(p) = localA(p) n (L: U: S) e:cecA(p) = (L~: U#) n (L: U: S) e:z:ecA(p) = (maz(L~,L): min(U#,U): S)

FIGURE 5
FIGURE 5 The execution spt calculation.
much as possible IJy analyzing definition-lise chains [:22].Reduction in communication ovt>rlwad can he signifinnrt if the sdwduling code can he moved out of orw or mon• nest1•d loops by this analysis.ln addition.JJ[.!:hp(perfonns i1rvariant communication hoisting.reuse of conmmnicated data.and loop fu~ion.

FIGURE 11
FIGURE 11  Load balancing for gatiSS.
ot•erlap_sh!fi primilin•.The compiler generates a call to tlw collcl'fil'e_comnutllicatioll primitiw if all dirrwnsions arc rnarked with some combination of no_collllflttllication, ot'erlap~~lz~ji, sh[fi, lllllltimst, or fral/,~fi'r.This primitive always requires that temporary arrays he allocated to hold the communicated somTe subarrays.lim\ t:'Yt:'I'.the scalar dimeusions are elinlinatt:'d IO COJlSt,rve nlelilOI"Y.If the compiler can• t generate any of I he above primitivt>s, it attt>mpts to detect "•hetlrer the copy-section primitive can he used.This primitin• copies a soun•p array section to a destination array section as . .shown in Figure (w.The soruTe and destination arrav st>clions can Lw aligned to distinct templates in this case.which implies that they can lraYP di~tinct distributions.Tlw conrpiil'r trit>s to apply this primitiw "•hen tlw result of tlre affine alignnH'llt function is Cl\1 Fortran lauguage [:31.:):2] implemented is a suhst>t of Fortran T!. extPnded by Fortran 8x array .to snpport a data-parallt•l programtuing style for the C\1 cotnputcr ~yste111.Sabot [:t)] dt•scrilws tPdmiqtws nsed hy the C\1 compilers to map tlw lin<'grained array parallelism of langtwges sttch as Fortran <)()and c~• onto the C\l architectun•s.Tlw PHEP.\HEIIPF compiler [?] is i1nplenwntcd 011 an Pnginc-likP basis.Eugint•s can cottcJnTent ly work on a con11nou internal program repn•st•ntation called the prepare intermediate representation (PIB).IIPF progn11n to he compilt•d is transfomwd into the PIR hy the I IPF front-t•nd <'ngine.Later.the parallelization eng:ine transfonn,; the PIH iuto a form that can lw efficit•ntly ••xecutPd on the target parallel machine.Tlw PTHAJ\ II compiler [:1-t] is a prototype contpilt•r for IIPF that is developed as a tt>stlwd for t•xperimenting \\ith distributed metnory compilation techniqtws.including automatic data partitioning and parallclization.cost modeling.and global COIIIIIHIIlication optimization.a significant challenge to llPF compiler implemeutors.Here . .we ha,~e explorPd some of tlw issues and outlirwd our co111pilation and ext>cution tcchniqtws.As shown by our experimental result~.I lPF lwnchmark programs compiled bypgft/~/can achieve reasonable dficiency given a problem of reasonable size 011 distributed nwmory parallel systern.'i.The figmTs sho" good scalability "~hen increased nurnlwrs of processors are used.Tlw loosely coupled SPMD codes produced by pghp/ arc adaptable to a Yariety of parallel systt•nt architecturPs.PCI is cotnntittcd to making a connrwrcially Yiahle IIPF compiler prothtct.The resu Its we haY<' he en able to achieve thus far and the potPntial for further perfor-maJW<' we set• as \\T corttinue our development project giye us conlidcrwe in thi~ t>ttdeavor.