Kemari : A Portable High Performance Fortran System for Distributed Memory Parallel Processors

We have developed a compilation system which extends High Performance Fortran (HPF) in various aspects. We support the parallelization of well-structured problems with loop distribution and alignment directives similar to HPF's data distribution directives. Such directives give both additional control to the user and simplify the compilation process. For the support of unstructured problems, we provide directives for dynamic data distribution through userdefined mappings. The compiler also allows integration of message-passing interface (MPI) primitives. The system is part of a complete programming environment which also comprises a parallel debugger and a performance monitor and analyzer. After an overview of the compiler, we describe the language extensions and related compilation mechanisms in detail. Performance measurements demonstrate the compiler's applicability to a variety of application classes. © 1997 John Wiley & Sons, Inc.


INTRODUCTION
The two most popular paradigms for programming distributed memory parallel processors (DMPPs) are data parallelism and message passing.For both paradigms, standards were defined, namely the data-parallel High Performance Fortran (HPF) [1 J and themessage-passing interface (.\1PI) [2].These standards HPceiwd l\lm 199:i lkviscd .lanuan199(, Tlw curT<'nt ;uldn•" for A. \hiller and H. Hiihl is ISE lnt<'i(l"ated Systems 1-:ugirwning .\G.T<•dmoparkstr.l,BOO.) Zurich.Switzerlaud.© I <)!) 7 ll\ .lohu\\ ilcy & Sons.Inc. Scientific Prograunnirrg.Yo!. 6. pp.-tl-:iB (Fl')7) ( :( :c I O.iB-9:?-t-t/9 7 /0 I 00-t 1-W have been accepted hy most vt'ndors which will enable programming DMPPs in a portable way.With \-JPI primitives, the programmer explicitly controls communication in a portable way at high efficiency.An IIPF program may nm less efficiently, but communication is implicit and programming less cumbersome and error prone. Several I IPF compilers are already commercially availablt' (for instance from Digital Equipment.and from Applied Parallel Research and the Portland Croup), and similar research systems have bet"n developed [3-1 OJ.Howt>vt>r.the user of such compilers still faces problems: IIPF is a complex language and the sophisticatt"d technology required to gent"rate efficient parallt>l programs is far from being established.
Ideally. the compiler hides communication from tfw user.However.wfwn paralldizing an existing Fortran application.the averag<' user often does not achieve the performance expected.To reduce communication costs, the source code must be restructured manually, guessing which message-passing code will be generated.This is even more difficult since appropriate debugging and performance monitoring tools are often not available.
In addition, HPF only poorly supports some important application classes, namely scientific applications which require unstructured computations.Only well-structured computations can be efficiently parallelized by using the static IIPF BLOCK and CYCLIC distributions.ln thP framework of the Joint CSCS/NEC Collaboration in Parallel Processing [ 11 J, we have dt>veloped a compilation system called ''Kernari"'* to overcome the above-mentioned problems.Kemari is integrated into a tool environment [12] which supports symbolic debugging and rwrformance monitoring of both highlevel data-parallel and low-level message-passing programs.The whole tool environment and in particular the (~ompilPr and its languagP Pxtensions are continuously evaluated by a team of application dewlopers.
We extend HPF so that the user can explicitly define the mapping of computation to processors.The mapping of loop nests can be specified via template and distribute and align directives. in a similar way to the distribution of arrays.With these directives.the user gains more control over thP compilation procPss and the compiler can generate optimized communications.
We have also extended TIPF with directives to support unstructurPd computations via dynamic data distributions and run-time preprocessing of critical code spgments.
A description of tlw dehugl!er. the rwrfonnance monitor.and the analYzer as wdl as more details on KPmari mn lw found in [ n-17].ln this article.we first provide a global viPw of our system.We describe the compilation process of well-structured and unstructured applications in mort' detail.Then we ontlinP how the compiler is integrated into the tool cnvironnwnt and demon~trate the cmnpi ler• s efficiency using a suite of lwnchmark programs compilt>d for an i\'EC C:enju-:3 DMPP.

GLOBAL SYSTEM VIEW
Kemari con~ists of two parts.an HPF compiler built by '\EC and the parallelization support tool (PST).a compiler built by C:SCS-ETH.The :\EC compiler accepts ~ubset llPF plus language extensions for com-*Kt•rtwri i~ u traditional .lap<~nt•~t•hall ganw.
prehensive computation mapping:.PST extends HPF for the support of unstructured computations.The two systems currently exist as two separate binaries and are combined by a common shell script driver which also accepts C and Fortran message-passing sources, and which controls the back-end compilation and the linking processes.While all regular l IPF code is translated by the HPF compiler, PST compiles only subroutines declared EXTRINSIC (PST-LOCAL).As shown in Figure 1 form for both compilers is :MPI.

1 Integration of HPF and PST Sources
Both the r TPF compiler and PST maintain information about distributed arrays in array descriptions.whid1 are passed to subroutineb as additional argumPnts.

Integration of Extended HPF and MPI Sources
We believe that it is important to enable the integration of message-passing code into high-level extended HPF sources for two reasons.First.An EXTRINSIC (NONE) procedure can be called inside parallel loops.Such a procedure might be implemented in a foreign language and cause side effects.Side effects can be specified with an INTENT clause extended to indudc an access range.The compilPr generates necessary communications before and after the procedure call.
We explicitly define that sequential code is replicated both by the I IPF compiler and by PST.This allows the user to apply "'lPI collective communication primitives to replicated data.Both the HPF compiler and PST nsc dedicated MPI communicators in their run-time libraries such that librarv communications do not interfere with user-defined conm11mication primitives.

COMPILATION OF WELL-STRUCTURED PROBLEMS
We prm•ide a set of HPF extensions for explicit compntation mapping alHI communication optimization.By defanlt.WP map compntation via the owner-cornputt>s mle [20].
For cases where this stratt>gy leads to inefficient code [21]  The I TEMPLATE directive declares the shape of an array which represents dw iteration SJHH'P of a loop.The ITERATIONS directive associates an iteration template with the loop nest defined by the loop indices ( indexl, index2, .) .The IDISTRIBUTE and IALIGN directives define the distribution of an iteration template.Their syntax is identical to the IIPF DISTRIBUTE and ALIGN directives.The target of a loop aligmnent can be mt HPF template.an array.or an iteration template.
As an example. in Figure :1.we show how a twodimensional triangular loop nest (a) is distributed with an iteration template on the one hand (b) and a Vienna Fortran O'J daust> on the other hand (c).Iteration templates achieve better load balance.Note that the loops could also he distributed equally using a CYCLIC distribution.However. in many cases (for instance stencil-based computations) this could induce additional communication.

OVERLAP Directive
The OVERLAP directive is placed after a DISTRIBUTE or IDISTRIBUTE direetive, and it specifies the overlapping of iteration or data distributions.The overlapping depth for the i-th dimension is specified in Fortran 90 notation a,; also shown in Figure 4.
The arrav rel!ion owned by each proce,;,or is extended with the OVERLAP directiw.

SINGLE/OWNER
The program fragment enclosed m the following two directives The OWNER directive is similar to the SINGLE direc-tin~ except that multiplf' processors which own a specified array element f'xf'cute the computation.

PARALLEL/SERIAL
!DIR$ PARALLEL !DIR$ SERIAL These directives can be pl;m~d in front of a DO I FOR-ALL loop.and specify whf'tllf'r it should be executf'd sequentially or in parallel.Tllf' SERIAL directive is useful to inhibit automatic parallelization of loops which would cause too much run-time overhead.

NQ_PRE_MQVE/NQ_PQST-MOVE
Tllf' following directives !DIR$ NO-PRE_MOVE data-object-list !DIR$ NO-POST_MOVE data-object-list control comm1mication.For remote accesses to objects included in the data-object-list the communi-cation is supprf'sscd.With these dirf'ctives a programmer can assist the compiler in rf'flucing commmJication overhead.

INTENT Attribute with Access Range
The INTENT clause is f'xtended so that the access range of an array can be specified to (kfine a more prPcise interface of a procedure.

INTENT (IN I OUT I INOUT)
var(access-range-specifier) Section 6.1 contains ati example code segment which illustratf's tllf' use of our INTENT directive.

Communication Generation
Kcmari generates requirf'd conunm1ications for regular loop computations based on dlf' alignment mechanism.In order to generate a communication for a distributed array, firsL the data alignment which rf'quirf's no remote accesses for tllf' array is computed.\Ve call this alignment optimal data distribution.A library routine call is gf'nf'rated to realign thf' original with the optimal data distribution.
The drfault 111appings art> dt>tcrminrd by the following rule's:

Mapping Generation
Tlw conmmnication anahsis for an aeTt>ss to array A .
. is carried out a~ follows: 1.If data distribution information for A is not available at rompil<•-timt> and the data access region can be n•presented by an HSD. the e•onipiler conlJHltt>s the optimal data distribution for A using the HSD and the loop alignment.An LREALIGN library call (dc~cribed in St>ction J.-t) for tlw region described by the RSD is gennatt•d.and the corresponding HSD is uot available.tlw wlwlt> rt>gion of A is copied to all the abstract proe•-t>ssors using a REPLICATE-ALL lwfon• the loop.

Communication Optimization
The e•ompiler n•e•ogtuzes four regular e•omniunwation pattems: SHIFT.If tlw distributions of the miginal and the target mapping an' the same.and the strides of both align triplets aJT the same . .a SHIFT comtllllllication is generated.The getwration is made for each dinwnsioll in which tlw uppt>r or lmver boundaries of the align triplets differ.REPLICATE, GATHER, and SCATIER.HEPIJC\TE is used for gcnnating COIIIJI!UTiications for REALIGN di-rrctivc~ which requirt' data replication.C\TI IEH and SCATTER are used "•hen distrilmted data are accessed in a sequential ext>cution part such as a SINGLE or OWNER region.CATIIEH collects the data to the executing processor.whi!t' SCATTEH writes the data hack to the owner processors.

Reduction of Memory Requirements
The rtuTent HPF contpiler is impleHwntt>d undt•r tlw all-replicate strategy.wlwre tlw whole global array spact> is al!twatPd for t>ach distributed array.This ~trat egy has advantagt~s: Address tran~lation from global into local is not needed and memory manage1ncnt is not n,qnin'd to n•map arrays at subroutine boundaries or at REALIGN /REDISTRIBUTE dircctin•s.On the other hand.a program using large distributed arrays can possibly not he cxet•uted.\\'ith tlw 1wxt Yt>rsion of tlw e•mnpil<•r we plan to implement a mmc sophisticated allocation stral!'gy.

Dynamic Data Distributions
Arrays can be distributed using all rt>gular HPF distrilmtion and alignment dirt>ctivt>s.into a single integer by using reverse lt>xicographic ordering (for instance column-major order for twodimensional arrays).
Tlw second dirPctive defines the distribution of an array via a mapping array (MapArray).This array has as many elements as the distrilmtt>d array . .each element dPfining the processor owning the rPS£WCtive elt>nH•nt of var .MapArray must be allocated and initialized explicitly by the user in the program The third directive uses mapping functions as an alternative to mapping arrays.which introduce significant mPmory ovt>rht>ad.G2L .

Communication Generation
Depending on the nesting of distributed-array accesses, the inspector consists of one or more slicPs of an EXTRINSIC (PST -LOCAL) procedure.Such a prt>processing strategy was first implenwnted and described with Oxygen [25], and has also lwPn invPstigated recently by other research groups [26. 27].
An exPcutor consists of computational chunks and communication checkpoi11ts in betwPen.Hemote data are fetched or updated in the cht>ckpoints and tlw computational chunks operate on buffprs to access remote data.To define the ordering of computation chunks.a virtual time stamp (so-called sPrial time) is introduced.By default this timP stamp is initialized to zero and inerPased by onP after every checkpoint.This identification is r:ot an address but symbolic information, i.e .. an array symbol and a (linearized) arrav indt'x.
The executor consists of the parallPI code as speci-fi<~d by the user (including explicit paralellism in the form of aligned DO loops).but with all references to nonlocal elements of distributed arrays replaced by references to dynamically allocated buffers.These buffers store data communicated in previous checkpoints (for remote data fetches).or updates of nonlocal data to be communicated in future checkpoints.
For the computation of non local data dependencies.the data flow in the program must be analyzed.This can he done at run-time by adding tags (or guards) to each data element wl1ich store the number of inspector iterations required for the complete computation of a respective envelope.Alternatively. a less accurate data-flow analysis can be performed at compile-time.Inspectors that are generated using compile-time analysis are typically faster and more memory efficient.but then~ are also cases where the guard-bast'd runtime mechanism is superior.
PST supports both inspector mechanisms and allows the user to choose (with a command line flag) between the more precise (guard-based) dynamic inspector and the faster and less memory-consuming static inspector.generated with compile-time dataflow analysis.For the generation of static inspt>etors . .PST uses the same algorithms as Oxygen: for a detailed explanation . .see [9].The generation of dynamic inspectors has been described in [ 17].

Reduction of Memory Requirements
We call the buffers mentioned above comrntmication cadws because of tl1eir faint structural resemblance to hardware caches.Also the problematic nature is similar: A compromise has to be made between memory consumption and run-time overhead.Figure 6 depicts the four altenwtives.In the simplest organization (0).we completely replicatP the allocation of distributed data on all processors.\fith organization ( 1 ), the global index of a distrilmtt'd array element is hashed by first computing the element's owner (i.e .. the processor which maintains a consistent copy of tlw element).and then by allocating a cornplt'te copy of that processor's part of the distributed array.:'IJote that the main concept behind our communication caches is not new and a similar mechanism with simpler buffer organizations was already described in 1991 by J. Saltz and his collPagucs [28].
In [29] we summarized the impact the choice between either of the four cache organizations <~an have on performance and memory consumption of a full application.

Run-Time Library
The PST run-time library consists of two parts: A set ofrn4 [18] preprocessor macros and the actual library routines written in either Cor C++.We use rn4 macros whenever calling a procedure would be too expensive.
for instance when creating symbol-table entries, or for the compilation of complicated control-flow constructs.
The library routines manage a great deal of the runtime preprocessing mechanism, like the management of multiple communication pat terns for the same PUBLIC EXTRINSIC (PST -LOCAL) subroutine.

INTEGRATION INTO THE TOOL ENVIRONMENT ANNAl
The software described in this report has been integrated into the multicomputer programming environment Annai [ 12].Since Kemari accepts as input not only extended HPF, but also Fortran and C with message-passing primitives. it serves as the main language processor for Annai.Annai also indudes a parallel debugging tool (PDT), a performance monitor and analyzer (P_ylA).and a common graphical user interface (L:l).The maiu design objectives of Annai can be summarized as the 1.Design and implementation of tools for the development of parallel programs in a high-level data -parallel multiple-instruction multipledata (\fT_yfD) language and also with low-level message passing.

1 Application of Static Compilation Techniques
TahiP 2lists llPF benchmark programs and their characteristics.The RED is initialized so that the value of RED(i.j,k) is

Extrinsic Interface
faee.With this feature.programmers can spedfy proeedure-levd parallelism.The parallelization of Ban.k Sealgam, and EP can take advantage of that eapabi!ity.For example, in the ease of Baro, there is a subroutine CMSLOW which processes a coltmm of lwo-dimensional arrays.The main loop (DO 10 0) in ;;ubroutine COMP can be parallelizf'd without in lining by specifying au iuterfaee block with INTEN'T direetiws as follows: Non-IlPF procedures wi1h side effects can be invoked in parallel loops using the EXTRINSIC (NONE) inter-SUBROUTINE COMP( .Scalgam is a typical Monte-Carlo partide code.In each iteration of the main loop, a particle generated bv random numbers is tracc(l.and contributions of the particlt> are pushed to the global data.The tract> computation is very complicated and it contains st>veral subroutine calls (e.g .A new subroutinf' EXEC is introduced, which pt>rforms the particle trace.For each variable which is used for the reduction operations (e.g., :\1AX and SLM), global variablt>s are defined to push local results.Aftt>r the parallel loop, the valtw of the local variablt>s is replaced by tlw value of corresponding global variables, snch that the following computations do not ha vc to lw changed.The loop can be parallelized using the EXTRINSIC (NONE) interface with little modifications to the loop body.MorToveL thP subroutines called from EXEC require no change.These subroutines and EXEC can be compiled just with a backend Fortran compiler.

Iteration Mapping
PDE 1 is a thret>-dimensional Poisson solver using redblack relaxation.The best performance ean hP achiPved using the (BLOCK,BLOCK,BLOCK) distribution.For the following nested loop, the iteration mapping can be spt>cifiPd using an iteration templatP.

The NAS MG Kernel
The !\"AS MG kernel [:33] calculates an approximate solution to the discrete Poisson problem using four iterations of the V -cyclf' multigrid algorithm on an n Xn Xn grid with periodic boundary conditions.MG is known to be a difficult problem for data-parallel We parallclized the original Fortran source using IIPF/PST directives and compared performance with an optimized C/MPl version of the benchmark.For both versions.the cubic grid is partitioned into ''matchsticks'•: All grid points in one dinwnsion are on one processor and distributed block-wise in the other two dimensions.in other words an HPF (*,BLOCK, BLOCK) distribution.
ln the PST version.user-defined mapping functions are used to distribute one long "work-array'' which keeps all levels of the grid.and the amount of data for each subsequent grid level decreases exponentially.Each grid lt>vel in the work array is distributed regularly in the matchstock fashion.Loop distribution di-reetiYe~ are applied to all of th~ nXnXn loops according to the data distribution.
As an example for thP integration of PST with PDT. and for the PDT distributed data visualization capabilities . .

CONCLUSIONS AND FUTURE WORK
Although tlte HPF standard (version 1.0) was finalized more than 2 years ago.only a few I IPF and subset HPF compilers are commercially available.These compil~rs have not been aceept~d too \vel!by scientific programmers.mostly because the compil~rs lack robustness and perform unpredictably on real-world applications.One of the reasons for this situation is the sophisticated compilation technology required for a language as complex as HPF.The compiler is responsible for both the computation mapping and the cmnmunication generation and optimization.Another reason is that sonte of th~ major research institutions which contributed to the definition of IIPF and \vhich have the required teclmology and expertise continue to develop systems know-how and keep on working on systems adapted to similar but different languages such as Fortran D and Vienna Fortran.
\\' e believe that part of this acceptance problem can be overcome by relieving the cornpiltcr from the computational mapping task, at least for performancecritical code segments to he specified by the user.For increased acceptance we integrated our compiler in a tool environment featuring debugging and perfor-mancP monitoring support, and we added directives for the support of unstructured problems.\V" have built a compiler prototype which is currently being used by pilot users.Their feedback and the performance results shown in Section 6 confirm that we are on the right way to develop an acceptablP system which generates efficient code and is indeed easier to 11se than plain mPssage passing.
Resid~ steadily increasing the robustness of 011r system . .f11t urc work will incl11de the consideration of other target platforllls, in particular machines with shared memory and a larg~r number of vector processors.\Vhen compiling well-structured problems.the memory allocation strategy of our compiler is rather simple: Distributed arrays are simply replicated.We hope to change this allocation strategy to a more efficient buffer IIH!llag~nwnt in the future.
, both the HPF compiler and the PST share the same parser and first intermediate language (IL).The first-level IT~ consists of a parse tree of the input program.It is a single-threaded image of the input source, with a one-to-one representation of parallel language constructs.A first analysis phase generates the information necessary for the translation of the single-threaded representation into a multithreadPd program image.Such information includes a data-dependPnce graph and results of the program flow analysis.Several optimization phases are then applied to the multithreaded representation.For instance, temporary variables are allocated and messagP aggregation is performed.A final phase generates Fortran 77 code with calls to the HPF compiler's run-time library.ln PST, first an abstract syntax tree is generated from the first-level IL. then the control-flow graphs of the EXTRINSIC (PST -LOCAL) subroutines arc generated.These graphs also storP information about the nesting of control-How constructs and about the implicit and explicit control-flow dependencies of accesses to nonlocal array clements.St>veral optimization phases precede the code-generation phase.These optimizations mainly exploit loop-invariant index clauses to avoid expensive run-time global-to-local index translations and owner computations.PST outputs C code with m4 [18] postprocessor macros and calls to the PST run-time library.The message-passing plat-

FIGURE 3
FIGURE 3 Parallelization of an example loop nest (a) with iteration templatt-s (h) and t!JP ON daust-(c).

: 2 .
If the arce~,; to A aprwars in tlw IPfdmnd ~ide.and the mrresponding HSD is not available.a POST _WRITE is generated after tlw updatt> of A. In addition.a SWEEP is generated aftn the loop to receive tlw data sent by POST_ \\HITE.:1.H tlw ac•n'bS to A appears iu the righthaml side•.

Figun
Figun• ;) shows an t'xamplc of the computation of an optimal data dist rihut ion.Array B and iteration \\e have de~igne•d a high-leYel run-timt' e•ommllnication library with cmnmunications optimizPd inside the library.The coinpiler nm lw ported to other target systt'l!IS just by porting (and optimizing) this library.
ln addition.the user has the choice to distribute an array var dynamically with the following three directives: !PST$ DISTRIBUTE var(BLOCK_GENERAL(BGMapArray)) !PST$ DISTRIBUTE var(DYNAMIC(MapArray)) !PST$ DISTRIBUTE var(DYNAMIC(G2L, L2G, G2PE, Sz)) The first directive dPfines a block-general distribution.That is.tlw array var is partitioned into nmtiguous blocks of possibly different sizes.ln contrast to a similar distribution described by Chapman Pt aL [24 J. PST BLOCK-GENERAL distributions rwrmit gaps.i.e ... Pxtra space is added when such an array is allocated to provide dynamic support for an increasing number of array clements during program execution.Array BGMapArray contains 2 X p integers (where p is the nu1nber of 1n•on•ssors) which ddine the start and size of Pach processor's block.Global Fortran indices of multidimensional arrays are mapped (or .. linearized'.) lf rPqnired.users can explicitly set the serial time with a directive.This feature can lw ust•d to parallelize loops with data dept>ndencies.The inspector preprocesses code segments to determine where nonlocal data are acct>sst>d and prt>parPs executor data transfers on both data-rt>qnesting and owning processors.So-called Pnvelopes are set up. which are later interprPtt>d in executor checkpoints to perform actual data exchanges.An envelope consists of ( 1) a logical timP stamp which identifies the checkpoint in which the Pnwlope is used.(2) a destinationor source-processor i1kntification.(:n a flag spe1•ifying whether remote data art> ff'tched or updatt•d.and ( 4:) a11 identification of tht> data item to be communicated.

Organization ( 2 )
differs from organization ( 1) in the fact that a processor• s local array segment is allocated in blocks of fixed size (b) rather than as a whole.The index of an array element in such a block is equal to the element's local index.modulo b. \Vhen remote data art' accessed, organization (.1) inserts remote values of a distributed arrav in a sortPd buffer.The buffer includes both data values and global array indices.
Figur~ 9 shows different PDT representations of a slice (with a given .x:mordinate) of the "'1G residual at four different iterations.Both data values and data distrihutimt of the two dimensional array representing the slice are depicted.The three-dimensional graphs in the picture show data values of the slice after 1. 2 . .4, and 8 V -cycles of the MG iteration . .and d~pict how the algorithm converges.The performance of tlw C/MPT and HPF/PST versions is compared in Table5.The C/MPT version is faster because all communication is concentrated into one MPl messag~ exchange.whereas in the I IPF/PST version communication is dispersed, refl~cting the stmct ure of th~ sequential code.For the HPF /PST version we chose communication eache organization (2) (as deserib~d in Section 4.3).
Kemari consists of an HPF compiler and PST. a tool which extends HPF for the support of unstructurPd compntatimb.Both compilers share the samP front end and the sanJP intermediatP languagP (IL).
Because the run-time systems of the two compilt'rs have different requirements.diffPrent dt>scriptor data structures arP usPd.HPF subroutines can call PST subroutines and viet> versa.If nPccssary.descriptors are automatieallv convPrted and data movement is performed.Libraries FIGUUE 1 IPF procedure i~ called.all processors have to e:xt•cutc the call statement and mnst enter the called procedure, because it may contain global cornlmmications.Therefore.nnder the current HPF spt>(oifi(~at ion.parallel loops or statement blocks which should be Pxt:cnt ed by a specific proce~sor must not include procedure calls.except for calls to Pl ~RE functions.
. with curre11t technology.the generation of efficient parallel programs from IIPF sources is based on a large number of hcuristi(~S.andoneshould not expect any HPF compiler to generate optimal code for any input source.For tuning selected program regions, the user should have access to the underlying message-passing platform.Second, it should he possible to integrate calls to standard libraries ,,•ithin I IPF sources.Several parallel numerical libraries arc currently being developed to case S(~ien tific computing on DMPPs.One prominent example is ScaLAPACK[19]which is based on MPI.

Table 1 .
List of Communication Library Routines: The Compiler System ean Generate any Communkation Pattern Using Tlwse Houtirws are identified as EXTRINSIC routines.Inside EXTRINSIC(PST_LOCAL) routines.the PST programming paradigm can he usPd.The following din~c tivPs are part of the declarations in subroutine lwads: LOCAL) subroutinP can he declared to be start-time schedulahlP.As a consequence . .run-time genPrated communication patterns are saved and re-ust>d in later invocations of the routinP.ThP communication patterns are savt>d symbolically, i.e .. even if different routine argunwnts are used.the execution will be correct . .as long as the arguments have the same shape (i.e .. same size and distribution) as the argnnwnts used for the routine's first invocation.Tht> integer t>xprPssion key is used for the generation of multiple cornmtmication patterns for the same snb-rontinP.
Gather n data objP<'t to the spP<•ified logical prm•t•ssors from tlw OWJlt'rs.Scallt'r a data object to tht' mvnns from speciliPd lo;rical proct'ssors.'\lulticasta data ohjPcl 10 other lo;rical procP"ors.TransfPr a data Plt'IIU'Ilt to thP specified lo;rical procp,;,;or.l,;cd for unstnH•tm•t'd cmmnuuication,;.S\\Pt>p data elemf'uts ,;em ln POST_ \\'HITE.compiler . L2G, G2 PE. and Sz are integer vahwd functions which.respectively.map global to loml indices.local to global indices, global indices to processors.and ddine the size of the local array which may be different on each processor.Nested Fortran DO loops can lw parallelizt>d by aligning them with a distributed variable of the same dimensions as the loop nest.Loops aligned with dynamically distributed arrays art> transfornwd into loops over the local index rang<'.and thP first statPrnPnt in their body compntPs thP n~spective global indices using tlw local-to-global index mapping.
statistics close to related program structures such as subroutines and DO loops for example.For PSB, Kemari generates tables with information about the program structure.These tables are stored in the generated object, and can be read by both PDT and PMA at run-time.
The compiler also dwngt's some of the original symbols.a nwdwnism which is kept transparent to the usPr of PDT and PMA.A rmai• s l ~I includes a program structure bnl\\ser (PSB).a display which allows depic-tion of performance

Table 2 .
••Distribution•• column describf~s the way the main data for each program arc distributed out o abstract processors.The COTT('SJlOtHiiug t•nt rics arc checked when the respectiYe direct i H'S were used.Table ;~ shows the measurement re~ults of HPF benchmark programs on several Ccnju-:3 configurations.The figures inside parentheses show the speedup ratio.where the sequential execution perfonmtllct' of the corresponding original program is 1.00.The shapes of the abstract processors onto which PDE 1• s main Tlwse Benchmark Programs a1•e all Well Structured: The Shape of the Main Data, Applied Data Distribution, and Applied Extended Directives for Them are Described

Table 5 .
Performam~" of MG wht>n Paralldized with PST on 64 and 128 Processor Cenju-3 Configurations : Execution times of the first and '"coiHI V-cyc•le arP shown awl \!Flops for the secoiHI nTlc (when the iteration Pnters steady slate.and PST-generat1•d conununication pattern~ are n•used) an• compared to what is achiewd with a nwnllally parallizccl C code.compilers.most of which cannot accommodate theYcycle which goes from large data sets to small and back [:H]. \OTE