Programming In Vienna Fortran

Exploiting the full performance potential of distributed memory machines requires a careful distribution of data across the processors. Vienna Fortran is a language extension of Fortran which provides the user with a wide range of facilities for such mapping of data structures. In contrast to current programming practice, programs in Vienna Fortran are written using global data references. Thus, the user has the advantages of a shared memory programming paradigm while explicitly controlling the data distribution. In this paper, we present the language features of Vienna Fortran for FORTRAN 77, together with examples illustrating the use of these features.


Introduction
The continued demand for increased computing power during the last decade has led to the development of computing systems in which a large number of processors are connected, so that their execution capabilities may be combined to solve a single problem.
Several distributed-memory processing systems (such as Intel's hypercube series and the nCUBE) have come onto the market and are slowly gaining user acceptance.Other systems are under development or have been announced in recent months.These architectures are relatively inexpensive to build, and are potentially scalable to very large numbers of processors.Hence their share of the market is likely to increase in the near future.
The most important single di erence between these and other computer architectures is the fact that the memory is physically distributed among the processors; the time required to access a non-local datum may be an order of magnitude higher than the time taken to access locally stored data.This has important consequences for program performance.In particular, the management of data, with the twin goals of both spreading the computational workload and minimizing the delays caused when a processor has to wait for non-local data, becomes of paramount importance.
A major di culty with the current generation of distributed memory computing systems is that they generally lack programming tools for software development at a suitably high level.The user is forced to deal with all aspects of the distribution of data and work to the processors, and must control the program's execution at a very low level.This results in a programming style which can be likened to assembly programming on a sequential machine.It is tedious, time-consuming and error prone.This has led to particularly slow software development cycles and, in consequence, high costs for software.
Thus much research activity is now concentrated on providing suitable programming tools for these architectures.One focus is on the provision of appropriate high-level language constructs to enable users to design programs in much the same way as they are accustomed to on a sequential machine.Several proposals (including ours) have been put forth in recent months for a set of language extensions to achieve this 5, 16,22,8,30], in particular (but not only) for Fortran, and current compiler research is aimed at implementing them.
Research in compiler technology has so far resulted in the development of a number of prototype systems, such as Kali 12], SUPERB 9,35], and the MIMDizer 19].In contrast to the current programming paradigm, these systems enable the user to write code using global data references, as on a shared memory machine, but require him or her to specify the distribution of the program's data.This data distribution is then used to guide the process of restructuring the code into an SPMD (Single Program Multiple Data) program for execution on the target distributed memory multiprocessor.The compiler analyzes the source code, translating global data references into local and non-local data references based on the distributions speci ed by the user.The non-local references are satis ed by inserting appropriate message-passing statements in the generated code.Finally, the communication is optimized where possible, in particular by combining messages and by sending data at the earliest possible point in time.
In this paper, we present a machine-independent language extension to Fortran 77, Vienna Fortran, which allows the user to write programs for distributed-memory multiprocessor systems using global addresses.The Vienna Fortran language extension to Fortran 90 is described in a separate paper 3].Since the performance of an SPMD program is profoundly in uenced by the distribution of its data, most of the extensions proposed here are geared towards allowing the user to explicitly control such distribution of data.Vienna Fortran provides the exibility and expressiveness needed to permit the speci cation of parallel algorithms and to carry out the complex task of optimization.Despite this fact, there are relatively few language extensions.A simple algorithm can be parallelized by the addition of just a few constructs which distribute the program's data across the machines.
This paper is organized as follows.In the next section we describe current programming practice on distributed memory MIMD architectures by means of a simple example.Then the programming model assumed by Vienna Fortran is introduced and an overview of the language elements is provided.This paper does not attempt to give a systematic introduction to the whole language, but rather describes some of the most important features by way of simple example codes.These form the body of the subsequent section.Section 5 outlines two more complex problems relevant to real applications, discusses the features of Vienna Fortran which may be used to implement them, and brie y discusses some important features of Fortran programs and how we handle them.Finally, we conclude with a discussion of related work and the implementation status of the Vienna Fortran Compilation System. 2 Programming Distributed Memory Systems: The State of the Art The current generation of distributed-memory multiprocessors is particularly di cult to program: the time taken to adapt existing sequential codes and to develop new applications is prohibitive in comparison to conventional machines, including vector supercomputers.Further, the low level at which programs must be written is the source of both frequent errors and of particularly in exible codes.Consider the brief example described below.
The Jacobi iterative procedure may be used to approximate the solution of a partial di erential equation discretized on a grid.At each step, it updates the current approximation at a grid point by computing a weighted average of the values at the neighboring grid points.An excerpt from a Jacobi relaxation code for execution on a sequential machine is shown in Figure 1.
When this code is parallelized by hand, the programmer must distribute the program's work and data to the processors which will execute it.One of the common approaches to do so makes use of the regularity of most numerical computations.This is the so-called SPMD (Single Program Multiple Data) or data parallel model of computation.With this method, the data arrays in the original program are each partitioned and mapped to the processors.This is known as distributing the arrays.The speci cation of the mapping of the elements of the arrays to the set of processors is called the data distribution of that program.A processor is then thought of as owning the data assigned to it; these data elements are stored in its local memory.Now the work is distributed according to the data distribution: computations which de ne the data elements owned by a processor are performed by it -this is sometimes known as the owner computes paradigm.The processors then execute essentially the same code in parallel, each on the data stored locally.
It is, however, unlikely that the code on one processor will run entirely without requiring data which is stored on another processor.Accesses to non-local data must be explicitly handled by the programmer, who has to insert communication constructs to send and receive data at the appropriate positions in the code.This is called message passing.The details of message passing can become surprisingly complex: bu ers must be set up, and the programmer must take care to send data as early as possible, and in economical sizes.Several issues arise which do not have their counterpart in sequential programming.New types of errors, such as deadlock and livelock, must be avoided.The programmer must decide when it is advantageous to replicate computations across processors, rather than send data.Moreover, for code which is explicitly parallel, debugging is a serious problem.
A major characteristic of this style of programming is that the performance of the resulting code depends to a very large extent on the data distribution selected by the programmer.The data distribution determines not only where computation will take place.It is also the main factor in deciding what communication is necessary.The total cost incurred when non-local data is accessed involves not only the actual time taken to send and receive data, but also the time delay when a processor must wait for nonlocal data, or for other processors to reach a certain position in the code.Note that the performance of a program can no longer be estimated solely by the amount of computation it comprises: extra computation is not necessarily costly, and the communication delay inherent in a particular data distribution could be prohibitive.
The message-passing programming style requires that the communication statements be explicitly hardcoded into the program.But these statements are based upon the chosen data distribution, and as a result, the data distribution is also implicitly hardcoded.It will generally require a great deal of reprogramming if the user wants to try out di erent data distributions.
To illustrate this, we reproduce in Figure 2 the above section of code, rewritten to run on a set of P 2 processors using message passing code of the kind described.We  have simpli ed matters by assuming that the processors have been organized into a twodimensional array PROC(P,P) and that the processor array elements may be addressed for the purpose of exchanging data items: normally, a structure of this kind would have to be set up by the user rst, and references would have to be converted to those provided by the environment.Further, we assume that the array sizes are multiples of P. Optimization of communication has been performed insomuch as messages have been extracted from the loops and organized into vectors for sending and receiving.When communication and computation are overlapped, as could be done here by carefully arranging the order in which local data is updated, the resulting code is considerably longer.In this version of the Jacobi relaxation, each processor has been assigned a square subblock of the original arrays.The programmer has declared local space of the appropriate size for each array on every processor.Array U has been declared in such a way that space is reserved not only for the local array elements, but also for those which are used in local computations, but are actually owned by other processors.This extra space surrounding the local elements is known as the overlap area.Values of UNEW on the local boundaries require elements of U stored non-locally for their computation.These must be received, and values from local boundaries must be sent to the processors which need them.Care is taken that the processors whose segments of U are on the original grid boundaries do not attempt to read from or send to non-existent processors.
The result of this low level style of programming is that the user spends a great deal of time organizing the storage and communication of data.In consequence, the time taken to produce a program is considerably longer than for comparable codes on shared-memory machines.Moreover, once written, the code is hard to modify or improve to run in some other way, even on the same machine.For example, if instead of dividing into square subblocks, the user wanted to experiment with blocking in only one dimension, e.g., blocks of rows or columns, most of the code dealing with speci cation and communication would have to be modi ed.We will see below how easily this code can be parallelized in Vienna Fortran. 3 The Vienna Fortran Language

The Programming Model
Vienna Fortran assumes that a program will be executed by a machine with one or more processors according to the SPMD programming model as described above.This model requires that each participating processor execute the same program; parallelism is obtained by applying the computation to di erent parts of the data domain simultaneously.
The generated code will store the local parts of arrays and the overlap areas locally and use message passing, optimized where possible, to exchange data.It will also map logical processor structures declared by the user to the physical processors which execute the program.These transformations are, however, transparent to the Vienna Fortran programmer.

The Language Features
The Vienna Fortran language extensions provide the user with the following features: The processors which execute the program may be explicitly speci ed and referred to.It is possible to impose one or more structures on them.
The distributions of arrays can be speci ed using annotations.These annotations may use processor structures introduced by the user.
{ Intrinsic functions are provided to specify the most common distributions.{ Distributions may be de ned indirectly via a map array.{ Data may be replicated to all or a subset of processors.{ The user may de ne new distribution functions.
An array may be aligned with another array, providing an implicit distribution.
Alignment functions may also be de ned by the user.
The distribution of arrays may be changed dynamically.However, a clear distinction is made between arrays which are statically distributed and those whose distribution may be changed at runtime.

In procedures, dummy array arguments may
{ inherit the distribution of the actual argument, or { be explicitly distributed, possibly causing some data motion.
A forall loop permits explicitly parallel loops to be written.Intrinsic reduction operations are provided, and others may be de ned by the user.Loop iterations may be executed { on a speci ed processor, { where a particular data object is stored, or { as determined by the compiler.
Arrays in common blocks may be distributed.Allocatable arrays may be used in much the same way as in Fortran 90.Array sections are permitted as actual arguments to procedures.
Assertions about relationships between objects of the program may be inserted into the program.
Vienna Fortran does not introduce a large number of new constructs, but those it does have are supplemented by a number of options and intrinsic functions, each of which serves a speci c purpose.They enable the user to exert additional control over the manner in which data is mapped or moved, or the code is executed.An overview of the Vienna Fortran language extensions for Fortran 77 is given below.
We use terminology and concepts from the de nition of Fortran 77 (and, occasionally, Fortran 90) freely throughout.

The Language Extensions: An Overview
Vienna Fortran includes all of the following language extensions to Fortran 77.Many of them will be discussed in the examples below, where their use is further described in an informal manner.For a complete and precise description of the language, see 36].The reader is also referred to 4] for further examples of the use of these extensions and demonstration of their expressiveness.
The PROCESSORS statement The user may declare and name one or more processor arrays by means of the PROCESSORS statement.The rst such array is called the primary processor array; others are declared using the keyword RESHAPE.They refer to precisely the same set of processors, providing di erent views of it: a correspondence is established between any two processor arrays by the column-major ordering of array elements de ned in Fortran 77.Expressions for the bounds of processor arrays may contain symbolic names, whose values are obtained from the environment at load time.Assertions may be used to impose restrictions on the values that can be assumed by these variables.This allows the program to be parameterized by the number of processors.This statement is optional in each program unit.For example: PROCESSORS MYP3(NP1, NP2, NP3) RESHAPE MYP2(NP1, NP2*NP3) Processor References Processor arrays may be referred to in their entirety by specifying the name only.Array section notation, as introduced in Fortran 90, is used to describe subsets of processor arrays; individual processors may be referenced by the usual array subscript notation.Dimensions of a processor array may be permuted.
Processor Intrinsics The number of processors on which the program executes may be accessed by the intrinsic function $NP.A one dimensional processor array, $P(1:$NP), is always implicitly declared and may be referred to.This is the default primary array if there is no processor statement in a program.The index of an executing processor in $P is returned by the intrinsic function $MY PROC.
Distribution Annotations Distribution annotations may be appended to array declarations to specify direct and implicit distributions of the arrays to processors.Direct distributions consist of the keyword DIST together with a parenthesized distribution expression, and an optional TO clause.The TO clause speci es the set of processors to which the array(s) are distributed; if it is not present, the primary processor array is selected by default.A distribution expression consists of a list of distribution functions.There is either one function to describe the distribution of the entire array, which may have more than one dimension, or each function in the list distributes the corresponding array dimension to a dimension of the processor array.The elision symbol \:" is provided to indicate that an array dimension is not distributed.If there are fewer distributed dimensions in the data array than there are in the processor array, the array will be replicated to the remaining processor dimensions.Both intrinsic functions and user-de ned functions may be used to specify the distribution of an array dimension.
REAL A(L,N,M), B(M,M,M) DIST ( BLOCK, CYCLIC, BLOCK ) REAL C(1200) DIST ( MYOWNFUNC ) TO $P Another way to specify a distribution is to prescribe that the same distribution function be employed as that which was used to distribute a dimension of another array.For example, REAL D(100,100) DIST (=A.1, =A.3) TO MYP2 will distribute D by BLOCK in both dimensions to the processor array MYP2.\A.1" refers to dimension 1 of array A while \=A.1" extracts the distribution of the rst dimension of the array A. Note that both the extents of the array dimensions being distributed and the set of processors may di er from those of A.
Implicit distributions begin with the keyword ALIGN and require both the target array and a source array (so called because it is the source of the distribution).An element of the target array is distributed to the same processor as the speci ed element of the source array, which is determined by evaluating the expressions in the source array description for each valid subscript of the target array.Here, II and JJ are bound variables in the annotation, and range in value from 1 through 80.

INTEGER IM(80,80) ALIGN IM(II,JJ) WITH D(JJ,II+10)
As is the case with direct distributions, the user may de ne functions to describe more complex alignments.
By default, an array which is not explicitly distributed is replicated to all processors.
Distribution Intrinsics Direct distributions may be speci ed by using the elision symbol, as described above, and the BLOCK and CYCLIC intrinsic functions.The BLOCK function distributes an array dimension to a processor dimension in evenly sized segments.The CYCLIC (or scatter) distribution maps elements of a dimension of the data array in a round-robin fashion to a dimension of the processor array.If a width is speci ed, then contiguous segments of that width are distributed in a round-robin manner.
The linear expressions which specify an alignment may contain, in addition to the usual arithmetic operators \+", \-" and \*", the intrinsic functions MAX, MIN, MOD, LBOUND, UBOUND, and SIZE.The latter three are intrinsic functions similar to Fortran 90, and refer to the lower bound, upper bound and size of an array (dimension), respectively.
The INDIRECT distribution intrinsic function enables the speci cation of a mapping array which allows each array element to be distributed individually to a single processor.The mapping array must be of the same size and shape as the array being distributed.
The values of the given array are processor numbers (in $P): INTEGER IAPROCS(1000) REAL A(1000) DIST ( INDIRECT(IAPROCS) ) Thus, for example, the value of IAPROCS( 60) is the number of the processor to which A(60) is to be mapped.Note that IAPROCS must be de ned before it is used to specify the distribution of A, and that each element of A can be mapped to only one processor.
Dynamic Distributions and the DISTRIBUTE Statement By default, the distribution of an array is static.Thus it does not change within the scope of the declaration to which the distribution has been appended.The keyword DYNAMIC is provided to declare an array distribution to be dynamic.This permits the array to be the target of a DISTRIBUTE statement.A dynamically distributed array may optionally be provided with an initial distribution in the manner described above for static distributions.A range of permissible distributions may be speci ed when the array is declared by giving the keyword RANGE and a set of explicit distributions.If this does not appear, the array may take on any permitted distribution with the appropriate dimensionality during execution of the program.Finally, the distribution of such an array may be dynamically connected to the distribution of another dynamically distributed array in a speci ed xed manner.This is expressed by means of the CONNECT keyword.Thus, if the latter array is redistributed, then the connected array will automatically also be redistributed.The distribute statement begins with the keyword DISTRIBUTE and a list of the arrays which are to be distributed at runtime.Following the separator symbol \::", a direct, implicit or indirect distribution is speci ed using the same constructs as those for specifying static distributions.It has an optional NOTRANSFER clause; if it appears, then it speci es that the arrays to which it applies are to be distributed according to the speci cation, but that old data (if there is any) is not to be transferred.Thus only the access function is modi ed.For example: DISTRIBUTE A, B :: ( CYCLIC(10) ) NOTRANSFER (B) in the above statement, both arrays A and B are redistributed with the new distribution CYCLIC (10), however for the array B only the access function is changed, the old values are not transferred to the new locations.Whenever an array is redistributed via a distribute statement, then any arrays connected to it are also automatically redistributed to maintain the relationship between their distributions.

CASE DEFAULT END SELECT
The distributions of two di erent arrays may be compared in a similar manner within an IF statement.
Allocatable Arrays An array may be declared with the allocatable attribute by specifying the keyword ALLOCATABLE as in Fortran 90.The declaration de nes the rank of the array, but not the bounds of any dimension.The array may be statically or dynamically distributed.The ALLOCATE statement is provided to allocate an instance of the array with speci ed bounds in each dimension.This instance is deallocated by means of the DEALLOCATE statement.An allocatable array may not be accessed unless it is currently allocated and a distribution has been associated with it.The allocatable attribute should be used wherever the size of an array is not known at compile time; the user is thus able to distribute the array with its actual bounds, rather than distributing the largest array which is permitted.Further, it may remove the need for work arrays in some situations.
Common Blocks Common blocks in which no data is explicitly distributed have the same semantics as in Fortran 77.The common block storage sequence is de ned for them.Individual arrays which occur in a named common block may also be explicitly and individually distributed just as other arrays are.However, they may not be dynamically distributed.Once storage space has been determined for a named common block, then it may not change during program execution.Note that, in accordance with Fortran 90, allocatable arrays may not be in common blocks.
Procedures Dummy array arguments may be distributed in the same way as other arrays.If the given distribution di ers from that of the actual argument, then redistribution will take place.If the actual argument is dynamically distributed, then it may be permanently modi ed in a procedure; if it is statically distributed, then the original distribution must be restored on procedure exit.This can always be enforced by the keyword RESTORE.While argument transmission is generally call by reference, there are situations in which arguments must be copied.The user can suppress this by specifying a NOCOPY.
Dummy array arguments may also inherit the distribution of the actual argument: this is speci ed by using an \*" as the distribution expression: Array sections may be passed as arguments to subroutines using the syntax of Fortran 90.
Intrinsic Functions A number of intrinsic functions from Fortran 90 are very useful for writing programs on distributed memory machines.They include the functions SIZE, LBOUND, UBOUND, COUNT, ANY, and ALL, which may be used in Vienna Fortran programs.
The FORALL Loop The FORALL loop enables the user to assert that the iterations of a loop are independent and can be executed in parallel.A precondition for the correctness of this loop is that a value written in one iteration is neither read nor written in any other iteration.There is an implicit synchronization at the beginning and end of such a loop.Private variables are permitted within forall loops; they are known only in the forall loop in which they are declared and each loop iteration has its own copy.The iterations of the loop may be assigned explicitly to processors if the user desires, or they may be performed by the processor which owns a speci ed datum.Only tightly nested forall loops are permitted.

END FORALL
A reduction statement may be used within forall loops to perform such operations as global sums (cf.ADD below); the result is not available until the end of the loop.The user may also de ne reduction functions for operations which are commutative and associative in the mathematical sense.The intrinsic reduction operators provided by Vienna Fortran are ADD, MULT, MAX and MIN.The following statement results in the values of the array A being summed and the result being placed in the variable X.

REDUCE ( ADD, X, A(I) )
Input/Output Files read/written by parallel programs may be stored in a distributed manner or on a single storage device.We provide a separate set of I/O operations to enable individual processor access to data stored across several devices.

Writing Programs in Vienna Fortran
In this section we introduce many of the language extensions of Vienna Fortran by showing how they may be used to produce parallel code for some simple problems.We discuss several di erent issues related to programming in general.The ideas in this section could, in principle, be applied to other programming languages which use similar data structures.

Distributing Data to Processors
In Section 2 above, we saw how a Jacobi relaxation might be parallelized manually, under certain simplifying assumptions.We present two versions of this same code in Vienna Fortran.The rst version tends to run faster on machines with a high communication latency, whereas the second version will often be preferred for its overall communication behavior.
All that has been added to the sequential code to produce the rst parallel Jacobi relaxation, shown in Figure 3  second dimension of all three arrays by block to all processors: the compiler will generate code to place the data accordingly.It is also responsible for inserting the necessary communication.
Note that no reference has been made to the processors executing the program in this example.Thus the data is mapped implicitly to a one-dimensional processor array consisting of the processors available at run time.The elision symbol was used to ensure that only one dimension of the arrays is distributed.
An alternative implementation of the Jacobi relaxation requires that the arrays be mapped to a two-dimensional processor grid.It begins with the following declarations: C PARALLEL CODE version 2 ASSERT(NP .GE .4) PROCESSORS P(NP,NP) REAL UNEW(1:N,1:N), U(1:N,1:N), F(1:N,1:N) DIST ( BLOCK, BLOCK ) : : : The rest of the code is the same as shown in Figure 3.This Vienna Fortran program rst declares a square processor array, whose size will be determined at load time.
The programmer requires at least four processors in each dimension and expresses this by making an appropriate assertion.The array declaration includes an annotation to distribute the arrays by block in both dimensions: this maps them in square blocks to the processors.The code has been written so as to be independent of the number of processors it will execute on, and does not need to be recompiled each time it runs on a di erent con guration.(But, if it is to be run on a xed number of processors every time, then a processor array may naturally be declared with xed bounds { it is likely to result in faster code).This is the data distribution used in manually parallelized version of the code and the compiler must distribute the data and organize the communication to produce the code similar to that shown in Figure 2.
This version of the code will thus result in compiled code which is markedly di erent from the rst version and may exhibit di erent behavior at run time.When the rst version is executed, the data is to be distributed in blocks of columns to the processors.
To compute local values of UNEW, a processor will require a vector of values from the two neighboring processors.The second version distributes data in squares.As a result, a processor will require values from four neighboring processors to compute its local values.In general, the second version requires fewer data items to be sent and received, however the number of messages per iteration increases from two to four.Thus, the actual performance of the codes will be dependent not only on the message latency of the underlying hardware but also on the start-up time per message.It is an easy matter to implement both versions in Vienna Fortran and compare their performance.

Other Ways to Distribute Arrays
We have already seen the intrinsic functions provided by Vienna Fortran to specify the most common kinds of distributions: BLOCK and CYCLIC map a dimension of an array to a dimension of a processor array.The following are further examples of Vienna Fortran array declarations annotated by a distribution: PROCESSORS P2(NP,MP) REAL XX(1000,100) DIST( CYCLIC(50), BLOCK ) REAL YY(10000) DIST ( BLOCK ) TO $P INTEGER KK(500,50,5) DIST ( BLOCK, CYCLIC, : ) TO P2/2,1/ Arrays XX and KK are distributed to P2: however, the dimensions have been permuted in the second case, so that the the rst dimension of KK is distributed by block to the second dimension of P2, and the second dimension of KK is scatter distributed to the rst dimension of P2.YY is distributed to $P, which has NP*MP elements in this case.Remember that the standard ordering of array elements de ned in Fortran 77 may be applied to processor arrays, so that there is a well-de ned relationship between the elements of $P and those of P2.

CONTINUE
The elements of arrays X and Y are aligned with the elements of array ZX in the example above: for each I from 1 through N, X(I) is mapped to the processor that owns element ZX(I+10).The $ symbol is merely a placeholder, indicating that multiple arrays are being aligned.Note that the scalar variables are replicated.
In practice, alignments can be used whenever there is a xed relationship between two arrays that is of a very speci c nature.In other situations it will generally su ce, or be more appropriate, to specify that data items are to be distributed \in the same way".In the above, for example, the distribution of X and Y could have been expressed by giving them the same distribution function as ZX: This distributes X and Y by block, with the appropriate block sizes.In this case, X and Y would be distributed evenly by block across the processors.Since they have fewer elements than ZX, the length of their blocks may be slightly smaller than the length of the blocks of ZX.When they are aligned with ZX as above, then the lengths of the rst blocks of X and Y will be identical to those of ZX.However, the last processor will contain fewer elements of these arrays.For example, if N = 100 and the data is distributed to 4 processors, then the second distribution would distribute 25 elements of X and Y to each processor, whereas the alignment with ZX would result in the mapping of 28 elements of X and Y to the rst three processors, and only 16 elements to the last of them.Thus the elements of X and Y are not spread evenly over the processors.It will depend very much on the nature of our computation which of these distributions performs better.
Note that if we choose to distribute X and Y in the same way as ZX, we could actually distribute them all by one single declaration in this case.But that would not be true in a subroutine when, say, ZX is a dummy argument whose distribution is not known.Both alignment and the referral to the distribution of other arrays are important in subroutines where information on the distribution of dummy arguments is incomplete.
Rather more complex distributions and alignments are required in many real applications.Many of them, such as arbitrary rectilinear block distributions are useful to the programmer and can be e ciently implemented.We will see an example of a user-de ned distribution function in Section 5.

Using Subroutines in Vienna Fortran
We discuss the main issues which arise when subroutines are invoked with distributed arguments by, again, looking at a very simple example.This permits us to ignore the computational problem and concentrate on the situations a programmer will need to be able to deal with.
It is common practice to write subroutines for such operations as matrix multiplication, which are used frequently.In this section we consider how this is done in Vienna Fortran.
When a distribution annotation is appended to a declaration in Vienna Fortran, then that distribution has the same scope as the declaration itself.In a subroutine, both local arrays and dummy array arguments may be given an explicit distribution when they are declared.As we will see below, this makes the mechanism of appending We will not examine functions separately; they can be written similarly.
distribution annotations to array declarations a very powerful tool, enabling a controlled redistribution of data.
One version of a subroutine to multiply matrices in Vienna Fortran is as follows:

CONTINUE RETURN END
In this routine we employ the additional method for specifying distributions which can be used for dummy array arguments only.If a \*" is used to specify the distribution, then the dummy argument inherits the distribution of the actual array.This means that each time the above routine is called, the actual arguments may be distributed di erently to the processors.Interprocedural distribution analysis will often reveal the distribution functions which reach the subroutine, and the compiler is then able to generate code based on that information.This is a exible way to write subroutines.But an unfortunate consequence of using inherited distributions is that the compiler may not always have precise (or, if it is separately compiled, any) information on the actual distributions which may reach the dummy arguments.In cases where this analysis fails, there is a way of providing extra help.If the user knows that only a few distributions will occur, then this information may be provided in a RANGE clause which is appended to the distribution.
For example, the speci cation: declares that only the distributions, (BLOCK,BLOCK) and (BLOCK,CYCLIC(100)) are allowed for the dummy argument A.
Further, the e ciency of the computation within the subroutine may depend very heavily on the actual distributions of the arguments, thus yielding good performance in some cases and very poor performance in others.
An alternative implementation might distribute the dummy array arguments explicitly.We may write, for example: SUBROUTINE MATMUL(A,B,C,N,M,L) REAL A(N,M), C(N,L) DIST ( BLOCK,: ) TO $P REAL B(M,L) DO 30 I = 1, N : : : Now this subroutine also has three dummy argument arrays, two of which, A and C, are distributed by block in the rst dimension to all available processors whereas the third, B, is replicated.The dummy arguments are explicitly distributed in order to eliminate communication during the computation of the result.However, the actual arguments may not have the same distribution as the dummy arguments with which they are associated.When their distributions di er, they must be redistributed on entry to the subroutine to match the speci ed distribution.In general, their original distribution must also be restored on exit from the subroutine.Thus the e cient implementation of the computation within the subroutine has a price: the redistribution of actual arguments may sometimes be very costly.
We have seen the apparent di culty in resolving two legitimate demands of a general purpose subroutine: that it handle a variety of di erent arguments, which may be di erently distributed, on the one hand, and that it handle them e ciently on the other hand.Redistribution may be costly, yet we may want to implement the routine in a way that is handled optimally on the target machine.Vienna Fortran provides a construct which may be used in this situation: the DCASE construct, which is modeled along the lines of the CASE construct in Fortran 90.It enables the selection of a block of statements according to the actual distribution of one or more arrays.
The third subroutine for matrix multiplication begins as follows:

CASE DEFAULT END SELECT
In the above, the matrix operation is handled in a speci c way depending on how the actual argument arrays are distributed.In this way, we can insert appropriate code or call further subroutines as required.The compiler has precise information on the distribution functions of the selected arrays for the block of statements within the cases.Only one of the case alternatives is executed; if none of the other speci cations match, then the default (if present) is selected.Here, the cases are examined in the order in which they occur textually.The rst distribution expression is compared with the actual distribution of C, and the second with that of A. If C is distributed by block in the rst dimension and not at all in the second, and A likewise, then the rst case is selected and its code executed.Otherwise, the distribution of C is then compared with the next case: if it is distributed by block in both dimensions, then if A is distributed by block in the rst dimension, this case is selected.An \*" matches any distribution whatsoever.

Applications in Vienna Fortran
In this section we look at the structure of two frequent kinds of codes that are used to handle a variety of applications.The rst of them shows how a particular numerical method might be expressed in Vienna Fortran; the second code shows how one could approach problems which cannot be e ciently distributed at compile time.We then brie y discuss some issues which arise with certain Fortran constructs and programming styles.

ADI Iteration
One well known and e ective method for solving partial di erential equations in two or more dimensions is known as ADI (Alternating Direction Implicit) 17].It is widely used in computational uid dynamics, and other areas of computational physics.The name ADI derives from the fact that \implicit" equations, usually tridiagonal systems, are solved in both the x and y directions at each step.In terms of data structure access, one step of the algorithm can be described as follows: an operation (a tridiagonal solve here) is performed independently on each x-line of the array and the same operation is then performed, again independently, on each y-line of the array.We present two versions of a step of the ADI algorithm here.The rst version is shown in Figure 4. Here, the current solution, U, the right hand sides, F, and the temporary array, V , are all distributed by blocks of columns to the implicit one-dimensional array of processors, $P.
In this version, the sweep over the columns (representing x-lines) is performed by the rst loop while the sweep over the rows (representing y-lines) is performed via a call to the routine YSWEEP.In each case, the subroutine sequential TRIDIAG (not shown here) is given a right hand side and overwrites it with the solution of a constant coe cient tridiagonal system.The array V is redistributed when subroutine YSWEEP is invoked; thus it is distributed in blocks of columns when the rst loop is executed, and is distributed in blocks of rows when the second loop is performed.This makes it possible to use a sequential tridiagonal solver in each of these since neither x-lines in the rst loop nor the y-lines in the second loop cross processor boundaries.Note that the redistribution of V is a \transpose" of the array with respect to the set of processors and requires each processor to exchange data with each of the other processors.The communication here is contained implicitly in the subroutine call and the tridiagonal solvers themselves do not require interprocessor communication.Since the distribution of a statically distributed array has to be restored on return to the calling unit, the array V is redistributed at subroutine exit to be distributed by columns.Hence, the assignment of the values of V to U in the last loop does not cause any communication.

PARAMETER(NX
We had presented another version of this algorithm in our earlier paper 4]; we reproduce the code here in Figure 5.In this second version, we do not call a subroutine to enforce a redistribution of V. Instead, V is declared to have a dynamic distribution, and is initially distributed by block in the second dimension.The range attribute speci es that the only distributions allowed for V are blocks of rows or columns.Thus, the situation for the rst loop remains the same, i.e., the columns do not cross processor boundaries and hence the sequential tridiagonal solver can be employed.loop we explicitly redistribute the array V to be blocked by rows via a DISTRIBUTE statement.Now, the second loop ranges over the rows of V again using the sequential tridiagonal solver.In this code, the nal assignment of the array V to the array U will also induce communication similar to the \transpose" at the subroutine boundary above since U and V are distributed in di erent dimensions.Thus in the rst case we performed the communication implicitly, by passing the array to a subroutine where the dummy argument has an explicit distribution, and in the second case we executed a statement to do the same work.There are many ways in which the ADI algorithm may be formulated.For example, another formulation would declare array V with a static distribution and not redistribute it at all.A parallel tridiagonal solver would then be called in the second loop; the communication would take place within the solver.Similarly, one could declare a twodimensional processor structure and distribute the arrays by block in both dimensions: a parallel tridiagonal solver would then be used for both the x-and the y-lines.All versions of this algorithm are equally easy to express in Vienna Fortran: which of these performs the best may be dependent on various factors including message startup and transfer times of the underlying architectures.The point is that it is a trivial matter to change the distributions, or to substitute the calls to the sequential tridiagonal solver used here by calls to a parallel tridiagonal solver and thus experiment with the di erent versions.In marked contrast, such changes will typically induce weeks of reprogramming in a message-passing language.

Irregular Distributions
There are a number of scienti c codes where an e cient distribution of some of the major data structures is not possible at compile time.The distribution of an array may depend, for example, on the values of another array -or even on its own values, as in the example given below.Examples of such codes include, but are not limited to, particle-incell methods, sparse linear algebra, and PDE solvers using unstructured and/or adaptive meshes.
Here, we look at an abstraction of a two-dimensional unstructured mesh Euler solver.The mesh is represented by triangles and the ow variables are stored at the vertices of the mesh.We reproduce only one part of the computation, which consists of accumulating at each node the contribution from each of the edges incident upon it.The computation is implemented as a loop over the edges: the contribution of each edge is subtracted from the value at one node and added to the value at the other node.
Figure 6 shows one way in which this computation may be speci ed in Vienna Fortran.
The mesh is represented by the array EDGE, where EDGE(I; 1) and EDGE(I; 2) are the node numbers at the two ends of the Ith edge.The arrays X and Y represent the values at each of the NNODE nodes.
Consider the distribution of the data across the (implicit) one-dimensional array of processors.Since the mesh must be distributed at runtime, in order to balance the computational load across the processors, each of the arrays has to be dynamically distributed.
The array X, representing a data value at each node, is declared to be dynamically distributed with an initial block distribution.Further below, this array is explicitly distributed via the indirect distribution mechanism provided by Vienna Fortran.The indirection is based on the mapping array MAP, whose values are dependent on the structure of the mesh and are de ned in the user speci ed routine PARTITION (the code for PARTITION has not been shown here).The value of the Ith element of the array MAP, which must be declared with the same size as X, is the number of the processor in $P to which the Ith element of the array X is distributed.
Y is also declared with the keyword DYNAMIC and is assigned the same distribution as X; its distribution is, however, connected with that of X by the CONNECT attribute.This means that when X is redistributed, Y is automatically redistributed with exactly the same distribution function.The DISTRIBUTE statement for array X speci es the NOTRANSFER attribute for array Y .This means that when the two arrays are redistributed, only the values of X are to be transferred to the new locations; the old values of Y are not moved.
The array EDGE is also declared with a dynamic distribution and is initially distributed by block.Given the structure of the computation, it would be useful to distribute EDGE in such a way that the values at one or both of its nodes are on the same processor.We have chosen to distribute the elements of EDGE to the processor which owns the values for the rst of its nodes.Such a distribution cannot be described by the intrinsic functions, so it is speci ed by the user-de ned distribution function (DFUNCTION) FDIST in Figure 6.
DFUNCTIONs are similar to regular Fortran functions, but have a special implicit argument declared with the keyword TARGET.It represents the array that is being distributed.Here, the distribution function FDIST takes as arguments the arrays MAP and EDGE and the special argument A. The function body then speci es that the Ith row of the array A is to be distributed to the processor whose number is given by MAP(EDGE(I; 1)).Thus, when the distribution function FDIST is accessed in the distribute statement, the special argument A is associated with the array being distributed, i.e., EDGE, so that EDGE is distributed as required.
The computation is speci ed using a FORALL loop, with an ON clause to specify where each iteration is to be performed.Thus the iterations of the loop, over the edges in this case, can be executed in parallel.In Figure 6, the ON clause speci es that the Ith iteration should be performed on the processor that owns the (I; 1)th element of EDGE.
Non-local values which are read can be gathered before the execution commences.
The variables N1, N2 and DELTAX declared within the FORALL loop are private variables.Thus assignments to these variables do not cause ow dependencies between iterations of the loop.For each edge, the X values at the two incident nodes are read and used to compute the contribution DELTAX for the edge.This contribution is then accumulated into the values of Y for the two nodes.
Since multiple iterations will accumulate Y values at each node, di erent iterations write to the same array elements, which is not permitted within a FORALL.statements which allow accumulations across the iterations of a FORALL loop.The reduction operator ADD is used here to accumulate the contribution of the edge to the values at the nodes on which it is incident.The results cannot be accessed within the FORALL loop, and hence the accumulations can be easily performed by the system after all iterations are completed.This code makes use of the reduction operator ADD.
The most important feature of this code as far as its compilation is concerned is that the values of X and Y are accessed via the edges, hence a level of indirection is involved.We distributed the arrays in such a way that the values at the rst node of an edge are always local to a loop iteration, but the values at the second node may not be.
The data distribution of each of the arrays is determined at run time; thus the compiler cannot detect which references are local and which are not.In such situations, runtime techniques such as those developed in 12, 28] are needed to generate and exploit the communication pattern.

Some Fortran Issues
There are several important features of Fortran codes which have not been dealt with in the sections above.We discuss just a few of them.
Common Blocks Common blocks are used in Fortran 77 to enable di erent program units to de ne and reference the same data without using arguments, and to share storage units.In Vienna Fortran, the user may retain full Fortran 77 semantics for a common block by not explicitly distributing any of the objects within it at any place in the program.In this case, there is conceptually one copy of the common block, and conventional storage association holds for it.Note that, in accordance with the rules of Fortran 90, allocatable arrays may not be in common blocks.Vienna Fortran also permits explicit distribution of arrays within named common blocks.However, their distribution may not be dynamic.If distributions are given at more than one place in the program for objects in common blocks with the same name, then they must be identical except for the names of the objects.The common block storage sequence holds for those parts of a common block which are not explicitly distributed { we refer to these as replicated sections below.The above common block does not contain any data explicitly distributed by the user.As a consequence, these data may be used in common blocks with the same name in the usual Fortran77 manner.In contrast, several objects in the following common block are explicitly distributed: The array R is not declared separately in the subprogram; it will be associated with the six variables of the replicated section above.The arrays S and T are declared such that they inherit their distributions from the distributed common objects, named A and B above, respectively, with whom they are associated by storage.
However, the following declaration of COM2 in a subroutine is not permitted: REAL E( 6) DIST( BLOCK ) REAL Z(2,5,2) DIST (:, CYCLIC, : ) C THIS IS NOT PERMITTED COMMON /COM2/ E, X(8), Y(4), Z Here, the replicated section of COM2 has been associated with an explicitly distributed object.Secondly, an attempt has been made to associate both arrays X and Y with the rst distributed common object.Finally, the second distributed common object of COM2 is redistributed by the explicit distribution of array Z.All three manipulations are not permitted.
Equivalence Association Some restrictions should be placed on the use of the Fortran EQUIVALENCE statement when data objects are distributed.In Vienna Fortran, we do not permit an implicit distribution by equivalencing.Further, no distributed array may be associated by equivalence with any other distributed object.Thus equivalence association is permitted between replicated data only.
Work Arrays Fortran 77 does not permit dynamic storage allocation.It is thus common programming practice to declare arrays with a maximum size and use them with some other, smaller, size during the computation.Further, a large work array is often declared, parts of which are then used as individual arrays with the size and shape required by the computation.So that arrays may be declared as they are used, Vienna Fortran includes the concept of allocatable arrays as de ned in Fortran 90.An individual array with unknown size may be declared with the ALLOCATABLE attribute.Once its bounds are known, it can be allocated using the ALLOCATE statement.An allocatable array may also be annotated with distribution expressions to specify the distribution of the array.This distribution expression can be completely evaluated only after the allocation of the array.For example: REAL A(:) ALLOCATABLE DIST(BLOCK) ...

READ (*,*) LEN ALLOCATE ( A(LEN))
Here we have declared A to be a one-dimensional allocatable array.Thus the distribution expression (BLOCK in this case) will be evaluated for A with length LEN, and the LEN elements of A distributed evenly across the processors.Without allocatable arrays, A would have to be declared with some maximum size (greater than LEN) and distributed by BLOCK with respect to this maximum size.Since only the rst LEN elements of A are to be used, some of the processors might not have any of the elements of A which are actually used in the computation.By using allocatable arrays, we make sure that all processors are involved in the computation.
As noted above, many Fortran applications are characterized by the fact that runtime data determines the size of the underlying data objects.In many applications, the actual number of objects involved is also unknown at compile time or may vary during computation.Such situations require the work array to be distributed dynamically, since the actual distribution of the objects may be dependent on runtime data.Such an array is declared with the ALLOCATABLE and DYNAMIC attributes.One strategy for distributing such a work array is to distribute each of these objects independently to all processors, by BLOCK for example.Another strategy would be to distribute each of these objects to a subset of processors.This kind of distribution must be handled by a user-de ned distribution function in Vienna Fortran.

Related Work
We discuss some of the related research in both language development for parallel machines and compilation techniques brie y below.
A number of parallel programming languages have been proposed, both for use on speci c machines and as general languages supporting some measure of portability (e.g.OCCAM 23]).Languages for coordinating individual threads of a parallel program, such as LINDA 1] and STRAND 7], have been introduced to enable functional parallelism.Most manufacturers have extended sequential languages, such as Fortran and C, with library routines to manage processes and communication.In most explicitly parallel languages, the user performs many of the tasks which a compiler is expected to handle for a Vienna Fortran program.
The concept of de ning processor arrays and distributing data to them was rst introduced in the programming language BLAZE 13] in the context of shared memory systems with non-uniform access times.This research was continued in the Kali programming language 18] for distributed memory machines, which requires that the user specify data distributions in much the same way that Vienna Fortran does.It permits shape and weighting the dimensions.Several methods for distributing iterations of loops are provided.
The Cray programming model assumes that initial execution is sequential and the user speci es the start and end of parallel execution explicitly.Many of the features of shared memory parallel languages have been retained: these include critical sections, events and locks.New instructions for node I/O are provided.In addition, there are a number of intrinsic functions to access parts of arrays local to a processor, and reduction and parallel pre x operations are included.
The implementation of Vienna Fortran and similar languages requires a particularly sophisticated compilation system, which not only performs standard program analysis but also, in particular, analyzes the program's data dependences 37].In general, a number of code transformations must be performed if the target code is to be e cient.The compiler must, in particular, insert all messages -optimizing their size and their position wherever possible.
The compilation system SUPERB (University of Vienna) 35] takes, in addition to a sequential Fortran program, a speci cation of the desired data distribution and converts the code to an equivalent program to run on a distributed memory machine, inserting the communication required and optimizing it where possible.The user is able to specify arbitrary block distributions.It can handle much of the functionality of Vienna Fortran with respect to static arrays.
The Kali compiler 12] was the rst system to support both regular and irregular computations, using an inspector/executor strategy to handle indirectly distributed data.It produces code which is independent of the number of processors.
The MIMDizer 19] and ASPAR 11] (within the Express system) are two commercial systems which support the task of generating parallel code.The MIMDizer incorporates a good deal of program analysis, and permits the user to interactively select block and cyclic distributions for array dimensions.ASPAR performs relatively little analysis, and instead employs pattern-matching techniques to detect common stencils in the code, from which communications are generated.
Pandore 2] takes a C program annotated with a user-declared virtual machine and data distributions to produce code containing explicit communication.Compilers for several functional languages annotated with data distributions (Id Nouveau 25], Crystal 15]) have also been developed which are targeted to distributed memory machines.
Quinn and Hatcher 10], and Reeves et al. 6,24] compile languages based on SIMD semantics.These attempt to minimize the interprocessor synchronizations inherent in SIMD execution.The AL compiler 33], targeted to one-dimensional systolic arrays, distributes only one dimension of the arrays.Based on the one dimensional distribution, this compiler allocates the iterations to the cells of the systolic array in a way that minimizes inter-cell communications.
The PARTI primitives, a set of run time library routines to handle irregular computations, have been developed by Saltz and coworkers 28, 29].These primitives have been integrated into the Vienna Fortran Compilation System and are also being implemented in the context of the FORTRAN D Programming environment being developed at Rice University.Similar strategies to preprocess DO loops at runtime to extract the communication pattern have also been developed within the context of the Kali language by Koelbel and Mehrotra 12,14].Explicit run-time generation of messages is also considered by 6, 15, 25], however, these do not save the extracted communication pattern to avoid recalculation.

Implementation Status
The Vienna Fortran Compilation System is currently being developed at the University of Vienna.It is based upon previous work performed by several groups, but, in particular, upon the experience gained with the parallelization system SUPERB ( 35]).It currently generates code for the Intel iPSC/860, the GENESIS architecture, and SUPRENUM.
The implementation of a substantial subset of Vienna Fortran has already been completed.This includes The current compilation system is a full implementation of Fortran 77.Among other things, it permits the user to distribute work arrays, sections of which may be individually distributed; it also handles equivalencing.It performs extensive data dependence analysis and interprocedural analysis to determine the correctness of all transformations applied to the program code.
Implementation of further features of Vienna Fortran, in particular the dynamic distributions, is under way.There is still an amount of research to be done in this area, including methods for the e cient handling of user de ned distribution and alignment functions.

Conclusions
In view of the increasing importance of distributed memory parallel computing systems, it is vital that the task of writing new programs and converting existing (sequential) code to these machines be greatly simpli ed.An approach which may substantially reduce the cost of developing codes is to provide a set of language extensions for existing sequential languages (in particular, Fortran and C) that are not bound to any speci c existing system but can be used across a wide range of architectures.These extensions should be as simple as possible, but they should also be broad enough to permit the expression of a wide variety of algorithms at a high level.In particular, since the data distribution has a critical impact on the performance of the program at runtime, tight programmer control of the mapping of data to the system's processors must be possible.
We believe that Vienna Fortran is a signi cant step on the path towards a standard in this area.

Figure 6 :
Figure 6: Code for Unstructured Mesh PROGRAM MAIN REAL A(12) DIST( BLOCK ) REAL B(4,5) DIST( CYCLIC, : ) COMMON /COM2/ CC, DD, EE, FF, GG, HH, A, B Arrays A and B are distributed explicitly and thus determine the distribution of these two storage areas in the common block.The variables in the common block before them comprise a replicated section of the common block and they will be stored contiguously.In a subroutine of the same program, a common block with the same name may be declared with: REAL S(4,3) DIST(*) REAL T(2,5,2) DIST(*) C THIS IS PERMITTED COMMON /COM2/ R(6), S(4,3), T(2,5,2) Static array distributionsArbitrary rectilinear block distributionsInherited distributions for dummy array argumentsForall loopsSpecial consideration has been given to optimizing the generated code.In particular, the following analysis and optimization methods have been implemented:Interprocedural communication analysisCommunication optimization: matching access patterns to aggregate communication routines, elimination of redundant communication, fusion of communication statements Interprocedural dynamic distribution analysis Interprocedural distribution propagation Procedure Cloning Optimization of parallel loop scheduling Optimization of irregular access patterns, based on the PARTI routines ( 28]) Distribution Queries and The DCASE Construct The DCASE construct enables the selection of a block of statements for execution depending on the actual distribution of one or more arrays.It is modeled after the CASE construct of Fortran 90.The keywords \SELECT DCASE" are followed by one or more arrays whose distribution functions are queried.The individual cases begin with the keyword \CASE" together with a distribution expression for each of the selected arrays.The distribution expressions consist of one or more distribution functions (which may contain arguments such as a length), or a \*" which matches any distribution.The distribution of an array is matched only if it is matched in all dimensions.The rst case which satis es the actual distributions of the selected arrays is chosen and its statements executed.No more than one case may be chosen.