Parallel Array Classes and lightweight Sharing Mechanisms

We discuss a set of parallel array classes, MetaMP, for distributed-memory architectures. The classes are implemented in C++ and interface to the PVM or Intel NX message-passing systems. An array class implements a partitioned array as a set of objects distributed across the nodes-a "collective" object. Object methods hide the low-level message-passing and implement meaningful array operations. These include transparent guard strips (or sharing regions) that support finite-difference stencils, reductions and multibroadcasts for support of pivoting and row operations, and interpolation/ contraction operations for support of multigrid algorithms. The concept of guard strips is generalized to an object implementation of lightweight sharing mechanisms for finite element method (FEM) and particle-in-cell (PIC) algorithms. The sharing is accomplished through the mechanism of weak memory coherence and can be efficiently implemented. The price of the efficient implementation is memory usage and the need to explicitly specify the coherence operations. An intriguing feature of this programming model is that it maps well to both distributed-memory and shared-memory architectures. © 1994 by John Wiley & Sons, Inc.


INTRODUCTION
This article describes Meta:\'IP, a programming environment for distributed-memory computers.Meta:\'IP stands for "meta-message-passing" and this should give the reader the clue that it has evolved from the standpoint of message-passing programming.The environment consists of a preprocessor, a set of C++ classes, and a run-time system that maps into a few well-known message-passing systems: PVM, 1\"X, and Express.The attempt is to abstract SPMD, message-passing programming to a more usable leveL without sacrificing much performance.The focus is on scalable, data-parallel applications, and only a subset of such applications is currently well supported.
The fundamental idea of the system is to support important, parallel data structures such as (partitioned) multidimensional arrays and (partitioned) unstructured meshes.Beyond just the data structure, important mechanisms associated with these data structures are implemented as object methods.The mechanisms provided are those that seem to be natural for each data structure.A few examples of these primitives are (for arrays): 1. Function overloading.An object method such as elem () (single element access of an array) is consistently called this even though there are many different kinds of element access in the system.2. Self-describing data structures.All the information that describes the shapes, sizes, strides, etc., of a particular instance of a data structure is stored within that data structure itself.This makes it very easy to pass around things like a "2-D (block, block-cyclic)" array and let the methods themselves inquire as to what particular type of partitioned array we have.3. Parameterized types."Templates" in C++.This language feature allows creation of generic container classes independent of the particular data types that are to be put inside the structure.For example, once we have implemented 3-D partitioned arrays as a generic container class, we can simply re-use that same class to store a 3-D array of ints or a 3-D array of triples of double, etc.
Data abstraction, in turn, promotes parallel programming that is independent of specific implementation choices such as: 1. Partitioning choice (e.g., block vs. block-cyclic).Abstracted iterators and object methods allow one to conveniently write programs that continue to function correctly when the partitioning choice is changed.2. Data types.The "all-meets-all" parallel algorithm looks the same whether the elements are triples of floating-point numbers (N-body gravity) or row, column combinations of a matrix (Jacobi eigensolver).3. Details of l/0.The array "read" method knows how to read from a striped, parallel file.Because the file was originally constructed by "write," information was written into the file, which makes the file selfdescribing.For instance, if the file was written with a block-cyclic partitioned array on 8 processors, and is being read into a block partitioned array on 32 processors, the read method can know this and load in the data correctly.
Conversely, we have not found all C++ features to be useful.For example, we do not use operator overloading at all.The allowable operators and their syntax is very limited; We are also not sure a parallel APL-like language would be very easy to read.
In some parts of the system, we have implemented constructs in the preprocessor even though one could do it via methods.A good example of this is iterators.To make sure that array element access within loops is efficient, we take control with the preprocessor and implement the loop and then rewrite the body of the loop for efficient element access.
Finally, deep inheritance trees can cause problems.It is certainly useful to add a feature to a MetaMP class by using it as a base class and then inserting an extra method or two.Heavy use of inheritance can easily cause one to violate some assumption made somewhere in the hierarchy.For example, many MetaMP methods must be called "loosely synchronously," i.e., all processors must eventually get around to calling the method.If this is not satisfied the likelv result is deadlock.
We will discuss the details of the Me taMP array class library.The C++ classes of the library implement the partitioned array as a "collective" [2, 3] object within an SPMD programming model.This means that there is a separate instance of an array object on each processor and that these objects work together to give a consistent view of a single, global data structure such as an array.Parameters such as those that describe the global shape of the array are replicated across the object instances; those parameters that describe the local features of the subdata structure (e.g., the subarray on this processor) are unique to the local object instance.The data of the data structure are also not replicated unless it is explicitly aliased by a guardStrip or alias modifier at construction time.
The object methods provide important operations that access or modifv the distributed data structure.Many methods contain communication or synchronization operations, implying that they must be called in a loosely synchronous fashion.A compiler could (often) check whether or not one is satisfying this requirement and this is an enhancement to the system being considered.
Below is an overview of the Meta:YIP arrav class features and object methods.

Array elements are accessed via global indexing.
There is no local index of an array element on a processor provided to the user.The choice of global indexing provides a consistent picture of what each array element is named and is implemented without additional run-time cost.If arrav elements are accessed from within loops governed by the abstract iterators, then one is guaranteed to access only on-processor elements of the array.
In the applications we have developed, we have achieved the access of off-processor array elements by the mechanisms of guard strips, aliasing regions, and bulk rolls or shifts of the array.That is, we employ usual message-passing techniques, but they are expressed at the level of the array and not in terms of explicit communications buffers.
The question arises as to whether one can easily provide for the simple access of off-processor array elements.In the DPC system of Hatcher and Quinn [ 4 J, off-processor references are handled by sending the requests and receiving the replies at the synchronization points implied by the DPC PARALLEL ARRAY CLASSES 205 language.We are trying to do this within the more asynchronous framework of SP::\1D programs.In this case however, it seems that one needs access to an active-message layer in the message-passing system [5].Baber's hypertasking system [6] provided an off-processor capability.Due to the high degree of nonportability among message-passing systems in their support of active-messages, we do not have off-processor access in the current Me-taMP array classes.Another comment to be made is that one can often achieve the desired effect in a more efficient manner through the use of aliasing.We will return to this point later in the article.

Partitioning Syntax
For describing the partitioning of multidimensional arrays, we have borrowed from the syntax of high performance Fortran (HPF) [7].The preprocessor for MetaMP understands a syntax for the creation of "templates" (an abstract array of virtual processors to which real arrays align), a distribute statement for templates (map the template onto the processors in block, cyclic fashion), and an align statement for the real array to the template (arrays are aligned with templates; the templates are mapped onto the physical processors).
This was done so as to provide a rich syntax, allowing the creation of many different mappings of arrays onto processors, but also a syntax that is becoming standardized.

lterators
Iterators provide for each processor looping over all the array elements it owns.As such, the iterator takes an array class instance as an argument.There can be multiple statements within an iteration.All elements on a processor need not be looped over; the global index domain can be restricted and the strides need not be 1.We again specify this by borrowing from Fortran 90 notation: [ lower : upper : step ] , i.e., the lower bound, the upper bound, and the step between successive members of the set.These numbers are in terms of the global dim ensions of the array.The iterator will convert these to the starting and ending values for each processor.
As was mentioned previously, iterators are im-plemented by our preprocessor.We do this in the preprocessor (rather than as a method) so as to ensure that we get the same efficient array access within loops as that done by Fortran compilers.The preprocessor rewrites the body of the loop (in C++), performing induction variable creation and strength reduction optimizations.

Locality-Based Access
Potential off-processor access of a local nature (limited distance in terms of the array indices, but not necessarily limited distance in memory) is provided by integrated guard strips.One can think of guard elements as being read-only copies of array elements on neighboring processors.There are object methods to update the guard values (all directions are done at once) and "corner values"' are gotten without additional run-time cost.That is, separate messages do not need to be sent to a processor in a diagonal direction.

Bulk Data Movement
For writing efficient pipelining algorithms such as matrix multiplication and N-body gravity direcL there are object methods that transfer the entire subarray from one processor to the next processor along some dimension of the array.All processors do this in parallel and the top processor wraps, leading to a rotated version of the partitioned array.The communications are performed efficiently and also the object instances are updated by the ro 11 {) method so that they "know" that they now contain a different section of the array.This is essential so that other methods, such as iterators, continue to function correctly.Another example of a bulk data-movement operation is matrix transpose, although the current array classes do not have this.

Broadcasting Primitives
These are primitives that capture the idea of broadcasting along some direction of an array.They are clearly useful for linear algebra, where row and column broadcasts are common.Actually specifying a row-broadcast from a dynamic origin (occurs in partial pivoting) is quite involved and uses the owner () method so that processors can independently compute the processor that owns a certain row.The actual broadcast is done with the pointToSubArray () method (so that a vector can alias to a row of a matrix) and the copy() method.

I/O-Bulk read/write Operations
Dumping an entire array to a (possible parallel) file is provided as an object method.Consider a read () .The method knows the shape, partitioning, number of processors, and so on of the array desired in memory; it also reads similar information written in the file (the file is self-describing).
Using this, it is easy to convert layouts, and to efficiently read the file.

Visualization-Drawing 2-D Arrays
A set of methods that give a convenient interface to draw a 2-D array in an X-window are being developed.

Aliasing Mechanisms for Unstructured Meshes
This can be thought of as a generalization of guard strip ideas to that of a "writable"' guard strip.The writable guard can also be described as a weakly coherent, shared memory.The "weak coherence" means that the programmer must specify when the shared memory is to be made consistent.This will be discussed in more detail later in the article.There are object methods for set-up of the alias groups, and for the coherence operations.

0 Other Features
To make a programming environment usable, there are some other features that are helpful to have.These are not tied to any particular data structure and are not object methods.They include: 1. Portable timing primitives-a simple interface to a clock 2. Global broadcast, global reduction, barrier functions 3. Coordination primitives for writing asynchronous, master-slave parallel programs 4. Load balance display as an X-window 5. X-window library for simple drawing

EXAMPLE APPLICATIONS
We show usage and explain more points about the array classes through some example parallel programs.The examples used in this section are a Laplace solver that is then extended to a multigrid solver.The next section will discuss aliasing for finite element method (FEM) and particle-in-cell (PIC) applications.

Laplace Solver
This is a program that solves a 2-D Laplace or Poisson problem via simple Jacobi relaxation.Although this is far from an optimal elliptic solver.we will turn it into a powerful one in the next section.Dirichlet boundary conditions can be set at any point on the regular 2-D mesh that describes the space.We won't show all the code, but merely the interesting features for parallelism.
We begin by declaring a 2-D mesh of processors, and a 2-D template that is block distributed across the processor mesh.Our HPF -like notation for this is: The S's alert the preprocessor that these are Me-taMP statements.These do not actually produce executable code, but instead cause the preprocessor to treat the set of processors as a p 1 X p2 mesh and to map the template array (regard this as a virtual set of M X N memory cells), Two in a block fashion onto the processor mesh.
The actual 2-D arrays are made by C++ constructors, modified with an HPF -like align statement: array2D(float) phi(M,N); $Align phi(i,j) with Two(i,j) $ The align lines the actual array phi up with the template array Two.It has the effect of adding several arguments to the C++ constructor that must appear on the same line.All of this is nothing but a convenient syntax that follows HPF.
To make things a bit more concrete, we can say that the iterator is expanded out to: for (i=phi.start(1);} i<phi.end(1);++i) { for (j=phi.start(O);j<phi.end(O);++j) { ... body } for block decompositions.Actually, it is a bit more complicated than this due to the fact that the iterator allows general [ 1: u: s] iteration sets.The starting and ending points on a given processor may not be simply given by the array starting location.For block distributions, the elem () method accesses a contiguous set of memory cells in the usual C fashion and is inlined for efficiencv: float& elem( int i, int j ) {return *(basePtr+i*stride[1]+j) }

Multigrid Extension, Graphics
The multigrid solver [8].Fundamental to multigrid techniques are array refinement and coarsening operations.These are illustrated in Figure 1.
The multigrid program can be expressed with just the classes and methods presented in the previous section.We could statically define arrays of size (M,N), (M/2,N/2), (M/4,N/4), . . .and then copy values from one array to another.We have taken the slightly further step and defined resize() object methods, which are used toreinterpret the array mapping and effectively change the size of the array.The same could be done by destructing and constructing the array, but resize() is light weight in the sense that memory is not reallocated.resize() remaps the arraythis means that as the array shrinks it remains properly distributed across the processors, it does not, for example, all shrink into processor 0. Finally, as in a conventional program for multigrid, the programmer still explicitly copies values, element by element, from an array of size (M, N) to one of size (M/2, N/2).
Figure 2 shows the multigrid solver running on a PVM installation.The solution is shown as a contour plot.This plot is gotten by simply opening a window with the openPlot () method:

LIGHTWEIGHT SHARING
In this section we will discuss our second effort at a useful set of classes for parallel programming.We generalize the idea of guard strips to the notion of a writable guard strip.When this idea is pursued, the guard region can be thought of as forming sets of memory cells that are shared between processors.In keeping with our philosophy of explicitly stating when communications occur, this shared memory is a weakly coherent shared memory.This means that the program explicitly states when the various copies of the shared memory are to be made coherent.Finally, in contrast to usual shared-memory models, no single copy of the memory "wins" and overwrites the other copies.Instead, the multiple copies are merged with some operator (usually a"+'') in a symmetric way and all copies are replaced with the merged values.
The ideas are explained using two important scientific applications: unstructured mesh FEM and PIC algorithms.

Unstructured Meshes, (FEM}
Central to iterative solvers for finite element problems are matrix-vector products.If the operator or matrix is stored in a sparse form, we can think of such computations as being represented by the uniprocessor code: Here, the vectors y [ 1 and X [ 1 represent fields that live on the nodes of the unstructured mesh; the index i is the node label.n [ i 1 is the number of neighbors of node i, and nbr s [ i 1 [ j 1 is the label of the jth neighbor of node i.The matrix is stored such that A [ i 1 [j 1 is the matrix element The multigrid application running on PVM.The two large windows show the solution (on the left) and a picture of the residual field.The third window shows the load balance display.This function is called loosely synchronous and is given an estimate of the achieved Mflops on that processor.In the situation shown we are running on two processors and achieving 4.3 Mflops out of an estimated max of 5.24 for an efficiency of .82 between node i and the j th neighbor of i .The computation is illustrated in Figure 3.One suggested approach for parallelizing this type of computation is discussed by Koelbel et al. [9].In this method, we partition the mesh nodes among the processors in a unique manner, i.e., each node is stored in only one processor.The computation of the matrix-vector product then proceeds in an "owner-computes" fashion: the loopOverAllMyPoints i do j = 1, n_local [i] sum that will be stored at node i is done bv the processor that owns node i (Fig. 4).
The code for the matrix-vector product gets broken into two parts: one set of loops sums up contributions from on-processor vector elements, the second set of loops sums contributions to all the nodes connected to an off-processor node.The pseudo-code becomes something like:

Communication --bring non-local x values into local buffer loopOverBndyPoints i do j = n_local[i]+l, n_non_local[i] y[i] += A[i] [j] * buff[bufLindex[i] [j]]
FIGURE 3 The sparse matrix-vector product.The matrix is nonzero only on nodes connected by edges, the arrows denote the neighbors of node i that are "pulled in" and summed to give a contribution to the vector element at i.The owner-computes method has led us into an unnatural split into a local and nonlocal computation.A similar thing occurs in regular (structured mesh) computations.We have seen earlier how the integration of the guard strip into the funda-FIGURE 4 Owner computes for the parallel sparse matrix-vector product.The processor boundary is shown by a dotted line.Arrows crossing the boundary represent interprocessor communications.

FIGURE 5
Alias groups for unstructured mesh computation.The double-ended arrows show each alias group.Mesh points along a processor boundary an~ replicated.The copies of a single point form an alias group and should be thought of as a weakly coherent shared memory.The aliases are all writable, and the results of write operations are merged at some point specified by the program.mental data structure itself can lead to a cleaner parallel program that is still efficient.\Ve pursue a similar idea here, with the extension that one can define the guard strip to be writeable, as opposed to the read-only guard strips used in the Laplace and multigrid solvers.
When integrating the guard strip into the unstructured mesh, we follow some of the ideas of Williams [ 1 0] and create multiple aliases of mesh nodes that are on processor boundaries.A set of alias groups for the previous unstructured mesh partition is drawn in Figure 5.
Again we partition the unstructured mesh, but with overlap: the nodes along the processor boundary are replicated onto two (or more) processors.The copies of a single point form an alias group and should be thought of as a weakly coherent, shared memory.The aliases are all writable, and the results of write operations are merged at some point specified by the program.This is where aliases differ from conventional shared memory.Instead of one copy being the current "true" copy that will overwrite the other copies, we take the symmetric choice and say that the values coming from the separate writes must be merged with a generalized "+" operator (i.e., a commutative, associative binary function).
The pseudo-code now takes on the pleasing structure:

Communication--merge aliased y values with+ function
Aliasing has formally promoted the communication buffer of the owner computes approach to portions of the vectors x [] and y [] that are shared.As is the case for the owner-computes technique [9:, the alias approach has a useful inspector I executor optimization strategy.The alias groups are created only once at the beginning of the program, and tables are constructed so as to efficiently perform the message passing, i.e., we can coalesce all the alias communications between any pair of processors into one, large message.
The status of our unstructured mesh work is that the author and T. Kubaska of Intel SSD have working C codes for both the case of assembled and unassembled matrices [ 11 J.We are currentlv measuring speeds on an iPSC 860 and are reworking the software to provide a C++ interface.This interface will have the following features: 1. Vectors and sparse matrices are automatically partitioned collective objects 2. Ylethods are called to define the alias points and to do the initial set-up (or inspector) portion of the computation 3. Abstract loop over this processor's mesh nodes is provided 4. Merging of aliases is again an object method.
Aliases can be thought of as an extension of guard strip ideas, but in the next section we will FIGURE 6 The particles in a cell "scatter" their position, velocity, and charge information to the j field located on the mesh sites.The "+" indicates that contributions from multiple particles are (vector) summed.
give an example of a computation in which the aliases are not merely along the processor boundaries.In addition, we will make stronger use of the fact that the aliases are writable.This will be essential for the efficiencv of the calculation.

PIC Algorithms
In this section we discuss some aspects of PIC algorithms [12] and how they might be implemented cleanly and efficiently using the framework of lightweight aliases.The applications of interest are plasma simulations, where one has charged particles moving about in space, and electromagnetic fields defined on a regular mesh.
The four major phases for each time step of a PIC algorithm are: 1. Scatter: accumulate current densities (j ) at mesh points from the particles in neighboring cells (see Fig. 6).2. Field solve: advance the electromagnetic fields (E, B) one time-step.Like j, these fields are located on the mesh sites.3. Gather: accumulate and interpolate field values to particle locations (Fig. 7).4. Push particles: compute the force on each particle, and move it forward one time-step.

7 Parallel PIC (Version 7)
Aliasing techniques make it easy to arrive at efficient parallel algorithms.In a first version of a parallel PIC program, we can alias the j , E, B FIGURE 7 Field values are accumulated and interpolated to the particle positions.
[X] fields along processor boundaries as shown in Figure 8.As before, the aliases have the effect of integrating guard strips or communication buffers into the fundamental data structure.A strong case can be made that the alias approach is necessary for an efficient implementation.Suppose there are several particles located in a cell along the processor boundary.Then, with the owner-computes approach, all of the particle data would have to cross the processor boundary.With aliasing, only the cumulated values (the partial sums) need cross.

Parallel PIC (Version 2)
A second version of parallel PIC is motivated by observations of Walker [ 12].One of the features of real PIC simulations is that particles tend to clump up and so, in order to load balance, we actually want to employ different distributions of the particles and mesh.Figure 9 shows such a nonconformant distribution-for the mesh we chose a normal, block distribution, but for the particles we chose a hierarchical (in dimensions), adaptive distribution.Each distribution comes into play at different phases of the PIC algorithm, and each is effective at balancing the load during its phase.
We define an entire region to be aliased between processors. Figure 10 shows the two distributions superimposed, and a region of aliasing is shaded.This region is written to and read from by both processors and as before, a merge operation with a + operator makes the shared memory coherent.Only one of the sharing regions is shown in Figure 10.There are similar sharing regions between other pairs of processors in Figure 10.
Alias region of j, E, B fields between two processors.We show the two distributions supcrimposed.The shaded area is shared by the processors above and below it.
sion 1) and nonconformant (version 2) PIC algorithms.For each version, the basic structure of the parallel algorithm is as follows.
1. Scatter from particles.loopOver particles scatter from particle to local copy of j field make aliased j field coherent 2. Field solve.run conventional, parallel field solve using field regions make aliased E, B fields coherent 3. Gather to particles.loopOver particles gather, interpolate field values to particle location 4. Push particles.
loopOver particles move particle one time step need to migrate some particles

Gaussian Elimination
We have written a parallel Gaussian elimination program using the array classes.The program performs partial pivoting.The search for the next pivot and the multicast of the pivot row are done through well-defined object methods.As is well known, Gaussian elimination has potential load imbalance due to the effect shown in Figure 11.As the computation proceeds, the active region of the matrix shrinks towards the lower right corner.
One known attack on the load imbalance is to use a cyclic or block-cyclic distribution of the rnatrix.Then, even as the active region shrinks, many processors stay involved in the computation.We wish to support cyclic or block-cyclic distributions transparently through our abstracted iterators.To keep efficient access of the array elements, one again needs compiler support of the iterator.Sharp and Otto [ 1] are producing a C++ to C++ translator that will provide this efficient access.

N-Body Gravity, Jacobi Eigensolver
Parallel versions of N-body gravity and the cyclic Jacobi eigensolver both rely on an efficient allmeet-all primitive.When we say l\;-body gravity we mean the direcL N 2 algorithm, and the Jacobi method for eigenvalues is within a constant factor of the best known algorithms in the sequential case, and is scalable (in contrast to the others).
The all-meet-all primitive can be represented by the sequential, triangular, double loop: for ( i=O; i<N; ++i ) for ( j=O; j<N; ++j i f ( i < j ) interact ( &obj [ i] , &obj [j] ) ; All-meet-all is a parallel version of this, using algorithms such as those found in Fox et al. [13].It is implemented as an object method, where the object is a 1-D array of elements that can be an arbitrary datatype, and it also takes the interact() function as an argument.interact() is FIGURE 11 Gaussian elimination.The active region of the matrix shrinks as the algorithm proceeds down the diagonal.user defined, and must take pointers to a pair of elements.For 1'\-body gravity, it is the function that computes the force between the i th and j th particles, for Jacobi, it takes rows and columns of a pair of matrices and does a transformation on them.All-meet-all is an efficient and easy to use abstraction for the situation of a distributed list of items, and where each item must "meet" even' other item.

PERFORMANCE RESULTS
The following table shows speed-up results for the Laplace and multigrid programs on an iPSC/860.
The absolute speed for one processor for all three programs was approximately 1.8 Mflops.We are currently measuring speeds for our FE:M/aliasing program, also on an iPSC/860 machine.
For our PVM implementation, we measure a speed of 2.2 Mflops per HP-720 processor for the multigrid program.Figure 2 shows this program

FIGURE 12
The parallel GP solver running with ~eta.\1P/PYM.The many windows show the current solution that is beinl!searched on each processor.This run was executed on seven workstations, a mix of Hewlett Packard and SC;\" ~icrosystems machines.The load balance display on the upper left shows that at this point, the parallel search is running at an efficiency of .91.The program is computing the mincut bipartition of a FEM mesh.
running on two HP-720 workstations for a total speed of 4.3 .\fflops.Relatively tightly coupled applications such as multigrid do not scale up well on this LAJ\;-based system.Future networks, or PV.\1 running on scalable parallel machines, will alleviate this problem.

ASYNCHRONOUS SEARCH PROGRAMS
Before closing, we would like to mention our asynchronous, loosely coupled parallel search programs.These run on PVM and scale up quite successfully.They are a parallel implementation of a powerful search heuristic for solving combinatoric optimization problems [14,15].Figure 12 shows the parallel graph partitioning solver running on seven workstations, achieving an efficiency of 91%.l\;ote that in a heterogeneous situation such as this, speed-up is difficult to define, but efficiency (what fraction of all the possible cycles was captured?) is still a valid concept.A related program for solving the traveling salesman problem shows similar behavior.

CONCLUSIONS
The partitioned array support that is provided by our array classes certainly seems useful.The basis of C++ has made it easier to write reusable software and we have successfully applied the fundamental object methods to several distinct parallel programs.
Cnstructured mesh applications and parallel programs such as PIC seem to be naturally expressed in terms of a weakly coherent sharedmemory system.\Ve are finding that these map well down to message-passing systems.Equally intriguing is the fact that these methods may map to large-scale shared -memory architectures in quite a nice way.Aliased variables are precisely those that should be replicated in a shared-memory implementation so as to avoid effects such as cache-line thrashing.The coherence calls are those places at which the replicated variables must be brought together.
Elaborate parallel applications demand carefully constructed parallel data structures and mechanisms.Although it is true that some of the ideas discussed here are finding their way into HPF compilers, this seems to us to stretch the idea of what a compiler should do.Compilers preclude extension; a mixed approach of compiling and class definitions may be more workable.

215
We are in the process of making the .VIet aMP preprocessor and array classes and run-time system available.Application programs are also available.

FIGURE 1
FIGURE 1Coarsening and refinement operations for multigrid.In coarsening, values from a fine grid are averaged and written to a coarse grid of fewer cells.In refinement, values from a coarse grid are interpolated to a fine grid of more cells.
phio.openPlot(' 'phi field''); and drawing the field into the window via: phio.draw(); Also shown in Figure2are a picture of the residual field and a usage of the load balance display utilities.The residual field display was easily added by creating another partitioned array of the same alignment as phi, setting it proportional to the residual during the convergence check phase of the program, and drawing using the draw() method.The load balance display is used by calling it with each processor's estimate of its Mflops.

.FIGURE 9 '
FIGURE 9 'ionconformant distributions.Due to particle clumping, we use an adaptive distribution that balances the (particle) load amongst the processors.During the field solve phase, we still want a conventional block distribution.