Experiences in Data-Parallel Programming

To efficiently parallelize a scientific application with a data-parallel compiler requires certain structural properties in the source program, and conversely, the absence of others. A recent parallelization effort of ours reinforced this observation and motivated this correspondence. Specifically, we have transformed a Fortran 77 version of GROMOS, a popular dusty-deck program for molecular dynamics, into Fortran D, a data-parallel dialect of Fortran. During this transformation we have encountered a number of difficulties that probably are neither limited to this particular application nor do they seem likely to be addressed by improved compiler technology in the near future. Our experience with GROMOS suggests a number of points to keep in mind when developing software that may at some time in its life cycle be parallelized with a data-parallel compiler. This note presents some guidelines for engineering data-parallel applications that are compatible with Fortran D or High Performance Fortran compilers.


Introduction
One concern often not foremost in a scienti c programmer's mind at the outset of software development is parallelization.Yet, even for scienti c applications developed for sequential execution, it is not unlikely that someone at sometime will parallelize the software.As it turns out, some programming styles are easier to parallelize than others.Moreover, for programs to yield to the data-parallel approach of compilers for Fortran D 5] or High Performance Fortran (HPF) 8], certain structural properties must be present in the software.Elements of a program style congruous with an HPF compiler include, for example, consistent distribution and use of data arrays and structured ow of control.It appears that writing such programs from the outset largely embraces good software engineering techniques (such as the structured-programming approach advocated by Dijkstra 4]).In this correspondence, we discuss some of these requirements and provide guidelines for engineering data-parallel applications to be compatible with compilers for data-parallel languages.Alternatively, the observations presented here could also serve as guidelines for making an existing program suitable for data-parallel compilation.In the following, we will rst outline some Corresponding author; supported in part by the National Science Foundation, award numbers ASC-921734 (which includes funds from DARPA) and MCB-9202918.y The work described in this paper was performed while in residence at the CRPC, in part supported by an IBM fellowship.
z Support provided in part by the National Aeronautics and Space Administration/the National Science Foundation under grant #ASC-9349459.
x Texas Center for Advanced Molecular Computation { Center for Research on Parallel Computation One underlying idea of data-parallel languages, such as Fortran D or HPF, is that the user does not explicitly specify the parallelism inherent in a program, but instead annotates the program with directives on how to distribute the data, and lets the compiler work from there.Performance and investment-preserving independence from environment-speci c details are two key objectives.The art of data-parallel programming might be de ned as achieving the former without compromising the latter.For this correspondence, this also means that while the programmer should understand certain characteristics of data-parallelism, he or she should not have to develop a style which will have an adverse e ect on code readability, maintainability or e ciency in non-data-parallel environments.
Performance depends on the degree of parallelism and on overheads, such as synchronization and communication costs. 1 Compilers for distributed memory architectures typically try to achieve parallelism by distributing loop iterations across processors.The rst guideline for designing dataparallel programs should therefore be: The ow of control should be structured; e.g., DO-loops are preferable to GOTOs.Distributing loop iterations is commonly driven by some heuristic for minimizing communication, such as the owner-computes rule 1] or variations thereof.This heuristic may fail even for \embarrassingly parallel" algorithms if the program expressing the algorithm obscures the parallelism by some (perhaps apparently unrelated) means.Even though there are many other issues crucial for achieving successful parallelization, realizing these points and designing program and data structures accordingly should in themselves go a long way towards good performance for many applications.
The perhaps most important guideline that should guide design decisions is: Loops and arrays should match.That is, in computationally intensive code regions, array subscripts and loop indices should be related to each other in a simple manner, allowing the compiler to derive a loop parallelization directly from an array distribution.To justify this guideline, let us brie y digress into the workings of a data-parallel compiler for distributed memory machines, such as the Fortran D compiler prototype at Rice University.
Assume the compiler is given a simple code segment as shown in Figure 1 on the left, Good1 FortD .The Fortran D compiler will generate a node program Good1 Node , shown in Figure 1 on the right, which in turn will be compiled by the native compiler of the target machine. 3This program will be written in local name space, as opposed to the single, global name space of the Fortran D program.It will contain the instructions for an individual processor, and it may contain communication statements.Here we will focus on the distribution of computation; for a discussion of communication generation and other compilation issues we refer the reader to other publications 7, 12].The compiler tries to parallelize the i-loop in Good1 FortD by applying the owner-computes rule to the distributed array reference, x(i).Assuming no sequentializing dependences on the rhs of the assignment, the owner-computes rule works ne here since induction variable i and array subscript i are in a simple relationship { they are identical.The loop and the array match.The compiler can fully apply the owner-computes rule at compile-time and perform loop-bounds reduction: assuming that there are N proc = 4 processors, each processor will perform only a quarter of the total number of iterations.Now consider the program BAD1 FortD in Figure 2. Similar to Good1 FortD , the array and the loop match in size and we can parallelize the loop, assuming again no dependences.However, the loop index i and the arrays subscript j do not match; the compiler cannot apply loop-bounds reduction, but instead has to apply the owner-computes rule at run time with a guard.The core of the computation, which we assume to be the assignment to x(j), will still be executed in parallel, but scalability is likely to be limited due to the fully replicated loop iteration set.
In some cases, it will be di cult to establish a simple relationship between loops and sub-scripts, for example in irregular access patterns of distributed arrays.However, subscript-analysis complications are often avoidable artifacts of programming styles that obscure compiler analysis in general, not only in the data-parallel context.For example, consider the Fortran D fragments in Figure 3 (from now on we will omit the FortD subscript in the examples).Bad2a and Bad2b illustrate the popular practice of linearizing arrays, for example by storing a set of coordinate triplets into a 1-D array.Such linearizations are used for example to eliminate a loop nest, facilitating vectorization of the resulting single loop 9].For data-parallel programming, linearization often results in blurring the distinction between distributed and replicated subscript components.For example, in Bad2a or Bad2b one would typically want to distribute x along the triplets, but keep each individual triplet on a single processor; i.e., the i loop should be parallelized, while the d loop should be replicated.This, however, is obscured to the compiler by the way the array is indexed, and, in Bad2b, by the arti cial self-dependence in incrementing the counter j.
In the fairly clean and simple case of Bad2a and Bad2b one might still be able to teach a compiler to correctly apply loop-bounds reduction to the outer loop; in the general case, however, it is likely that the compiler will resort to replicating both loops and inserting guards similar to Bad1 Node .It is much more desirable to clearly re ect the programmer's intent by splitting the subscript of x into the distributed index i and the replicated dimension index d, as shown in Good2.In general, a guideline for making the compiler's life easier is: Arrays should not be linearized.

Guidelines applied to a molecular dynamics program
Before turning to our experience with parallelizing Gromos 6] using the Fortran D compiler, we rst give a brief introduction into the underlying application for the examples presented here; for more details, we refer the reader to the literature 2, 10].Molecular dynamics (MD) is a classical mechanics approach typically used to determine the motion of large molecular systems.At the core of the simulation, a force is calculated for each atom from the analytic derivative of a potential energy function.This force displaces the atom from its position in the previous time step.The MD program iterates over some number of time steps in the course of calculating a molecular dynamics trajectory.Since each atom interacts with other atoms in some spatial neighborhood, dependences arise between atoms in so far as the potential energy function for each atom is evaluated with the positions of surrounding atoms from the previous time step.A pair list indicating which atoms interact with each other is computed every t steps, where t usually ranges from 10 to 50.Since there are typically tens to hundreds of interaction partners for each atom, the data structures representing the pair list tend to be the most space consuming in the program.Within a time step, the computation for each atom is independent from the computation for all other atoms and therefore is inherently parallel. 4We base this report on the replicated approach 3], where we distribute the pairlist data structures, inb and jnb, while replicating the other principal arrays, which includes the coordinate and velocity arrays, x and v, and the forces, f.For studys on more aggressive distributions we refer the interested reader to the literature 2].

Delinearizing the non-bonded force calculation
The non-bonded force (NBF) constitutes the main component of the molecular dynamics computation performed by Gromos and other MD programs.Since a force is computed for each atom, one natural implementation of the NBF algorithm loops over atoms and their partners, computes Figure 4: NBF kernel with linearized (left) and delinearized (middle) atom-based pair list, and with delinearized charge-group-based pair list (right).Natom is the number of atoms, MaxAllP the maximum total number of partners, and MaxAtomP the maximum number of partners per atom.Nchg is the number of charge groups, and MaxChgP the maximum number of partners per charge group.firstAt and lastAt give the range of atoms for a charge group.the force between them, and accumulates it according to Newton's Third Law, as shown in the (highly abstracted) Gromos subroutine NBF Lin At() in Figure 4.The underlying computation is inherently parallel; however, the compiler will fail to parallelize this loop nest.The reason is the loop-carried dependence on cnt, which is similar to the dependence on j in Bad2b (Figure 3).Here, however, even advanced compiler analysis cannot identify a simple relationship between the array subscript cnt and the loop indices i and p.The problem is that in order to retrieve the list of partners for some atom i, one has to calculate the correct o set into jnb, which in turn depends on inb(i') for 1 i' < i; i.e., to determine the range of j for some iteration i, one has to iterate through all previous jnb segments.The advantage of linearizing jnb this way is space conservation; instead of having to reserve an equal amount of storage for each atom's pairlist, we only need to reserve enough storage to accommodate the sum of partners.However, we can expect the storage savings in distributing jnb across processors to more than compensate for this increased space requirement, and to do so requires a delinearization as shown in NBF Delin At() in Figure 4.This will also allow parallelization, since now the distributed and replicated array dimensions are separated, and they directly correspond to the surrounding parallel and sequential loops.

Delinearizing the pairlist construction
The force calculation in NBF Delin At() now corresponds to the pattern in Good2, so, in itself, it can easily be parallelized.However, we must also consider the construction of the pair list; Pairs Delin At() in Figure 5 shows a simpli ed version of the corresponding Gromos routine, with jnb delinearized according to NBF Delin At().It turns out that the criterion for including pairs of atoms in the pairlist actually depends on which charge group each atom belongs to, where a charge group is a collection of atoms treated collectively by the MD model.(Two atoms are considered \close" if their respective charge groups are \close.")Pairs Delin At() determines interacting pairs by looping over charge groups ii, determining for each charge group which other charge groups jj are close to it, storing the charge-group partners of ii in an array isPair.Then looping over the atoms i of charge group ii, inb(i) and jnb(1:inb(i),i) are constructed accordingly.Distributed and replicated array dimensions are cleanly separated; however, we again have unmatched loop and data structures.The distributed dimensions of inb and jnb are both indexed by atom index, whereas the enclosing loops iterate over charge groups (ii) and atoms within each charge group (i).The problem is that the granularity of the pair list computation is not the atom, but the charge group.We therefore switch to a charge-group-based representation, as in Pairs Delin Chg(); this not only allows parallelization by loop-bounds reduction, but also preserves memory. 5To nalize the data-parallelization (at the level presented here), we now have to also modify the NBF calculation to use the charge-group-based pair list.The result is NBF Delin Chg() shown in Figure 4.

Discussion
We have stressed the importance of matching array and loop structures for data-parallel programming.However, there are many other issues in uencing the quality of a compiler's analysis and the performance of the resulting code.We list, without further elaboration, other guidelines which may be of particular signi cance to dusty-decks.
Do not use distributed arrays as work space for other, non-distributed (or di erently distributed) data.Keep unrelated computations separate.Keep array uses consistent across procedure boundaries.
As a general rule of thumb, one may say that programs that are hard to parallelize by hand will be even harder to parallelize for the compiler; data-parallel compilers provide only limited help with the high-level task of extracting exploitable parallelism from an application.However, that is not to say that a compiler can be of little use for parallelization.High-level data-parallel languages such as Fortran D and HPF remove from the programming task the tedious, error-prone, machine-speci c, low-level work that has traditionally accompanied parallel computing.This note intends to help programmers harness that power to its fullest potential.

Figure 1 :
Figure 1: Simple example loop with matching array access; the Fortran D program (left) can be compiled into a node program (right) with reduced loop bounds (assuming N proc = 4).Distributed array and loop indices are framed.

Figure 2 :Figure 3 :
Figure 2: Example loop not matching the array access; the Fortran D program (left) will be compiled into a node program (right) with full loops and a guard (assuming N proc = 4).The id of an individual processor is denoted by my$p, which ranges from 0 to N proc ? 1.