Simulation of Compressible Flow on a Massively Parallel Architecture

This article describes the porting and optimization of an explicit, time-dependent, computational fluid dynamics code on an 8,192-node MasPar MP-1. The MasPar is a very fine-grained, single instruction, multiple data parallel computer. The code uses the flux-corrected transport algorithm. We describe the techniques used to port and optimize the code, and the behavior of a test problem. The test problem used to benchmark the flux-corrected transport code on the MasPar was a two-dimensional exploding shock with periodic boundary conditions. We discuss the performance that our code achieved on the MasPar, and compare its performance on the MasPar with its performance on other architectures. The comparisons show that the performance of the code on the MasPar is slightly better than on a CRAY Y-MP for a functionally equivalent, optimized two-dimensional code.


INTRODUCTION
~We have ported a computational fluid dynamics application to the :\lasPar :\IP-1.The application uses the flux-corrected transpor1 (FCTJ algorithm [ 1], and consist:-; of a set of Fortran subroutines that are collective~\referred to bv the name of the . .main suLroutine.LCPFCT.In addition to the main subroutine, several auxiliary subroutines can be used to define the geometry.source terms, and boundary conditions.The code solves the coupled sets of multidimensional nonlinear conservation laws that describe reactive and nonreactive gas dynamics.
LCPFCT itself can handle Cartesian.cylindri-caL sphericaL or user-defined coordinate svs-terns.Alternate sets of boundary conditions can be selected by making the appropriate choice of arguments to the subroutine calls.Besides inflow, outflow, or reflecting wall conditions in any coordinate system, LCPFCT can also handle periodic boundary conditions.:\lultidimensional problems are solved using the method of fractional steps [5].The computational grid can be nonuniform.and can move during a time step.allowing the user to perform Lagrangian or sliding rezone calculations.
This article describes our porting and optimization of FCT on the MasPar.First we give a general description of the flow model and the FCT algorithm.Second, we give a brief overview of the MasPar architecture.Third, we briefly describe the two-dimensional blastwave problem with periodic or solid-wall boundan• conditions which we use as a benchmark.Fourth, we discuss the hurdles involved in adapting the basic LCPFCT compressible fluid dynamics module, to a form compatible with and efficient on the MasPar.Finally, we summarize the performance of our code on the WILLIAMS A:\D BAL\\:E\S MasPar.We compare the results of running our two-dimensional code on the .MasPar to equivalent benchmarks on other architectures including the CRA Y Y-MP and the Connection Machine CM-2.

THE FLOW MODEL AND ALGORITHM
Our flow model is based on the inviscid, timedependent, multidimensional Euler equations of gas dynamics.This includes three con;:;ervation laws, for mass, momentum, and energy.In the inviscid model, shocks appear as mathematical discontinuities.The ;:;olution that we seek is the weak solution to the differential equation that satisfies the integral form of the equations.From a physical standpoint, in the presence of ,.;hocb, the integral form is the proper form of the consen•ation statement, The FCT algorithm is used to solve each of the coupled conservation equations defining the gas dynamic system.
FCT is an explicit, nonlinear, monotone method designed to enforce positivity and causality on the numerically computed solution.These constraints ensure that the numerical solution approximates the solution of the consen•ation laws in integral form.Positivity and causality would Le lost in the numerical approximation if one did not ensure that the finite-difference scheme is conservative.FCT satisfies the monotonicity requirement: it implements profile-dependent nonlinear corrections to the truncation-error terms.which ensures that no new numerically produced maxima or minima occur near shocks or contact discontinuities.The FCT algorithm ensures fourthorder phase accuracy in smooth regions of the flow and guarantees conservation, monotonicity.and positivity in regions with steep gradients.An extensive discussion and anaksis of FCT has been published by Boris and Book [ 1 J.
FCT is second order accurate in space.Second order accuracy in time is achieved Ly splitting the time step.First a half time step computation is executed and then the intermediate, time-centered values of the physical variables are u,.;eJ to evaluate the source terms for the full time step.To ensure second order accuracy, the time step must be small enough so that the cell a\•eraged values of the physical variables do not change appreciably during the time step.Direction splitting [ 6] allows for multidimensional FCT calculations.In a twodimensional problem this is accomplished by separating each gas dynamic equation into its respec-tive x andy part,.;.First each y-direction column in the grid is integrated, and then the x-dirt>ction rows are integrated.Direction splitting creates a bias that will eventuallv break the snnmt>tn• of a .
. .solution, depending on which direction is integrated first.This can be eliminated by performing two calculations, x-y then y-x, and an•raging the results.
FCT is a "uniform"' algorithm in which.at each time step every cell undergoes the ,.;ame numerical operations regardless of the values of the physical variables in that cell.It is therefore ideallv suited for massively parallel processing on a single instruction, multiple data ISI\ID) architt>cture ,uch as the \lasPar.

THE MASPAR
The \'lasPar ~vstem architecture indudt>s a processor ele1nent arraY.an array control unit.and a . .There are three types of communication on the \1asPar.First., there are communications from the array control unit to the pnwt>ssor array wlwre the array control unit broadcasts data or instructions to all processors in the array simultaneously.St>cond, nearest neighbor data communications are carried out by the X-net.The X-net is an eight-way, two-dimensional toroidal mesh that allows a processor to communicate with its nearest neighbors.Finally, communication between arbitrary processors i,; carried out by a hierarchical crossbar called the global router.
Interprocessor communication i,; required when a processor requires data that are not resident in its local memory.X-net communication will be used if the data reside on a neighboring processor.In this case.the performance penalty is low because X-net communication is fast.If the data do not reside on a neighboring processor.global router communication will be used.While router communication i,; efficient.it is much slower than X-net communication-the X-net has approximately 16 times the bandwidth of the router.
Another more costly type of communication is array sloshing between the front end and the processor array.Array sloshing has a profound effect on performance and occurs in several circumstances.First, if an array that has been allocated on the processor array is accessed in a serial (Fortran 77) manner.it will be sloshed to the front end.To avoid this situation.serial acces,; on the processor array should be avoided.Likewise.when an arrav has been allocated on the front end.and a subroutine is called that use,; this arrav in a parallel context, the array will be sloshed from the front end to the proces,-or array and back again when the routine exits.To avoid this, compiler directives can be used to force allocation on the processor array at declaration time.Even more costlv than the latter two cases is the sloshing of a CO.\I.\101\" block of arrays.This can be avoided in the same manner a,.; for indi\•idual arrays.

TWO-DIMENSIONAL BLASTWAVE COMPUTATION
The problem used to benchmark F< :T on the .\IasPar is a two-dimensional blastwave computation with periodic boundary conditions.This computation involves both supersonic and subsonic flows with interacting shocks and a high degree of symmetry.The blastwave problem may not be a significant real-world problem.but it is a good benchmark for computational fluid dynamic (CFD) codes.It is a good benchmark hecaww the solution is very symmetric, it maintain,; this symmetry for a long time.and a good CFD code should resolve the shockwaves within a few cells while maintaining the symmetry of the solution.Also, this problem was previously used by Oran et a!.[ 5], and using the same problem allowed us to compare our results.
The computation is initialized with a high-density, high-pressure square of fluid that is 32 cells on each side.The square is situated in a doubly periodic mesh 128 cells on each side.The square region in the center of the domain begins with a density 15 times the background density and a pressure 30 times the background pressure.The contact surface that defines the interface between the initial high pressure material and the low pressure material is tracked by using an additional species variable in the computations.This computation is a good benchmark for a CFD model because it should retain its symmetry for a long time, and because periodic or solid wall boundary conditions should give the same solution as long as the symmetry is maintained.
Figures 1 and 2 show contours of pressure and location of the contact surface for several time steps during the simulation.During the simulation, the up-down symmetry is eventually broken by round-off error arising from the limited precision of the floating-point calculations.Symmetry across the 45° degree diagonals is eventually broken by the truncation errors in the time step splitting.1hese errors appear to be larger than the round-off errors.
~~hen the unconfined high-pressure gas is released.a shock forms that races out from the edges of the initial square.The contact surface closely follows behind the shock.Figure 1 shows the development of the pressure contours for six time step~.By step 500 the initially square shock has progressed through a circular phase and continues to change shape.By step 1..500 the shock has reflected from the ends of the computational domain and has begun to recompress the material inside the contact surface.As time proceeds to step 10,000 the shocks become progressively weaker, and become oriented parallel with the sides of the domain .
~chereas the shock patterns become simpler as time progresses, the vorticity caused by the shock interactions with the contact surface warps the interface into increasingly complex patterns.The vorticity is generated by the baroclinic source term in the vorticity equation.This term is nonzero when the gradients of density and pressure are not aligned.The misalignment is created by the shock reflecting from the corners of the domain, which weakens the shock.The shocks reflect back through the interface creating more vorticity.

PORTING AND OPTIMIZING FCT
The MasPar can be programmed in either :\1PL.. a parallel extension to C, or .\IPF.an implementation of a subset of Fortran 90.\\'e used version 3.012 of MasPar's high-performance Fortran compiler "mpfortran" or .\IPF.In .\1PF,parallel operations are expressed with the Fortran 90 array extensions.Arrays are treated as unitary objects rather than requiring them to be iterated through one element at a time, as in standard Fortran 77.MPF generates code for both the front end and the processor array, effecti,•ely making the details of the architecture transparent to the programmer.However.programming style will directly affect performance.
Optimization of FCT on the :\fasPar required two main issues to be addressed: making effective use of the processor array and nununization of communication cost.In SI.\ID architectures.operations performed on a subset of the processor array, such as single lines or columns.or boundary nodes.cost as much in cycles as operations on the whole array.Thus.constructs detaling.for instance . .with the boundaries separately from the main arrays can eaf'ily double the computation time, and they are advantageously concatenated with the main operations.In addition, unwanted interprocessor con1munication can occur if arrays are not correctly allocated on the processor array.
It was critical to ensure that all arrays were properly aligned on the processor array in order to minimize communication overhead.
Porting the code involved getting the code to run on the processor array, and optimizing the code on the processor array.Fir;;t, since the subroutines were originally written in F011ran 77.they were converted to Fortran 90 using the .the blocks were used in Fortran 90 array constructs, the whole block had to be sloshed to the processor array.By removing all CO:\I:\10!\"blocks, and passing the block elements a,.; parameters to the subroutines, we improved the temporal performance by a factor of 2. This was still unsatisfactorv.
Stage two consisted of altering the implementation of direction splitting.The original LCPFCT subroutines were one dimensional.Two-dimensional problems were handled by calling the onedimensional subroutines row bv row, and then column bv column.This minimizes scratch memory-a desirable feature on a vector machine, but leads to poor performance on a SI:\ID architecture.For example, for a 128 X 128 grid, a row or column array will contain only 128 elements.\Vhen the one-dimensional subroutines are called with a 128-element array only 128 processors will be used at a time, resulting in a performance that is less than 1.0% of the peak perfor- mance of the machine.The code was rewrittt•n such that all the rows were dealt "-ith in a ;-;inf!le operation and then all the cohunn,.;.Thi,.; can be done because adjacent rows or columns are indepewlent uf uue another during thi, computation.This stage improved the temporal performaJH'e to approximately :3 tstPp/s.
The profiler was used in ,;taf!e three to find out which parts of the code used the most CPC time.The profiles showed that approximately 70% of the time wa:; spent in the LCPFCT subroutine.and a large pPrcentage of the -time wa,; spent in calculating the boundary conditions.The DP\'AST version of the code u,;pd two one-dimensional arravs to store the results of the boundarY . .calculations.These arrays could not be properly aligned with the main two-dimensional arrays.To eliminate the communication penalty that rPsult,; from this situation.the code that dt>alt with the Loundan• calculations was rewritten.and tlw results of the boundan• calculations were stored in the main two-dimensional arravs.In addition.in the original implementation.periodic boundary conditions were controlled by setting the ntlue of a double precision flag to either zpro or one.The code was modified to use a logical flag and an IF-THE.K-ELSE block instead.As a result of the modifications performed in this stage the temporal performance improved to -l.5.S tstep/s.
Stage four consisted of replacing some of the subroutine array parameters with C0:\1:\IOI\ blocks.As a result of eliminating the CO:\L\10:\ blocks in stage two of the optimization.a significant amount of m•erhead had heen created from passing a large number of array parameter:; to the subroutines.The original CO:\niOJ\' Ll()(•ks contained both array and scalar data, CO:\L\IOI\ blocks that contain onh• arra,•s.and no scalar . .data will be properly allocated on the processor array.Replacing some of the subroutine array arguments with CO:\niOl\" blocks reduced the over- head.This stal!e further imJHO\ed the temporul performance to :l.:)6 tstep/s.In stage fi,•e.analy,;is of the code profile,., indicated that some overhead was !wing incurred due to ruutt•r cotllllllllliL•ation. lwnube ;,OJJle of the subroutirw calls included arguments that \\Trt' parts of arrays.This situation is handled by copying into a temporary array throuf!h ust> of the router, which is time consuming.The code was modified to eliminate the router opt>rations and temporal performance increased to ;~.88 tstep/ s.
Further impron"Illt'Ill wa,.; obtaiued iu ,.;tage ,;ix by aligning all array allocations.The current compiler implementation maps arrays dirt>ctly on the processor array based upon declaration indices.without any optimization attempt.Some of our arrays needed to include boundary nodes.while some others did not.and we had declared tlwm with their minimal size.which caust>d unnecessary shifting of whole arrays to occur.which is costly and can be avoided by proper array declaration.By changing the declaration indices we aligned the array allocations and impnAed the temporal performance to a!Jout 7 hlt'p/s.Table 2 shows the benckmark performance achieved with the optimized version of the code.The first row of Table 2 (128 X 6-f) corresponds to a problem with an array size that can be mapped optimally onto the proces,;or arrav.In this case there is a nununum in communication overhead because none of the array;; have to be mapped into multiple layers of processor mt'mory.For the 128 X 128 case a drop of about 2.3% in single precision.and about 10% in doublt> precision performance is observed.Thi,; would be due to the communication overhead created bv one extra layer of memo1•y having to be allocated.Table 2 also shows the percentage of the peak performance attained by FCT.Since we have implemented two-dimensional FCT computations using direction splitting.the peak performance that our code could attain is reduced bv at least a factor of two.This is because the rows and columns of the grid are dealt with separately.With a fully two-dimensional version of FCT this problem could be fixed, but would require substantial rewriting of our original code.Taking this into consideration our results are quite good.

RESULTS AND DISCUSSION
Figure 3 shows the dependence of benchmark performance on the problem size.After the initial   drop in benchmark performance.when changing from an 128 X 64 to a 128 X 128 grid, the performance increases at an approximately logarithmic rate as the problem size increases.This is becau;;e the ratio of floating-point operations to communication overhead is increasing.The incremental changes in benchmark performance seem to be much more dramatic for the single precision than double precision, because communication m•erhead is independent of the precision.Timings and floating-point performance for FCT on various architectures are presented in Table 3, which includes the current figures and data from Oran et a!.[ .S J. Timings are shown for two typical vector supercomputers, the CRA Y Y -~IP and the Fujitsu YPX240.The timing for the :\la;;-Par as compared to the Cray is competitive, with the ~fasPar approximately 7'Yo fastPr.The Fujitsu is a much faster machine than either the Crav or the ~fasPar, with a theoretical peak performance of2.5 Gflop/s.The timings shown in Table 3 were obtained without any attempt to optimize the code on the Fujitsu, and should be susceptible to improvement relatively easily.
Another parallel architecture was also Included for comparison.The Connection .\lachineC\1-2 has a Sl~1D architecture similar to the \lasPar.The MasPar timing is 43% faster than the Connection ~achine.FCT on the Connection \lachine was written in a parallel implementation of C called C* ( C star) by Oran et a!.[ 5 J. while the code was written in Fortran 90 on the .\lasPar.The final two platforms presented for comparison are both single processor RISC workstations.
Further optimization of our code could be obtained by writing a fully two-dimensional wrsion of LCPFCT, but this would require starting on~r from scratch.Due to our limited time on the \las-Par we did not perform this step.l\ow that we have a fully optimized version of LCPFCT that uses direction splitting there are several po:-;sibili-ties for using the code to perform real-world ,;imulations.\Ve now hm•e the a),ility to perform large scale direct simulations of both IHHHPacting and reacting flows.A combu;;tion model is currently being addt>d to the code so that we can simulate detonations and reacting flows in combustors.Our results only consiuered rectangular domains.To anJid the drastic performance degradation.involved with simulating flow through more complex domains, techniques such as domain dPcomposition, and mapping of rectangular ;;ubdomains on the processor array topology would be required.Jn any eYent.finite-difference algorithms are arf!Uably not the most suitable for complex geometries.

CONCLUSIONS
The :\lasPar• s SI\ID architecture i;-; well suited for explicit finite-difference Euler soh er~ because the problem can be optimally mapped onto the machine topology.and the algorithm require~ that the same operations be performed at all cells at all time steps.Optimization required that we rewrite the code to calculate the uirection-:'iplit rows or columns simultaneously (in parallel).which can be done since adjacent row~ or columns are independent of one another during such a computation.Apart from syntax modifications.further optimization was carried out by modifying the boundary condition calculations so that they were aligned with the main arrays.We ha,•e improved performance by a factor of 2-tO through the described optimizations.
In general, Fortran 90 parallel code is ea~ier to write and work with, and shorter than the corresponding scalar code.The performance of FCT on the ~lasPar is slightly better than on CRA Y Y-MP (1 CPU).and is also faster than on the Connection ~lachine C~l-2.The \fasPar has been relatively user friendly and easy to program, and the

FIGURE 1
FIGURE 1 Pressure contours at six time step;; for the square ],]astwmc problem.

FIGURE 3
FIGURE 3 Dependence of benchmark performance on the problem size.

200WILLIA:
\18 A:'\D BAC\"\E'iS defines temporal performance as the inverse of the execution time [RT = T-1 (N;pJ:, wh~n•l\ is the problem size and pis the number of processors.Temporal performance is measured in solutions per second (sol/s) or time steps per second (tstep/ s ).It i;; a good metric for comparing different algorithms when solving a certain benchmark problem because it tells you which algorithm solves the problem the fastest.

Table 1
presents a summary of the effects that each stage of optimization had upon the JWrformance of FCT.The benchmark performance was calculated as suggested by Hockney [ -t]. t The