Parallelisation of a Three-dimensional Shallow Water Estuary Model on the KSR-1

Flows in estuarial and coastal regions may be described by the shallow-water equations. The processes of pollution transport, sediment transport, and plume dispersion are driven by the underlying hydrodynamics. Accurate resolution of these processes requires a three-dimensional formulation with turbulence modeling, which is very demanding computationally. A numerical scheme has been developed which is both stable and accurate - we show that this scheme is also well suited to parallel processing, making the solution of massive complex problems a practical computing possibility. We describe the implementation of the numerical scheme on a Kendall Square Research KSR-1 multiprocessor, and present experimental results which demonstrate that a problem requiring 600,000 mesh points and 6,000 time steps can be solved in under 8 hours using 32 processors.


INTRODUCTION
Environmental impact studie,; relating tu estuarial or coastal region,; invariably involve computationa!flow simulation with additional simulation for the transport of pollution, sediment.or thermal plumes.The equations to be solved are known as the shallow-water equations which are based on the 1\'avier-Stokes and continuity equa-ti~ns.with the assumption that the pressure everywhere in the flow is simply hydrostatic.The fo~mulation may be simplified further by making the "depth-averaged" assumption where velocity .profile will not be typical of a steady unidirectional current when flow curvature effects and eddv shedding are significant.This has obvious implications for predicting the transport of pollutionusually released near the sea bed-where the vertical distribution of velocity and turbulence (mixing) processes has an important influence.For sediment transport the near bed velocity and turbulence characteristics are also of vital importance.When buoyant plumes are released from power station outfalls, vertical motion is clearly significant to plume dispersion.Overall it can be seen that computation of the shallow-water equa-156 KOR~ ET AL. tions m three-dimensional form is highly desirable.
Casulli and Cheng [ 1 J developed a semi-Implicit, Lagrangian finite-difference scheme as an alternative to a Eulerian, alternating direction-implicit scheme [ 10 J avoiding the need for upwind differencing to give stability and the time step limitation of the Courant condition due to convective terms.Casulli and Cheng applied their scheme to tidal flows in the San Francisco Bay and the Yenice Lagoon reporting good results.
Stansby and Lloyd [8] refined this scheme and applied it to the less spectacular, but probably more hydrodynamically demanding, case of flow around a circular island with sloping sides generating vortex shedding (see Fig. 1 ).The choice of this simple geometry was motivated by the desire to validate the model before applying it to realworld estuaries (see Section 8).Hence, the output of the program was compared to detailed measurements obtained from a laboratory tank, resulting in good agreement [7].
Typical simulations require the order of 1 Qh mesh points and several thousand time steps.On scalar computers this would be compumtionally prohibitive.Even on a modest vector processor.the Cray EL-98, the code required excessive computer time (days) for large problems.In this article we investigate the use of parallel processing for producing such simulations within practical time scales.

THE THREE-DIMENSIONAL SHALLOW-WATER METHOD
In the laboratory experiment de,.;cribed in the previous section.thf' water is initiall~• ,.;tatiouary and the water level horizontal.The inlet flow ratf' is then increased with time a,; a quarter sinusoid and maintained after a specific time step at a constant value to represent a steady current.At the outlet boundary.the velocities u and L' are given zero normal gradient,; and the water depth is fixed.At the two side walls.v and the normal !!radients of u and 1J are set to zero.
The original formulation of the numerical scheme proposed by Casulli and Chen!! [ 1 J us<>cl a uniform mesh in the venical direction.con,.;tantvenical and horizontal mixing coefficienti'.and the Chezy coefficients to give bed boundary conditions.In order to give an accurate representation of bed and water-surface conditions.Stansb\• and Lloyd [8] introduced the a--coordinate ,;y,.;tema-= (z -1J )I (17zo) for the vertical direction.defining the bed surface by its roughnes:-; height.This enables a turbulence model for the vertical direction to be incorporated: Stansby and Lloyd proposed a simple two-layer mixing length model for rou!!hturbulent flow.FurthermorP.the,• introduced for horizontal mixing a mixing coefficient proportional to depth and friction velocity.
.The finite-difference mesh used in the numerical computation is a staggered rectangular system "\\ri~h a "wet/dry"' boundary crossing the horizontal mesh obliquely (giving wet and dry cells).This is not a severe limitation since velocitie» close to the shoreline with gently sloping beds tend to be quite small.The a--coordinate s\•stem entails a fixed number of vertical cells at ~ach horizontal mesh point.\re will refer to the number of mesh points in each spatial direction by n.ro n, .. and n~, and to the corresponding coordinates by xi(i = L . . ., nr).)j (j = 1. . . ., n,.).and ::k (k = 1 . . . . .n~).
An important feature of the numerical scheme is the Lagrangian treatment of the convective terms.This avoids the need in conventional Eulerian schemes (e.g., TRISCLA [10]) to generate stability through upwind differencing with some inevitable numerical viscosity.The terms involving surface elevation gradient and vertical nuxmg are handled implicitly for stability, whereas the terms involving horizontal mixing are handled explicitly.The equations are solved as fully coupled in both horizontal directions producing at each time step a pentadigonal system of equations for the new values of 1J at each grid point in the horizontal plane.Schemes which inmin• uncoupling (alternating direction schemes) require smaller time steps to be used for equivalent accuracv.

THE APPLICATION PROGRAM: SW3D
In this section we describe the structure of a Fortran 77 program.S\\'3D, which implements the three-dimensional shallow-water method described in Section 2. The version of S"'3D which forms the starting point for the parallelization process had previously been run on a Cray EL-98 svstem.- The main computational effort of s"-3D is contained within a subroutine called LXY, which is sketched in the pseudo code shown in Figure 2. \\'e distinguish between actual array elements (written in truetype font) and mathematical objects and operations (using standard notation).
For instance, A '-i denotes a n.X n. tridiagonal matrix which depends on the index pair ( i, j ) , while u ( i, j , k) represents the ( i, j , k) -th element of the array storing the values of u.The vectors b1 and b2 in BC_CC and BY_CY are fixed, and ny in SETuP implies a numbering scheme of the nx X n_,.pentadiagonal matrix P.
Most of the work in LXY is devoted to setting up the matrix P and right-hand side r of the linear system Pe = r which is solved for the new surface elevation.For each time step the sequence of operations is as follows: first, code segments FU and FY evaluate.for every grid point, the finite-difference operator arising from the explicit terms for convection and horizontal mixing, and store the values into arrays fu and fv, respectively.1\iext, segments BU_CC and BY_CV each solve (for every ( i, j) ) two tridiagonal linear systems of dimen- x n. tridiagonal matrix A (i,j) using values stored in v bv(i,j) = bf (A(i,j)rl b2 setup n.-dimensional vector b(i,j) using values stored in fv cv(i,j) = bf (A(i,j)r 1 b(i,il SETUP do i = 1, n., do j = 1, ny compute 5 non-zero entries of row n; 1 of P using bu(i,j), bu(i+l,j), bv(i,j), bv(i,j+l) compute element n;i of r using cu(i,j), cu(i+l,j), cv(i,j), cv(i,j+l) PENTA solve pentadiagonal system Pe = r; store results in array e COPY copy u into uold, v into vold UPDATE-D x n. tridiagonal matrix A(i,j) using values stored in uold setup n,-dimensional vector b(i,j) using values stored in fu and e u(i,j ,k) = k-th element of (A(i,j)r 1 b(i,j) UPDATE-V do i = 1, n., do j = 1, ny setup n, x n, tridiagonal matrix A (i,j) using values stored in vold setup n.-dimensional vector b(i,j) using values stored in fv and e u(i,j,k) = k-th element of (A(i,j))- Figure 2 contains only thP most computationally significant ,;ef!nH'nh of LXY.F urtlwr code such a,; the calculation of the new water depth using tlw nf'w surfact' f'lP\•atinn or the flooding of dry points are not included.Thi,;; code will.however.not be neglPcted when analyzing the runtime.
To determine the memory requirements Wt' introduce the notation ;_rz

THE KENDALL SQUARE RESEARCH KSR-1
The KSR-1 is a virtual shared memory multiprocessor.The machine consists of processor-memory pairs (cells) arranged in a hierarchy of search groups.each group containinf!32 cells.The virtual memory is implemented on the physically distributed memorie,;; by a combination of operating system software and hardware support through the KSR ALLCACHE ,;earch enbrine.The OS manages page migration and fault handling in units of 16 Khyte.The ALLCACHE engine manages movement of 128 byte subpages within the system ..\lovement of sub pages is therefore cheap compared to the movement of pages.The implementation described in thi,; work is for the 64 cell.doublP ,;earch group.KSR-1 installed at .\IanchesterCniversitv.* Each cell i,; a 20 .\1Hz.super-scalar.RISC chip with a peak 6-t-bit floatinf!-pointperformance of 40 .\Hlop/s (achieved with a multiply-add instmction) and 32 .\Ihyte of memory.Two instmctions may he issued per cycle: the in,;truction pair consists of one load/ store or i/o instruction and one llnatin_g:-point or integer instmction.The cdls in a single group are connected by a unidirectional slotted ring network with a bandwidth of 1 Gbyte/ s.The two search _groups of the .\lanchestermachine are connected b,-a further unidirectional slotted ring network with a bandwidth nf -t Gbyte/ s, where up to :3-t groups can he attached.
The ALLCACHE memorv svstem is a directory-based system which support,; full cache coherency in hardware.Data movement is n~qm~st driven:_a memory read operation which cannot be sati,;fied by a cell" s own memory _generates a request which traverses the hierarchy of rings and returns a copy of the data item to the requesting celL A memory write request which cannot be satisfied bv a celrs own memory results in that cell .
. obtaining exclusive ownership of the data itemthe data item moves to the requesting celL In the process.a!-i the request traverses the memory system.all other copie,; of the data item are invalidated.thu,; maintaining cache coherence through an invalidate-on -write policy.
The machine has a Cnix-compatible distributed operating system-the Mach-based OSF I 1-allowing multiuser operation.The programming model supported is primarily that of program directives placed in the user code (Fortran 77 and to some extent, C. [.5 J ).The directives may be placed manually or automatically (by a pre-processor, KAP).A run-time support system, PRESTO, and underlying Posix-hased threads model support the user directives.The run-time 160 KOR:\' ET AL. system and threads are also directly accessible through a standard library interface.

KSR-1 Memory Latencies
The KSR-1 processor has a JeyeJ 1 cache.known as the subcache.The subcache is 0.5 .\lbne in size, split equally bPtween instructions and .data.The data subcache is rwo-wa\• st'l associati\e with a random rt>placement policy.The cache line of the data subcache is 6-t bytt>s iJwlf a sul,pa;.!e).
. This \ ahw has to be multiplied !Jy a factor of :3 if tlw rP<JLW."ti, satisfied by a cell of the ,.;ecoud rillf!.A requP,t for data not eurreutly eacht>d iu any cPIJ" s memory results in a traditional.hif!h laww•y.paf!t:' fault to disk.

Memory System Behavior-Alignment and Padding
In order for a thread to <HTe,s data on a ,.;ul,paf!t". the paf!t:' in "•hich the !'lubpa!!e re,.;ides mu"t lw present iu the cache of the proce,,or otJ which tlw thread executes.If the pa/!P is not pre>'Pllt.a paf!P nubs occurs ancl thP opPratitl!! ,.;~•,tem and ALLCACHE systt-'111 <•ombiue to make the !"'!!" preseut.Jf a new page caLbe" au old paf!P in the cache to be displaced.tlw old JWf!e is mon•d to the cache of another cell if po,sildt>.H uo room t•an lw found for tlw JH\~!e in any cadw.tlw pa,~P is displaced to disk ..\Io,•iuf! a paf!e to the caclw of another cPI! is much clwapt>r than JlHf!inf! to di,.,k.
Performance of applicatimh is virtual JJJPlllOry system:; can suffer from the plwnometJ<Jil of false sharing: if two threads.ruunin!! ou different n•lls.request separate data items which reside ou the same ...;ubpaf!e.t!wt subpa!!e nwy continually thrash back and forth between cdb ..\lo,t vir•llwl memon, svstems han• to contPnd with fabe ,.;!wring at the OS Jla!!e It>\ el.whieh is typically "t'\ era! kilobnes in size.On the K.SR-1 the unit of mmement around the svstem i,; the n•lativeh• ,.;mall . .128-byte subpaf!e.At this size.ensurinf!that data structures accessed Lv sevt•ral threads do unt cause thrashing can be achieved simply by en,.;uring that the structures are paclded out to a sub-page boundary and that they are alif!ned so as to begin on a ,.;ubpaf!e boundary.This is most ,;imply achievecl throuf!h suitable dP('laration of data structures: e.p: .. paddiuf! the inner dinwnsion of multidimPnsional aJTa\•,.;.

SEQUENTIAL OPTIMIZATIONS
The original code consisted of nearly 1.000 lines of Fortran code.handling PEl\TA through IT-PACK [3], a 9,000 line Fortran package which offers seven iterative methods to solve sparse linear systems with symmetric positive definite or mildly nonsymmetric coefficient matrices.The Jacobi conjugate gradient (JCG) mdhod was chosen because of its convergence properties.As proposed by Casulli and Cheng [ 1 J. the code was developed for vector processors, running initially on a Cray EL-98 with an optimized version of IT-PACK.The code was transferred to the KSR -1 and compiled without any change.\~•e always used the highest optimization level of the compiler (-02 option).Furthermore.ITPACK was compiled on the KSR-1 with the -r8 option.Otherwise all floating-point variables.which are declared as DOlJBLE PRECISIO~.would be handled as 128-bit values.
It is important to note that one cell does not have enough memory to cope with the required 80 Mbyte.causing a considerable amount of data to be placed on the memory of other cells.Hence.the sequential program suffers communication overhead since it has to perform some remote data acces,.;e,.;.Analy,.;i,.; of the code led to following optimizations.
1. Reducing memory requirements: From Figure 2 WP can see that the most natural loop orders are ijk (i.e., i outermost.kinnermost) or j ik, where i runs over the x dimension.j over the y dimension.and k over the z dimension.Because the algorithm is applied to shallow-water problems.the index space of k is much smaller than that of i or j .Casulli and Cheng [ 1] suggest that the proposed algorithm is suited for vectorization: efficient vectorization would required flipping the loop order to make the innermostloops the longest.i.e., kij orkj i order.This "unnatural" loop ordering was implemented in the original code provided here only in FU and FV: the remaining computations used loop order ij k.As the Cray vectorizing compiler reported that loop bodies in FU and FY were too long to vectorize, the loop bodies were split in two.This involved the storage of intermediate data into six arrays of type (n.r, n,., n=).By reversing this splitting we avoided the temporary arrays, reducing the number of three-dimensional arrays to seven, and the total memory required to around 4.3 ..\lbyte.This in tum reduced the number of remote accesses.
2. Avoiding bad stride: All loops over i, j. and k were converted to ij k order.Since Fortran arravs are stored column wise.the seven three-dimensional arravs were declared of type (n=.n,., n.r) thus achieving a correlation between loop nest order and the layout of arrays in memory.This is vital for achieving a high rate of data reuse in a hierarchical memorv svstem.A minor side effect of k being the innermost array index is the fact that the solution of the tridiagonal systems in UPDATE-C and L'PDATE-Y can be stored directly into u and v. respectively.rather than having to use an intermediate vector.

Stripping ITPACK:
As a first step.the path followed by the JCG call through the library routines was identified and isolated: almost 7,500 lines of unnecessary code were deleted.Furthermore, the routines SCAL and Cl\SCAL were modified.ITPACK calls the former before the first iteration to scale the matrix, the right-hand side and the initial solution.After convergence, the scaling is reversed.In LXY the unsealing of the matrix and right-hand ,.;ide i,.; not necessary since thev are not used after PEl\TA.Hence Cl\SCAL was reduced to a single loop.which was inlined.to unscale the solution.The modification of SCAL was motivated retroactively by the necessity to parallelize ITPACK.The matrix is scaled bv ITPACK such that all diagonal elements have the value 1.To perform the unsealing.the original diagonal elements are stored at the beginning of the one-dimensional array containing all nonzero elements of the sparse matrix.This implies shifting the off-diagonal elements.an operation that is inherently sequential.Therefore, the sparse matrix structure is constructed accordingly in SETLP, i.e., the diagonal elements were stored at the beginning.and the rest afterwards.The consequences for SCAL an~ the avoidance of the shifting, and the simplification of the search for the diai!onal elements.
The correctness of these code transformations was confirmed in separate runs by dumpinf!te:o;t data to a file after each time step and comparin::r them with the values from the original ,•ersion of the code.A new version was only acceptt'd if the files were identical.
The effort invested in these sequemial chan::res has a significant payoff-the elapsed time for fin• time steps was reduced from nearly :2.730 second,.; in the original code to about 600 :-ecouds.:'\early 86% of this enhancement is as a result of the avoidance of Lad striJe by dedariug the threedimensional arrays as (n=.n, .. n.r ).On a n•ctor processor (on which the code was dnelopedi stride has little impact.since the nwmory on ~uch an architecture is basicallv '"flat.••In a hierarchically structured memory.however.ensuring maximum reu:;e of data is vital to obtain efliciel11 emit>. The reduction of data and stripping c1f ITPACK resulted in 13% and 1% improwment in execution time, respectively.The total amount of work invested in these optimisations.including the time required to become acquainted with the altroridun and the code.was about 7 person-day~ (we consider 1 person-day to be 8 hours of dedieated work).

PARALLELIZATION
The version containin!! all seqtwntial optimizations proposed in Section ;) was the starting point for parallelization.Table 1 shows in detail the contribution of the different parts to the total runtime of LXY: computations of similar structure have been grouped to!!ether since they can be parallelized in a similar wav.REST accounts for all minor computations scattered throughout LXY.including COPY.
LXY was paralldized stepwise.the ~~"qurnct> order-reflected in the following subsectionsbein::r determined by the magnitude of the execution time given in Table 1.The only exception wa~ PEl\TA which was left to the end, because the loops in ITP.\CK have a different structure to all the others in LXY.The puralldization stmtet-'Y for loops not in PE:\TA is already implied in Fi~ur~• 2.
The obYious and successful approach is to ,.;plit the x.y plane evenly among all threads a1:1d let each one work independently on its portion of the plane.This is achieved by tilin[!tlw loops o\•er i and j. thus: c*ksr* tile (i, j, strateg)=slice, private=(k)) Here the index ;;pace of i and j is partitioned by PRESTO (the KSR run-time ~y::<tem) into contiguous tiles which are distributed between all threads such that each gets exactly onf' (,-lief" strate!!yj.All variables are shared (i.e .. ju,.;t oue copy exists which can be accessed Ly all thread~; except the tiled index variable,; and tho~e explicitly listed as prh•ate. A detailed anal~ sis showed that the propo:-ed approach is indeed mlid.The shared data eon:-i;;t basically of all arrays of size at lea,.;t n,.x n, ,e.::r .. fu, v. and e).Tiling of the loops re~ult~ in correct execution since only the thrt:>ad "owning" an index pair ( i, j ) updates the corre,.;pondin_!! t>lement of an\• shared arn.n•.awl the tile ,;tatement~ . .impose the neces,;ary ,.;ynchronizatiou poim,.; which pre,•ent thrt>ad,; from ,.,tanill{! the execution of a sub:;equent loop nest until all other threads have completed the current loop llt>:'l.
:\ote that the island is mapped onto tht:> thread,.; dependinf!on the cho,;en partitiouin~ of the .r. •'• plane.This could lead in some partitioniu~,; to load imbalance.fiince no computation i,; performed on dry wid points !see ~eetion :3).Benm,;;e the size of the island is small compared to the estuary, the load imbalance i:-; negligible.For a complex estuary with irregular wet/ dry boundaries a more sophisticated load-balancing strate{:.ry is required (see Section 3).
Figure ~ shows that some communication of data between thread,; is necessary in SETCP, caused when a thread owning index pair ( i, j ) but not ( i + 1, j ) accesses.for instance.bu ( i + 1, j).The set up of the tridia!-!onal matrices (and to a lesser extent some of the right-ham.! sidt>s) as well as tht> computation in UPDATE-"cause similar.regular communication patterns.In FC and FV.however, the communication pattern depends on the data and is therefore unpredictable and irregular (see Section 6.1).The situation in PEl\TA will be described in Section 6.6.
The performance re,;ults pre,;ented in Section 7 confirm the validity of this approach.The following subsections report insight,; and experiences gained during the procp,;,; of parallelizing tht> various segments of LX Y.This parallelization process required approximately 10 person-day,;.

FU and FV
The main difficulty encountered in addinl-! the tile directives (to these and all other loop,;) was tht> identification of the private variables.Havinl-! done this for FL and FV.we discovered.usinl-!PRESTO information.an important amount of load imbalance cause by an uneven assignment of indices to threads.~-e therefore decided to take manual control of the size and distribution of tiles in order to improve load balance.In Section 7 we will give more details and rqJOrt on the re:-;ult,; obtained.
\'\'e would like to stress the ea,;e of parallelizinl-!FL and F\" on a virtual shan•d mPnWIY architecture like the K.SR-1.A,; we have alread\ nwntioned.the conununication pattern in t!JPSP ,;tpp,; is unpredictable: for each (i, j, k) we nPed the velocity at that grid point (.r,.y 1 .Zk) and at the point (.r;a.:Vib.Zk = r).where a. b, and r depend on the actual valuPs of u ( i, j, k).v ( i, j , k) .and w ( i, j , k) .Since (.r;a.YJb: Zk -c) is usually not a !!rid point.its velocity is obtained by interpolating the velocities of the eil-!ht cell corner:; containin~ it (the two-dinwnsional analog is shown in Fil! .. 3).
On a message-passing architecture each process would have to find out where the information concerninl-!(.r;a .. l~ib, Zk-r) is stored.~wnd a message to the corresponding process.and wait  ---------------{x;.a.yj-b) for the data to arrive.!\"ote that this protocol is complicated by the fact that the ••owner•• of (.r;a, .l'}-b.Zk -c) does not know who is l!oing to contact him.or wherL Alternatively. the processes could exchange "halo'• data.but significant amounts of the communicated data would be unused.On the KSR -1. the remote accesses to array elements at (.r;a .. l } -b.Zk -c) are automaticallv handled bv the ALLCACHE memorv . .svstem.

BU_CU, BV_CV, UPDATE-U, UPDATE-V
Since the tridiagonal systems in these four segments are solved in paralleL each thread needs its own copy of the coefficient and right-hand side arrays.Because KSR Fortran does not support private arrays.a technique called array expansion had to be applied.Hereby.an array is expanded is the number of threads.so that thread i(i = 1. . ... p) uses the memory locations starting at index (1. . . . .nt. i 1 .
Havinl-! done this. the measured run -time of the parallel version was disappointin~.Lsing GIST. a tool for lo~ §!in~ and visualizinl-!Pvents.serious load imbalance was detPcted.
Information from PRESTO revealed that some loops over i run from 1 to n.r and others from 2 to nx.Tilinl-! the former causes the tir,;t thread to stan with index 1 and finish with some m 1 • while in the latter case the same thread handles the range 2 . . . ., m 1 + 1. Hence.data locality is not preserved.a situation which can lead to a significant number of renwte data accesses.This was avoided by embracing all loops in LXY by a KSR Fortran aflinity rPgion [ -i]. which ensures that for different tiled loop,;.the same val uPs of indices are scheduled to the sanw threads.even thoul-!h the loop bounds may bP different.Although this mea-sure did not solve the load imbalance problem and showed almost no performance benefit, we maintained it to ease experimentinl! with different tile sizes and strategies.
Finally we turned to P:\10N, a tool that gathers hardware monitoring statistics.This information showed that some threads had an extremely hi;.rhrate of cache subpage misses, that is they were accessing a large number of data items located on other cells.Since this did not occur in BC or BY.we searched for some different:es in the code which might account for this, and identified the "privatized" arrays as the source of the problem.The scalar expansion of arrays is an example where the effect of false sharinl!(see Section 4.2) can be severe.For instance, the element;; A (nz, 3, i) andA(1, 1, i +1) lieinconsecutivememory locations, the first being written to by thread i, the second by thread i + 1.If these two 8-byte data items happen to be on the same 128-Lyte subpage, this subpage is moved back and forth, causing unnecessary communication.The smaller the original array is (in our case it has only 3n= elements), the higher the degree of false sharing in the extended array.By padding all extended arrays to subpage boundaries.we eliminated this effect and achieved much better load balancing.

UPDATE-W
After the experience gained in the previous steps.tiling this loop, including identifying the private variables and expanding some arrays, was tri,•ial.

REST
The REST segments consist mainly of smaller loops scattered throughout LXY.Although they account for very little of the sequential execution time, it was important to parallelize them . .as otherwise significant data mm•ement will occur.:\lost of these loops are similar to COPY and were tiled trivially.Some other loops cover only the boundaries of the domain.The bodies of these loops should ideally be performed by the threads that own the corresponding x, y index pairs.However this is not easy to perform on the KSR-1 and since the performance gain would not justify the eff011.
we did not parallelize them.At the end of LX Y. the flooding and drying of cells in the horizonal plane is handled.Introducing parallelism in these final loops would result in several threads writing to the same memory location, making the use of locks or critical regions necessary, and again any performance gain would not justify the effort involved.

SETUP
Due to the steps undertaken when optimizing the sequential code, it was straightforward to set up the pentadiagonal matrix in parallel maintaining the correct (sequential) order of the rows.

PENTA
In the optimized sequential version the call of the JCG routine in ITPACK took only 6. 7% of the total time, but this increased as the parallelization steps progressed.Eventually.havillf( carried out parallelization of all other segments.about half of the elapsed time (using 16 cells) was ;-;pent in PEI\"TA.
ITPACK handles vector operations through level 1 BLAS-like routines while matrix-n•ctor multiplications are adapted to the structure of the data type containing the sparse matrix.These subroutines contain a single loop of length .V (where ,Vis the dimension of the linear system-in our case ,V = n,.n,.) as opposed to two outermost loops of length nx and n,.encountered in the previous sections.Therefore we tiled ITPACK loops specifying that they should not be part of the endosing affinity region.
As a consequence.some data mo,•ement will occur at the beginning and the end of the iteratiw procedure.After the first iteration.mo~t data are local and do not move to other threads.Communication takes place in each iteration due to the scatter and gather of wctors in GAXPY -like operations !Ax + b) with a ~parse matrix A. and the reduction phase in the parallel execution of dot products.
1\"ote that the influence of rounding errors can change the result of parallelized floatinf(-point vector sums.Therefore we maintai1wd one sequential and one parallel version of the dot product.The former was used to check the correctness of all changes (as mentioned in Section 5 ). the latter for run-time mea~urements.
Finally. it is important to note that we are coping with some load imbalance in the parallelized version of JCG.Tile sizes produced by PRESTO are by default a multiple of 16 (see Section •t.:3).Changing PRESTOs default (for in;;tance to a multiple of one) leads to better load balance.but results in false sharinf(.Our experiment~ ;;bowed that in this trade-off between load imbalance and false sharint!. tlw former cau~ed the smaller performance penalty.Thi,.; can lw explain<>d by noting that mo,.;t operation~ in JCG are level 1 BLA.Slike routitws.wlwre tlw ratio of acce,.;:;eddata to computation i:'i hit!h.\'..t:> then .. fore maintaint>d the original PRESTO default vahw.
. .l\'ote that n= X n,.X 8 bytes = 16 Kbyte.which is the size of a page.Thus. the access for instance of u (kO, j 0, iO) will caust:> the page containin:r the elements u (k, j, iO) (k = L . . . .n=. j = 1. . ... n,.J to become resident on the requestin:r ct>ll.\'.'ith the modulo tiling strate~-we can expect that each thread will use almost t:>vt>ry value of i. HencP.almost every paw• is requested by every thread.even thou:rh only a few of irs subpages are actually u,.;t>d by an~one thread.i\'ote that this false sharint! of pa!!e" is different from the false sharing of suhpat!e,; encountered in Section 6.2.To avoid thi,; problt:>m wt' let each thread work on exactly one tile of size r rz.r/ p 1 X n,.(equivalent to tilin!! mer i in a slice fa,.;hion; so that each of the p threads requires only a pth of all pages and acee,.;se,.; all ""bpat!es within them.Thi,doe,.;however n•,.;11lt in some load imhalanct>-see hdow.
we ensured that all the cell,; were on tlw samt:> rini!.
:\'ote the di,..crepancy lwtwet•n the ,.;equemial and the one-cell parallel time.This can be explained by the increa,.;e in memory rt:>quiremenr,.;caused by array expan,.;ion.:\'ote al:-;o that dte four cdl time is less than half that on two cell,;.\'.'hen usinl!one or two cells.there is not enoui!hmemory  to hold all of S\C3D"s data.and it is necessarv for the operatin~ system to place some data in other cells' memories.""ith four or more cells.however there is enough memory.and the number of remote accesses are considerabh-reduced.,,.e also see that the use of more than -±8 cells gives almost no further reduction of the run-time.
Figure 4 shows the simulation performance in time steps per second.The naive ideal performance is the reciprocal of the naive ideal time.which is computed by simply dividin!! the execution time of the optimized sequential code by the number of cells.To check that extrapolation of our results to long runs is indeed valid.we ran 6.000 time steps on :32 cells (using both rin!!s.l.This simulation took 7 hours -±7 minutes.whieh corresponds to 0.21 timesteps per second.the same n1lue as we obtained from mea,.;urin!! five time steps. ""e have performed an analysis of the parallel overheads in order to identify the major factors causing the discrepancy between the naive ideal and the actual performance.,,.e define the total overhead as the difference between the actual measured time and the nain• ideal time."•e then apportion the total overhead into four categories as follows: 1. Cnparallelised code.This is the uverlu•ad incurred due to the parallt'l Yersion containing sections of sequential code.2. Load imbalance.This is the merlwad that results from processors havinf! to wait at a synchronization point for other proces~or,.; to finish their parallel tasks.1\"ote that this analysis 1s ,.;omewhat t•omplicated by the fact that the sequential Yersion makes a substantial number of remote memory acn~,.;,.;es.because there is too much data to fit in the memory of a sin~le cell.This makes the naiYe ideal time somewhat pessimistic.and re,;ulh in nef!atiYe values of memory acces,.; overhead.In such a situation the time spent by dw parallel \ersion in making remote data accesses is a useful additional statistic.as it givt•,; a better impres,.;ion of the perfornlance loss resultinf!from communication of data between processor,.;.The result,.; of this anah-sis for 16. :3:2.and -tB cells are di,;played in Tah!P :~.Tlw analy,.;isshows that tlw total overheacb are increasinf!a,; the number of cells increases.such dmt on -tB pn>eessors the o\erhead accounts for rwarh :3()'1o of the measured execution time.This sugge,.;tsthat with the present parallPiization strate1-':•• it is unlikPiy that the execution time could lw reduced lwlow :3 seconds per time step for a problem of this size.no rnatter how many pron•,.;sor,., \n•re u,.;Pd.
It is dear that load im!Jalance i:-; the most significant source of overhead.About half of this load imbalance can be attrilnlted to tlw uneven assignment of indices to threads.since 1 b.:3:2.and -tB are not divisors of n_,..For example.u,.;ing :3:2 processors the tile size is 11. which results in two processor,.; being idle.Thi,.; could be ameliorated by transforming all double-rwsted loops over i and j into one loop from 1 to n ..-X n, .. The remaining load imbalance cannot he ex-plainPd by unevPn assignment of indices to threads in PEl\TA (see Section 6.6;.Lsing hardware monitoring information we discm•erecl that in many of the tiled loops.some threads are stalled waiting for data almost twice as !on!! as others.although all have the same workload.This additional stalling is not caused Ly remote accesses.but Ly subcache misses.The implication of this is that for certain values of i and j there is considerably more overhashing (and ht>nce displacement) of subcache lines, than for others.The precise cause of this is current!,-not clear.but we believe it may be a side effect of havinl! a number of large arravs with a varietv of sizes.~lore research is .
. needed to understand and overcome this overhead source.since it causes sif!nificant degradation of performance with increasing numbers of cells.
The next most important source of overhead is the unparallelized code.~lost of this code is concerned with setting boundary conditions.and again with some more effort it may be possible to parallelize some of these sections.although, of course.doing so may increase the overheads from other sources.There is little that can be done to reduce the cost of svnchronization and tile scheduling.In each time step around 200 parallel loops are executed r:there is some variation depending on the number of steps required in the conjugate gradient solver).Experiments ha,•e shown that each parallel loop incurs a synchronization and scheduling overhead of 1 ms for small numbers of cells.risintr to 1.-t ms on .')6cell:'i.Finally we note that remote data accpsses are the least significant source_ of performance loss.hence there is no point in attempting to reduce the number of remote accesses before the other more significant sources of overhead have been addressed.

FUTURE WORK
After the validation of the code with a simple geometry.we intend to apply it tu a real-world estuary.Bideford Bay (southwest Lnited Kingdom).which is used as a benchmark to allow a standardized approach to the testing and comparison of modeling software used for hydrodynamic and bacterial dispersion modeling [ 11].This will require a more sophisticated load-balancing strategy in order to cope efficiently with highly irregular wet/ dry boundaries which may be changing markedly with time.One possible strategy consists of allowing the boundaries between the partitions of the horizontal plane to change as the simulation proceeds, in such a way as to equalize the time spent by each proce~sor [2].
Section 7 showed that further analysis is necessarv to understand and overcome some overhead sources, particularly the load imbalance.and the remaining unparallelized code.l\ote that the load imbalance problems should be solved by a dynamic load-balancing strategy.It would also be interesting to verify the parallelization strategy on other virtual shared memory architectures, such as the Cray T3D and Convex Exemplar.
There are several planned enhancements to the computational model, including pollution transport, sediment transport, and plume dispersion.Also a more sophisticated turbulence model involving the transport of Reynolds stresses directly is planned.Algorithms for biological and chemical behavior of pollutants are also desirable.Since the governing equations for each of these additions are of essentiallv similar form to those studied here, and are assumed to be largely uncoupled from the hydrodynamics equations, we can expect to apply the same algorithmic structure and parallelization strategy.
We also envisage some enhancements to the numerical scheme, including the avoidance of the time step limitation due to the explicit treatment of horizontal mixing.This could be achie"~ed by an implicit treatment, possibly as a fractional step process.A greater challenge, especially for parallelization, will arise from the introduction of adaptive mesh refinement.For example we might adopt a strategy where mesh sizes are successively halved in proportion to the inverse of water depth and spatial flow gradients, e.g., vorticity (9].

CONCLUSIONS
We have described the parallelization of a threedimensional shallow-water estuarv model on the Kendall Square Research KSR-1.Although the semi-implicit Lagrangian scheme was initially described as an algorithm well suited for vectorization [1], we have found that its parallelization is natural and easy to perform, resulting in exceptionally efficient execution.
Recall that the time stepping solution process revolves around the solution of a pentadiagonal system of equations describing the evolution of the surface elevation.This system of equations can itself be solved in parallel, but parallelism can also be exploited in all other major computation segments such as the set up of the matrix and righthand side coefficients for the system describing the new surface elevation.These matrix coefficients and right-hand side terms result from the solution of a set of independent tridiagonal sys-tems of equations, one at each grid point in the horizontal plane.The parallel algorithm partitions the horizontal plane equally between threads.each one setting up and soh•ing a !!roup of independent tridiagonal systems.This partitioning approach is used for all other code segments.except for the conjugate gradient solver itself.
In practical terms we have demonstrated that a simulation which would require several days of CPU time on a powf:'rful workstation or a modest vector processor can be run overnight on :32 cells of a KSR-1.~-e have also found that the development process.consisting of sequential optimizations followed by an incremental parallelization strategy.has given very good performance without an excessive amount of programmer effort.\,.e have performed an analysis of the sources of overhead in the parallel version of the code.which hail allowed us to identify the aspects of the parallelization strategy which are most in need of attention should it prove desirable to further reduce the run-time by using more processors.

Table 1 .
Elapsed Times in Seconds for Five Time Slt•ps for the Optimized Sequential Version of LX\' Bl'_(:L.l'PD.\-

Table 3 .
Owrhead Analysis per Time Step