Investigating Solution Convergence in a Global Ocean Model Using a 2048-Processor Cluster of Distributed Shared Memory Machines

Up to 1920 processors of a cluster of distributed shared memory machines at the NASA Ames Research Center are being used to simulate ocean circulation globally at horizontal resolutions of 1/4, 1/8, and 1/16-degree with the Massachusetts Institute of Technology General Circulation Model, a finite volume code that can scale to large numbers of processors. The study aims to understand physical processes responsible for skill improvements as resolution is increased and to gain insight into what resolution is sufficient for particular purposes. This paper focuses on the computational aspects of reaching the technical objective of efficiently performing these global eddy-resolving ocean simulations. At 1/16-degree resolution the model grid contains 1.2 billion cells. At this resolution it is possible to simulate approximately one month of ocean dynamics in about 17 hours of wallclock time with a model timestep of two minutes on a cluster of four 512-way NUMA Altix systems. The Altix systems’ large main memory and I/O subsystems allow computation and disk storage of rich sets of diagnostics during each integration, supporting the scientific objective to develop a better understanding of global ocean circulation model solution convergence as model resolution is increased.


Introduction
In this article we describe technical aspects of global ocean model configurations with resolutions up to 1/16 • (≈ 5 km) that exploit a testbed 2048 Itanium-2 processor SGI Altix system at the NASA Ames Research Center.The size of this system makes possible global ocean simulations at resolutions that until now have been impractical.The calculations we describe employ an Adams-Bashforth time-stepping procedure on a finite volume grid with up to 1.25 billion grid cells.This workload is spread evenly over 1920 of the Altix processors so that each individual processor is respon-sible for simulating around 586,000 grid cells (corresponding to a surface region roughly 210 × 210 km in extent).Decomposing the workload over this many processors yields a setup that, with extensive diagnostics and analysis options included, uses about 870 MB of main memory per processor and can integrate forward at a rate of around 5 µs/timestep/gridcell.With a timestep of two minutes this performance allows a year of simulation to be completed in under ten days.
The model configurations we employ are significant in that, at the resolutions the Altix system makes possible, numerical ocean simulations begin to truly represent the key dynamical process of oceanic meso-scale turbulence.Meso-scale turbulence in the ocean is, in some ways, the analog of synoptic weather fronts in the atmosphere.However, because of the density characteristics of seawater, the length scale of turbulent eddy ISSN 1058-9244/07/$17.00 2007 -IOS Press and the authors.All rights reserved Fig. 1.Atmosphere and ocean eddies.Eddies occur in both the atmosphere and ocean but with vastly different length scales.The left panel shows a polar view of the atmospheric jet stream meandering as it circulates the northern hemisphere.The length scales of the meanders are several thousand kilometers.The right panel shows satellite imagery of Gulf Stream meanders off the U.S. East Coast.In places the length scales of the meanders are as small as ten kilometers.
phenomena in the ocean is around ten or less kilometers.In contrast, in the atmosphere, where the same dynamical process occurs, it has length scales of thousands of kilometers -see Fig. 1.Although it has been possible to resolve ocean eddy processes well in regional ocean simulations [12] and process studies [7,14,28] for some time, global scale simulations that resolve or partially resolve the ocean's energetic eddy field are still rare [17,23] because of the immense computational challenge they represent.
Physically ocean meso-scale eddies contribute to regulating large-scale ocean transports of momentum, heat and fresh water, all of which play an important role in creating the Earth's climate.Eddies also interact, non-linearly, with the larger scale general circulation thereby modifying an ocean simulation's mean state and altering its transient response to perturbations.Through these processes, and through the processes by which ocean energy is dissipated at basin margins, meso-scale eddies play a significant, complex, and nonlinear role in climate.Process studies [14] suggest that resolutions of a few kilometers are needed to adequately resolve eddy processes.Consequently ocean modelers are continually on the look out for technological means to increase the resolution of global-scale integrations in order to converge on an adequate representation of these non-linear behaviors.The work we describe here creates the technological machinery for examining how much resolution is needed for eddies, and their impacts, to be accurately represented in a global ocean model.
The article is organized as follows.We first review prior work in Section 2. A description of the underlying mathematical and numerical basis of our model together with a discussion of its computational formulation is in Section 3. Section 4 provides details on how the 2048 shared memory Altix cluster is configured, looks at how our parallel algorithm maps to the system, and describes the resultant performance.In Sections 5 and 6, we summarize the prospects for using a large Altix system for optimal representation of eddy terms in a global ocean model.

Prior work
Eddying calculations first started to appear around 1990 [2,4].These initial calculations were facilitated by the availability of vector machines such as the Cray YMP with sustained memory bandwidths of several gigabytes/second per processor.Global eddying calculations were first carried out [30] around this time too.During this period computational resources supported simulations that permitted eddies, but did not allow models to resolve the scales shown in Fig. 1.More recently [12,25,29,31] have undertaken large-scale regional simulations that are termed eddy-resolving in that they resolve the first baroclinic Rossby radius of deformation in the ocean [5], a physical estimate of a typical ocean eddy size.In these regional studies authors find that the transition from eddy-permitting to eddy-resolving produces qualitative improvements in the fit between simulations and observations.Following these and other studies [17] reported on a global 1 10 • simulation on an IBM Power 3 system and [23] described a global 1   10   • simulation on the Earth Simulator in Japan.These studies further reinforce the notion that resolving eddy processes provides a significant boost in model skill.In this paper we report on the computational aspects of a study that uses a cluster of four 512way NUMA Altix systems for global eddy resolving simulations at up to 1   16   • resolution.The study is aimed at developing a clearer understanding of processes underlying skill improvements that eddy resolving models show.To develop this understanding we perform a series of heavily instrumented simulations that produce more than a terabyte of output for each year simulated and that repeatedly process more than one billion grid cells in a time-stepping procedure.In doing this we gain insights into the performance of a clustered Altix system with 2048 CPUs for ocean simulation.

Algorithm
The Masschusetts Institute of Technology General Circulation Model (MITgcm) algorithm is rooted in the incompressible form of the Navier-Stokes equations for fluid motion in a rotating frame of reference [11,20,21].The model supports different numerical configurations including hydrostatic and non-hydrostatic simulations.A generic, height like, vertical coordinate allows the fluid equations to be solved in either height or pressure coordinates, making the code amenable to both ocean and atmospheric simulations [9,19].Adjoint forms of the code are automatically maintained [8] supporting advanced assimilation [32] and dynamical analysis [10,18].The configuration used here is hydrostatic and the underlying continuous equations can be written as partial differential equations, in an Eulerian form, that govern the three-dimensional temporal and spatial evolution of velocity u, potential temperature θ, salinity s and pressure gρ ref η + p hyd within the fluid In Eq. ( 1) gρ ref η and p hyd are the surface pressure (due to fluid surface elevation η) and the internal hydrostatic pressure (due to the mass of the fluid above a particular depth), respectively.The subscript h denotes operations in the horizontal plane.Hydrostatic pressure relates to fluid potential temperature, θ, salinity, s and depth z according to the function ρ in (4) multiplied by gravity g.The term G u h in Eq. ( 1) incorporates sources and sinks of momentum, due, for example, to surface winds or bottom drag together with terms that represent coriolis forces and the transport of momentum by the fluid flow.Equation ( 2) captures the time evolution of the ocean heat content, θ, and salt, s and tracers, λ, respectively.Right-hand-side terms, G θ,s,λ , include source terms due to surface fluxes and transport by the fluid flow.
To evaluate Eq. ( 1) we require knowledge of the pressure field terms in the fluid, gρ ref η + p hyd .The hydrostatic part of the pressure field can be found by integrating the fluid density relation ρ(θ, s, z) down from the surface, where p hyd = 0.The surface pressure is the product of the fluid free surface elevation η multiplied by gravity g and a reference density ρ ref .Invoking the vertically integrated form of the continuity condition for an incompressible fluid with a free surface, ∇ • u + ∂η ∂t = 0, yields a diagnostic relation Eq. (3) for η.Here φ indicates a vertical integral from the bottom of the ocean to the surface.The vertical velocity w is found by integrating Eq. ( 5) vertically from the bottom of the ocean.
Equations ( 1) and ( 2) are stepped forward explicitly in time using an Adams-Bashforth procedure that is second order accurate.The equations are discretized in space using a finite volume based technique [1] yielding a solution procedure that requires at each time step explicitly evaluated local finite volume computations and an implicit two-dimensional elliptic inversion.
To solve Eqs (1)-( 5) numerically we write them in terms of discrete time levels n, n where γ = θ, s, λ (7) The n+ 1 2 terms in Eqs ( 6)- (10) denote centered in time values that are explicitly evaluated from known values at time levels n and n − 1 using Adams-Bashforth extrapolation, . Time levels are separated by a dimensional time ∆t.The computational algorithm then proceeds as a series of n final − n initial time steps according to algortihm 6.
Algorithm 6 shapes the computational formulation of MITgcm.Lines 2: and 5: of the algorithm involve integrals along the vertical axis of the discrete domain.Typically these integrals contain dependencies that prevent efficient parallel decomposition along this axis.Accordingly the parallel formulation takes a global finite volume domain with N x × N y × N z cells in the two horizontal directions, x and y, and the vertical direction, z, respectively, and decomposes it into N sx × N sy sub-domains each of size O y values are overlap region finite volume cells that are added to the boundaries of the subdomains to hold replicated data from neighboring subdomains.The values of O x and O y are chosen such that the explicit steps 2: and 4: in algorithm 6 can be evaluated locally by each subdomain independent of others.Each computational process integrating forward the MITgcm is then given a static set of one or more subdomains.The process is then responsible for evaluating all the terms required by algorithm 6 for its subdomains and for cooperating with other processes to maintain coherent data in the overlap region finite volume cells.This yields a decomposed computational procedure outlined in algorithm 6.In algorithm 6 a single time-step is split into a series of compute, exchange, and sum phases.Compute phases contain only local computation (predominantly arithmetic and associated memory loads and stores) and I/O operations.Performance of compute phases is sensitive to the volume of I/O and computation involved, local CPU and memory capabilities of the underlying system, and to the system I/O capacity.Exchange phases involve point-to-point communication between neighbor processes.Their performance hinges on the performance of the systems interconnect and inter-process communication software stack.Sum phases involve all subdomains collectively combining locally calculated 8-byte floating point values to yield a single global sum.The sum phases are sensitive to how system performance for collective communication scales with processor count.

Running on the Altix system
The system we are using consists of four interconnected 512 processor SGI Altix machines each with a separate OS image.Each Altix system uses 1.6 GHz Itanium 2 processors arranged in pairs and is interconnected with SGI's BX2 NUMALink 4 technology through a front side bus interface.Within each 512way system NUMAlink supports non-uniform access global shared memory [33] through memory coherence technology rooted in the distributed shared memory approaches of the Stanford DASH project [16].The network fabric within an Altix system is a fat-tree arrangement [3,15] based on 8-port router chips.This provides for 64 bisection links on a 512-way system giving a total bisection bandwidth of 400TB/s.Each CPU has memory banks populated with 2 GB of 166MHz DDR memory, so that the core memory of a 512-way unit is 1TB.On the Stream benchmark [24] a 512-way system achieves a sustained memory bandwidth for the triad value of just over 1TB/s [6].Each 512 way system runs a version of the Linux 2.4 kernel with specific SGI modifications that supports scaling to large numbers of shared memory processors [27].
We scale our runs to 1920 CPUs using a cluster of four 512-way systems interconnected with a single NU-MAlink fat-tree.The SGI Array Services and Message Passing Tookit systems are available on the Altix system, providing optimized MPI implementations that can take advantage of the coherent shared memory within each 512-way unit and transparently adapt to non-coherent memory communication, still over NU-MAlink, between systems.The exchange and sum phases in the MITgcm algorithm are, therefore, configured to use the systems native MPI implementation for the runs we describe.
The most challenging numerical experiment we have undertaken is a 1 16 • near global simulation.In this section we examine the performance of that simulation on the Altix system.The analysis makes use of operation counts based on code inspection as well as runtime processor statistics provided by the pfmon utility [13] which reads Itanium hardware counters and per MPI process communication statistics available when the -stats option of the Message Passing Toolkit is enabled.The grid spacing for this model configuration is • longitudinally while latitude circle spacing is • cos(latitude).The model domain extends from 78.7 • S to 80 • N.This gives a global finite volume grid with grid cells that are 6.9 km × 6.9 km at the equator and 1.2 km × 1.2 km at the northern boundary.The number of surface grid cells is just over 25 million, and the configuration has 50 vertical levels, making the total number of cells in all three dimensions just over 1.25 billion.Each of the three dimensional fields that make up the description of the simulation domain and its time evolving state therefore requires 10 GB of storage.With the full set of runtime diagnostics that are required for a scientifically useful calculations there are about 180 active three dimensional fields, so that the memory footprint of our running calculation is around 1.8TB.Each time step of algorithm 6 requires computing approximately 3600 floating point operations for each grid point.Aggregated over a total of 1.25 billion grid cells a single time step entails approximately 4.5 × 10 12 arithmetic operations.
At each timestep communication is necessary to update overlap regions in the sub-domains on which we compute (see Section 3).We use a sub-domain size of 96×136 with overlap regions O x = O y = 3.A total of 1920 sub-domains are needed to cover the full domain at this size so that a single three-dimensional exchange operation entails transferring a total of 1 GB of data.For the configuration used each timestep involves eighteen such operations, corresponding to steps 4: and 15: in algorithm 6.
Additionally an average of forty iterations in the conjugate gradient solver are required at each timestep.Each iteration entails a pair of two-dimensional exchange operations (each involving 20 MB of data transfer) and at least one global sum of an 8-byte floating point value from each sub-domain.
A large number of diagnostics are saved to allow analysis.In addition to basic state information highfrequency outputs of two-dimensional sea-surface elevation and bottom pressure that can be related to satellite measurements are saved along with threedimensional product terms that cannot be derived from their separate components.The list of possible outputs is application dependent but at a minimum the I/O requirement is 300 GB per month of integration.The 1.8 TB memory footprint of our calculation is larger than the main memory of a single 512-way box so for this study we examine processor counts of 960, 1440 and 1920 which correspond to running across two, three and four 512-way Altix boxes.We make use of a dedicated sets of CPUs and employ the Message Passing Toolkit MPI DSM DISTRIBUTE flags to ensure that when an MPI process is assigned to an Altix node, the CPU and memory on that node bind to that process.
Figure 2 shows the scaling behavior of two key communication primitives used in MITgcm, the sum operation and the two-dimensional form of the exchange operation.These times were obtained from an isolated kernel benchmark, taken from the MITgcm code, because the full, 1   16   • , configuration will not fit on smaller numbers of processors.The times, shown in microsec- onds, increase with processor count for both the sum operation and the two-dimensional exchange.However, it is encouraging to note that the increase in time when going between different numbers of Altix boxes is comparable to the increase associated with going from 1 to 512 CPUs.The exchange times stay relatively flat because although more data is exchanged between processors as the number of CPUs increases the available network bandwidth increases in proportion with CPU count.
Figure 3 shows the overall scaling and performance of a full simulation.Two scenarios are shown one with a modest I/O load (the upper line) and one with a very high I/O load (the lower line).The communication costs in Fig. 2 are only a small fraction of the total time in both cases.For the modest I/O case the peak performance at 1920 processors is sufficient to achieve 45 days of simulation per wall-clock day with an aggregate performance of 1.39 TFlop/s.The modest I/O case also scales well from 960 processors.For the intensive I/O scenario performance drops so that maximum throughput on 1920 processors is around 17 simulated days per wall-clock day with an aggregate performance of around 460 GFlop/s.For the intensive I/O scenario the scaling from 960 processors falls below the linear relation (the dotted line).However, at present, the MITgcm I/O routines being used have not been optimized for the Altix systems and so we expect the performance of the intensive I/O scenario to improve significantly.Over- all these results are encouraging and are sufficient to enable us to proceed with some preliminary numerical experiments.

Preliminary results
Using the Altix configuration described earlier, we undertook a series of numerical simulations at 1

Conclusions
At first glance the three different resolution runs show significant differences.There does, however, seem to be smaller change between the 1 8 • and 1 16 • simulations.A next step is to undertake a fourth series of runs at even higher resolution, 1   20   • or 1   32   • .Formally quantifying the changes between these runs would provide important information on whether ocean models are reaching numerically converged solutions.
Performance on the Altix shows it is well suited for addressing these questions.We monitored our code to be achieving about 722 Mflop/s/cpu when running on 1920 processors.This is 14% of the per CPU Top500 number achieved on the system [26].Our code consists of predominantly BLAS1 class operations and cannot exploit the level of cache reuse that the Top500 Linpack benchmark achieves.The scaling we found across multiple Altix systems is encouraging and suggests that configurations that span eight or more Altix boxes, and that would therefore support 1   20   • and higher resolution, are today within reach.

Fig. 2 . 16 •
Fig. 2. Performance of key primitives used on the 1 16 • resolution simulation.The exchange times are for a sub-domain of size 96 × 136 with

16 • 1 4• 8 • and 1 16 •
resolutions.The three cases are all configured as described in section 4 and, apart from differing resolutions, the model configurations are the same.Figures 4, 5 and 6 show significant changes in solution with resolution.Each plot shows the change in simulated sea-surface height over the same month.The plots capture changes due to eddy activity over a single month.Changes with resolution occur in the regions of the oceans where eddies are prevalent (such as the Gulf Stream, the Kurishio, the Aghullas, the Drake Passage and the Antarctic Circumpolar Current).For example in the Gulf Stream region the upper panel of figure4( ) shows a relatively small area of vigorous seasurface height changes.The middle and lower panels ( 1 respectively) show more extensive areas of changes but there are big shifts between the 1

Figure 5
Figure 5 and 6 are closeups of the Gulf Stream and Drake Passage.Key behaviors like how tightly waters "stick" to the coast, or how far energetic eddies pen-