Application of OpenMP to weather, wave and ocean codes

Weather forecast limited area models, wave models and ocean models run commonly on vector machines or on MPP systems. Recently shared memory multiprocessor systems with ccNUMA architecture (SMP-ccNUMA) have been shown to deliver very good performances on many applications. It is important to know that the SMP-ccNUMA systems perform and scale well even for the above mentioned models and that a relatively simple effort is needed to parallelize the codes on these systems due to the availability of OpenMP as standard shared memory paradigm. This paper will deal with the implementation on a SGI Origin 2000 of a weather forecast model (LAMBO -- Limited Area Model Bologna, the NCEP ETA model adapted to the Italian territory), a wave model (WA.M. -- Wave Model, on the Mediterranean Sea and on the Adriatic Sea) and an ocean model (M.O.M. -- Modular Ocean Model, used with data assimilation). These three models were written for vector machines, so the paper will describe the technique used to port a vector code to a SMP-ccNUMA architecture. Another aspect covered by this paper are the performances that these models have on these systems.


Introduction
In recent years it has been a common perception that only vector processor machines are appropriate for running limited area weather forecast models, wave models and ocean models.Few recent models have been developed on MPP systems, while others have been ported using the message passing paradigm.This paper will deal with the implementation, the porting and the performances on a SGI Origin 2000 of a weather forecast model, a wave model and an ocean model.After the description of the memory and communication architecture of the SGI Origin 2000 in section 2, section 3 describes a weather forecast model (LAMBO -Limited Area Model Bologna), a wave model (WA.M. -Wave Model) and an ocean model (M.O.M. -Modular Ocean Model).Section 4 describes the parallelization techniques adopted at CINECA to port these array-native codes from a vector to a shared memory multiprocessor machine using OpenMP [DM98].In order to evaluate the parallel execution time performance, results are compared with those theoretically predicted.Section 5 also shows how different scalability curves can be experimentally obtained varying the size of input data; this will be done on data coming from a realistic case using three different spatial resolutions.Moreover the paper will point out the effects on the efficiency curve of the cpu upgrade without upgrading the interconnection network.Numerical representation effects on weather forecast will be briefly observed.Finally the paper provides some conclusions.

Programming environment
The SGI Origin 2000 [LL97] is a scalable sharedmemory multiprocessing architecture.It provides global address spaces for memory and for the I/O subsystem.The communication architecture is much more tightly integrated than in other recent commercial distributed shared memory (DSM) systems, with the stated goal of treating a local access as simply an optimization of a general DSM memory reference.The two processors within a node do not work as a snoopy SMP cluster but operate separately over the single multiplexed physical bus and are governed by the same, on-level directory protocol.Less snooping keeps low both absolute memory latency and the ratio of remote to local latency, and provides remote memory bandwidth equal to local memory bandwidth (780 MB/s each).The two processors within a node share a hardwired coherence controller, called hub, that implements the directory based cache coherence protocol.The Origin includes other architectural features for good performance, including support for dynamic page migration and prefetching, a highperformance local and global interconnect design, coherence protocol features to minimize latency and bandwidth needs per access, and synchronization primitives like LL/SC and atmemory fetch-and-op to reduce the serialization for highly contended syncronization events.Whithin a node each processor has separate 32 KB first-level instruction and data caches (L1 cache) and a unified 4 MB second-level cache (L2 cache) with 2-way associativity on R10000 processor (from now on indicated by R10K); second level cache is 8 MB on R12000 processor (R12K).

LAMBO
LAMBO, Limited Area Model BOlogna, is a grid-point primitive equations model, based on the 1989 and 1993 versions of the ETA model, operationally used at the National Centre for Environmental Prediction of Washington.The model, originally developed in its former adiabatic version during the early seventies, has been consistently improved during the years, both with regards to numerical schemes, related to the adiabatic part of the model, and also with respect to the parametrization of the physical processes [MJN + 88, Bla88, Jan90].LAMBO has been running operationally since 1993 [Pac94] at Agenzia Regionale Prevenzione Ambiente -Servizio Meteorologico Regionale where it has been almost completely reformulated in its pre-and post-processing sections.As mentioned earlier, LAMBO is a grid-point, primitive equations limited-area model: in such models the only basic approximation, which is well justified by the scale analysis of the vertical component of the momentum equation, is the hydrostatic approximation, which assumes that the pressure at any point is simply equal to the weight of the unit cross-section column of air above that point.In general, a primitive equations model is a model in which, assuming that the atmosphere is in hydrostatic equilibrium, the motion is predicted by applying the principles of conservation of momentum, energy and mass (separately for dry air and moisture) and using the law of ideal gases.Such a set of differential equations constitutes the initial and boundary value problem, the solution of which provides the future state of the atmosphere.The equations of motion are solved in practice using finite difference methods and all model variables are defined on the so-called Arakawa E-type grid.Particular numerical schemes were developed to integrate on the E-grid the part of the equations related to adiabatic processes and precisely to horizontal advection [Jan84] and geostrophic adjustment [Mes73].

WA.M.
The Wave Model, WA.M., has been developed by a group of international scientists with the aim of producing a tool for the forecast of the waves based only on physical principles.The WA.M. describes the sea state at a certain time in a certain position as the overlapping of many sinusoidals with different frequencies and directions.The energy distribution on these components is called the "sea spectrum".The model integrates numerically the "energy balance equation", that expresses the equilibrium between the energy associated to the sea state in a fixed position, its advection energy and the local velocity to produce and dissipate the undulatory motion.This includes the generation from wind, energy exchange between the wave components, dissipation phenomena (as white-capping and sea bottom friction), shoaling, refraction from the bottom and interaction with the streams.The equations are solved on all the grid points and for each spectrum component.

M.O.M.
Any Ocean Data Assimilation system consists of three components: the dynamical model, the data and quality control procedures and the insertion technique.The numerical model is a modified version of the Modular Ocean Model, M.O.M., implementation in the global ocean [Cox84,RM88].M.O.M. solves the primitive equations under hydrostatic, Boussinesq and rigid lid approximations using finite difference methods.All variables are defined on the so called "B-grid" of Arakawa and Lamb [AL77].The horizontal resolution is 1x1 degree almost everywhere except in the tropical area where the north-south resolution is increased to 1/3 of a degree.There are 15 levels unevenly spaced down to 3000 meters and the first 11 levels are confined in the first 250m.The vertical diffusion and horizontal viscosity are parameterized with the Mellor-Yamada [MY82] turbulence closure scheme and Smagorinsky nonlinear viscosity [Sma97], respectively.At the surface the ECMWF atmospheric reanalysis fields are used to compute momentum and heat fluxes with the method implemented by Rosati & Miyakoda [MRG77].The surface salinity boundary condition is still a relaxation to climatological monthly mean values.The data set assimilated into the ocean model consists of both XBT and CTD temperature profiles contained in the World Ocean Data Bank-94 [BL94] and the Reynolds weekly sea surface temperature analyses [RSS94].The preassimilation procedure has been implemented and checked in order to ensure the most effective use of the observations.The assimilation scheme consists of the univariate variational optimal interpolation scheme developed by Derber and Rosati [DR89].

Porting techniques
All the three codes described were running on CINECA CRAY C90 (from now on indicated with C90) and had to be ported on a Origin with 16 R10K processors at 195 MHz, with 8 MB global shared memory (from now on indicated with Origin-R10K).Later CINECA's Origin was upgraded to a 64 R12K processors at 300 MHz, with 32 GB of global shared memory (from now on indicated with Origin-R12K).The migration of these codes to the parallel Origin system has been structured in four major steps: • porting; • single processor tuning; • parallelization; • performance analysis.In order to obtain good MFLOP performance from porting a code written for a vector machine to a RISC processor, it is necessary to use well all memory hierarchies (expecially the L1 and L2 caches) to minimize data movement from RAM and feed the cpu registers; for this reason single processor tuning is a crucial step for the scalability of these codes.The parallel version of these codes has been written in a shared memory programming model, which is by far the most natural and efficient way to implement parallel code on Origin systems.Thus the parallelism has been achieved by the insertion of OpenMP standard directives and by exploiting the auto-parallelizing compiler features.OpenMP permits the use of different parallelization schemes inside the same code; this flexibility is not present with other programming models (e.g.message passing).Another advantage given by OpenMP is the possibility of using an incremental code parallelization approach: at the beginning the parallelization effort has been applied to the most time consuming routines, incrementally considering other routines to reach the desired parallelization level.SGI's ssrun tool has been fundamental in recognizing the most time consuming subroutines.Unless specified the experiments were run using SGI's miser (or equivalent tool), so that the cpus were dedicated to the application.

LAMBO
The purpose of this work was to follow two basic criteria: • the parallel version of LAMBO had to run on the Origin-R10K at least in the same time as the serial C90 version; • to retain code readability and portability the code modifications had to be kept to a minimum.The porting process has been straightforward: the only important issue was related to the numerical precision required, due to the different default variable size on C90 (64 bits) and on Origin-R10K (32 bits).Single processor tuning: in order to run efficiently the LAMBO vector code on the cache-based Origin architecture, aggressive optimization compiler flags had to be turned on, in particular for loop nesting and cache prefetching analysis.The major problem arose when considering the code parallelization: the original version of LAMBO made a large use of EQUIVALENCEd variables, to save memory, but the presence of an EQUIVALENCEd variable in a loop inhibits its parallelization.In order to achieve a significant level of parallelism, it has been necessary to remove most of the EQUIVALENCE statements, thus reducing the code readability for the original authors.Different parallelization schemes have been applied to different subroutines, always choosing the best approach according to the algorithm implemented: as an example, in the horizontal diffusion subroutine, HDIFF, the vertical level outer loop has been chosen for parallelization, while in the vertical advection subroutine, VTADV, the parallelization has been applied to the horizontal inner loop.In the end it turned out that 10 subroutines were manually parallelized by the insertion of OpenMP directives and 6 were automatically parallelized by the compiler.In the case of the radiation package, the parallelization has been achieved at a higher level, by parallelizing the main loop in the driver routine which calls the other radiation routines.The experiment was done on a 125x111x31 grid, with a 60 seconds timestep, for 20 timesteps, and time redistribution between LAMBO subroutines is shown in Table 1.

WA.M.
Porting: this code needed slight code modifications, such as the modularization of some PARAMETERs and EQUIVALENCEs and the substitution of some CRAY proprietary subroutines.This model was tested with a 1/8 of a degree configuration on the Mediterranean Sea, so the grid is made by 337 longitudinal points by 129 latitudinal points; moreover for each grid point 25 frequencies and 12 angles have been considered.Single processor tuning: this model works with 64 bit numerical precision (both real and integer arithmetic) and doesn't show any numerical instabilities.For a 6 hour time integration run Table 2 shows that cpu time decreases when the optimization level grows; the code is fastest when the optimization is refined asking for an arithmetic not compliant to the IEEE-754 standard, a code optimized for the R12K processor, with aggressive prefetching and loop fusion.Time redistribution between WA.M. subroutines for a 72 hours run is summarized in Table 3

M.O.M.
Porting: initially the model was run with a 32 bit arithmetic and numerical representation to obtain higher execution speed but losing numerical precision.This test, however, didn't give the expected benefits, because the model has some numerically instable kernels and it explodes when a 32 bit arithmetic is used.The model instability has been shown also for the 64 bit numerical representation when the high optimization level (that implies a non standard IEEE-754 arithmetic) has been used, in particular some transformations, such as x/y = x*1/y, introduced the presence of NaN (Not a Number) quantities.Single processor tuning: due to numerical instability a non highly aggressive optimization level was selected for the numerical point of view and performance of the model was improved using the software pipelining option.The introduction of a flag that switches on the data caches prefetch (both primary and secondary) has slightly lowered the execution time, while loop fusion and loop fission flags didn't give any substantial benefit; the same happened for the interprocedural analysis option.Profiling tools have been useful in analyzing the model behaviour and in locating a numerical kernel where most of the execution time is spent.This kernel is a nested do loop containing an instruction similar to: A floating point analysis shows that this instruction can't exceed the 45 Mflop/s on the Origin-R12K.The performance tools showed that with all the optimization options turned on the compiler doesn't reach this upper bound.Loop fission, array automatic padding, array grouping (automatic and manual) inside a common block didn't give any performance gain as well as re-writing the code using FORTRAN 90 for grouping the data structures involved in the loop -so to use better the primary data cache.The cpu time redistribution of M.O.M. subroutines has been studied varying the number of timesteps, to find out a number of timesteps sufficiently small to be a good and representative sample of the model for long integrations and to minimize the execution time.The behaviour of the seven most time consuming subroutines has been observed for 4, 8, 16, 32, 64, 128 and 256 timesteps configurations.
The redistribution time in the two last configurations is almost the same.For previous configurations only a high evolution (when the configuration passes from 4 to 8, from 8 to 16 and from 16 to 32 timesteps) can be observed, then a smooth evolution (when the configuration passes from 32 to 64 and from 64 to128 timesteps) up to reach an arrangement when the configuration passes from 128 to 256 timesteps, as shown in Figure 1.

Experimental results
In order to evaluate the parallel performance, the results are compared with those predicted by the so-called Amdhal's law which represents the parallel execution time T(p) as a function of the number of processor p and of the parallel fraction f p of the serial time T s .According to Amdhal's law: where S(p) is the speed-up function definition.Speed-up is used to evaluate the parallel performance; in the ideal case, when all the code is perfectly parallel (f p =1) , the speed-up function is the linear function S(p)=p.

LAMBO performance analysis
Since only routines that together account for the 96% of the total execution time were considered for parallelization, the Amdhal's curve corresponding to f p =0.92 should be considered.
Due to imperfect load balancing, cache misses and data contention between processors that fraction is positioned between 90% and 92%.Table 4 reports the parallel execution time obtained running the experiment described previously in section 4.1 for 20 timesteps on Origin-R10K.Table 4 reports also the real speedup between the theoretical speedup calculated for f p =0.90 and for f p =0.92.LAMBO has been operative on CINECA's Origin-R10K since the 1 st July 1998.Using 10 R10K the first run takes about 5 minutes while the second takes about 32 minutes.This should be compared with the 10 minutes and 50 minutes, respectively required by the previous C90 runs.

WA.M. performance analysis
Parallel execution time obtained running the experiment described before in section 4.2 for an integration of 72 hours on Origin-R12K are shown in Table 5.
Performance tools demonstrate that the subroutines that have been parallelized sum up to the 92% of the serial execution time.The asymptotic behaviour of the two curves is similar so the parallelization can be considered satisfactory.( )

performance analysis
The elapsed time on Origin-R12K for one month integration shown in the Table 6 were obtained with a not completely unloaded machine without using miser.
The higher execution time when the number of processor passes from 12 to 14 is probably due to a high machine load.Another cause that can generate such behaviour is a bad process workload distribution or an higher number of processors is not exploited in terms of memory use.Finally the usage of another router on the comunication network can increase the comunication overhead because of the growth of the bisection bandwidth, in other terms the maximum number of hops that a message has to do to reach another node grows.

Input size effects
Some experiments taking as input the dataset relating to the south Ticino flood (Sept.1995) have been done to understand the impact on the model scalability when the resolution is changed.Under the IRIX 6.5.2 environment LAMBO has been compiled with MIPSpro 7.3 and has been run using miser on Origin-R12K.The experiments had these configurations: • Father: 65x65x32 grid, timestep 120 seconds • Son:129x129x32 grid, timestep 60 seconds • Grandson: 197x197x32 grid, timestep 30 seconds Figure 2 summarizes the efficencies coming from experimental results together with the efficencies predicted by the Amdhal's law when the parallel fraction f p is 70%, 80% and 90%.It's easy to see that LAMBO scalability (and in general all model scalability) is strongly related to the configuration of the experiment or better to the input size.This observation leads to two other considerations: to obtain models that scale well on shared memory machines, data structures have to be large enough to be distributed among the processors that have to be used, otherwise the overhead that comes from remote memory access will drop down the performances; OpenMP implementation overhead (always present) can be percentually reduced increasing the input size so to enlarge the computational part of the model.All these three configurations scales up to 16 processors with quite different efficencies but when more cpus are added there is no gain in time performances due to overhead (synchronization and remote accesses).LAMBO scales up to 32 processors if the configuration is greater than Grandson but, as mentioned earlier, LAMBO is a hydrostatic model and this kind of model can not be used for very fine grids.It has been noticed that the thread control overhead explodes over 16 processors: the experiment configuration has to be very big so that the computational part hides the overhead.Moreover it is possible to observe that the Son configuration on Origin-R12K is larger than the one that has been used to port the code on the Origin-R10K but the former configuration has a poorer scalability: this behaviour is due to the change of cpu.Passing from R10K, 195 MHz, 4MB L2 cache to R12K, 300 MHz, 8MB L2 cache leads the model to run faster on a single cpu and to fit in a secondary data cache using a smaller number of processors.The same behaviour can be observed in figure 3: this picture shows the WA.M. cpu time obtained on the CINECA's Origin-R12K and on CINECA's Onyx equipped with 8 processors R10K, 275 MHz, 64 KB L1 cache, 4 MB L2 cache, 4 GB of memory.The experiment configuration in this case was 1/12 of a degree on the Adriatic Sea represented as a 97x73 grid, for each grid point 25 frequencies and 12 angles have been considered.The model integrates 6 hours forecast.WA.M. for this configuration scales up to 4 processors then there is no more benefit in adding cpus, infact the model is quite small.WA.M. on the Mediterranean Sea, instead, scales up to 12 processors.

Effect on numerical representation
Some experiments has been done in order to evaluate the relation between LAMBO output and numerical representation.The input dataset chosen is still the one related to south Ticino flood.This meteorological situation has been chosen because it is characterized by intense phenomena in order to have extreme values for some model variables and to enlight numerical differences and eventual fatal errors; moreover, with such situation, the convective scheme has been frequently used during the computation.
The variable chosen for the comparision is the total precipitation (which is the most interesting variable for the end users).The comparison has been done on the last snapshot released by the model (after 72 hours integration) because numerical differences between two different computations grow with integration time.Figures 4 and 5 shows respectively the total precipitation for a run with 32 bit numerical representation and a 64 bit one: the areas where total precipitation is intensive have the same structure in both the figures, while some differences are present where total precipitation is light.This qualitative analysis has been completed with a statistical analysis and also other model variables have been examined (relative humidity at 850 hPa and mean sea level pressure).A similar comparision has been done between a scalar and a parallel run in order to test the numerical impact of parallelization on the output.Also in this case there were not main quantitative and qualitative differences in test variables.

Conclusions
During this paper it has been shown that: • these kinds of models does not seem to be suited only for vector machines: it has been observed that cpu time performances for all these three configurations can be better than the respective ones obtained on a CRAY C90; • the possibility of using an incremental code parallelization approach and of using different parallelization schemes inside the same code are two big advantages given by OpenMP.A relatively simple parallelization effort has been done for porting these code: a much greater effort would have been necessary in the case of a message passing implementation; • single processor tuning is a crucial step: the porting of a vector code requests particular attention, especially loop nesting, prefetching and cache optimization; • scalability is strictly dependent by the input size, the processor, the L1 and L2 cache sizes; • these models can perform well on scalable shared memory parallel computers providing satisfactory operational forecasts also with 32 bit numerical representation.

Figure 1 -
Figure 1 -Time redistribution between M.O.M. subroutines varying the number of timesteps Parallelization: the code has been passed in the Power Fortran Accelerator (pfa) for obtaining a parallel version of the code.
Future work: experiment new commercial systems.LAMBO serial version has been run on IBM Power3 processor at 200 MHz with 128 KB of L1 cache and 4 MB of L2 cache: a 6 hour integration takes 1823 and 4500 seconds respectively for south Ticino flood Son and Grandson configuration, instead 2207 and 7247 second were necessary for the same configurations on a R12K.

Table 4 -
LAMBO parallel execution: cpu time, theoretical speedup for f p =0.90, real speedup, theoretical speedup for f p =0.92.