Using Coarrays to Parallelize Legacy Fortran Applications: Strategy and Case Study

,


Introduction
Background.Legacy software is old software that serves a useful purpose.In high-performance computing (HPC), a code becomes "old" when it no longer effectively exploits current hardware.With the proliferation of multicore processors and many-core accelerators, one might reasonably label any serial code as "legacy software." The software that has proved its utility over many years, however, typically has earned the trust of its user community.
Any successful strategy for modernizing legacy codes must honor that trust.This paper presents two strategies for parallelizing a legacy Fortran code while bolstering trust in the result: (1) a test-driven approach that verifies the numerical results and the performance relative to the original code and (2) an evolutionary approach that leaves much of the original code intact while offering a clear path to execution on multicore and many-core architectures in shared and distributed memory.
The literature on modernizing legacy Fortran codes focuses on programmability issues such as increasing type safety and modularization while reducing data dependancies via encapsulation and information hiding.Achee and Carver [1] examined object extraction, which involves identifying candidate objects by analyzing the data flow in Fortran 77 code.They define a cohesion metric that they use to group global variables and parameters.They then extracted methods from the source code.In a 1500-line code, for example, they extract 26 candidate objects.
Norton and Decyk [2], on the other hand, focused on wrapping legacy Fortran with more modern interfaces.They then wrap the modernized interfaces inside an object/abstraction layer.They outline a step-by-step process that ensures standards compliance, eliminates undesirable features, creates interfaces, adds new capabilities, and then groups related abstractions into classes and components.Examples of undesirable features include common blocks, which potentially facilitate global data-sharing and aliasing of variable names and types.In Fortran, giving procedures explicit interfaces facilitates compiler checks on argument type, kind, and rank.New capabilities they introduced included dynamic memory allocation.
Greenough and Worth [3] surveyed tools that enhance software quality by helping to detect errors and to highlight poor practices.The appendices of their report provide extensive summaries of the tools available from eight vendors with a very wide range of capabilities.A sample of these capabilities includes memory leak detection, automatic vectorization and parallelization, dependency analysis, call-graph generation, and static (compile-time) as well as dynamic (run-time) correctness checking.
Each of the aforementioned studies explored how to update codes to the Fortran 90/95 standards.None of the studies explored subsequent standards and most did not emphasize performance improvement as a main goal.One recent study, however, applied automated code transformations in preparation for possible shared-memory, loop-level parallelization with OpenMP [4].We are aware of no published studies on employing the Fortran 2008 coarray parallel programming to refactor a serial Fortran 77 application.Such a refactoring for parallelization purposes is the central aim of the current paper.
Case Study: PRM.Most commercial software models for turbulent flow in engineering devices solve the Reynoldsaveraged Navier-Stokes (RANS) partial differential equations.Deriving these equations involves decomposing the fluid velocity field, u, into a mean part, u, and a fluctuating part, u  : Substituting (1) into a momentum balance and then averaging over an ensemble of turbulent flows yield the following RANS equation: where  is the fluid's dynamic viscosity;  is the fluid's density;  is the time coordinate;   and   are the th and th cartesian components of u; and   and   are the th and th cartesian components of the spatial coordinate x.The term −      in ( 2) is called the Reynolds stress tensor.Its presence poses the chief difficulty at the heart of Reynoldsaveraged turbulence modeling; closing the RANS equations requires postulating relations between the Reynolds stress and other terms appearing in the RANS equations, typically the velocity gradient   /  and scalars representing the turbulence scale.Doing so in the most common ways works well for predicting turbulent flows in which the statistics of u  stay in near-equilibrium with the flow deformations applied via gradients in u.Traditional RANS models work less well for flows undergoing deformations so rapid that the fluctuating field responds solely to the deformation without time for the nonlinear interactions with itself that are the hallmark of fluid turbulence.The Particle Representation Model (PRM) [5,6] addresses this shortcoming.Given sufficient computing resources, a software implementation of the PRM can exactly predict the response of the fluctuating velocity field to rapid deformations.
A proprietary in-house software implementation of the PRM was developed initially at Stanford University, and development continued at the University of Cyprus.The PRM uses a set of hypothetical particles over a unit hemisphere surface.The particles are distributed on each octant of the hemisphere in bands, as shown in Figure 1 for ten bands.The total number of particles is given by (3) So, the computational time scales quadratically with the number of bands used.
Each particle has a set of assigned properties that describe the characteristics of an idealized flow.Assigned particle properties include vector quantities such as velocity and  orientation as well as scalar quantities such as pressure.Thus, each particle can be thought of as representing the dynamics of a hypothetical one-dimensional (1D), one-component (1C) flow.Tracking a sufficiently large number of particles and then averaging the properties of all the particles (as shown in Figure 2), that is, all the possible flows considered, yield a representation of the 3D behavior in an actual flowing fluid.Historically, a key disadvantage of the PRM has been costly execution times because a very large number of particles are needed to accurately capture the physics of the flow.Parallelization can reduce this cost significantly.Previous attempts to develop a parallel implementation of the PRM using MPI were abandoned because the development, validation, and verification times did not justify the gains.Coarrays allowed us to parallelize the software with minimal invasiveness and the OO test suite facilitated a continuous build-and-test cycle that reduced the development time.

Modernization Strategy.
Test-Driven Development (TDD) grew out of the Extreme Programming movement of the 1990s, although the basic concepts date as far back as the NASA space program in the 1960s.TDD iterates quickly toward software solutions by first writing tests that specify what the working software must do and then writing only a sufficient amount of application code in order to pass the test.In the current context, TDD serves the purpose of ensuring that our refactoring exercise preserves the expected results for representative production runs.Table 1 lists 17 steps employed in refactoring and parallelizing the serial implementation of the PRM.They have been broken down into groups that addressed various facets of the refactoring process.The open-source CTest framework that is part of CMake was used for building the tests.Our first step, therefore, was to construct a CMake infrastructure that we used for automated building and testing and to set up a code repository for version control and coordination.
The next six steps address Fortran 77 features that have been declared obsolete in more recent standards or have been deprecated in the Fortran literature.We did not replace continue statements with end do statements as these did not affect the functionality of the code.
The next two steps were crucial in setting up the build testing infrastructure.We automated the initialization by replacing the keyboard inputs with default values.The next step was to construct extensible tests based on these default values, which are described in Section 3.
The next three steps expose optimization opportunities to the compiler.One exploits Fortran's array syntax.Two exploit Fortran's facility for explicitly declaring a procedure to be "pure, " that is, free of side effects, including input/output, modifying arguments, halting execution, or modifying nonlocal state.Other steps address type safety and memory management.
Array syntax gives the compiler a high-level view of operations on arrays in ways the compiler can exploit with various optimizations, including vectorization.The ability to communicate functional purity to compilers also enables numerous compiler optimizations, including parallelism.The final steps directly address parallelism and optimization.One unrolls a loop to provide for more fine-grained data distribution.The other exploits the co sum intrinsic collective procedure that is expected to be part of Fortran 2015 and is already supported by the Cray Fortran compiler.(With the Intel compiler, we write our own co sum procedure.)The final step involves performance analysis using the Tuning and Analysis Utilities [7].

Extensible OO Test Suite
At every step, we ran a suite of accuracy tests to verify that the results of a representative simulation did not deviate from the serial code's results by more than 50 parts per million (ppm).We also ran a performance test to ensure that the single-image runtime of the parallel code did not exceed the serial code's runtime by more than 20%.(We allowed for some increase with the expectation that significant speedup would result from running multiple images.) Our accuracy tests examine tensor statistics that are calculated using the PRM.In order to establish a uniform protocol for running tests, we defined an abstract base tensor class as shown in Listing 1.
The base class provided the bindings for comparing tensor statistics, displaying test results to the user, and exception handling.Specific tests take the form of three child classes, reynolds stress, dimensionality, and circulicity, that extend the tensor class and thereby inherit a responsibility to implement the tensor's deferred bindings compute results and expected results.The class diagram is shown in Figure 3.
The tests then take the form if (.not.stess tensor%verify result (when)) & error stop 'Test failed.' where stress tensor is an instance of one of the three child classes shown in Figure 3 that extend tensor; "when" is an integer time stamp; error stop halts all images and prints the shown string to standard error; and verify result is the pure function shown in Listing 1 that invokes the two aforementioned deferred bindings to compare the computed results to the expected results.

Coarray Parallelization
Modern HPC software must be executed on multicore processors or many-core accelerators in shared or distributed memory.Fortran provides for such flexibility by defining a partitioned global address space (PGAS) without referencing how to map coarray code onto a particular architecture.Coarray Fortran is based on the Single Program Multiple Data (SPMD) model, and each replication of the program is called an image [8].Fortran 2008 compilers map these images to an underlying transport network of the compiler's choice.For example, the Intel compiler uses MPI for the transport network whereas the Cray compiler uses a dedicated transport layer.
A coarray declaration of the form real, allocatable :: a (:, :, :) [:] facilitates indexing into the variable "a" along three regular dimensions and one codimension so necessary, adding coindices facilitated the construction of collective procedures to compute statistics.
In the legacy version, the computations of the particle properties were done using two nested loops, as shown in Listing 2.
Distributing the particles across the images and executing the computations inside these loops can speed up the execution time.This can be achieved in two ways.
Method 1 works with the particles directly, splitting them as evenly as possible across all the images, allowing image boundaries to occur in the middle of a band.This distribution is shown in Figure 4(a).To achieve this distribution, the two nested do loops are replaced by one loop over the particles, and the indices for the two original loops are computed from the global particle number, as shown in Listing 3. However in this case, the code becomes complex and sensitive to precision.Method 2 works with the bands, splitting them across the images to make the particle distribution as even as possible.This partitioning is shown in Figure 4(b).Method 2, as shown in Listing 4, requires fewer changes to the original code shown in Listing 2 but is suboptimal in load balancing.

Source Code Impact.
We applied our strategy to two serial software implementations of the PRM.For one version, the resulting code was 10% longer than the original: 639 lines versus 580 lines with no test suite.In the second version, the code expanded 40% from 903 lines to 1260 lines, not including new input/output (I/O) code and the test code described in Section 3. The test and I/O code occupied additional 569 lines.help in parallelizing the program without making significant changes to the source code.A lot of the bookkeeping is handled behind the scenes by the compiler making it possible to make the parallelization more abstract but also easier to follow.For example, Listing 5 shows the MPI calls necessary to gather the local arrays into a global array on all the processors.

Ease of
The equivalent calls using the coarray syntax is the listing shown in Listing 6.
Reducing the complexity of the code also reduces the chances of bugs in the code.In the legacy code, the arrays   and  carried the information about the state of the particles.By using the coarray syntax and dropping the coindex, we were able to reuse all the original algorithms that implemented the core logic of the PRM.This made it significantly easier to ensure that the refactoring did not alter the results of the model.The main changes were to add codimensions to the  and  declarations and update them when needed, as shown in Listing 6.

Scalability.
We intend for PRM to serve as an alternative to turbulence models used in routine engineering design of fluid devices.There is no significant difference in the PRM results when more than 1024 bands (approximately 2.1 million particles) are used to represent the flow state so this was chosen as the upper limit of the size of our data set.Most engineers and designers run simulations on desktop computers.As such, the upper bound on what is commonly available is roughly 32 to 48 cores on two or four central processing units (CPUs) plus additional cores on one or more accelerators.We also looked at the scaling performance of parallel implementation of the PRM using Cray hardware and Fortran compiler which has excellent support for distributedmemory execution of coarray programs.
Figure 5 shows the speedup obtained for 200 and 400 bands with the Intel Fortran compiler using the two particledistribution schemes described in the Coarray Parallelization section.The runs were done using up to 32 cores on the "fat" nodes of ACISS (http://aciss-computing.uoregon.edu/).Each node has four Intel X7560 2.27 GHz 8-core CPUs and 384 GB of DDR3 memory.We see that the speedup was very poor when the number of processors was increased.
We used TAU [7] to profile the parallel runs to understand the bottlenecks during execution.Figure 6 shows the TAU plot for the runtime share for the dominant procedures using different number of images.Figure 7 shows the runtimes for the different functions on the different images.The heights of the columns show the runtime for different functions on the individual cores.There is no significant difference in the heights of the columns proving that the load balancing is very good across the images.We achieved this by mainly using the one-sided communication protocols of CAF as shown in Listing 6 and restricting the sync statements to the collective procedures as shown in Listings 7 and 8. Looking at the runtimes in Figure 6, we identified the chief bottlenecks to be the two collective co sum procedures which sum values across a coarray by sequentially polling each image for its portion of the coarray.The time required for this procedure is ( images ).The unoptimized co sum routine for adding a vector across all images is shown in Listing 7.There is an equivalent subroutine for summing a matrix also.
Designing an optimal co sum algorithm is a platformdependent exercise best left to compilers.The Fortran standards committee is working on a co sum intrinsic procedure that will likely become part of Fortran 2015.But to improve the parallel performance of the program, we rewrote the collective co sum procedures using a binomial tree algorithm that is (log  images ) in time.The optimized version of the co sum version is shown in Listing 8.
The speedup obtained with the optimized co sum routine is shown in Figure 8.We see that the scaling performance of the program becomes nearly linear with the implementation of the optimized co sum routine.We also see that the scaling efficiency increases when the problem size is increased.This indicates that the poor scaling at smaller problem sizes is due to communication and synchronization [9].
The TAU profile analysis of the runs using different number of images is shown in Figure 9.While there is a small increase in the co sum computation time when increasing the number of images, it is significantly lower than increase in time for the unoptimized version.
To fully understand the impact of the co sum routines, we also benchmarked the program using the Cray compiler and hardware.Cray has native support for the co sum directive in the compiler.Cray also uses its own communication library on Cray hardware instead of building on top of MPI as is done by the Intel compiler.As we can see in Figure 10, the parallel code showed very good strong scaling on the Cray hardware up to 128 images for the problem sizes that we tested.
We also looked at the TAU profiles of the parallel code on the Cray hardware, shown in Figure 11.The profile analysis shows that the time is spent mainly in the time advancement loop when the native co sum implementation is used.
We hope that, with the development and implementation of intrinsic co sum routines as part of the 2015 Fortran standard, the Intel compiler will also improve its strong scaling performance with larger number of images.Table 2 shows the raw runtimes for the different runs using 128 bands whose TAU profiles have been shown in Figures 6, 9, and 11.The runtimes for one to four images are very close but they quickly diverge as we increase the number of images due to the impact of the collective procedures.
Table 3 shows the weak scaling performance of the program using the optimized co sum procedures using the Intel compiler.The number of particles as shown in Figure 1 scales as the square of the number of bands.Therefore, when  6: TAU profiling analysis of function runtimes when using the unoptimized co sum routines with 1, 2, 4, 8, 16, and 32 images.The .TAU application is the main program wrapped by TAU for profiling, and .TAU application => refers to functions wrapped by TAU.This notation is also seen in Figures 7 and 9.  doubling the number of bands, the number of processors must be quadrupled to have the same execution time.The scaling efficiency for the larger problem drops because of memory requirements; the objects fit in the heap and must be swapped out as needed, increasing the execution time.

Conclusions and Future Work
We demonstrated a strategy for parallelizing legacy Fortran 77 codes using Fortran 2008 coarrays.The strategy starts with constructing extensible tests using Fortran's OOP features.
The tests check for regressions in accuracy and performance.
In the PRM case study, our strategy expanded two Fortran 77 codes by 10% and 40%, exclusive of the test and I/O infrastructure.The most significant code revision involved unrolling two nested loops that distribute particles across images.The resulting parallel code achieves even load balancing but poor scaling.TAU identified the chief bottleneck as a sequential summation scheme.
Based on these preliminary results, we rewrote our co sum procedure, and the speedup showed marked improvement.We also benchmarked the native co sum implementation available in the Cray compiler.Our results show that the natively supported collective procedures show the best scaling performance even when using distributed memory.We hope that future native support for collective

Figure 1 :
Figure 1: Distribution of particles in bands in one octant.

Figure 2 :
Figure 2: Results of a PRM computation.The particles are colored based on their initial location.The applied flow condition, shear flow along the -direction, causes the uniformly distributed particles to aggregate along that axis.

Figure 3 :Listing 1 :Listing 2 :
Figure 3: Class diagram of the testing framework.The deferred bindings are shown in italics, and the abstract class is shown in bold italics.

Figure 4 :Listing 3 :
Figure 4: Two different partitioning schemes were tried for load balancing.

Listing 4 :
Use: Coarrays versus MPI.The ability to drop the coindex from the notation, as explained in Section 4, was a big !Loop over the bands do k = my first band, my last band !Global number ! of last particle in (k − 1) band l = k * * 2 + (k − 1) * * 2 − 1 !Loop over the particles in band do m Parallel loop by splitting bands.

Figure 5 :
Figure 5: Speedup obtained with sequential co sum implementation using multiple images on a single server.

Figure 7 :
Figure 7: TAU analysis of load balancing and bottlenecks for the parallel code using 32 images.

Table 1 :
Modernization steps: horizontal lines indicate partial ordering.Set up automated builds via CMake 1 and version control via Git 2 . 2 Convert fixed-to free-source format via "convert.f90"by Metcalf 3 .3 Replace goto with do while for main loop termination.4 Enforce type/kind/rank consistency of arguments and return values by wrapping all procedures in a module.
14 Expose greater parallelism by unrolling the nested loops in the particle set-up.15 Balance the work distribution by spreading particles across images during set-up.16 Exploit a Fortran 2015 collective procedure to gather statistics.17 Study and tune performance with TAU 4 .
Speedup obtained with parallel co sum implementation using multiple images on a single server.Figure9: TAU profiling analysis of function runtimes when using the optimized co sum routines with 1, 2, 4, 8, 16, and 32 images.Speedup obtained with parallel co sum implementation using multiple images on a distributed-memory Cray cluster.Figure11: TAU profiling analysis of function runtimes when using the Cray native co sum routines with 1, 2, 4, 8, 16, and 32 images.

Table 2 :
Runtime in seconds for parallel using 128 bands, and different collective sum routines.

Table 3 :
Weak scaling performance of coarray version.Number of images Number of bands Number of particles Particles per image Time in seconds Runtime per particle Efficiency