This paper summarizes a strategy for parallelizing a legacy Fortran 77 program using the object-oriented (OO) and coarray features that entered Fortran in the 2003 and 2008 standards, respectively. OO programming (OOP) facilitates the construction of an extensible suite of model-verification and performance tests that drive the development. Coarray parallel programming facilitates a rapid evolution from a serial application to a parallel application capable of running on multicore processors and many-core accelerators in shared and distributed memory. We delineate 17 code modernization steps used to refactor and parallelize the program and study the resulting performance. Our initial studies were done using the Intel Fortran compiler on a 32-core shared memory server. Scaling behavior was very poor, and profile analysis using TAU showed that the bottleneck in the performance was due to our implementation of a collective, sequential summation procedure. We were able to improve the scalability and achieve nearly linear speedup by replacing the sequential summation with a parallel, binary tree algorithm. We also tested the Cray compiler, which provides its own collective summation procedure. Intel provides no collective reductions. With Cray, the program shows linear speedup even in distributed-memory execution. We anticipate similar results with other compilers once they support the new collective procedures proposed for Fortran 2015.
Any successful strategy for modernizing legacy codes must honor that trust. This paper presents two strategies for parallelizing a legacy Fortran code while bolstering trust in the result: (1) a test-driven approach that verifies the numerical results and the performance relative to the original code and (2) an evolutionary approach that leaves much of the original code intact while offering a clear path to execution on multicore and many-core architectures in shared and distributed memory.
The literature on modernizing legacy Fortran codes focuses on programmability issues such as increasing type safety and modularization while reducing data dependancies via encapsulation and information hiding. Achee and Carver [
Norton and Decyk [
Greenough and Worth [
Each of the aforementioned studies explored how to update codes to the Fortran 90/95 standards. None of the studies explored subsequent standards and most did not emphasize performance improvement as a main goal. One recent study, however, applied automated code transformations in preparation for possible shared-memory, loop-level parallelization with OpenMP [
The term
A proprietary in-house software implementation of the PRM was developed initially at Stanford University, and development continued at the University of Cyprus. The PRM uses a set of hypothetical particles over a unit hemisphere surface. The particles are distributed on each octant of the hemisphere in bands, as shown in Figure
Distribution of particles in bands in one octant.
Each particle has a set of assigned properties that describe the characteristics of an idealized flow. Assigned particle properties include vector quantities such as velocity and orientation as well as scalar quantities such as pressure. Thus, each particle can be thought of as representing the dynamics of a hypothetical one-dimensional (1D), one-component (1C) flow. Tracking a sufficiently large number of particles and then averaging the properties of all the particles (as shown in Figure
Results of a PRM computation. The particles are colored based on their initial location. The applied flow condition, shear flow along the
Time = 0 seconds
Time = 2 seconds
Time = 4 seconds
Time = 6 seconds
Historically, a key disadvantage of the PRM has been costly execution times because a very large number of particles are needed to accurately capture the physics of the flow. Parallelization can reduce this cost significantly. Previous attempts to develop a parallel implementation of the PRM using MPI were abandoned because the development, validation, and verification times did not justify the gains. Coarrays allowed us to parallelize the software with minimal invasiveness and the OO test suite facilitated a continuous build-and-test cycle that reduced the development time.
Test-Driven Development (TDD) grew out of the Extreme Programming movement of the 1990s, although the basic concepts date as far back as the NASA space program in the 1960s. TDD iterates quickly toward software solutions by first writing tests that specify what the working software must do and then writing only a sufficient amount of application code in order to pass the test. In the current context, TDD serves the purpose of ensuring that our refactoring exercise preserves the expected results for representative production runs.
Table
Modernization steps: horizontal lines indicate partial ordering.
Step | Details |
---|---|
1 | Set up automated builds via CMake1 and version control via Git2. |
|
|
2 | Convert fixed- to free-source format via “convert.f90” by Metcalf3. |
3 | Replace |
4 | Enforce type/kind/rank consistency of arguments and return values by wrapping all procedures in a |
5 | Eliminate implicit typing. |
6 | Replace |
7 | Replace write-access to |
|
|
8 | Replace keyboard input with default initializations. |
9 | Set up automated, extensible tests for accuracy and performance via OOP and CTest1. |
|
|
10 | Make all procedures outside of the main program |
11 | Eliminate actual/dummy array shape inconsistencies by passing array subsections to assumed-shape arrays. |
12 | Replace static memory allocation with dynamic allocation. |
|
|
13 | Replace loops with array assignments. |
14 | Expose greater parallelism by unrolling the nested loops in the particle set-up. |
15 | Balance the work distribution by spreading particles across images during set-up. |
16 | Exploit a Fortran 2015 collective procedure to gather statistics. |
17 | Study and tune performance with TAU4. |
2
4
The next six steps address Fortran 77 features that have been declared obsolete in more recent standards or have been deprecated in the Fortran literature. We did not replace
The next two steps were crucial in setting up the build testing infrastructure. We automated the initialization by replacing the keyboard inputs with default values. The next step was to construct extensible tests based on these default values, which are described in Section
The next three steps expose optimization opportunities to the compiler. One exploits Fortran’s array syntax. Two exploit Fortran’s facility for explicitly declaring a procedure to be “
Array syntax gives the compiler a high-level view of operations on arrays in ways the compiler can exploit with various optimizations, including vectorization. The ability to communicate functional purity to compilers also enables numerous compiler optimizations, including parallelism.
The final steps directly address parallelism and optimization. One unrolls a loop to provide for more fine-grained data distribution. The other exploits the co_sum intrinsic collective procedure that is expected to be part of Fortran 2015 and is already supported by the Cray Fortran compiler. (With the Intel compiler, we write our own co_sum procedure.) The final step involves performance analysis using the Tuning and Analysis Utilities [
At every step, we ran a suite of accuracy tests to verify that the results of a representative simulation did not deviate from the serial code’s results by more than 50 parts per million (ppm). We also ran a performance test to ensure that the single-image runtime of the parallel code did not exceed the serial code’s runtime by more than 20%. (We allowed for some increase with the expectation that significant speedup would result from running multiple images.)
Our accuracy tests examine tensor statistics that are calculated using the PRM. In order to establish a uniform protocol for running tests, we defined an abstract base tensor class as shown in Listing
computed_results expected_results
all_tests_passed = this%computed_results(), this%expected_results()))
The base class provided the bindings for comparing tensor statistics, displaying test results to the user, and exception handling. Specific tests take the form of three child classes, reynolds_stress, dimensionality, and circulicity, that extend the tensor class and thereby inherit a responsibility to implement the tensor’s deferred bindings compute_results and expected_results. The class diagram is shown in Figure (.
Class diagram of the testing framework. The deferred bindings are shown in italics, and the abstract class is shown in bold italics.
where stress_tensor is an instance of one of the three child classes shown in Figure
Modern HPC software must be executed on multicore processors or many-core accelerators in shared or distributed memory. Fortran provides for such flexibility by defining a partitioned global address space (PGAS) without referencing how to map coarray code onto a particular architecture. Coarray Fortran is based on the Single Program Multiple Data (SPMD) model, and each replication of the program is called an image [
A coarray declaration of the form
facilitates indexing into the variable “a” along three regular dimensions and one codimension so a (1, 1, 1) = a (1, 1, 1) [2]
copies the first element of image 2 to the first element of whatever image executes this line. The ability to omit the coindex on the left-hand side (LHS) played a pivotal role in refactoring the serial code with minimal work; although we added codimensions to existing variables’ declarations, subsequent accesses to those variables remained unmodified except where communication across images is desired. When necessary, adding coindices facilitated the construction of collective procedures to compute statistics.
In the legacy version, the computations of the particle properties were done using two nested loops, as shown in Listing
l = 0
Distributing the particles across the images and executing the computations inside these loops can speed up the execution time. This can be achieved in two ways.
Method 1 works with the particles directly, splitting them as evenly as possible across all the images, allowing image boundaries to occur in the middle of a band. This distribution is shown in Figure
Two different partitioning schemes were tried for load balancing.
Partitioning of the particles to achieve even distribution of particles
Partitioning of the bands to achieve nearly even distribution of particles
Method 2 works with the bands, splitting them across the images to make the particle distribution as even as possible. This partitioning is shown in Figure
We applied our strategy to two serial software implementations of the PRM. For one version, the resulting code was 10% longer than the original: 639 lines versus 580 lines with no test suite. In the second version, the code expanded 40% from 903 lines to 1260 lines, not including new input/output (I/O) code and the test code described in Section
The ability to drop the coindex from the notation, as explained in Section
my_first, my_last, counts, displs
counts(num_procs), displs(num_procs)) my_first(my_rank + 1) = my_last(my_rank + 1) =
The equivalent calls using the coarray syntax is the listing shown in Listing
my_first = my_last =
cr_global(:, my_first[l]:my_last[l]) = cr(:,:)[l] sn_global(:, my_first[l]:my_last[l]) = sn(:,:)[l]
Reducing the complexity of the code also reduces the chances of bugs in the code. In the legacy code, the arrays
We intend for PRM to serve as an alternative to turbulence models used in routine engineering design of fluid devices. There is no significant difference in the PRM results when more than 1024 bands (approximately 2.1 million particles) are used to represent the flow state so this was chosen as the upper limit of the size of our data set. Most engineers and designers run simulations on desktop computers. As such, the upper bound on what is commonly available is roughly 32 to 48 cores on two or four central processing units (CPUs) plus additional cores on one or more accelerators. We also looked at the scaling performance of parallel implementation of the PRM using Cray hardware and Fortran compiler which has excellent support for distributed-memory execution of coarray programs.
Figure
Speedup obtained with sequential co_sum implementation using multiple images on a single server.
We used TAU [
vector(:)
step = 2 temp = vector + vector[ temp = vector vector = temp step = step * 2
TAU profiling analysis of function runtimes when using the unoptimized co_sum routines with 1, 2, 4, 8, 16, and 32 images. The
TAU analysis of load balancing and bottlenecks for the parallel code using 32 images.
Designing an optimal co_sum algorithm is a platform-dependent exercise best left to compilers. The Fortran standards committee is working on a co_sum intrinsic procedure that will likely become part of Fortran 2015. But to improve the parallel performance of the program, we rewrote the collective co_sum procedures using a binomial tree algorithm that is
The speedup obtained with the optimized co_sum routine is shown in Figure
Speedup obtained with parallel co_sum implementation using multiple images on a single server.
TAU profiling analysis of function runtimes when using the optimized co_sum routines with 1, 2, 4, 8, 16, and 32 images.
The TAU profile analysis of the runs using different number of images is shown in Figure
To fully understand the impact of the co_sum routines, we also benchmarked the program using the Cray compiler and hardware. Cray has native support for the co_sum directive in the compiler. Cray also uses its own communication library on Cray hardware instead of building on top of MPI as is done by the Intel compiler. As we can see in Figure
Speedup obtained with parallel co_sum implementation using multiple images on a distributed-memory Cray cluster.
We also looked at the TAU profiles of the parallel code on the Cray hardware, shown in Figure
TAU profiling analysis of function runtimes when using the Cray native co_sum routines with 1, 2, 4, 8, 16, and 32 images.
We hope that, with the development and implementation of intrinsic co_sum routines as part of the 2015 Fortran standard, the Intel compiler will also improve its strong scaling performance with larger number of images. Table
Runtime in seconds for parallel using 128 bands, and different collective sum routines.
Number of Images | ||||||
---|---|---|---|---|---|---|
1 | 2 | 4 | 8 | 16 | 32 | |
Intel Serial co_sum | 35.55 | 19.80 | 11.69 | 9.73 | 18.71 | 66.82 |
Intel Parallel co_sum | 37.30 | 19.33 | 10.00 | 6.17 | 4.62 | 5.41 |
Cray Native co_sum | 46.71 | 23.68 | 11.88 | 6.06 | 3.06 | 1.73 |
Table
Weak scaling performance of coarray version.
Number of images | Number of bands | Number of particles | Particles per image | Time in seconds | Runtime per particle | Efficiency |
---|---|---|---|---|---|---|
1 | 128 | 33024 | 33024 | 44.279 |
|
1.000 |
4 | 256 | 131584 | 32896 | 44.953 |
|
0.978 |
16 | 512 | 525312 | 32832 | 49.400 |
|
0.893 |
|
||||||
2 | 256 | 131584 | 65792 | 101.03 |
|
1.000 |
8 | 512 | 525312 | 65664 | 102.11 |
|
0.987 |
32 | 1024 | 2099200 | 65600 | 129.75 |
|
0.777 |
We demonstrated a strategy for parallelizing legacy Fortran 77 codes using Fortran 2008 coarrays. The strategy starts with constructing extensible tests using Fortran’s OOP features. The tests check for regressions in accuracy and performance. In the PRM case study, our strategy expanded two Fortran 77 codes by 10% and 40%, exclusive of the test and I/O infrastructure. The most significant code revision involved unrolling two nested loops that distribute particles across images. The resulting parallel code achieves even load balancing but poor scaling. TAU identified the chief bottleneck as a sequential summation scheme.
Based on these preliminary results, we rewrote our co_sum procedure, and the speedup showed marked improvement. We also benchmarked the native co_sum implementation available in the Cray compiler. Our results show that the natively supported collective procedures show the best scaling performance even when using distributed memory. We hope that future native support for collective procedures in Fortran 2015 by all the compilers will bring such performance to all platforms.
The authors declare that there is no conflict of interests regarding the publication of this paper.
The initial code refactoring was performed at the University of Cyprus with funding from the European Commission Marie Curie ToK-DEV grant (Contract MTKD-CT-2004-014199). Part of this work was also supported by the Cyprus Research Promotion Foundation’s Framework Programme for Research, Technological Development and Innovation 2009-2010 (ΔEΣMH 2009-2010) under Grant TΠE/ΠΛHPO/0609(BE)/11. This work used resources of the National Energy Research Scientific Computing Center, which is supported by the Office of Science of the U.S. Department of Energy under Contract no. DE-AC02-05CH11231. This work also used hardware resources from the ACISS cluster at the University of Oregon acquired by a Major Research Instrumentation grant from the National Science Foundation, Office of Cyber Infrastructure, “MRI- R2: Acquisition of an Applied Computational Instrument for Scientific Synthesis (ACISS),” Grant no. OCI-0960354. This research was also supported by Sandia National Laboratories a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the National Nuclear Security Administration under Contract DE-AC04-94-AL85000. Portions of the Sandia contribution to this work were funded by the New Mexico Small Business Administration and the Office of Naval Research.