High-resolution numerical methods and unstructured meshes are required in many applications of Computational Fluid Dynamics (CFD). These methods are quite computationally expensive and hence benefit from being parallelized. Message Passing Interface (MPI) has been utilized traditionally as a parallelization strategy. However, the inherent complexity of MPI contributes further to the existing complexity of the CFD scientific codes. The Partitioned Global Address Space (PGAS) parallelization paradigm was introduced in an attempt to improve the clarity of the parallel implementation. We present our experiences of converting an unstructured high-resolution compressible Navier-Stokes CFD solver from MPI to PGAS Coarray Fortran. We present the challenges, methodology, and performance measurements of our approach using Coarray Fortran. With the Cray compiler, we observe Coarray Fortran as a viable alternative to MPI. We are hopeful that Intel and open-source implementations could be utilized in the future.
While it is the dominant communication paradigm, Message Passing Interface (MPI) has received its share of criticism in the High-Performance Computing (HPC) community. It provides a complex interface to parallel programming, which is mostly underutilised by researchers whose primary skill is not software development. Maintenance and modernization of parallel codes written with MPI also require more person-hours and associated funding costs compared to the serial counterpart [
In parallel programming, details of communication strategies should not overbear the researchers, to avoid shifting their focus from the core research objective. Unfortunately, it has been observed that hardware advancements do not come hand in hand with better performances. Scientific codes utilizing the MPI paradigm have to be modified in order to achieve the best possible performance gains. With the goal of Exascale computing, both the underlying hardware and the software tools available should support the scientific numerical codes so that they are efficiently adaptable to future computing platforms. Discussing the efficiency of a scientific code is a twofold matter and it should involve both the effort put during the development or reengineering phase, as well as the performance gains observed later.
Recently, Partitioned Global Address Space (PGAS) based parallel programming languages have been gaining popularity. Several languages such as Unified Parallel C (UPC), Coarray Fortran, Fortress, Chapel, and X10 are based on the PGAS paradigm. In comparison to many of its competitors, Coarray Fortran is relatively mature and has undergone considerable research [
Coarray Fortran was originally a small syntactic extension (F−) to the Fortran programming language, which enabled parallel programming. It is now part of the Fortran programming language since the adoption of the Fortran 2008 standards. Some features, such as collective intrinsic routines, teams, and error handling of failed images, were left out in Fortran 2008 standards. With the acceptance of the technical specification document, they will become standard in Fortran 2015 [
Like other PGAS languages, Coarray Fortran provides language constructs equivalent to one-sided communication during run-time. This feature improves productivity and could also harness the communication features of the underlying hardware. Some studies have been performed to quantify the effort and performance of such PGAS languages, most notably in the PRACE-PP (Partnership for Advanced Computing in Europe-Preparatory Phase) project. It involved the development of three benchmark cases by different researchers and collecting the feedback of development time (effort) and performance [
Coarray Fortran is today supported by Cray with extended features and by Intel with compatibility with Fortran standards [
Over the years many benchmark studies have been performed [
Computational Fluid Dynamics (CFD) studies of complex flows in a wide range of applications certainly benefit from parallelization due to the high computational costs of the numerical methods employed. Recent performance studies in the literature have only focused on numerical codes with structured meshes. These codes have natural, geometry driven, grid partitioning, and regular communication patterns. In many scientific domains, where complex geometries are involved, unstructured meshes are the norm. These meshes lead to nonintuitive mesh partitionings, have greater load imbalances, and suffer from nonregular communication patterns. When higher-order numerical schemes are required, the complexity of the communication patterns and associated data structures increases even more. In our study, we present our experience of converting a scientific numerical CFD code with unstructured meshes and higher-order numerical schemes from MPI to Coarray Fortran for parallel communication.
Coarray Fortran is based on the Single Program Multiple Data (SPMD) model of parallel programming [
A
An allocatable
Also, the same allocate statement should be executed by all
Data is remotely accessed using
Note that a
Similarly, for an allocatable, derived data type
As the
Our CFD solver is an unstructured mesh, finite volume Navier-Stokes solver for compressible flows, supporting mixed element meshes. In certain situations, the compressible nature of the fluid results in shock waves, with a sharp interface between regions of distinct properties such as density and pressure. These flows are commonly encountered in aerospace applications. To avoid prediction of a diffused interface and to predict the shock strength accurately using a CFD solver, higher-order numerical schemes are essential [
In a cell-centered finite volume solver, such as ours, the cell volume averaged solution (with either conserved or characteristic variable) is stored at the center of the cells in the mesh. If these cell-averaged values are used for the intercell flux calculations in the iterative solver to determine the solution at next time step or iteration, then first-order spatial accuracy is achieved. For greater accuracy, conservative and higher-order reconstruction polynomial is used. The neighboring cells which are used for calculating the reconstruction polynomial define the zone of influence and are collectively known as the stencil. The order of accuracy of the reconstruction is dependent upon the size of the stencil, while the reconstruction provides greater accuracy in the regions with smooth solutions; near sharp discontinuities such polynomials are inherently oscillatory [
In the traditional Total Variation Diminishing (TVD) schemes, the oscillatory nature of the polynomial near a discontinuity is kept under control by using slope or flux limiters. Thus, resulting schemes, such as MUSCL scheme (Monotonic Upstream-Centered Scheme for Conservation Laws), have higher-order accuracy in the region with smooth solutions while accuracy is lowered in regions with sharp or discontinuous solution.
The WENO scheme aims to provide higher-order accuracy throughout the domain by using multiple reconstruction polynomials with solution adaptive nonlinear weighting. The WENO scheme uses one central stencil and several directional stencils to construct the reconstruction polynomial. Higher weighting is given to smoother reconstruction polynomial among the directional stencils, and the highest weighting is given to the central stencil. The nonlinear weights are thus solution adaptive.
Details of the implementation of the CFD solver are provided in [
The MPI version of the code uses different derived data types to store the values of the solution variables and the associated mesh data. Since the code uses unstructured meshes and the WENO scheme, it has inherent load imbalance due to stencils of varying lengths. To accommodate the imbalanced memory storage, derived data type
For simplicity, a generic naming scheme in the following text to explain the modifications required in the code to incorporate communication with Coarray Fortran.
Let us say, in an
In the MPI version, every process has its
Schematic of data exchange process using MPI.
In the MPI version, respective
To incorporate the Coarray Fortran communication, with minimal changes to the original data structure and to avoid any additional memory copies before and after communication, an additional communication array was created.
Since push communication was needed in the Coarray Fortran version as well, the location of the
An initialization subroutine is called once before the communication subroutine to find the
Schematic of data exchange process using Coarray Fortran.
Figure
Working of CommArr in the Coarray Fortran version of the code.
A 2D, external flow, test case was chosen for validation and performance measurements. In this test case, air flow over RAE2822 aerofoil in steady, turbulent conditions was modeled in the transonic regime. The computational domain boundaries were fixed 300 chord lengths away, and an unstructured mixed mesh was created which contained quadrilaterals in the boundary layer near the aerofoil and triangular element away from it. The resultant mesh had 52378 cells, 39120 quadrilateral cells, and 13258 triangular cells. The free-stream conditions at the inlet correspond to
Here,
The third-order WENO scheme, denoted as WENO-3, was used for achieving higher-order accuracy. For the WENO-3 scheme, the central and the directional stencils for a triangular mesh element are shown in Figure
Central (stencil 1) and directional stencils (stencil 2, stencil 3, and stencil 4) for a triangular mesh element.
Combined stencil for the triangular mesh element.
Two HPC facilities were used in our study, ASTRAL and ARCHER. ASTRAL is an SGI, Intel processor based, cluster owned by Cranfield University. ARCHER, the UK’s national supercomputing facility, is a Cray XC30 system.
ASTRAL has 80 physical compute nodes. Each compute node has two 8-core E5-2260 series processor and 8 GB RAM per core (i.e., 128 GB per node). Hyperthreading is disabled. A 34 TB parallel file storage system (Panasas) is connected to all the nodes. Infiniband QDR connectivity exists among all nodes and to the storage appliance. The operating system is Suse Linux 11.2.
ARCHER has a total of 4920 compute nodes. Each standard compute node (4544 nodes out of the total 4920 nodes) contains two 12-core E5-2697 v2 (Ivy Bridge) series processors and a total of 64 GB RAM per node. Each processor can support two hyperthreads, but they were not used. The compute nodes are connected with a parallel Lustre filesystem. The Cray Aries interconnect links all the compute nodes. A stripped-down version of the CLE, Compute Node Linux (CNL), is run on the compute nodes to reduce the memory footprint and overheads of the full OS.
Intel Fortran compiler 15.0.3 and Intel MPI version 5.0 Update 3 were used on ASTRAL and Cray compiling environment 8.3.3 was used on ARCHER.
The compiler flags used on ASTRAL were
The compiler flags used on ARCHER were
Coarray Fortran uses a simpler and user-friendly syntax, which results in a cleaner code in comparison to MPI.
To demonstrate the clarity obtained with the Coarray Fortran, we have presented one communication subroutine from our code in Listing
This subroutine is used for communicating the reconstructed, boundary extrapolated values of each Gaussian quadrature point of the neighboring halo cells. The derived data type (
The resultant code is also much easier to understand, while preserving the functionality.
To validate the numerical predictions from the CFD code the experimental measurements [
The pressure coefficient profile over the aerofoil for the WENO-3 scheme along with the reference results is shown in Figure
Pressure coefficient over the RAE 2822 airfoil surface for the code compared with experimental and CFD results from literature. Note: some points in CFD results are omitted for clarity.
The sharp dip in pressure coefficient, near
Mach number contour over the RAE 2822 airfoil.
It can be observed that the predictions with the WENO-3 scheme are more accurate in predicting the shock location, compared to the WIND code. The WIND code uses second-order finite difference scheme; thus greater errors may be expected. The MPI and Coarray Fortran version of our code provides the same predictions, which reassures that errors were not introduced during the conversion process.
To compare the performance of MPI and Coarray Fortran communication in the code, the validation test case was run for 1000 iterations. These tests were performed on ASTRAL (Intel compiler) and ARCHER (Cray compiler), and the elapsed time was measured for the iterative calculations excluding any initialization and savefile outputs. Since the most time-consuming part of the simulation is the iterative solver, the initialization and savefile output time can be neglected.
Figure
Performance results for the solver.
On ASTRAL with the Intel compiler, the Coarray Fortran version of the code is slower than MPI. Since the Coarray Fortran implementation of Intel is based on MPI-3 remote memory access calls, it is subject to overheads over MPI. These overheads are so big that any performance gains—by replacing the blocking send and receive commands in the MPI version to nonblocking remote access calls in the Coarray Fortran version—are wiped out. An interesting result to note is that a sudden performance degradation occurs when communication takes place among
Haveraaen et al. [
In contrast, on ARCHER with the Cray compiler, the performance of Coarray Fortran version of the code is mostly similar (till 96 cores) or in some cases (192 and 384 cores) better than the MPI version. Also, the execution time is lower on ARCHER due to the faster architecture compared to ASTRAL. For shorter messages, Coarray Fortran has lower overheads compared to MPI; this translated into the better performance when higher cores were used with the Coarray Fortran version. Also, the extraredirection due to the communication array did not adversely affect the results in comparison to the other gains.
Open-source compilers such as GCC (with OpenCoarrays) and OpenUH have been used in other benchmark studies to demonstrate the performance of their Coarray Fortran implementation in comparison to the MPI implementations. During our study, we found that the code featured in Listing
Coarray Fortran provides a simpler and a more productive alternative to MPI for parallelization. With minimal code modifications, even codes with unstructured meshes can be parallelized with Coarray Fortran. The increased readability of the resultant code enhances the productivity. Based on the performance and the level of current development, we found the Cray compiler to be suitable for development with Coarray Fortran. The performance with Cray compiler was similar or better than MPI in the tests. With Intel compiler, significant performance degradation was observed, especially on internode communication. It may be attributed to the inefficient implementation and possible bugs. While commercial compilers support all the features of Coarrays Fortran specified in the Fortran 2008 standard, some limitations still exist with the open-source implementations, such as OpenCoarrays (GCC) and OpenUH. We are hopeful that these limitations will be resolved soon in future versions.
The authors declare that there are no conflicts of interest regarding the publication of this paper.
This work used the ARCHER UK National Supercomputing Service (