Scalability ofOpenFOAMDensity-BasedSolverwithRunge–Kutta Temporal Discretization Scheme

Compressible density-based solvers are widely used in OpenFOAM, and the parallel scalability of these solvers is crucial for largescale simulations. In this paper, we report our experiences with the scalability of OpenFOAM’s native rhoCentralFoam solver, and by making a small number of modifications to it, we show the degree to which the scalability of the solver can be improved. )e main modification made is to replace the first-order accurate Euler scheme in rhoCentralFoam with a third-order accurate, fourstage Runge-Kutta or RK4 scheme for the time integration. )e scaling test we used is the transonic flow over the ONERA M6 wing. )is is a common validation test for compressible flows solvers in aerospace and other engineering applications. Numerical experiments show that our modified solver, referred to as rhoCentralRK4Foam, for the same spatial discretization, achieves as much as a 123.2% improvement in scalability over the rhoCentralFoam solver. As expected, the better time resolution of the Runge–Kutta scheme makes it more suitable for unsteady problems such as the Taylor–Green vortex decay where the new solver showed a 50% decrease in the overall time-to-solution compared to rhoCentralFoam to get to the final solution with the same numerical accuracy. Finally, the improved scalability can be traced to the improvement of the computation to communication ratio obtained by substituting the RK4 scheme in place of the Euler scheme. All numerical tests were conducted on a Cray XC40 parallel system, )eta, at Argonne National Laboratory.


Introduction
With the continuous development of hardware and software supporting high-performance computing, eventually leading to exascale computing capabilities in a near future, highfidelity simulations of complex flows that were out of reach until a decade ago are now becoming feasible on supercomputer infrastructures. Direct numerical simulations (DNS) and large-eddy simulations (LES) are natural candidates for high-fidelity simulations as they can capture all relevant spatial and temporal characteristic scales of the flow. Indeed, there is consensus in the computational fluid dynamics (CFD) community that DNS/LES codes incorporating high-order time integration and spatial discretization methods are preferable for ensuring minimal influence of numerical diffusion and dispersion on the flow physics. While these numerical constraints have been traditionally integrated in the simulation of academic flows on simple geometries, they are also being considered for industrial and more complex applications where accurate prediction of local or instantaneous flow properties are required (e.g., in combustion, multiphase and reacting flows).
In this context, the OpenFOAM package [1] represents a popular open-source software originally designed for use in CFD. Its operation is composed of solvers, dictionaries, and domain manipulation tools. It provides several different kinds of solvers for different partial differential equations (PDEs) and framework to implement third-party solvers for customized PDEs. e solvers integrated in the standard distributions of OpenFOAM are robust, but they generally lack precision with 2nd-order accuracy at most in both space and time. erefore, there is a growing interest in the CFD community to develop and implement higher-order methods in OpenFOAM for transient flow calculations. ese include, for example, numerical algorithms for DNS/ LES of incompressible flows [2], compressible flows [3][4][5], and reacting flows [6].
In addition to high-order numerical schemes, parallel efficiency and scalability are of course crucial for highperformance computing of complex flows. Parallelism in OpenFOAM is implemented by means of MPI (Message Passing Interface) although previous scalability analysis of incompressible OpenFOAM solvers showed limited speedup [7,8]. e improvement of scaling performance in Open-FOAM has been a concern in many recent studies. In order to optimize the performance, it is crucial to first conduct performance analysis to find the bottleneck of the simulation process. Culpo [7] found that the communication is the bottleneck in scalability of solvers in OpenFOAM for largescale simulations. Duran et al. [9] studied the speedup of icoFoam solver for different problem sizes and showed that there is a large room for scalability improvement. Lin et al. [10] proposed a communication optimization method for multiphase flow solver in OpenFOAM. Ojha et al. [11] applied optimizations to the geometric algebraic multigrid linear solver and showed an improved scaling performance.
In this work, we are primarily interested in compressible flow solvers for aeronautical applications. In the standard distribution of OpenFOAM, rhoCentralFoam is the only density-based solver for transient, compressible flows [12]. It is built upon a second-order nonstaggered central scheme based on KT (Kurganov and Tadmor [13]) and KNP (Kurganov,Noelle,and Petrova [14]) methods. ere are a few studies that tried to improve the numerical algorithm of density-based solver. Heyns et al. [15] extended the rho-CentralFoam solver through the implementation of alternative discretization schemes to improve its stability and efficiency. More recently, Modesti and Pirozzoli [4] developed a solver named rhoEnergyFoam relying on the AUSM scheme by Liou and Steffen Jr. [16]. By advancing using a low-storage third-order four-stage Runge-Kutta algorithm for time advancement, their solver showed reduced numerical diffusion and better conservation properties compared to rhoCentralFoam in both steady and unsteady turbulent flows.
In this work, we present a new OpenFOAM solver, named rhoCentralRK4Foam, which is derived from rho-CentralFoam by replacing its first-order time advancement scheme with a third-order, four-stage Runge-Kutta scheme. e aim of developing this solver is twofold: (i) to improve the scaling performance of rhoCentralFoam, especially in large-scale simulations; (ii) to improve time accuracy and overall time-to-solution using high-order Runge-Kutta scheme [17]. Instead of trying to optimize the parallelism embedded in OpenFoam, which has been shown to be very hard to achieve (see, e.g., Culpo [7]), our proposed approach is to select a different numerical integration scheme that shows improved CPU and scalability performances with minimal modification of the standard distribution. e cases are configured to solve the PDEs through the rhoCentralFoam and rhoCentralRK4Foam solvers. We investigate the parallel performance of rhoCentralFoam and rhoCentralRK4Foam on Cray XC40 system eta [18]. ese two solvers are benchmarked on two cases, an inviscid transonic flow over the ONERA M6 wing, and a supersonic flow over the Forwardfacing step to validate the new solver's shock capturing capability.
e TAU (Tuning and Analysis Utilities) Performance System analyzer [19] is used to collect the hotspot profiles of the two solvers. e strong and weak scaling tests of the benchmark problems are conducted on eta up to 4,096 cores. For the time-to-solution analysis, we considered the Taylor-Green vortex problem and determined the time-tosolution needed by the two solvers to get a solution with the same accuracy (same numerical error) with respect to the analytical solution. e present study is also aimed at providing users a handle on parameter choices for performance and scalability. e rest of the paper is organized as follows: in Section 2, a description of numerical methods for the two solvers as well as the hardware and profiling tools are presented. e benchmark test cases and the results for the scalability analysis are presented in Section 3. In Section 4, we present the results for the time-to-solution analysis. Section 5 shows the conclusion.

Governing Equations and Spatial Integration.
e solver rhoCentralFoam is the most widely used compressible flow solver in the baseline distribution of OpenFOAM. It relies on the full discretization of the convective fluxes through the central TVD scheme (total variation diminishing) of KT and KNP schemes. rhoCentralFoam solves the governing fluid equations in an Eulerian frame of reference for three conservative variables, specifically density (ρ), momentum density (ρu), and total energy density (ρE): In the above equations, E ≡ e(T) + 1/2u · u, where e(T) is the internal energy and T is the temperature; p is the thermodynamic pressure, related to the temperature and density through the equation of state for ideal gases: p � ρRT, where R is the gas constant; and τ � 2μ[(1/2)(∇u + ∇ T u) − 1/3(∇ · u)I] is the viscous stress tensor under the assumption of Newtonian fluid with dynamic viscosity μ while I is the unit tensor. Finally, J � −λ∇T is the heat flux vector where λ is the thermal conductivity. e governing equations are discretized using a finite volume method where equations (1)-(3) are integrated over a control volume represented by a grid cell of volume V. Using Gauss theorem, all fluxes can be transformed into surface integrals across the cell boundaries, which are estimated by summing up the flux contributions from all faces f of the cell. e volume average of the state vector of conserved variables Ψ ≡ [ρ; ρu; ρE] is given by So the integral form of equations (1)-(3) can be expressed as where L(Ψ) denotes the operator returning the right-hand side of equations (1)-(3) containing all inviscid and viscous fluxes. ese fluxes have to be estimated numerically using the volume-averaged state variables Ψ of adjacent cells. In particular, the convective fluxes are obtained as where subscript f identifies the faces of the cell volume, whereas the terms Ψ f , u f , S f , and ϕ f ≡ u f · S f represent the state vector, velocity, surface area, and volumetric flux at the interface between two adjacent cells, respectively. e product ϕ f ρ f is obtained by applying the KT scheme: where + and − superscripts denote surface values interpolated from the owner cell and neighbor cell, respectively, while α max is an artificial diffusive factor (see [12] for details).
In the implementation of rhoCentralRK4Foam, the Eulerian fluxes are firstly calculated explicitly as shown in Algorithm 1 where "phi" represents ϕ f ρ f , "phiUp" repre-
e implementation in OpenFOAM is reported in Algorithm 2, and the C++ code is publicly available at https://github.com/SiboLi666/ rhoCentralRK4Foam. Note that the original Euler scheme implemented in rhoCentralFoam can be simply obtained from equation (8) by setting N s � 1 and α 1 � 1.

Hardware and Profiling Tools.
e simulations were run on supercomputer platform " eta" at Argonne National Laboratory. eta is a Cray XC40 system with a secondgeneration Intel Xeon Phi (Knights Landing) processor and the Cray Aries proprietary interconnect. e system comprises 4,392 compute nodes. Each compute node is a single Xeon Phi chip with 64 cores, 16 GB Multi-Channel DRAM (MCDRAM), and 192 GB DDR4 memory. To avoid any possible memory issues with OpenFoam, the simulations were run using 32 cores of each computing node (half of its maximum capacity).
We assess the scalability and parallel efficiency of the two solvers, rhoCentralFoam and rhoCentralRK4Foam, by analyzing their speedup on the same grid and by monitoring the portions of CPU time spent in communication and in computation, respectively. ese measurements are performed with performance tools adapted to the machine. In the strong scaling tests, the tools are set to start counting after the initial setup stage is completed, in our case, after 20 time steps, lasting for an extra 75 time steps. e strong scaling tests are performed on eta. In order to measure the time spent in communication, we rely on the TAU Performance system [19]. e TAU performance tool suite is an open-source software developed at the University of Oregon and offers a diverse range of performance. On eta, the OpenFOAM makefile is modified to utilize the TAU wrapper functions for compilation. TAU can automatically parse the source code and insert instrumentation calls to collect profile and/or trace data, which allow us to measure the total time spent in communication during the targeted 75 time steps and identify the hotspots for the two solvers.

Test Cases Description.
For the scalability analysis, we tested the new solver rhoCentralRK4Foam in two different benchmark cases: (i) the three-dimensional ONERA M6 transonic wing; (ii) the two-dimensional supersonic forward facing step. e two cases are steady flows; they are first used for validating the new solver's shock-capturing capability and then used for detailed parallel performance analysis of the solver in the next section. For the two cases, we used the decomposePar tool in OpenFOAM to decompose the generated mesh. e scotch decomposition method [20] is used to partition the mesh into subdomains. A brief description of the two cases is presented next.

Transonic M6 Wing.
In the ONERA M6 wing case, the mean aerodynamic chord is c � 0.53 m, the semispan is b � 1 m, and the computational domain extends to 20c. e angle of attack is 3.06°, and the free stream mach number is Ma � 0.8395. e Reynolds number in the original experiment was Re � 11.72 × 10 6 , and so the flow is certainly turbulent; however, in order to capture the pressure distribution and the shock location along the wing, inviscid calculations can be safely employed and are indeed customary (see, for example, Modesti and Pirozzoli, [4]). e geometry is meshed using hexahedral elements, and three grids with different sizes are generated: Grid1 has 1 million cells (shown in Figure 1(a)), Grid2 has 5 million cells, and Scientific Programming Grid3 has 43 million cells. e flow-filed analysis was conducted on Grid1 (which is sufficient for grid convergence), while Grid2 and Grid 3 were used for scaling analysis. Figure 1(b) shows the pressure distribution on the wing surface computed by using rhoCentralFoam and rhoCentralRK4Foam. In Figure 1(c), the pressure coefficient at the inner wing section (20% span) computed by the two solvers is compared with the experimental data (blue circle symbols [21]). We can observe that the newly developed solver captures the main flow features. It generates a pressure distribution on the wing surface similar to that obtained with rhoCentralFoam and captures the shock location precisely.

Supersonic Forward-Facing
Step. Computations of inviscid flow over a forward-facing step are used to further verify the shock-capturing capability of rhoCentralRK4Foam solver. e flow configuration used by Woodward and Colella [22] is considered in this work. e supersonic flow mach number is Ma ∞ � 3. e grid has N x × N y � 240 × 80 cells in the coordinate directions. e shock pattern represented by density distribution shown in Figure 2 confirms that the rhoCentralRK4Foam is capable of capturing strong shocks for supersonic flows.

Strong Scaling Tests.
In the strong scaling tests, we tested the rhoCentralFoam and rhoCentralRK4Foam solvers on three different mesh sizes of ONERA M6 wing. Here, we present the scalability results by showing the speedup as well as the speedup increment percentage for rhoCen-tralRK4Foam for Grid1 up to 1024 ranks in Table 1, Grid2 up to 2048 ranks in Table 2, and Grid3 up to 4096 ranks in Table 3. Note that the scaling is presented as CPU time normalized by the CPU time obtained with 128 ranks, which

Result
where t p and t s are the CPU time spent in parallel and serial (i.e., not parallelizable) sections of the solver, respectively. From the previous equations, we have where S ∞ ≡ S(N r ⟶ ∞) is the maximum theoretical speedup obtained for N r ⟶ ∞. By measuring S ∞ , we can directly evaluate f from equation (10) for the cases featuring asymptotic speedup: for rhoCentralFoam (Grid3), this yields f≃99.88%. We can also estimate f but best fitting S(N c ) to Amdahl's formula in equation (9) for rhoCentralRK4Foam, which yields f≃99.98%. Of course, these results can be affected by other important factors such as latency and bandwidth of the machine that are not taken into account in Amdahl's model. We also observe that the scalability becomes better as the problem size increases as the workload of communication over computation decreases. For example, with 2048 ranks, the speedup for rhoCentralRK4Foam is 6.005 for Grid2 (5 million cells), while it becomes 9.115 for Grid3 (43 million cells). Moreover, we can also see that as the grid size increases, the maximum speedup increment percentage also increases, which corresponding to 16%, 32%, and 123% for Grid1, Grid2, and Grid3, respectively. As a final exercise, the TAU performance tool was applied to profile the code for Grid3 with 2048 ranks. Figure 4 also shows the communication and computation time percentage for rhoCentralFoam and rhoCentralRK4Foam on Grid3 from 128 to 4,096 ranks. We can observe that the rho-CentralRK4Foam solver takes less time to communicate, which lead to the improved parallel performance found in previous tests. In addition, among the MPI subroutines that are relevant for communication (i.e., called every time step in the simulation), MPI_waitall() was the one that employed most of the time (see Figure 5), which is in line with previous profiling (see, for example, Axtmann and Rist [24]).

Weak Scaling Tests.
In order to further analyze the parallel scalability of rhoCentralRK4Foam solver, we conducted a weak scaling analysis based on the ONERA M6 wing case and the forward-facing step case. e grid size ranges from 700, 000 to 44, 800, 000 cells in ONERA M6 wing case and from 1, 032, 000 to 66, 060, 000 cells in the forward-facing step case. e configurations of the weak scaling test cases are the same as the one in the previous test cases. e number of ranks ranges from 16 to 1024 and 64 to 4096, respectively. e number of grid points per rank stays constant at around 43 × 10 3 and 16 × 10 3 for two cases, which ensures that the computation workload is not impacted much by communication. Denoting by τ the CPU time for a given run, the relative CPU time is obtained by      Tables 4 and 5. e comparison of the two solvers is also plotted in Figures 6  and 7. e weak scaling tests give us indication about how well does the two solvers scale when the problem size is increased proportional to process count. From Tables 4 and  5 and Figures 6 and 7, it can be observed that for lower MPI tasks (16 to 64), both of the two solvers scale reasonably well. However, for higher MPI tasks (128 to 1024), the rho-CentralRK4Foam solver scales better. Remarkably, we can observe a distinguishable relative time difference between the two solvers as the number of ranks increases from 512 to 1024. e profiling results from TAU show that in the test case with 1024 ranks, rhoCentralRK4Foam spends around 40% time on computation while rhoCentralFoam only has around 20% time spent on computation. As for the forwardfacing step case, we were able to conduct the tests at larger cores count (up to 4096 cores) and it can be confirmed that rhoCentralRK4Foam solver has better scaling performance than rhoCentralFoam solver. Indeed, it scales better with larger grid size and the relative time of rhoCentralRK4Foam solver maintains at 1.063 with 4096 cores (the relative time is 1.148 with 1024 cores in ONERA M6 wing case). Generally, the rhoCentralRK4Foam solver outperforms the rhoCen-tralFoam solver at large-scale ranks due to communication hiding.

Results for the Time-to-Solution Analysis
e scaling analysis is an important exercise to evaluate the parallel performance of the code on multiple cores; however, it does not provide insights into the time accuracy of the numerical algorithm nor into the time-to-solution needed to evolve the discretized system of equations (equations (1)-(3)) up to a final state within an acceptable numerical error. For example, in the scaling study conducted on the M6 transonic wing, the number of iterations was fixed for both solvers rhoCentralFoam and rhoCentralRK4Foam and calculations were performed in the inviscid limit so that the implicit solver used in rhoCentralFoam to integrate the viscous fluxes was not active. In such circumstances, comparing the two solvers essentially amounts at comparing first-order (Euler) and fourth-order forward in time Runge-Kutta algorithms. Hence, it is not surprising that the CPU time for rhoCentralRK4Foam is about four times larger than rhoCentralFoam (for the same number of iterations) since it requires four more evaluations of the right-hand-side of equations (5) per time step. For a sound evaluation of the time-to-solution however, we need to compare the two solvers over the same physical time at which the errors with respect to an exact solution are comparable. Because the time accuracy of the solvers is different, time step is also different and so are the number of iterations required to achieve the final state.

Test Case Description.
For the time-to-solution analysis, we consider the numerical simulation of Taylor-Green (TG) vortex, a benchmark problem in CFD [25] for validating unsteady flow solvers [26,27] requiring time accurate integration scheme as the Runge-Kutta scheme implemented in rhoCentralRK4Foam. ere is no turbulence model applied in the current simulation. e TG vortex admits analytical time-dependent solutions, which allows for precise definition of the numerical error of the algorithm with respect to a given quantity of interest. e flow is initialized in a square domain 0 ≤ x, y ≤ L as follows: u(x, y, 0) � −U 0 cos kx sin ky, (12) v(x, y, 0) � U 0 sin kx cos ky, (13) p(x, y, 0) � p 0 − ρ 0 U 2 0 4 (cos 2 kx + cos 2 ky), (14) where u and v are the velocity components along x and y directions, k ≡ 2π/L is the wave number, and U 0 , ρ 0 , and p 0 are the arbitrary constant velocity, density, and pressure, respectively. e analytical solution for the initial-value problem defined by equations (11)-(14) is which shows the velocity decay due to viscous dissipation. As a quantity of interest, we chose to monitor both the velocity profiles u at selected location and the overall kinetic energy in the computational box, which can be readily obtained from equations (16)- (18) as

Comparison of Time-to-Solution.
For the analysis of rhoCentralFoam and rhoCentralRK4Foam solvers we consider fixed box size L � 2π m and U 0 � 1 m/s and two values of Reynolds number, Re � LU 0 /v, to test the effect of viscosity. e computation is done on a 160 × 160 points uniform grid. e simulations run up to t � 0.58 s for Re � 6.28 and t � 11.6 s for Re � 125.6, which corresponds to the same nondimensional time t/t ref � 0.0147 with t ref ≡ L 2 /]. At this time, the kinetic energy ε decreased to 10% from the initial value, as an illustrative example; Figure 8 shows the velocity magnitude contour for the Re � 6.28. Figure 9 presents the computed velocity profile u(y) in the middle of the box, x � L/2. rhoC stands for rhoCentralFoam solver and rhoCRK4 stands for rhoCentralRK4Foam solver. It can be observed that rhoCentralRK4Foam shows an excellent agreement with the analytical profile for Δt � 10 − 4 s (reducing the time step further does provide any further convergence for the grid used here). In order to achieve the same level of accuracy with rhoCentralFoam, time step has to be reduced by a factor of 5 to Δt � 2 × 10 − 5 s compared to rhoCentralRK4Foam. e same behavior is observed for the evolution of ε as shown in Figure 10.
In order to calculate the time-to-solution τ, we first evaluated the CPU time τ 0 per time step per grid point per   Scientific Programming core for both solvers. We used an Intel Xeon E5-2695v4 processor installed on eta and Bebop platforms at Argonne National Laboratory and reported the results of the tests in Table 6. Even though τ 0 depends on the specific processor used but the ratio τ 0 rhoCentralRK4 /τ 0 rhoCentral does not and so is a good indication of the relative performance of the two solvers. In the present case, this ratio is consistently found to be 3 for all Reynolds number; we made additional tests with much higher Reynolds up to Re ∼ 10 5 to confirm this (see Table 6). Denoting by t f the physical final time of the simulation, by N iter the number of iterations and by N points the number of grid points, the time-to-solution for the two solvers can be obtained as and is reported in Table 7. It is concluded that to achieve the same level of accuracy, τ rhoCentralRK4 is about 1.5 to 1.6 times smaller than τ rhoCentral . Additionally, as we discussed in the scaling analysis, rhoCentralRK4Foam can achieve up to 123% improvement in scalability over rhoCentralFoam when using 4096 ranks. us, the reduction in time-tosolution for rhoCentralRK4Foam is expected to be even larger for large-scale, time-accurate simulations utilizing thousands of parallel cores.

Conclusion
In this study, we presented a new solver called rhoCen-tralRK4Foam for the integration of Navier-Stokes equations in the CFD package OpenFOAM. e novelty consists in the replacement of first-order time integration scheme of the native solver rhoCentralFoam with a third-order Runge-Kutta scheme which is more suitable for the simulation of time-dependent flows. We first analyzed the scalability of the two solvers for a benchmark case featuring transonic flow over a wing on a structured mesh and showed that rhoCentralRK4Foam can achieve a substantial improvement (up to 120%) in strong scaling compared to rhoCen-tralFoam. We also observed that the scalability becomes better as the problem size grows. Hence, even though OpenFOAM scalability is generally poor or decent at best, the new solver can at least alleviate this deficiency. We then analyzed the performance of the two solvers in the case of Taylor-Green vortex decay and compared the time-to-solution, which gives an indication of the workload needed to achieve the same level of accuracy of the solution (i.e., the same numerical error). For the problem considered here, we obtained a reduction of time-to-solution by a factor of about 1.5 when using rhoCentralRK4Foam compared to rho-CentralFoam due to the use of larger time steps to get to the final solution with the same numerical accuracy. is factor will be eventually even larger for larger-scale simulations employing thousands or more cores, thanks to the better speedup of rhoCentralRK4Foam. e proposed solver is potentially a useful alternative for the simulation of compressible flows requiring time-accurate integration like in direct or large-eddy simulations. Furthermore, the implementation in OpenFOAM can be easily generalized to different schemes of Runge-Kutta family with minimal code modification.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.