In this paper, we develop and parallelize a CFD solver that supports overlapped meshes on multiple MIC architectures by using multithreaded technique. We optimize the solver through several considerations including vectorization, memory arrangement, and an asynchronous strategy for data exchange on multiple devices. Comparisons of different vectorization strategies are made, and the performances of core functions of the solver are reported. Experiments show that about 3.16x speedup can be achieved for the six core functions on a single Intel Xeon Phi 5110P MIC card, and 5.9x speedup can be achieved using two cards compared to an Intel E5-2680 processor for two ONERA M6 wings case.
National Natural Science Foundation of China6170243811502267Key Research Project of Institutions of Higher Education of Henan Province17B520034Nanhu Scholar Program of XYNU1. Introduction
Computing with accelerators such as graphics processing unit (GPU) [1] and Intel many integrated core (MIC) architecture [2] has been attractive in computational fluid dynamics (CFD) areas recent years because it provides researchers with the possibility of accelerating or scaling their numerical codes by various parallel techniques. Meanwhile, the fast development of computer hardware and the emerging techniques require researches to explore suitable parallel methods for applications in engineering. Intel MIC architecture consists of processors that inherit many key features of Intel CPU cores, which makes the code migrating less expensive and become popular in the development of parallel algorithms.
Many CFD-based codes or solvers have been studied on Intel MIC architecture. Gorobets et al. [3] used various accelerators including AMD GPUs, NVIDIA GPUs, and Intel Xeon Phi coprocessors to conduct direct numerical simulation for turbulent flows and compared the results from these accelerators. Farhan et al. [4] utilized native and offload mode of MIC programming model to parallelize the flux kernel of PETSc-FUN3D, and they obtained about 3.8x speedup with offload mode and 5x speedup with native mode by exploring a series of shared memory optimization techniques. Graf et al. [5] ran their PDE codes on single Intel Xeon Phi Knights Landing (KNL) node and multiple KNL nodes respectively by using MPI + OpenMP programming model with different thread affinity types. Cai et al. [6] calculated the nonlinear dynamic problems on Intel MIC by employing offload mode and overlapped data transfer strategy, and they obtained 17x on MIC over sequential version on the host for the simulation of a bus model. Saini et al. [7] investigated various numerical codes on MIC to seek performance improvement with different host-coprocessor computing modes comparisons, and they also presented load-balancing approach for symmetric use of MIC coprocessors. Wang et al. [8] reported the large-scale computation of a high-order CFD code on Tianhe-2 supercomputer that consists of both CPU and MIC coprocessors. And other CFD-related works on Intel MIC architecture can be found in references [9–12]. Working as coprocessors, GPUs also have been popular in CFD. Many researchers [13–17] have studied GPU computing on structured meshes, which involved coalesced computation technique [13], heterogeneous algorithm [15, 17], numerical methods [16], etc. Corrigan et al. [18] investigated an Euler solver on GPU by employing unstructured grid and gained important factor of speedup over CPUs. Then, a lot of results included data structure optimization [19, 20], numerical techniques [21], and applications [22] based on unstructured meshes on the GPU platform. For GPU simulations on overlapped (overset) meshes, Soni et al. [23] developed a steady CFD solver on unstructured overset meshes by using GPU programming model, which accelerated both the procedure of grid (mesh) assembly and the procedure of numerical calculations. Then, they extended their solver to unsteady ones [24, 25] to make their GPU implementation capable of handling dynamic overset meshes. More CFD-related computing on GPUs can be found in an overview reference [26].
However, most of the existing works either used consistent structured or unstructured meshes without mesh overlapping over the computational domain on MIC architecture or studied overlapped meshes on GPUs. A majority of CFD simulations involving overlapped meshes were implemented or developed on distributed system through message passing interface (MPI) [27] without using coprocessors in past several decades. Specifically, Djomehri and Jin [28] reported the parallel performance of an overset solver using a hybrid programming model with both MPI [27] and OpenMP [29]. Prewitt et al. [30] conducted a review of parallel implementations using overlapped mesh methods. Roget and Sitaraman [31] developed an effective overlapped mesh assembler and investigated unsteady simulations on nearly 10000 CPU cores. Zagaris et al. [32] discussed a range of problems regarding parallel computing for moving body problems using overlapped meshes and presented the preliminary performance on parallel overlapped grid assembly. Other overlapped mesh-related works can be found in [33–36]. Although the work in [7] conducted tests by using a solver that is compatible with overlapped meshes on MIC coprocessors, the way it accessed multiple MIC coprocessors is through the native mode or symmetric mode of MIC. In this paper, we focus on the use of offload mode of MIC and investigate the parallelization of a solver with overset meshes on a single host node with multiple MIC coprocessors. The contributions of this work are as follows:
We parallelize an Euler solver using overlapped meshes and propose an asynchronous strategy for calculations on multiple MIC coprocessors within a single host node
We investigate the performances of core functions of the solver by employing offload mode with different thread affinity types on the MIC architecture, and we make detailed comparisons between the results obtained by Intel MIC vectorization and that obtained by Intel SSE vectorization
A speedup of 5.9x can be obtained on two MIC coprocessors over a single Intel E5-2680 processor for two M6 wings case
The remainder of the paper is as follows. We first introduce the MIC architecture and programming model. And this is followed by equations and numerical algorithms that have been implemented in the solver. In Section 4, we discuss implementation and optimization aspects including data transfer, vectorization, and asynchronous data exchange strategy on multiple MIC coprocessors. The performances of core functions by using different thread affinity types are reported, and comparisons are made in Section 5. The last section summarizes our work.
2. MIC Architecture and Programming Model
Many integrated core (MIC) architecture [2] is a processor that is capable of integrating many ×86 cores, providing the computing power of high parallelism. The architecture used in the first Intel Xeon Phi product is called Knights Corner, KNC. The KNC coprocessors can have many (up to 61) double dispatched, in-order executing ×86 computing cores. Each core has a 512 bit vector processing unit (VPU) which supports 16 single or 8 double floating point operations per cycle, and each core is able to launch 4 hardware threads. 32 KB L1 code cache, 32 KB L1 data cache, and 512 KB L2 cache are available to each core. The coprocessor used in this work is Intel Xeon Phi 5110P [37], which consists of 60 cores each of which runs at 1.05 GHz. It can launch a total of 240 threads simultaneously. And 8 GB of GDDR5 memory is available on it.
Anyone who is familiar with C, C++, or Fortran programming language can develop codes on MIC coprocessors without major revision of their source codes. MIC provides very flexible programming models, including native host mode, native MIC mode, offload mode, and symmetric mode [38]. Coprocessors are not used in native host mode, and programmers can run their codes on CPUs just like they do before the MIC architecture was introduced. By contrast, codes can be conducted only on coprocessors in native MIC mode when they are compiled with “-mmic” option. Symmetric mode allows programmers to run codes on both CPU cores and coprocessors. And offload mode is most commonly used on a single coprocessor or multiple coprocessors within a single host node. The basic use of offload mode for programmers is to write offload directives to make the code segment run on MIC coprocessors. To take full advantage of computational resources on MIC, the code segment can be conducted in parallel by employing multithreading techniques, such as OpenMP [29].
3. Equations and Numerical Algorithms3.1. Compressible Euler Equations
The three-dimensional time-dependent compressible Euler equations over a control volume Ω can be expressed in integral form as(1)∂∂t∫ΩWdΩ+∮∂ΩFcdS=0,where W=ρ,ρu,ρv,ρw,ρET represents the vector of conservative variables and Fc=ρV,ρuV+nxp,ρvV+nyp,ρwV+nzp,ρHVT denotes the vector of convective fluxes with V=nxu+nyv+nzw.
3.2. Numerical Algorithms
The flux-difference splitting (FDS) [39] technique is employed to calculate the spatial derivative of convective fluxes. In this method, the flux at the interface (for example, i direction), expressed by Fi±1/2,j,k, can be computed by solving an approximate Riemann problem as(2)Fi+1/2,j,k=12FWL+FWR−AinvWR−WL,where the left and right state of q, qL, and qR are constructed by Monotonic Upstream-Centered Scheme for Conservation Laws (MUSCL) [40] and min-mod limiter.
Equation (1) is solved in time in this work by employing an implicit approximate-factorization method [41], which achieves first-order accuracy in steady-state simulations.
Euler wall boundary conditions (also called inviscid surface conditions), inflow/outflow boundary conditions, and symmetry plane boundary conditions are applied to equation (1) by using the ghost cell method [42].
3.3. Mesh Technique
In this work, we aim to solve 3D Euler equations on multiple MIC devices by using overlapped mesh technique. In this technique [33, 42], each entity or component is configured by a multiblock mesh system, and different multiblock mesh systems are allowed to overlap with each other. During the numerical calculations, overlapped regions need to receive interpolated information from each other by using interpolation methods [43, 44]. The process of identifying interpolation information among overlapped regions, termed as mesh (grid) assembly [31–34, 45], has to be employed before starting numerical calculations. This technique has reduced the difficulty of generating meshes for complex geometries in engineering areas because an independent mesh system with high mesh quality can be designed for a component without considering other components. However, it adds the complexity of conducting parallel calculations (via MPI [27], for example) on this kind of mesh system. As mesh blocks are distributed on separated processors where data can not be shared directly, more detailed work should be done on data exchange among mesh blocks. There are two types of data communication for overlapped mesh system when calculations are conducted on distributed computers. One is the data exchange on the shared interfaces where one block connects to another, and the other is the interpolation data exchange where one block overlaps with other blocks. And the discussion of both types of communication is going to be covered in this work.
4. Implementation4.1. Calculation Procedure
The procedure of solving equation (1) depends mainly on the mesh technique employed. In the overlapped mesh system, the steps of calculations are organized in Listing 1. As described in Section 3.3, performing mesh assembly is a necessary step (Listing 1, line 1) before conducting numerical computing on an overlapped mesh system. Mesh assembly identifies cells which need to receive interpolation data (CRI) as well as cells which provide interpolation data (CPI) for CRIs in overlapped regions and creates a map to record where the data associating with each block should be sent to or received from. This data map stays unchanged during the steady-state procedure and is used for data exchange (Listing 1, line 8) within each iteration. Then, the mesh blocks that connect with other blocks need to share the solutions at their common interfaces. When the data from blocks that provide interpolation and from neighbouring blocks are ready, a loop (Listing 1, lines 10–13) is launched to compute fluxes and update solutions on each block one by one.
<bold>Listing 1</bold>: Procedure of solving equation (<xref ref-type="disp-formula" rid="EEq1">1</xref>) on overlapped mesh system.
Conduct mesh assembly to obtain interpolation information over the mesh system (MS)
Read flow configuration file and overlapped mesh data
Initialization
repeat
for each iblock in MS do
Apply boundary conditions to iblock
end for
WsCRI=fWtCPI: exchange interpolated data between overlapped block pair s,t, where CRI: cells receiving interpolation data and CPI: cells providing interpolation data
Wighostcells⟵Wj and Wjghostcells⟵Wi: exchange each one-to-one block data at block interface connecting blocki and blockj
for each iblock in MS do
Calculate Fci, Fcj, and Fck fluxes on iblock
AF time advancement and update solutions on iblock
end for
Until convergence
Output flow data
Figure 1 illustrates how we exploit multithreading technique to perform all operations in Listing 1 on a single computing node with multiple MIC coprocessors. Even though more than one MIC coprocessor can be assembled on the same node, they are unable to communicate with each other directly. They must communicate through the host node, and it is expensive for data to be moved between MIC devices and the host node because of the limited PCI-E bandwidth. That requires us to work out data transfer strategy in order to achieve better performance on MIC coprocessors as two types of communication occur in every single iteration.
Flow chart of calculations on multiple MIC coprocessors.
Our first consideration is to locate each mesh block cluster (MBC) that is associated with a component or entity on a specific MIC coprocessor. This benefits from the nature feature of the overlapped mesh system (Section 3.3) employed in this work. Since a MBC consists of one-to-one matched blocks only, it does not involve operations of interpolation over all mesh blocks it contains. So, this way of distributing computational workload to MIC coprocessors avoids one-to-one block data exchange across MIC coprocessors, which reduces the frequency of host-device communication. However, the data transfer from one block to all the blocks it overlaps across MIC devices is inevitable. For data transfer over overlapped regions across different MIC devices, we proposed an algorithm of communication optimization which will be introduced and discussed in Section 4.4.
As shown in Figure 1, a bunch of OpenMP threads are created on the host node, and each thread is associated with a MIC coprocessor and responsible for the data movement and computation of at least one MBC. The operations of physical boundary conditions (lines 5–7), the data exchange on one-to-one block interfaces (line 9) over a MBC, and the most time consuming part (lines 10–13) in Listing 1 are conducted on the corresponding MIC device. For steady calculations in this work, the host calls the process of mesh assembly [46, 47] only once to identify the CRIs and CPIs for each mesh block and prepare the data structure in overlapped regions. At the end of each steady cycle, the updated solutions of CPIs are used to calculate, update the interpolation data, and then copied to the host memory. The host collects the data, redistributes it for CRIs, and copies the CRIs with new interpolated values back into the MIC memory. When the interpolated data are ready, a cycle of spatial and temporal calculations can be performed on mesh blocks one by one without any data dependence.
4.2. Memory Arrangement
A MIC coprocessor has its own memory space, and independent memory allocation and arrangement have to be performed for the computations that are offloaded to a MIC device. More importantly, manipulating memory frequently and dynamically might have a negative effect on the overall efficiency, and it is the programmer’s responsibility to control and optimize the memory usage.
In the computational procedure of solving equation (1), the core calculations including boundary conditions, spatial discretizations, and advancements in time just involve updating the values in arrays that can be used in different subroutines during the repeated cycles. Therefore, the whole procedure conducts calculations on allocated variables during the cycle and then outputs them at the end of the process. This fact inspires us to use !dec$ offload begin target(mic:i) and !dec$ end offload to include all the variables that need to be allocated on MIC or copied from host before the cycle starts. The clause of in(varname:length(len)) is used with alloc_if(.true.) free_if(.false.) options to allocate memory for a variable and initialize it with the values on the host by memory copy. However, there are a lot of variables that are used as temporary spaces and do not have to be initialized. In this case, we use the keyword of nocopy(nocopy(tmp:length(len_tmp) alloc_if(.true.) free_if(.false.))) to avoid extra memory copies between the host and a MIC coprocessor. When all the calculations are completed, out clause is declared with alloc_if(.false.) free_if(.true.) options and placed between offload begin and end offload regions to copy the newest conservative flow variables to the host memory and free the variables in the device memory. The memory arrangement described above is illustrated in Listing 2.
<bold>Listing 2</bold>: Memory arrangement on MIC.
!dec$ offload begin target(mic:n) in(mbc_n:length(len) alloc_if(.true) free_if(.false.)) …
!dec$ offload begin target(mic:n) nocopy(mbc_n:length(len) alloc_if(.false.) free_if(.false.))
!dec$ end offload
Inside the cycle, many MIC kernels are launched by the host, and the nocopy keyword is used again to keep the data declared out of the cycle persistent across offloading procedure. For example, as shown in Listing 2, Fluxj_mic is declared as a function that can be called on MIC directly, and it reuses the array mbc_n which is declared and allocated outside the cycle without any extra memory operations during the cycle. And mbc_n can either be copied out to host memory or just deallocated on the coprocessor at the end point of the cycle. This concept of memory arrangement, which has been applied in other applications [6, 7], prevents frequent data copies inside the cycle from affecting the overall efficiency.
4.3. Vectorization
Vectorization is an effective way to improve computing speed on CPU by using Streaming SIMD (single instruction, multiple data) Extension (SSE). Similarly, MIC coprocessors support 512 bit wide Knights Corner instructions which allow 16 single or 8 double floating point operations at the same time. In this section, we explore vector optimization to make full use of MIC computational resources.
Generally, there are two different ways to implement the flux calculations. The first method is to perform a nested loop over all mesh cells (or mesh nodes). It calculates all the fluxes that go in or out of a cell’s interfaces and accumulates them for the cell. And the vector operations can be applied to the innermost loop by using Intel autovectorization or manual SIMD directives. However, this algorithm involves redundant computations, in which the flux of an interface which is shared by two cells is computed twice. Literatures [18] have shown that redundant computing was not harmful to GPU implementation because it can hide the latency of global memory accesses on GPUs. An alternative way of flux computation, which is more popular for CPU computing and employed in this work, is to evaluate the flux of every edge only once by performing three nested loops along three directions. And then, the fluxes of edges are accumulated into the cells that contain them. However, there are still different coding strategies for the later technique. We investigate two of them and compare the effects on vectorizations.
One of the coding strategies (CS-1) is to conduct flux computing along different directions. For example, each thread is responsible for visiting a piece of the array of solution along i direction, denoted as wj,k,1:id,n, when F_{i} is considered. As discontinuous memory accesses occur when higher dimensions of w are accessed, loading the piece of w into a one-dimensional temporary array at the beginning is an effective way to reduce cache misses in the following process. And the remaining flux calculations can be concentrated on the temporary array completely. Then, two directives, dir$ ivdep or dir$ simd, can be declared to vectorize all the loops regarding the temporary array. Figure 2(a) shows the pseudocode of this strategy. The advantage of this consideration is to keep the code of flux computation along different directions almost same and reduce the workload of redevelopment. Another method (CS-2) refers to performing F_{j}, F_{k}, and F_{i} fluxes along the lowest dimension of w. Fluxes along different directions are computed by the nested loops with the same innermost loop to guarantee that the accesses of continuous memory hold, no matter which direction is under consideration. However, the body codes inside the nested loops vary from one direction to another. In this circumstance, vectorization directives are applied to the innermost (the lowest dimension) loop and the two nested loops outside are unpacked manually to increase the parallelism [7], or applied to the two inner loops by merging two loops to make full use of the VPU on MIC. The method merging the two inner loops is employed in the present paper, and the pseudocode is shown in Figure 2(b). It should be noted that the Knights Corner instructions [38], which can be applied to perform vectorization more elaborately, are not involved in this work, because this assembly-level optimization may result in poor code readability and make the overhead of code maintenance increase.
Two different vectorization implementations.
Except for the content of the code inside the loops, the process of time advancement has the similar structure of nested loops with fluxes computing. Therefore, we merge the two loops inside to provide larger data set for vectorization on MIC.
4.4. Host-Device Communication Optimization
As described in Section 4.1, our task allocation strategy is to distribute workload by MBCs, which avoids the data exchange on one-to-one interfaces across MIC coprocessors but inevitably leads to communication across MIC coprocessors because of interpolation among MBCs.
Although workloads which are located on different MIC devices are performed in parallel, they still are dispatched one by one when they are on the same coprocessor. As a MBC corresponds to a specific MIC device, mesh blocks in the MBC are calculated and updated one by one. Therefore, a mesh block that contains CPIs does not have to wait until the computation of other mesh blocks has completed before it needs to copy data to the host. More specifically, when the solution of a mesh block has been updated successfully during a block loop (Listing 1, line 10), it can start a communication request to copy interpolation data provided by CPIs to the host immediately. And the overhead of copying data from a mesh block to the host can be hidden by the flow computing of other mesh blocks in the same device. To implement this idea, we take advantage of the asynchronous transfer of the offload mode. Because overlapped mesh that consists of 2 MBCs is the most common case, we address the pseudocode of our algorithm about communication optimization in Algorithm 1 by using 2 MBCs and 2 MICs.
Outside the main procedure of numerical calculations, a loop over all MIC devices is performed. In order to make each device responsible for a part of the whole computation, we use OpenMP directive omp parallel do before the loop to start a number of threads, each of which maps to a device. According to the memory arrangement discussed in Section 4.2, memory allocation and initialization have been done on devices before the computation starts. Therefore, by using this thread-device mapping, each device can access its own device memory, start computing, and use shared host memory for necessary data exchange. At the end point of time advancement, the clause offload_transfer is used with the signal option to put a request of data transfer from the device to the host, and it returns immediately and allows the device to keep running the next MIC kernels. So, the data transfer and the block computing can be conducted at the same time. Once all the MIC kernels in a steady step have finished, the CPU is required to check the signals by using offload_wait to make sure all the transfers have been done before executing the process of exchanging interpolation data. Then, the CPU uses the master thread to distribute the interpolation data from CPIs to the corresponding positions of CRIs. Vectorization can be employed in this part because the operations involve assignment operations only. After that, the CPU starts data transfer requests to copy data back to each device respectively. At the same time, the next cycle begins, and the MIC kernel of set_boundary_condition is expected to run on MIC overlapping with the overhead of transferring the interpolation data back to MIC devices. In some cases, the last mesh block residing on a MIC device may have the largest number of CPIs, and it is difficult to overlap the data transfer from device to the host with computation because there are no more blocks that need to be computed. As the overlapped region involves only a part of mesh blocks, we sort the mesh blocks in the descending order in accordance with the number of CPIs to avoid transferring data with insufficient overlapped computation.
5. Experiments5.1. Testing Environment
Our experiments were conducted on the YUAN cluster at the Computer Network Information Center at the Chinese Academy of Sciences. The cluster is of hybrid architecture that consists of both MIC and GPU nodes. The configuration of MIC nodes is that each node has two Intel E5-2680 V2 (Ivy Bridge, 2.8 GHz, 10 cores) CPUs and two Intel Xeon Phi 5110P MIC coprocessors. The memory capacity for the host and coprocessors is 64 GB and 8 GB, respectively. Two environmental variables, MIC_OMP_NUM_THREADS and MIC_KMP_AFFINITY, were declared to investigate the performances under different number of OpenMP threads and affinity types. Intel Fortran compiler (version 2013_sp1.0.080) was used with the highest optimization -O3 and -openmp options to compile the solver.
5.2. Results
We used two ONERA M6 wings (shown in Figure 3), each of which was configured with four 129×113×105 subblocks. The lower wing and its mesh system were formed by making a translation of the upper wing down along Y-axis by the length of the wing, and then, the two mesh systems overlapped with each other. Figure 4 shows a closer view of the overlapped mesh system at the symmetry plane before the mesh assembly. Figure 5 shows a closer view of the mesh system after the mesh assembly was performed. The mesh assembly procedure automatically formed an optimized region where the interpolation relationship between the two mesh systems can be identified, which is shown in the middle of the upper and lower wings. The initial condition for this case is as follows: Mach number M=0.84, angle of attack α=3.06°, and slip angle β=0°. As the host node has two Intel MIC coprocessors, we made each MIC coprocessor responsible for the workload of one mesh system. The results presented below start with the performances of six kernels (three in spatial and three in temporal) running on MIC.
Two M6 wings.
Closer view of two overlapped mesh systems before mesh assembly.
Closer view of two overlapped mesh systems after mesh assembly.
Tables 1–3 show both the wall clock times for one cycle obtained by different combinations of number of threads and thread affinity types and that obtained by CPUs. The wall clock times for one cycle were evaluated by averaging the times of first 100 cycles. And the second coding strategy for vectorization of flux computing was employed in this test. In each table, the first column lists the number of threads from 4 to 118, which is horizontally followed by two three-column blocks. Each block corresponds to a specified MIC kernel and lists wall clock times in different thread affinity types. A column in such a block represents the wall clock times on MIC varying from the number of threads under a declared thread affinity type. In order to observe the scalability of MIC kernels as the number of threads increases, we demonstrate the relative speedup as t4/tp, where t4 is the wall clock time estimated by using 4 threads and tp is that estimated by using p threads. The relative speedups are also filled in each column on the right of wall clock times. Furthermore, to compare the performance on the MIC architecture with that on CPUs, we also show the corresponding full vectorization optimized CPU time for each kernel at the last row of each table and calculate the absolute speedup through dividing the CPU time by the minimum value in all three values obtained by 118 threads under different affinity types. The results in all three tables show that the wall clock times obtained by scatter and balanced modes have an advantage over that obtained by compact mode. This can be explained by the fact that the compact mode distributes adjacent threads to the same physical processor as much as possible in order to make maximum utilization of L2 cache when adjacent threads need to share data with each other but this can result in load imbalance problems among physical cores. Our implementations have made each thread load independent data into a temporary array and operate the array locally, which is more suitable for both scatter and balanced modes because the local operations in each thread make better use of L1 cache on a processor without any intervention from other threads.
Wall clock times for Fj and Fk (in seconds).
NT
Fj
Fk
Scatter
Balanced
Compact
Scatter
Balanced
Compact
4
1.80 (1.0x)
1.81 (1.0x)
3.256 (1.0x)
0.680 (1.0x)
0.680 (1.0x)
1.668 (1.0x)
8
0.932 (1.93x)
0.935 (1.94x)
1.680 (1.94x)
0.352 (1.93x)
0.352 (1.93x)
0.856 (1.95x)
16
0.520 (3.46x)
0.519 (3.49x)
0.880 (3.7x)
0.192 (3.54x)
0.194 (3.51x)
0.468 (3.56x)
32
0.288 (6.25x)
0.288 (6.28x)
0.488 (6.67x)
0.160 (4.25x)
0.161 (4.22x)
0.28 (5.96x)
59
0.196 (9.18x)
0.196 (9.23x)
0.296 (11.0x)
0.144 (4.72x)
0.144 (4.72x)
0.224 (7.45x)
118
0.144 (12.5x)
0.136 (13.3x)
0.160 (20.35x)
0.148 (4.59x)
0.148 (4.59x)
0.196 (8.51x)
CPU time
0.52 (3.82x)
0.54 (3.64x)
NT : number of threads.
Wall clock times for Fi and TAj (in seconds).
NT
Fi
TAj
Scatter
Balanced
Compact
Scatter
Balanced
Compact
4
1.22 (1.0x)
1.22 (1.0x)
2.288 (1.0x)
2.132 (1.0x)
2.132 (1.0x)
6.80 (1.0x)
8
0.636 (1.91x)
0.636 (1.91x)
1.196 (1.91x)
1.160 (1.83x)
1.161 (1.84x)
3.44 (1.98x)
16
0.352 (3.47x)
0.353 (3.46x)
0.660 (3.47x)
0.632 (3.37x)
0.630 (3.38x)
1.832 (3.71x)
32
0.268 (4.55x)
0.266 (4.59x)
0.432 (5.29x)
0.360 (5.92x)
0.362 (5.89x)
0.960 (7.08x)
59
0.232 (5.26x)
0.231 (5.28x)
0.272 (8.41x)
0.296 (7.2x)
0.296 (7.2x)
0.684 (9.94x)
118
0.216 (5.65x)
0.212 (5.75x)
0.260 (8.8x)
0.296 (7.2x)
0.288 (7.4x)
0.48 (14.17x)
CPU time
0.676 (3.19x)
0.72 (2.5x)
NT : number of threads.
Wall clock times for TAk and TAi (in seconds).
NT
TAk
TAi
Scatter
Balanced
Compact
Scatter
Balanced
Compact
4
0.988 (1.0x)
0.988 (1.0x)
2.424 (1.0x)
1.404 (1.0x)
1.404 (1.0x)
2.736 (1.0x)
8
0.508 (1.94x)
0.508 (1.94x)
1.256 (1.93x)
0.716 (1.96x)
0.714 (1.97x)
1.436 (1.91x)
16
0.280 (3.53x)
0.282 (3.50x)
0.664 (3.65x)
0.408 (3.44x)
0.407 (3.45x)
0.804 (3.4x)
32
0.164 (6.02x)
0.166 (5.95x)
0.368 (6.59x)
0.260 (5.4x)
0.264 (5.32x)
0.464 (5.90x)
59
0.140 (7.06x)
0.139 (7.11x)
0.232 (10.4x)
0.200 (7.02x)
0.202 (6.95x)
0.283 (9.67x)
118
0.156 (6.33x)
0.152 (6.5x)
0.196 (12.4x)
0.200 (7.02x)
0.199 (7.06x)
0.228 (12.0x)
CPU time
0.56 (3.68x)
0.572 (2.87x)
NT : number of threads.
As the balanced mode distributes threads to processors in the same way as the scatter mode does when NT≤59, the results obtained from these two modes are supposed to be same. However, there exist slight differences between them through Tables 1–3 because of time collecting errors coming from different runs. When NT=118, each coprocessor took responsibility of two threads in both the balanced and scatter modes, but they differed because the balanced mode kept two adjacent threads on the same coprocessor whereas the scatter mode kept them on different coprocessor. However, the results obtained by these two modes did not show obvious gap in this case. This might be caused by the cache race among threads on a single MIC processor when it undertook more than one thread.
It is noticed that except F_{j} the wall clock times of other five MIC kernels no longer decrease effectively when more than 59 threads were used in both balanced and scatter modes. This can be explained by the fact that the OpenMP directives in these kernels worked at the outmost loop and the maximum number of workload pieces equals to 128 (or 112, 104); so, it was difficult for each thread to balance the workload on it when 118 threads were launched. And for F_{j}, we vectorized it along the innermost direction and manually combined the two nested looped into one loop to increase the parallelism by slightly modifying the codes. Therefore, it has been showed that F_{j} scaled better than F_{k} and F_{i} did as the number of threads increased, and F_{k} and F_{i} have an advantage in wall clock time due to maximizing the use of VPU on MIC but at the expense of scalability.
A speedup of 2.5x was obtained on the MIC coprocessor for TA_{j}, which is the lowest speedup among the three kernels. That was mainly caused by the cache misses and poor vectorization when each thread tried to load the j−k plane of the solution array wj,k,i,n into the temporary space and reshaped it along k direction. The same reason can explain the results shown in Figure 6, where the comparisons between two code strategies (CS) for flux computation of F_{k} and F_{i} by using balanced mode have been made. The wall clock times for the four kernels were evaluated by performing the first 100 steps of the solver. CS-1 conducted 10 more discontinuous array loadings than CS-2 did, which made vectorizations far from efficient on MIC coprocessors. Although the discontinuous memory accesses were vectorized by simd clause compulsively, they turned out less effective on MIC coprocessors. Also, we can observe clear efficiency drops when more than 16 threads were used in CS-1, because we have done 2D vectorization in CS-1 which made it hard for MIC to keep balanced workload among threads.
Comparisons among different implementations in vectorization.
Then, we split each of the original block into 4 subblocks along both i and j directions and form a mesh system of 16 blocks for each wing and report the wall clock times for 500 steps by running the case on CPU and two MIC devices, respectively. Figure 7 shows wall clock time comparisons in different block sizes. Solving equation (1) on 8 overlapped mesh blocks using two Intel Xeon Phi 5110P MIC cards can achieve 5.9x speedup compared to sequential calculations with full optimizations, whereas solving the same problem on 32 overlapped mesh blocks using two MIC cards only achieves 3.6x. This is to be expected, because threads are unlikely to keep busy and balanced workload when the dimensions of blocks are relatively small as stated above, and the six core functions can achieve only about 2x speedup on a single MIC device. Furthermore, it is observed that sequential calculations using 32 mesh blocks spent about 10% more time than that using 8 mesh blocks. This can be explained by the fact that larger mesh block makes better use of Intel SSE vectorization. To show the correctness and accuracy of the parallel solver on the MIC architecture, we plot the pressure contours at the symmetry plane in Figure 8 and compare the corresponding pressure coefficients on the airfoils calculated by MICs with that calculated by CPUs in Figure 9. We can see clearly from Figure 9 that our solver running on MICs can capture each shockwave which is expected on each airfoil for this transonic problem and produce pressure coefficients in agreement with that calculated by CPUs.
Wall clock times for 500 steps on CPU and MIC.
Pressure contours at the symmetry plane.
Pressure coefficients comparisons on the airfoils at the symmetry plane.
6. Summary
We have developed and optimized an overlapped mesh-supported CFD solver on multiple MIC coprocessors. We demonstrated and compared different code strategies for vectorizations by using a two M6 wings case and analysed the results in detail. An asynchronous method has been employed in order to keep the data exchange from interfering the overall efficiency. Calculations on the case achieved 5.9x speedup using two MIC devices compared to the case using an Intel E5-2680 processor. Our future work includes extending this solver to cover unsteady flow calculations which involve relative motion among overlapped mesh blocks on multiple MIC devices.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
This work was supported by a grant from the National Natural Science Foundation of China (nos. 61702438 and 11502267), the Key Research Project of Institutions of Higher Education of Henan Province (no. 17B520034), and the Nanhu Scholar Program of XYNU.
NVIDIAhttps://www.intel.com/content/www/us/en/architecture-and-technology/many-integrated-core/intel-many-integrated-core-architecture.htmlGorobetsA.TriasF.BorrellR.OyarzunG.OlivaA.Direct numerical simulation of turbulent flows with parallel algorithms for various computing architecturesProceedings of the 6th European Conference on Computational Fluid DynamicsJuly 2014Barcelona, SpainFarhanM. A. A.KaushikD. K.KeyesD. E.Unstructured computational aerodynamics on many integrated core architectureGrafJ. S.GobbertM. K.KhuvisS.Long-time simulations with complex code using multiple nodes of Intel Xeon Phi knights landingCaiY.LiG.LiuW.Parallelized implementation of an explicit finite element method in many integrated core (MIC) architectureSainiS.JinH.JespersenD.Early multi-node performance evaluation of a knights corner (KNC) based NASA supercomputerProceedings of the IEEE International Parallel & Distributed Processing Symposium WorkshopMay 2015Chicago, FL, USAWangY. X.ZhangL. L.LiuW.ChengX. H.ZhuangY.ChronopoulosA. T.Performance optimizations for scalable CFD applications on hybrid CPU+MIC heterogeneous computing system with millions of coresBanaśK.KrużelF.BielańskiJ.Finite element numerical integration for first order approximations on multi-and many-core architecturesSchneckW. C.GregoryE. D.LeckeyC. A. C.Optimization of elastodynamic finite integration technique on Intel Xeon Phi knights landing processorsCebriánJ. M.CeciliaJ. M.HernándezM.GarcíaJ. M.Code modernization strategies to 3-D stencil-based applications on Intel Xeon Phi: KNC and KNLLukasM.JanZ.MichalM.Evaluation of the Intel Xeon Phi offload runtimes for domain decomposition solversGohariS. M. I.EsfahanianV.MoqtaderiH.Coalesced computations of the incompressible Navier–Stokes equations over an airfoil using graphics processing unitsFuL.GaoK. Z.XuF.A multi-block viscous flow solver based on GPU parallel methodologyCaoW.XuC. F.WangZ. H.LiuH. Y.LiuH. Y.CPU/GPU computing for a multi-block structured grid based high-order flow solver on a large heterogeneous systemAissaM.VerstraeteT.VuikC.Toward a GPU-aware comparison of explicit and implicit CFD simulations on structured meshesXuC.DengX.ZhangL.Collaborating CPU and GPU for large-scale high-order CFD simulations with complex grids on the TianHe-1A supercomputerCorriganA.CamelliF. F.LöhnerR.WallinJ.Running unstructured grid-based CFD solvers on modern graphics hardwareLacastaA.Morales-HernándezM.MurilloJ.García-NavarroP.An optimized GPU implementation of a 2D free surface simulation model on unstructured meshesBarrioP.CarrerasC.RoblesR.JuanA. L.JevticR.SierraR.Memory optimization in FPGA-accelerated scientific codes based on unstructured meshesXiaY.LuoH.FrisbeyM.NourgalievR.A set of parallel, implicit methods for a reconstructed discontinuous Galerkin method for compressible flows on 3D hybrid gridsProceedings of the 7th AIAA Theoretical Fluid Mechanics Conference2014Atlanta, GA, USALangguthJ.WuN.ChaiJ.CaiX.Parallel performance modeling of irregular applications in cell-centered finite volume methods over unstructured tetrahedral meshesSoniK.ChandarD. D. J.SitaramanJ.Development of an overset grid computational fluid dynamics solver on graphical processing unitsChandarD. D. J.SitaramanJ.MavriplisD.GPU parallelization of an unstructured overset grid incompressible Navier–Stokes solver for moving bodiesProceedings of the 50th AIAA Aerospace Sciences Meeting Including the New Horizons Forum and Aerospace ExpositionJanuary 2012TN, USANashvilleChandarD.SitaramanJ.MavriplisD.Dynamic overset grid computations for CFD applications on graphics processing unitsProceedings of the Seventh International Conference on Computational Fluid DynamicsJuly 2012Big Island, HawaiiNiemeyerK. E.SungC.-J.Recent progress and challenges in exploiting graphics processors in computational fluid dynamicsEdgarG.GrahamE. F.GeorgeB.Open MPI: goals, concept, and design of a next generation MPI implementationProceedings of the 11th European PVM/MPI Users? Group MeetingSeptember 2004Budapest, Hungary97104http://www.open-mpi.orgDjomehriM. J.JinH.Hybrid MPI+OpenMP programming of an overset CFD solver and performance investigations2002CA, USANASA Ames Research Center, Moffett FieldNASA Technical ReportChapmanB.JostG.van der PassR.PrewittN. C.BelkD. M.ShyyW.Parallel computing of overset grids for aerodynamic problems with moving objectsRogetB.SitaramanJ.Robust and efficient overset grid assembly for partitioned unstructured meshesZagarisG.CampbellM. T.BodonyD. J.A toolkit for parallel overset grid assembly targeting large-scale moving body aerodynamic simulationsProceedings of the 19th International Meshing RoundtableOctober 2010Berlin, HeidelbergSpringer385401CaiJ.TsaiF.LiuF.A parallel viscous flow solver on multi-block overset gridsLandmannB.MontagnacM.A highly automated parallel Chimera method for overset grids based on the implicit hole cutting techniqueHenshawW. D.Solving fluid flow problems on moving and adaptive overlapping gridsProceedings of the International Conference on Parallel Computational Fluid DynamicsMay 2005Washington, DC, USALiaoW.CaiJ.TsaiH. M.A multigrid overset grid flow solver with implicit hole cutting methodhttps://ark.intel.com/products/71992/Intel-Xeon-Phi-Coprocessor-5110P-8GB-1-053-GHz-60-core-WangE.ZhangQ.ShenB.RoeP. L.Approximate Riemann solvers, parameter vectors, and difference schemesJamesonA.SchmidtW.TrukelE.Numerical solutions of the Euler equations by finite volume methods using Runge–Kutta time-stepping schemesProceedings of the 14th Fluid and Plasma Dynamics Conference AIAA Paper1981Palo Alto, CA, USAPulliamT. H.ChausseeD. S.A diagonal form of an implicit approximate-factorization algorithmBlazekJ.WangZ.HariharanN.ChenR.Recent developments on the conservation property of chimeraProceedings of the 36th AIAA Aerospace Sciences Meeting and ExhibitJanuary 1998Reno, NV, USA10.2514/6.1998-216MeakinR. L.On the spatial and temporal accuracy of overset grid methods for moving body problemsProceedings of the 12th Applied Aerodynamics Conference AIAA Paper 1994-1925June 1994CO, USAColorado SpringsRogersS. E.SuhsN. E.DietzW. E.PEGASUS 5: an automated preprocessor for overset-grid computational fluid dynamicsMaW.HuX.LiuX.Parallel multibody separation simulation using MPI and OpenMP with communication optimizationWuY.Numerical simulation and Aerodynamic effect research for multi-warhead projection