Scalable Parallel Algorithm of Multiple-Relaxation-Time Lattice Boltzmann Method with Large Eddy Simulation on Multi-GPUs

The lattice Boltzmann method (LBM) has become an attractive and promising approach in computational fluid dynamics (CFD). In this paper, parallel algorithm of D3Q19 multi-relaxation-time LBMwith large eddy simulation (LES) is presented to simulate 3D flow past a sphere using multi-GPUs (graphic processing units). In order to deal with complex boundary, the judgement method of boundary lattice for complex boundary is devised. The 3D domain decomposition method is applied to improve the scalability for cluster, and the overlapping mode is introduced to hide the communication time by dividing the subdomain into two parts: inner part and outer part. Numerical results show good agreement with literature and the 12 Kepler K20M GPUs perform about 5100 million lattice updates per second, which indicates considerable scalability.


Introduction
Driven by the market demand for real-time, high-definition 3D graphics at processing large graphics data sets, graphics processing unit (GPU) has been developed for rending tasks.Due to its high computational power and the emergence of compute unified device architecture (CUDA) [1], GPU has drawn much attention to accelerate nongraphics computing [2][3][4][5].
Large eddy simulation (LES) is a mathematical model for turbulence applied in CFD [13].It has become an effective way to simulate physical flow.In order to capture the turbulence flow physics, Hou et al. combine SRT-LBM with Smagorinsky model [14], and Krafczyk et al. incorporated the Smagorinsky model into MRT-LBM [15].The improved MRT-LBM with LES still remains the inherent parallelism.In this paper, the parallel algorithm of MRT-LBM-LES on multi-GPUs is studied.
Due to algorithmic simplicity and suitability for parallel computing, numerous attempts of LBM on homogeneous and heterogeneous clusters have been reported.Before the advent of GPU, many researches have been carried out on LBM to obtain high performance and scalability.Kandhai et al. [16] presented different kind of domain decomposition method and parallel strategies about D2Q9 SRT-LBM.Velivelli and Bryden [17] examined cache optimization for twodimensional LBM in both serial and parallel implements.Pan et al. [18] investigated five domain decomposition methods and the parallel efficiency of simulating single and multiphase flow in porous medium systems using D3Q15 LBM on various parallel computing platforms.Wu and Shao [19] simulated the lid-driven cavity flow using MRT-LBM compared with SRT-LBM by parallel implementation.Schepke et al. [20] presented blocked parallel implementation of D2Q9 and D3Q19 SRT-LBM; they divided computational domain into subdomains to reduce the MPI communication time.Since GPU 2 Scientific Programming has many computational units and the localized communication mode of each lattice well matches the data-parallel characteristic of GPU, Tölke [21] used shared memory to propagate parts of particle distribution functions (PDFs), avoiding costly misaligned access to the global memory on a single GPU, and Habich et al. [22] extended this strategy from D2Q9 to D3Q19 LBM.Due to the improvement of NVIDIA hardware, it is not necessary to use shared memory.Obrecht et al. [23] found out that the cost of a misaligned access was nearly equal to that caused by global memory access; they presented two schemes, a split one and a revise one, without using shared memory.Zhou et al. [24] simulated flows with curved boundaries; they gave detailed implementation to deal with curved boundaries on D2Q9 SRT-LBM.In order to simulate large-scale simulations, single GPU cannot afford the computational requirement, More schemes on multi-GPUs were presented.Xian and Takayuki [25] introduced the overlapping technique between computation and communication to hide the communication time in two streams simultaneously based on 3D domain decomposition.Based on presented work, Hong et al. [3] presented schemes to simulate 3D cavity flow based on D3Q19 MRT-LBM and evaluated the performance of various schemes.Mawson and Revell [26] examined and analysed previous optimization strategies; the implementation including misaligned accesses to the global memory is found to be more efficient.Ye et al. [27] proposed the hybrid CPU-GPU computing environment of D3Q19 ELBM on single-node multi-GPU platform; their method not only takes advantage of GPU but also exploits the computing power according to model-based optimal load balancing.Tran et al. developed high performance parallelization of LBM on a GPU by reducing the overheads associated with the uncoalesced memory accesses and improving the cache locality using the tiling optimization with the data layout change [28].
The rest of the paper is organized as follows.First, a short overview of MRT-LBM with LES is given.In Section 3, the multi-GPUs architecture is introduced.Next, MRT-LBM with LES parallel strategy is presented in Section 4. Then the numerical experiment is shown in Section 5. Finally, the major conclusions are summarized in Section 6.

MRT-LBM with LES
The LBM based on lattice Boltzmann equation (LBE) can be described as [12] where f(x, ) represents the PDFs at site x and time , Ω denotes the collision operator, the left side   is the time step, and e  expresses the particle velocity in the th direction.
In the MRT model, Ω can be expressed as where the transformation matrix M linearly transforms the PDFs f and equilibrium distribution functions (EDFs) f eq to the velocity moments m and m eq , respectively.S is the relaxation diagonal matrix and where  9 and  13 satisfy and the remaining relaxation frequencies can be referenced by [29].The EDF  eq  (, ) can be obtained by where   is the speed of sound:   = 1/ √ 3; and the coefficients of   are weighting factor listed as follows: The macroscopic density and velocity based on PDFs can be written as Based on (1), LBM can be expressed by two steps: collision (see (9)) and propagation (see (10)).
where f +  (x, ) expresses the postcollision state of the PDF.With the purpose of simulating turbulent flow at high Reynolds numbers, Krafczyk et al. incorporated MRT-LBM with LES [15].The total viscosity ] total can be expressed as the sum of ] and turbulent eddy viscosity ]  .
where   is the Smagorinsky constant and it is set to be 0.16 [15];   (,  ∈ , , ) in Cartesian coordinates is the strain rate tensor; that is,   can be computed directly from nonequilibrium moments ( Then, the corresponding relaxation frequencies can be modified in terms of (5).
In this present work, a nonequilibrium extrapolation method is used for the nonslip static straight wall [30] and a unified boundary treatment for curved boundary is applied to model nonslip boundary condition with curved boundaries [31].In Figure 2, the solid circle is the solid node and the hollow circle is the fluid node, and a curved wall separates the solid nodes from the fluid nodes.The lattice node is denoted as x  on the fluid side of the boundary and   on the solid side.x  is the adjacent fluid node of x  .The filled small rhombus on the boundary wall   denotes the intersections of the wall with various lattice links.The boundary velocity at   is denoted as u w .Next, this curved boundary scheme is given by where  is the opposite direction of  and  is written by When dealing with curved boundary, it is necessary to judge the lattice type (solid, fluid, or boundary lattice) and the value of .Algorithm 1 shows the judgement of lattice type and how to obtain .The ray method is employed to generate the lattice type.When the lattice is in the geometry, the number of intersections is even.

Multi-GPUs Architecture
In the earliest days, computers only consisted of central processing units (CPUs).However, the many-core accelerators (GPU and Intel Xeon Phi) are becoming prevalent in the past decade.Over time, GPUs have become more and more powerful and generalized for general-purpose parallel computing tasks with excellent performance and high efficiency [1,32].The GPU architecture consists of a scalable array of Streaming Multiprocessors (SMX).Each SMX supports the concurrent execution of hundreds of threads so that thousands of threads can be executed concurrently on a single GPU.CUDA employs Single Instruction Multiple Thread (SIMT) architecture to manage and execute threads in a group of 32 called warps.The threads in a warp perform the same instruction at the same time; each thread in a warp has its own instruction address counter and register state and executes the current instruction on its own data.In this paper, the K20 based on GK110 architecture is tested.It contains 13 SMXs, 1.25 MB L2 cache, 5 GB memory size, and six 64-bit memory controllers.Each SMX includes 192 single-precision CUDA cores, 64 double-precision units, 32 special function units (SFU), and 32 load/store units (LD/ST).Each SMX consists of 4 warp schedulers and 8 instruction dispatchers which can issue 4 warps and execute concurrently.The theoretical peak of single-precision performance is 3.5 TFLOPS, and the memory bandwidth achieves 208 GB/s.
(1) Generate the solid lattice based on grid file using ray method.
(2) Obtain the curved boundary lattice according to the solid lattice and (2).As a result of the dimension of the problems treated with the LBM, a single piece of GPU cannot deal with the problems and high computing power and large memory space are required.It has to be solved on multi-GPUs.As shown in Figure 3, the CPU processor transfers data with GPU device by cudaMemcpyHostToDevice and cudaMemcpyDeviceTo-Host.The CPU processors communicate with MPI ISend and MPI IRecv.

Domain Decomposition Method and Data Exchange.
Owing to the local nature of LBM, it is a better choice to adopt domain decomposition method that partitions the flow domain into subdomain.As for three-dimensional flow simulation, there are 1D, 2D, and 3D (see Figure 4) domain decomposition methods [20], in which the fluid domain can be divided along one, two, or three directions.It can be seen from [20,25] that choosing the 3D one is the best option when running on cluster.In the 3D one, each MPI process deals with a part of computational domain, and the process can be denoted as a triple (, , ) according to the MPI process rank .The triple can be obtained by where  and  denote the amount of domain decomposition along  and , respectively.Then, the computational part along  direction of process  can be got by where num is the lattice node count along the  direction.Beg is the start lattice node along  direction, and End is the end one.The range of  and  directions can be calculated similarly.Based on ( 9) and ( 10), the only communication occurs in the propagation step.After collision, each MPI process needs to exchange the interface between neighboring MPI processes.In 3D, each MPI process (, , ) has to send data of 6 surfaces and 12 edges and receive data of 6 surfaces (see Figure 5) and 12 edges (see Figure 6).When transferring data  +  , there is no need to transfer all lattice data of  +  ; only the expected data in propagation operation are required along the direction.For example, only the data of  + 1 ,  + 7 ,  + 9 ,  + 11 , and  + 13 are expected when transferring data from MPI process (, , ) to ( + 1, , ).

Data Structure.
Group of threads into warps is relevant not only to computation but also to global memory access [32].GPU coalesces global memory loads and stores accessed by threads of a warp to minimize DRAM bandwidth.In the GPU implement of LBM, it is a simple and efficient way to assign a lattice node to a thread.Each thread needs to store f, f + , , u, and lattice style.Figures 7 and 8 illustrate the memory layout of Array of Structures (AoS) and Structure of Arrays (SoA), respectively.Storing the data of PDFs in AoS format on the GPU and executing an operation that only requires  0 would result in a 18/19 loss of bandwidth as other PDFs are implicitly loaded in each 32-byte segment or 128-byte cache line.In the meantime, the AoS format would also waste the L2 cache space on other unneeded PDFs values.Storing the data in SoA format can make full use of GPU memory bandwidth which provides coalesced memory access.In order to obtain coalescence of global memory access and efficient global memory utilization, it is better to choose SoA type rather than AoS type.

Framework of GPU Implement of MRT-LBM-LES.
There are two propagation schemes: out-of-place propagation (see Figure 9), in which the collision step is carried out before the propagation step, and in-place propagation (see Figure 10), in which the collision step is executed after the propagation step [23].Hong et al. [3] compare the two schemes.It can be seen that out-of-place propagation which we take is more efficient than in-place propagation on multi-GPUs.Based on [3], the reverse scheme can improve the performance greatly.The out-of-place scheme can be illustrated as Algorithm 2.
Based on out-of-place scheme, the framework of GPU implement of MRT-LBM with LES can be described as Algorithm 3 and Figure 11.Compared with [3], it is not necessary to execute the collision step in the lattices stored in the buffers when the MPI communication occurs after collision step.In order to improve parallel efficiency, overlapping mode Figure 6: Edges needed to be exchanged.is applied to hide the communication time.The subdomain is divided into two parts: inner part and outer part [25].The inner part does not need to be exchanged with other MPI processes.But PDFs of outer part of subdomain are needed by neighbor MPI processes after propagation.In our implement, two streams are designed: one is for inner part and the other is for boundary part (see Figure 12).When executing the computation of inner part, the MPI communication part of boundary part is processed simultaneously.

Numerical Results and Discussion
The numerical experiments are tested on a three-node GPU cluster; each node is equipped with four NVIDIA Kepler K20M GPU devices and two Intel Xeon E5-2680 CPUs.Each node is connected by 56 Gb FDR InfiniBand network.

Validation of Flow
Past a Sphere.In this section, the simulation of the 3D incompressible flow past a sphere with a constant velocity profile,  =  ∞ = {0.2,0, 0}, is a validation test.The sphere radius  = 20 is taken.Figure 13 shows the flow geometry, coordinate system, and computational domain.The length of the computational domain is 25.6, the width is 6.4, and the height is 6.4.The lattice scale is 512 × 128 × 128.
Figure 14 shows the sphere surface grid, which is used for the test.Figures 15 and 16 show the lattices around sphere on the slice of  and  plane, respectively.The drag force on the sphere   is computed with the momentum-exchange method [33] and the drag force coefficient   [34] can be expressed by Figure 17 shows   compared with the theoretical value.Figures 18 and 19 describe the vortical structure when Re = 1000 and 3700, respectively.
Figures 20 and 21 show the strong scaling and weak scaling, respectively.According to Figure 20, the ideal MLUPS is proportional to the number of GPUs and the present MLUPS of our algorithm is nearly equal to the ideal MLUPS.The weak scaling test from Figure 21 shows that the MLUPS on each GPU remains almost the same when the grid size is assigned to each GPU equally regardless of the number of GPUs.The computation of multi-GPUs of MRT-LBM-LES has considerable strong and weak scalability when fixed grid size is 512 × 128 × 128. ×  ×  indicates that the  direction of the fluid domain is divided into  parts,  direction is splitted into  part, and  direction is decomposed as  parts.In order to compare overlapping and nonoverlapping modes, Figure 22 shows that the nonoverlapping mode can obtain 66% improvement and improve the scalability when our present algorithm is executed on 12 GPUs.

Conclusion
In this paper, a scalable parallel algorithm of D3Q19 MRT-LBM with LES on 12 NVIDIA Kepler K20M GPUs platform is presented to simulate three-dimensional flow past a sphere.The numerical results of flow past a sphere are compared with the literature and show good agreements.In order to simulate complex geometry, a method to judge lattice style is also described.The 3D domain decomposition method and overlapping mode are employed to improve the scalability for     heterogeneous cluster, which can obtain 66% improvement.Numerical results show that the present algorithm can perform 5100 MLUPS on 12 Kepler K20M GPUs, which indicates that the present algorithm is efficient and scalable.

Figure 15 :
Figure 15: Lattices on the slice of  plane.

5. 2 .
Parallel Performance.The parallel performance is examined in terms of MLUPS (million lattice updates per second), and MLUPS can be obtained by MLUPS = lattice scale × number of iterations × 10 −6 total iterative time .

Figure 16 :Figure 17 :
Figure 16: Lattices on the slice of  plane.

Figure 20 :
Figure 20: MLUPS of overlapping multi-GPUs with different domain decomposition method.

Figure 22 :
Figure 22: Performance comparison of overlapping and nonoverlapping with different domain decomposition method.