Direct Numerical Simulation and Large Eddy Simulation on a Turbulent Wall-Bounded Flow Using Lattice Boltzmann Method and Multiple GPUs

Direct numerical simulation (DNS) and large eddy simulation (LES) were performed on the wall-bounded flow at Re τ = 180 using lattice Boltzmann method (LBM) and multiple GPUs (Graphic Processing Units). In the DNS, 8 K20M GPUs were adopted. The maximum number of meshes is 6.7 × 10, which results in the nondimensional mesh size of Δ+ = 1.41 for the whole solution domain. It took 24 hours for GPU-LBM solver to simulate 3 × 106 LBM steps. The aspect ratio of resolution domain was tested to obtain accurate results for DNS. As a result, both the mean velocity and turbulent variables, such as Reynolds stress and velocity fluctuations, perfectly agree with the results of Kim et al. (1987) when the aspect ratios in streamwise and spanwise directions are 8 and 2, respectively. As for the LES, the local grid refinement technique was tested and then used. Using 1.76 × 106 grids and Smagorinsky constant (C s ) = 0.13, good results were obtained. The ability and validity of LBM on simulating turbulent flow were verified.


Introduction
Lattice Boltzmann method has been regarded as a promising alternative for simulation of fluid flows with complex physics in the past two decades.Unlike conventional numerical schemes based on discretizations of macroscopic continuum equations, the LBM is based on microscopic models and mesoscopic kinetic equations.It has many advantages, including easy implementation of boundary conditions, easy programming, and fully parallel algorithms [1].As a result, the LBM has been applied in many fields, such as biofluid and porous medium [2].But its feasibility and validity in both direct numerical simulation and large eddy simulation of turbulent flow remain to need further validation [3,4].
Fully developed turbulent wall-bounded flow is a classic problem, which is seemingly simple but complicated.
The DNS and LES on it, including the boundary layer flow, have become important tools for investigating the mechanism of turbulence.For a long time, the simple geometry attracts many experimental and theoretical investigations of complex turbulence interactions in the vicinity of walls.Among them the direct numerical simulation done by Kim et al. [5,6] is most cited for comparison.Their DNS results of Re  =   / = 180 (  and  are the friction velocity and channel half-width, resp.,)have been considered as the benchmark of numerical simulations.Because of the large computational cost of DNS and LES, the published related results on wall-bounded flow are rare in spite of the increasingly high performance of high performance computing (HPC) in the recent decade years.
Nowadays, the presence of GPU brings possibility to the large-scaled DNS and LES for turbulence prediction.GPU equipped on the graphic board in a computer is originally used to process large graphics data sets fast.Recently, with the emergence and developing of computing platforms, for example, CUDA [7] and OpenCL, the use of GPU to accelerate nongraphic computations has drawn more and more attention [8,9].Due to its high performance of floating-point arithmetic operation, wide memory bandwidth, and better programmability, GPU has been further used for the general purposes, that is, GPGPU.Computational fluid dynamics (CFD) based on grids is one of the important applications.According to our experience, solving incompressible Navier-Stokes equations to simulate fluid flow with the marker and cell (MAC) solver on single GPU is 30 ∼ 40 times faster than the heavily optimized CPU-based implementations.And the LBE GPU-based calculations can even obtain more than 100 speedups [10][11][12][13].Aoki group [14] has devoted to the GPU high-performance computing since 2006.The team members performed various numerical simulations using hundreds or even thousands of GPUs, such as weather prediction, crystal growth process, and air flow in the city, and a good performance of PFLOPS was achieved [15][16][17][18].At present, GPU applications for engineering purpose in fields such as energy, environment, and aerospace become hotspots.
In the present work, DNS and LES on the wall-bounded flow at Re  = 180 were performed using D3Q19 model of lattice Boltzmann method and multiple GPUs.There are two objectives.One is to present a high performance computing with LBM and multi-GPUs, which may provide a possible way for the prediction on turbulent flow.The other is to verify the validity and ability of LBM on simulating turbulent flow.
This paper is organized as follows.In Section 2, lattice Boltzmann equations (LBE) for DNS and LES with forcing term are described.Section 3 introduces the model system of the simulation and data transfer in CUDA-MPI.The results of DNS and LES on fully developed turbulent wall-bounded flow are discussed in Sections 4 and 5, respectively.The summary and conclusions are presented in Section 6.

Lattice Boltzmann Equation (LBE)
Formulation for DNS and LES of Turbulence
Collision step: Mathematical Problems in Engineering Streaming step: where   and f denote the pre-and postcollision states of the distribution function, respectively.

LBE for LES.
The filtered form of the LBE for LES is defined as [21][22][23] where   and  eq  represent the distribution function and the equilibrium distribution function of the resolved scales, respectively.Moreover,  0 and   are the relaxation time corresponding to the molecular viscosity ] 0 and eddy viscosity ]  , respectively.Accordingly, ] * is given by [21][22][23] Here,   is Smagorinsky constant.In addition, finitedifference approximation was chosen to compute the strain rater tensor in the present work.

External Force.
For the external forcing term, there are many implementation methods.In the presented wallbounded flow simulation, the modified velocity method [24] was applied that is using a modified velocity to compute the equilibrium distribution function.It can be described as follows: where G is the external force.As the wall-bounded flow simulated here is driven by a constant body force, the uniform pressure drop, we define G as Here,   and  represent the friction velocity and channel half-width, respectively.Then calculate the equilibrium distribution function   eq with u eq instead of u.So (4) can be transferred as  where  eq  was computed by the macroscopic values u and  on the boundaries according to (4). eq  cannot be computed directly and was assumed to be equal to the value of its neighboring inner node.f neighbor and  eq  neighbor were computed by ( 9) and ( 4), respectively, with the corresponding u and .

Model System and Parameters
Figure 2 shows the model system.The calculation domain length is , , and  corresponding to the streamwise , normal , and spanwise  direction, respectively. is halfwidth in the normal direction to walls, which is used to define the Reynolds number.The standard Reynolds number can be given by Re  =   / with the mean centerline velocity.In addition, it is common to define the Reynolds number in such flow model using the "wall unit" or viscous length   = /  .The skin friction is calculated as   = √G/ and the corresponding "wall unit" time is   =   /  .Whereby, the friction Reynolds number is obtained as which implies Re  = /  =  + .As usual, a superscript (+) indicates quantities measured in the "wall units"   and   .In a fully developed turbulent wall-bounded flow beyond the transient range of Reynolds numbers, Re  ≥ Re  .
To carry out numerical simulation of turbulence, the nondimensional physical parameters describing the problem must be defined first.For wall-bounded flow, the spatial extent of the computational domain plays an important role to obtain accurate results.In particular, the aspect ratios are   =   / and   =   /.Referring to the study in [26], Spasov et al. [27] performed the direct numerical simulations of Re  = 180 at   = 4 and   = 1.The mean velocities were overpredicted compared with the results obtained by Kim et al. [5].Also, the Reynolds stress did not agree well.In Section 4, we verified various groups of aspect ratio to perform DNS.
For DNS, the grid size cannot be larger than the local Kolmogorov length .In fully developed turbulent flow,  + = 1.5 in the region near the wall [28]. + increases with the increasing distance from the wall.Owing to the uniform mesh used in LBM, the grid scale in the whole field should satisfy Δ + <  + = 1.5.As for the LBM, Δ = 1, Δ + = Δ/  = Re  / < 1.5, which results in  > Re  /1.5.Accordingly, with the friction Reynolds number in the present work Re  = 180, the half-width  in the computational domain should be large than 120.
As for initial conditions,  =  0 = 1 and   was given by logarithmic law of friction:   /  = (1/) lnRe  + , in which von Karman constant  = 0.4,  = 6, and   was set as 0.1.The velocity field was initialized as follows:  =  +   , V = V + V  , and  =  +   , where the fluctuating terms were got from   = V  =   =   ( rand − 0.5),  rand ∈ [0, 1], V =  = 0, and  was based on a frequently used engineering approximation for the universal mean velocity profile near the wall The dimensionless velocity  + = /  is measured in the "wall units." The particular values chosen,   = 11.6 for the velocity profile inflection point,  = 0.4 for the Van Karman constant, and  = 5.2, are not expected to have an impact on the eventual flow profile taken for the computation of developed turbulence statistics.Current parallel computation was made by 1D domain partitioning using 8 GPUs. Figure 3 shows the 1D partitioning in -direction executed in the present study.The data are exchanged with top and bottom ranks.As for detailed information on data transfer, refer to our previous work [12].Data communications among GPUs are done by Message Passing Interface (MPI) and cudaMemcpy.First, the data are copied from GPU to CPU by CUDA API cudaMemcpy(. ..,cudaMemcpyDeviceToHost). Then the data are exchanged between corresponding CPUs using parallel tool like MPI, and so forth.Finally, the exchanged data are copied from CPU to GPU by cudaMemcpy(. ..,cudaMemcpyHostToDevice).

DNS for Wall-Bounded Flow
The DNS was performed on turbulent wall-bounded flow with  = 128 and various aspect ratios.The corresponding nondimensional mesh size is Δ + = 1.41 for the whole solution domain.Figure 4 shows (a) the velocity pattern at a constant streamwise location and (b) the isosurface of second invariant of velocity gradient tensor  when the aspect ratio is   = 4,   = 1.

Performance of Computation by GPUs.
For this DNS work, it took about 24 hours to simulate 3 × 10 6 LBM-steps for 6.7×10 7 LBM-grids with 8 K20M GPUs.The performance achieves 2333 MLUPS; that is, 2.333 × 10 9 meshes are processed per second.Compared with the similar DNS in [27] done by 36 CPUs, in which the performance is 5.6 MLUPS, the present performance by GPUs is about 416 times higher.With LBM-GPU, high performance can be achieved easily without any difficulty in programming.

LES for Wall-Bounded Flow
Since the DNS approach resolves all relevant spatial and temporal scales, it can predict all possible fluid motions with high fidelity but large cost.In this part, we simulated wallbounded flow by applying the standard Smagorinsky model to LES.Moreover, the local grid refinement technique of LBM was tested and then used.

Local Grid Refinement.
Local grid refinement is usually applied to regions where large changes of solution are expected.For LBM, the locally refined patches are superposed to the global coarse grid, saving computational time to some extent and enabling a high resolution where needed.Grid refinement is performed by dividing the space step through a refinement factor .The kinematic viscosity, defined in the frame of the LBGK model, depends on the step size with  = ((2 − 1)/6).To keep a constant viscosity and the same Reynolds number on coarse grids (  ) and fine grids (  =   /), the relaxation time  in (8) has to be redefined as [29] Here   and   represent the relaxation time on the fine and coarse grids, respectively.Taking into account that when 1/  is very close to 2, the LBGK scheme becomes unstable and 1/  > 1, we estimate from (20) that the upper limit of  is about 50.In the present study the value of refinement factor  is 2. Accordingly, the time step on the fine grid is decreased by   =   / =   /2, in which   is the time-step on the coarse grid.Because of the continuity of the hydrodynamic variables and their derivatives on the interface between two grids, the relationships between postcollision distribution functions on the coarse and fine grid nodes obtained from (4), (9), and (20) can be described as follows, respectively: Strategy of the numerical realization with refinement factor  = 2 can be seen in Figure 7. Firstly, carry out a simulation on coarse grids to 1.Then, map the solution to fine grid using second-order interpolation in space and time from the results of coarse grids at 1, according to (22).Finally, advance the dependent variables on the coarse and fine grids in time proceeding with the obtained fine grid boundary conditions.The validity of the adopted local grid refinement was verified by simulating the classical problem of lid-driven cavity flow at Re = 1000 with and without local grid refinement.Local grid refinement was used with a coarse grid (64 × 32 nodes) in middle part and a fine grid (16 × 127 nodes) in 1/4 top and bottom computational region, respectively.The comparison between the NSE results of Ghia et al. [25] and our LBE results with and without using grid refinement on this problem is presented by velocity profile on the center line in the normal direction in Figure 8.It is clearly shown that the LBE results adopted grid refinement coincide better with the results gained by Ghia et al. [25] near boundaries where the grids are refined than the ones by the uniform coarse grid system.

Results and Discussion
. For LES, first, the number of grids is decreased by 4 times in each direction based on the DNS and the aspect ratio is keep to   = 8,   = 2.That is, the total grids for LES is 1/64 of DNS. Figure 9 displays the profile of averaged velocity in the normal -direction using LBM-LES at various   for a grid system 256 × 64 × 64, whose grid scale is Δ + = 5.64.The results at different   are Figure 7: The strategy of grid refinement of LBM.same, which means that the used subgrid scale model does not work at such a large grid scale.Considering grid scale and computational cost simultaneously, grids in 1/4  of top and bottom walls were doubled.Figure 10 shows the mean velocity distribution in wall normal direction at various   computed in a local grid refinement system, where /4 top and bottom parts adjacent to walls were refined to 512 × 15 × 127 (Δ + = 2.82) and the middle parts are kept at 256 × 48 × 64 (Δ + = 5.64).As is observed, the subgrid model takes effect and the results get the best match when the value of   is 0.13, which is less than the typical value of   = 0.17 used in the NS-LES [28].
Figure 11 shows the profile of Reynolds stress in the normal -direction using LBM-LES at   = 0.13 for the same local grid refinement system mentioned above.It can be observed that the values deviate LBM-DNS results near the walls, that means the grid is not fine enough to capture the very fine eddies in boundary layer.Also, the SGS model for constant   is less efficient near walls.

Conclusion
In the present work, DNS and LES were performed on the wall-bounded flow at Re  = 180 using lattice Boltzmann method and multiple GPUs.In the DNS, 8 K20M GPUs were adopted.The effect of aspect ratio was verified and the number of grids is 6.7 × 10 7 for the largest aspect ratio.The results show that although the periodical boundary is used in spanwise and streamwise directions, the dimensions in both directions have effect on the DNS results.For the case of    = 8,   = 2, both the mean velocity and turbulent statistic variables, such as Reynolds stress and velocity fluctuations, agree perfectly with the results of Kim et al. [5].At the same time, high performance of 2333 MLUPS was obtained.As for the LES, the local grid refinement technique was tested and then used.Using 1.76 × 10 6 grids and   = 0.13, good results on mean values were obtained.However, the turbulent statistic values deviate from LBM-DNS data near the wall.It may suggest that more grids should be used near walls and the dynamic Smagorinsky constant should be utilized to obtain more accurate results.Moreover, multi-GPU, which is of super computing power and matches perfectly with the good parallelism of LBM, presents a surprisingly high performance in this work.Productive and High Performance Application Frameworks for Post-Petascale Computing."

Figure 6
presents the turbulent statistics calculated by LBM-DNS and comparison with the DNS data by Kim et al. [5] for (i) Reynolds stress in wall-normal directions, (ii) the components of the root-mean-square (rms) in streamwise, spanwise, and wall-normal velocity fluctuations, at (a)   = 4,   = 2, (b)   = 8,   = 2, respectively.It can be seen that our LBM-DNS results and Kim's data [5] are generally in good agreement for the two cases of aspect ratio.There are slight discrepancies of rms at   = 4,   = 2, especially the rms in streamwise velocity fluctuations indicated by  rms.Comparing with mean velocity and Reynolds stresses, DNS results of velocity fluctuations are more sensitive to the streamwise length chosen for simulation.

Figure 8 :
Figure 8:  The comparison between the NSE results of Ghia et al.[25] and our LBE results with and without using grid refinement on the problem of lid-driven cavity flow: velocity profile on the center line in the normal direction at Re = 1000.

Figure 10 :
Figure10: The profile of averaged velocity in the normal -direction using LBM-LES at various   for a local grid refinement system: /4 top and bottom parts adjacent to walls were refined to 512 × 15 × 127 (Δ + = 2.82) and the middle parts are kept at 256×48×64 (Δ + = 5.64).

Figure 11 :
Figure 11:  The profile of Reynolds stress in the normal -direction using LBM-LES at   = 0.13 for a local grid refinement system: /4 top and bottom parts adjacent to walls were refined to 512 × 15 × 127 (Δ + = 2.82) and the middle parts are kept at 256×48×64 (Δ + = 5.64).