High Performance Computation of a Jet in Crossflow by Lattice Boltzmann Based Parallel Direct Numerical Simulation

Direct numerical simulation (DNS) of a round jet in crossflow based on lattice Boltzmann method (LBM) is carried out on multiGPU cluster. Data parallel SIMT (single instruction multiple thread) characteristic of GPU matches the parallelism of LBM well, which leads to the high efficiency of GPU on the LBM solver. With present GPU settings (6 Nvidia Tesla K20M), the present DNS simulation can be completed in several hours. A grid system of 1.5 × 10 is adopted and largest jet Reynolds number reaches 3000. The jet-to-free-stream velocity ratio is set as 3.3. The jet is orthogonal to the mainstream flow direction. The validated code shows good agreement with experiments. Vortical structures of CRVP, shear-layer vortices and horseshoe vortices, are presented and analyzed based on velocity fields and vorticity distributions. Turbulent statistical quantities of Reynolds stress are also displayed. Coherent structures are revealed in a very fine resolution based on the second invariant of the velocity gradients.


Introduction
A jet in crossflow (JICF), also known as transverse jet, normally describes a jet of fluid which enters and interacts with the crossflow and the resulting flow field.JICF has wide applications in many engineering fields, such as gas turbine impingement cooling and film cooling, fuel injection in combustors, thrust vectoring system in turbojet propulsion, and reaction control for missiles.For a traditional JICF problem, near the flow field of the transverse jet, four types of vortical structures can be observed [1] due to the interaction between mainstream and jet as shown in Figure 1: (1) the counter-rotating vortex pair (CRVP), also known as kidney vortices; (2) the horseshoe vortices system; (3) the jet shearlayer vortices due to Kelvin-Helmholtz instabilities; (4) the wake vortices.The CRVP and the horseshoe vortices are normally defined by mean flow even though they may have unsteady part.The shear-layer vortices and the wake vortices are naturally unsteady.
Those vortical structures are very important to fluid flow and heat transfer behaviors of those fields where JICF is applied.If gas turbine film cooling is taken as an example, film coverage and the cooling effectiveness are closely relevant to those large structures, no matter whether they are defined as steady or naturally unsteady.The "steady" CRVP lifts the coolant up and mixes with the mainstream, weakening the film coverage and cooling effectiveness.The "unsteady" shear-layer vortices have different directions determined by the shape of the ejection hole and consequently either strengthen or suppress the CRVP [2].To accurately predict those large structures in which unsteadiness and anisotropy are inherent, a time-resolved scheme is required in the flow field calculation with fine spatial discretization.
The flow field of JICF is characterized by both anisotropic large-scale structures which break down to smaller sizes.These resultant small scale structures are isotropic and dissipation is dominant.The typical energy spectrum for the JICF inherently requires capturing the flow structures across the spectrum of all scales and thus a time and space accurate calculation is needed.Direct numerical simulation (DNS) and large eddy simulation (LES) are commonly used to resolve the turbulent characteristics of JICF in largescale parallel computation devices.However, inherent features of Navier-Stokes equation or transport equation make the parallel processing with low efficiency.Thus, a highly parallel scheme is extremely helpful to those large-scale simulations of turbulent flow field.Lattice Boltzmann method (LBM) has been regarded as a promising candidate for years due to its major merits of fully parallel algorithm.Unlike conventional numerical schemes based on assumption and discretizations of macroscopic continuum, the LBM adopts microscopic models [3].Other advantages of LBM include (1) easy treatment of boundary conditions and (2) easy programming.As a result, the LBM has its wide application in scientific and engineering research, such as Biofluid and porous medium [4][5][6].However, to resolve the turbulent flow, LBM-based DNS and LES require a very high grid resolution and thus huge computation resource is needed.
Graphic Processing Unit (GPU) brings excellent hardware conditions for these simulations.With the development of computing platforms, for example, CUDA [7] and OpenCL, the use of GPU to accelerate nongraphic computations has drawn much more attention [8,9].Due to its high performance of floating-point arithmetic operation, together with wide memory bandwidth and good programmability, general-purpose GPU (GPGPU) has its huge advantage over CPU in turbulent flow simulation [10][11][12] and thus has been applied in fields like weather prediction, crystal growth process, and air flow in the city [13][14][15].
The feasibility for LBM-based DNS has some preliminary validation for turbulent flows between parallel plates [16]; further verification for JICF is still empty.In this paper, the lattice Boltzmann based DNS on JICF model is performed using D3Q19 model in multi-GPUs.With further validation of this code, this paper aims to resolve the JICF flow field, including the time-averaged and instantaneous velocities, vorticities, and turbulent quantities of Reynolds stress.The characteristic coherent hairpin structures are also revealed and discussed.
Collision step is as follows: streaming step is as follows: where   and f denote the pre-and postcollision states of the distribution function.Please note that in the collision step f is calculated absolutely locally and in streaming step   is only relevant to its neighboring nodes.Thus, the LBM code itself is highly suitable for parallel processing and capable of obtaining high efficiency.Macroscopic quantities, such as  and, can be calculated as follows: The lattice speed of sound is   = 1/ √ 3, and pressure can be obtained from the state equation of the ideal gas: Considering boundaries of f in transverse direction (front and back boundaries), periodic boundary conditions are used in this paper.For solid walls, the following boundary conditions are applied [19]: in which  eq  is calculated by macroscopic u and  on boundaries by (4).The nonequilibrium distribution on boundaries  neq  is assumed to be of the same value as that in the neighboring inner node  neq -neighbor .f-neighbor and  eq -neighbor on the inner node are computed by ( 6) and (4), respectively.

GPU Settings
A well-known inherent advantage of lattice Boltzmann method is its parallelism.Previous study [20] compared the acceleration performance between Navier-Stokes solver and current LBM solver, in which incompressible flow around a cylinder was simulated on GPU (GeForce GTX280) and CPU (Xeon E5420, 2.5 GHz) separately.For the Navier-Stokes solver in which Red-Black scheme and multi-grids were applied, the GPU acceleration over CPU was 13.7.Comparatively, for LBM on the same grid scale, the acceleration was 87.4.In addition to the acceleration performance of LBM on GPU as mentioned, the difficulty of coding is much smaller for LBM than Navier-Stokes on GPU.
However, the expense to LBM's parallelism is also obvious due to substantial variables and consequent huge memory consumption.For the D3Q19 model in single precision calculation, the theoretical GPU memory requirement on 1.5 × 10 8 grids (1024 × 256 × 600) is over 30 G. Besides, data transfer between GPUs is realized by MPI (message passing interface) and CudaMemcpy() sentences in CUDA.From the previous experience [11], if the number of GPUs is less than 10, the 1 domain partitioning is more efficient than 2 and 3.In the current research, 6 GPUs of Nvidia Tesla K20M are used   and the current 1 partitioning of the GPUs is illustrated in Figure 3.

Physical and Numerical Models
The physical model of the jet in cross flow problem is shown in Figure 4.The computational domain is hexahedral in shape.The mainstream flow is in -direction, while the jet is orthogonal to the mainstream in -direction and ejected from the bottom plane uniformly with diameter .Origin of the coordinates is located at the center of the round jet exit.The flow domain is zero-gradient in pressure and no-slip boundary conditions are applied on the bottom plane excluding the jet exit.On transverse planes (with normal in -direction), periodic boundary conditions are applied.The inlet turbulent profile of velocity is given through a calculation procedure similar to the so-called "rescaling process" as described in [21].The results presented in this paper are based on flow parameters and boundary conditions listed as follows.
(i) Ejecting hole diameter () = 75.Reynolds number Re  is defined based on jet velocity   and film hole diameter  as Re  =   /.The largest Reynolds number Re  reaches 3,000 on grids (1024 × 256 × 600), and several calculations are performed with different dimensions, grids, and jet locations trialed as shown in Table 1.The discussion is based on the 1st case in the table.The flow domain is kept in the "low-speed" range, and velocity magnitude is smaller than 0.3 times of the lattice sound speed   .Thus, the LBGK model used in current study is suitable [22].

Model Validation
Nondimensional time-averaged streamwise velocity  + = /  versus  + =   / is presented in Figure 5.The location is selected before the jet at  = −2 and  = /2 to eliminate the influence from the jet.A turbulent boundary layer profile is obtained, consisting of laminar sublayer and log layer, which is consistent with the Von Karman mixing length theory.
Further comparisons with experiments are made in regions where the mainstream is disturbed by the jet as shown in Figure 6.The experiments were done by Meyer  et al. [23,24] using stereoscopic PIV in a JICF problem.Time-averaged streamwise and spanwise velocities (/ ∞ and / ∞ ) are presented in spanwise direction (with respect to /) at locations of  = 0.5,  = /2 and  = 3,  = /2.Results show good agreement with experimental data.
In current simulation, a uniformly distributed velocity profile is set at the jet exit, while, in experiment, the jet is supplied through a tube and a turbulent profile has been developed inside it.Thus, an overestimation of the jet's momentum near its boundary of the jet hole can be expected and seen in Figure 6.

Results and Analysis
6.1.Energy Spectra.Fast Fourier transformation (FFT) is performed to a time series of turbulent energy  tur at the sample frequency of 2000 at  = 6,  = 4, and  = /2.The turbulent energy is defined as ).The corresponding turbulent power spectrum is presented versus frequency as shown in Figure 7.The calculated turbulence decay rate is close to the theoretical canonical value of −5/3 and no obvious spectrum buildup is observed, which indicate the current mesh resolution is adequate.b) and 8(c), respectively.Generation and shedding of Kelvin-Helmholtz vortices are introduced by the shear layers between the mainstream and the jet on both leading edge and trailing edge.The shed vortices are mixing with the mainstream and dissipated in the downstream region quickly.In the wake region of the jet, wake vortices are observed through those nontrivial vorticity values; however, no vortex tube is directly observed in this cross section view.

Velocity Field and
To compare the time-averaged fields with their instantaneous counterparts, the Kelvin-Helmholtz vortices are not available after averaging, which proves the shear-layer induced vortices are inherently unsteady and periodic.The 2 instantaneous velocity (, V) and time-averaged velocity (, ) fields are displayed at locations = 0.5, 1, and 2, respectively, in Figure 9.The corresponding vorticity distributions are also displayed in the figure.In Figure 9(a), the instantaneous velocity vectors and vorticity distributions at time step 80000 are shown.Separation of mainstream occurs near the mid-chord portion.Further, vortex shedding and dissipation are observed in the downstream area.Wake region is formed at the backside of the jet.A characteristic flow pattern of the wake by the transverse jet is obtained which is significantly different from that by a solid cylinder as described in [1].In Figures 9(b) and 9(c), the time-averaged velocity vectors and vorticity distributions are displayed.Similar to flow passing a solid cylinder, along the streamwise direction, positive and negative values appear when the mainstream starts to separate.In downstream region, the vorticity recovers to trivial values due to the dissipation of those periodically-shed vortices.
The 2 time-averaged velocity (, ) fields are presented in several streamwise planes at locations  = 1, 1.5, and 3 in Figures 10(a), 10(b), and 10(c), respectively, with vorticity distributions.As shown in Figures 10(a) and 10(b), the CRVP system consists of two layers, the upper one and the lower one.For a round hole, as used in the current study, the rotating directions of the two pairs are the same.This "two-deck" vortices structure is very similar to that observed in [2] by laser-induced fluorescence.By comparing among Figures 10(a the lower layer vortices lift in vertical direction and weaken in strength.Finally, in far downstream region ( = 3), the -component is at the same level of -component since the jet is almost fully bended and those two layers fully emerge with strength further weakened.For both     and     , the highest values appear in the leading and trailing portions of the jet due to large velocity gradients.Right behind the jet in the recirculation zone, both components maintain some nontrivial values, especially     , indicating the wrapping motion around the jet by the fluid flow.

Coherent Structure.
The isosurfaces of second invariant of velocity gradient  at time steps 80,000 and 120,000 with domain height of /2 are plotted in Figure 12.The definition of  is given by (11) in tensor form: where   and   denote the rate of rotation and rate of shearing, respectively.The hair-pin-shaped coherent structures are originated from the side and trailing edge of the ejecting hole and wrapping around the jet.In the downstream region, the hair-pin structure is weakened and dissipated into the mainstream.

Conclusion
A direct numerical simulation of jet in crossflow based on lattice Boltzmann method is performed on multi-GPUs.The current simulation takes about 10 hours based on largest grid of 1.5 × 10 8 .Several points can be summarized and concluded as follows.(2) The turbulent energy spectrum is plotted versus frequency at  = 6,  = 4, and  = /2.A decay rate close to the theoretical value of −5/3 is observed.
(3) 2 velocity vectors and vorticity fields are plotted in -, -, and -direction planes.Unsteady shear-layer vortices are formed and shed along the leading and trailing side of the jet trajectory.Horseshoe vortex is generated in the plate boundary layer before the jet.Two-deck structure of the CRVP is retained and the lower pair is observed to rise along the streamwise direction.
(4) The normal components (    and     ) and shear component (    ) of the Reynolds stress are revealed in the  = /2 plane.Strong anisotropy is observed for the flow field due to the shear layer and wake.
(5) Coherent structure represented by the second invariant of velocity gradient () is plotted at different time steps.The characteristic hair-pin vortices are observed.
Vorticity.The 2 time-averaged velocity (, ) and instantaneous velocity (, ) fields in the middle plane of -direction ( = /2) are plotted in Figures 8(a), 8(b), and 8(c), respectively.The corresponding vorticity distributions based on the time-averaged and instantaneous velocities are also shown in the figure.In Figure 8(a), the vertical jet is bended toward exit by the mainstream flow and phenomenon like flow passing a solid obstacle is observed in the wake region.As shown in the enlarging plot of flow field, the approaching wall boundary layer meets with adverse pressure gradient before the jet and consequently separates and forms the horseshoe vortex.Considering the vorticity field, negative and positive values are observed in leading edge and trailing edge along the jet trajectory and trivial values exist in the wake region.Instantaneous velocity and vorticity fields at time steps 80000 and 120000 are shown in Figures 8(

6. 3 .
Turbulent Statistics.Reynolds stress distributions of     ,     , and     are plotted in the midplane of direction ( = /2) in Figures 11(a), 11(b), and 11(c), respectively.The shear stress component of the Reynolds stress tensor is shown in Figure 11(a).Strong anisotropy of the flow field can be observed in the flow field.Negative values of     appear in the leading edge of the jet with positive values in the trailing portion.In the wake region, both positive and negative values exist with big nonuniformity.In Figures 11(b) and 11(c), the normal Reynolds stress is displayed.

Figure 12 :
Figure 12: Isosurfaces of second invariant of velocity gradient.