Parallelization of an Unsteady ALE Solver with Deforming Mesh Using

This paper presents a parallel, GPU-based, deforming mesh-enabled unsteady numerical solver for solving moving body problems by using OpenACC. Both the 2D and 3D parallel algorithms based on spring-like deforming mesh methods are proposed and then implemented through OpenACC programming model. Furthermore, these algorithms are coupled with an unstructured mesh based, implicit time scheme integrated numerical solver, which makes the full GPU version of the solver capable of handling unsteady calculations with deforming mesh. Experiments results show that the proposed parallel deforming mesh algorithm can achieve over 2.5x speedup on K20 GPU card in comparison with 20 OpenMP threads on Intel E5-2658 V2 CPU cores. And both 2D and 3D cases are conducted to validate the efficiency, correctness, and accuracy of the present solver.


Introduction
Many unsteady fluid flow simulations, such as fluid-structure interaction (FSI) [1][2][3] and wing-store separation [4], involve slight deformation of wall boundaries or relative motion between components in Computational Fluid Dynamics (CFD).This kind of problems is termed as dynamic mesh problems in unsteady calculations.So far various kinds of dynamic mesh approaches [3,[5][6][7][8] have been proposed and validated.
According to the present literatures, the dynamic mesh approaches can be divided into two categories.The first class refers to the overset grid technique [7,[9][10][11][12] in which mesh movements can be modeled independently for the involved boundary that is supposed to be moved.The meshes of different moving parts overlap with each other to form the complete overset grid system.This technique, applied in both structured [10,11] and unstructured [7] grids, alleviates the complexity of mesh generation and improves the mesh quality of local domains.However, it adds the difficulty of the mesh preprocessing work [7,9,11,13] in which the interpolated information of each pair of overlapped meshblock needs to be determined from each one.An alternative category, termed as deforming mesh technique, is about using single, consistent mesh and deforming it to conform the new geometric shape without changing the connectivity of the whole computational mesh.In this category, the node connectivity-based methods consider each two neighboring mesh nodes and treat the whole mesh domain as a spring network.The representative methods involve 2D springanalogy [14], 2D torsional spring [15], 3D ball-vertex [16], and 3D torsional spring [17,18] approaches.Contrary to the node connectivity-based methods, the nonconnectivity methods move the mesh nodes point-by-point by calculating the relative position with respect to the given reference position.RBF-(radial basis functions-) based algorithm [19] and DGM (Delaunay grid mapping) [20] algorithm are two popular approaches in nonconnectivity methods.
Parallel algorithm design and implementation (high performance computing, HPC) for CFD applications have become popular and necessary in recent decades due to the ever-growing computational scale.And the flow fluid simulations based on the traditional parallel techniques, such as MPI (message passing interface) [21] and OpenMP [22], have been studied in [3,4,6,9,11].Since Nvidia's CUDA (Compute Unified Device Architecture) [23] was 2 Scientific Programming published in 2007, GPU computing has attracted more and more attention due to its high memory bandwidth and lightweighted threads of parallelism.CUDA also provides both hardware configurations and software supports for one to develop programs in parallel by using C/C++ language.And recent years have witnessed a lot of numerical solvers [24][25][26][27][28][29][30] that involve various CFD areas on the GPU device.
However, the GPU-based developed solvers are most of the steady ones, which do not involve moving mesh.Even in unsteady calculations [31][32][33] on GPU device, either no mesh deformation or dynamic overset grid technique is studied.Furthermore, the developed GPU-based solvers are implemented by CUDA programming model in which the core functions of the source code require the programmers to redesign in CUDA-kernel subroutines.This code rewriting procedure is a time-consuming job that leads to low efficiency of GPU code development.With this in mind, the objective of this paper is to develop an accelerated, unstructured mesh based, unsteady solver that is capable of handling deforming mesh for moving body problems on the GPU platform by employing the new directive-based parallel programming interface-OpenACC.The proposed solver requires minor code revision effort but can show well parallel performance on the GPU device.The present paper is organized as follows: Section 2 gives a brief introduction of the OpenACC programming model; Section 3 states the unsteady numerical algorithms of ALE formulation.Both 2D and 3D parallel deforming mesh algorithms (PDMA) and OpenACC implementation are proposed in Section 4. The implementation of an unsteady solver that is coupled with the proposed algorithm for moving body problems is presented in Section 5; Section 6 reports the experiments result of the ALE unsteady solver with deforming mesh; and the last section is the summary of the work.

OpenACC Programming Model
The past several years have witnessed the development of heterogeneous architectures, such as DSPs (digital signal processors), FPGA (Field-Programmable Gate Array), and GPU (Graphics Processing Unit), capable of providing great computational power as an auxiliary accelerator for various applications in scientific areas.The programming interfaces, meanwhile, including OpenCL [34], CUDA [23], and C++AMP [35], emerge for developers to design the corresponding algorithm on each architecture.However, in order to explore the potential of computing of theses architectures thoroughly, the programming models require users to understand the hardware-level framework well and put large amount of effort on rewriting the codes, which makes the code migration work with low efficiency.OpenACC [36], published in late 2011, is designed as a programming model that allows researchers to accelerate their scientific codes on different processors and platforms by using highlevel compiler directives.The design concept of OpenACC resembles OpenMP [22]: one needs to firstly identify the areas that should be parallelized and then accelerate them by declaring appropriate directives or calling runtime library routines.This directive-based programming model not only makes one parallelize application codes with minor effort but also enable developers to run a single source code on different accelerating platforms, such as many-core CPUs, graphics processing units (GPU).The OpenACC standard is currently supported by four commercial compiler vendors, CAPS [37], Cray [38], PGI [39], and Nvidia [40].It is quickly becoming a popular, effective, attractive programming model that offers scientists and researchers an efficient way to develop platform independent, portable and parallel scientific codes.Anyone who is familiar with scientific language, such as C, C++, and Fortran, is capable of porting the codes on an accelerator.From the point of view of architecture, the OpenACC provides an abstract model by defining host and accelerator device with a hierarchy of memories [41].The accelerator device could be the same device as a CPU core, a GPU card, or a future accelerator device.The interactions of different devices, such as memory management, data transfers, and asynchronous operations, can be achieved by compiler and runtime library routines.

ALE Formulation for Unsteady Simulation
The numerical model for unsteady flows with moving or deforming mesh employed in present paper is Arbitrary Lagrangian Eulerian (ALE) formulation for Navier Stokes equations, which can be written in integral form as where Ω is control volume,  is the integral surface element, W = [, , V, , ]  denotes five conservative variables in three dimensions, and F   = F inv  − F ALE  , F V represent the convective fluxes on moving or deforming mesh and viscous fluxes, respectively.These two vectors are given by where ,  denote the density and pressure, respectively, U = [, V, ]  represents the velocity components in each direction, N = [  ,   ,   ]  stands for the unit normal vector of the integral surface, and  = ()  (,  = , , ) denotes the viscous stress tensor.Two velocity vectors, V  and V  , satisfy   = U  ⋅ N and   = U ⋅ N −   , where the vector is the velocity of the mesh.The component of Θ, denoted as Θ  =   + V  +   + (/) ( = , , ), stands for the work of viscous stresses and the heat conduction.By introducing two additional equations,  =  and  = ( − 1)( − ( 2 + V 2 + 2 )/2), the complete ALE equations contain seven unknown variables to be solved.
In the case of unsteady simulations involving deforming or moving mesh, Geometric Conservation Law (GCL) [42] By moving the second term to the right side of (3) and performing second-order time discretization on the left term of (3), the variation of control volumes in different time levels can be expressed as

Two-Dimensional Parallel Deforming Mesh Algorithm.
The spring-like deforming mesh algorithms [14,15] have been applied widely due to its robustness in very complex configurations.The main idea of these approaches is to view the unstructured grid of the computational domain as a network of springs which are constructed by the connectivity of nodes.The equilibrium equations of the spring network are satisfied when the grid is static and the equations are satisfied again after the grid movement by providing the displacements of moving nodes as boundary conditions.Then the displacements of inner nodes can be determined through solving a linear system of equations.For the convenience of the following statement, the classical two-dimensional torsional spring deforming mesh approach [15] is reviewed first.
As illustrated in Figure 1, the torsional stiffness    is defined as 1/sin 2    , which is used as collapse control parameters to avoid grid crossover problems during the grid movement.Then the moments, expressed as a vector M ijk , can be written as M ijk = C ijk Δ ijk , where C ijk is a 3 × 3 diagonal matrix with    ,    ,    in diagonal positions, respectively, and Δ  = [Δ  , Δ  , Δ  ]  = R  q  represents the rotation increment vector with a 3×6 matrix R  and the displacements of , ,  nodes in horizontal and vertical directions.Finally, the effect of torsional springs in triangle   can be calculated by using discrete forces as where ) ( ̸ = , ,  = , , ).The above linear system of equations is generally solved by employing traditional iterative methods, such as Jacobi, Gauss-Seidel, SOR, and SSOR.These methods do not require explicit matrix format and work with lower memory cost.However, they generally require a lot of iterations to convergence.The spring-like mesh deformation algorithms produce symmetric torsional stiffness matrices, but the matrices are not always positive definite.Therefore, a fast Krylov subspace based preconditioned BiCGSTAB [43] (P-BiCGSTAB) method is employed in this paper.And a parallel coefficient matrix assembly (PCMA) algorithm is proposed to generate explicit format of the sparse matrix required by P-BiCGSTAB method.
As listed in Algorithm 1, 2DPCMA process fills the CSR (Compressed Sparse Row) format of the spring coefficient matrix which is constructed by a row array _, a column array _, and a value array V_ through launching many threads concurrently.Each thread calculates two rows of the sparse matrix independently by accumulating the corresponding values in torsional coefficient matrix.Specifically, given an inner node , each triangle that contains  contributes to V 2,2 (value at 2 th row, 2 th column of the torsional coefficient matrix with nonsparse format), Input: (1) Unstructured grid metrics (UGM) including the coordinates of the mesh nodes (X, Y) and the indices of the nodes that form the elements; adjacent data structure of grid (2) Row array, _, which stores the starting positions of rows.
(3) Inner node identifier  and the GPU thread identifier .Output: (1) Column array, _, which stores the positions of the no-zero values in each row.
(2) Value array, V_, which stores the no-zero values in each row.
(3) Right hand vector of the linear system of equations, b.Procedure: Thread  is responsible for the following tasks (1) initialize an integer counter:  = 0; (2) initialize the elements of vector b that  is to calculate: b IF is a moving boundary node Then (10) For each triangle   that contains  edge Do (11) b End For (13) Else (14)  + +; For each triangle   that contains  edge Do (18) V_ End For (21) End IF (22) End For Algorithm 1: Two-dimensional parallel coefficient matrix assembly (2DPCMA) algorithm.
The subsequent lines in Algorithm 1 fill the remaining positions of the 2 th and 2 + 1 th rows of the torsional coefficient matrix and the nonzero positions of vector b.This filling process is accomplished by looping all neighbors of node .When node  encounters a boundary node  that is tagged as a moving point, then the torsional coefficients of the triangles which contain  edge contribute to 2 th position of vector b; otherwise, they are accumulated to V 2,2 , V 2,2+1 , V 2+1,2 , V 2+1,2+1 .To explain this, Figure 2 shows an example of unstructured mesh that contains 7 nodes, 6 triangles, and two boundary nodes.V_ is one-to-one mapped and the sequence of neighbor looping has no impact on the content of CSR format matrix.
To implement the 2DPCMA algorithm efficiently, basic grid data structure has to be arranged optimally.Figure 2 shows the anticlockwise neighbor storage strategy in the implementation that is capable of containing all neighboring information implicitly.As depicted in Figure 2, for example, the storing of the neighbors of node () starts with node () and ends with node (); then it is easy to visit the previous and subsequent nodes of () ( and , resp.) when the data information of triangles that contain edge  is required in line (10) of Algorithm 1.
The implementation of 2DPCMA by using OpenACC directives on the GPU accelerator is shown as Listing 1.This implementation consists of two core sections.One part is responsible for arranging memory on GPU via employing #pragma acc data clause.The first three lines list two groups of variables with different data attributes.The memories of V_ and _ variables are created by using creat clause while the memories of the remaining variables are both created and initialized with the corresponding values on CPU through copyin clause.Another segment of Listing 1 is a function in which the workload can be distributed automatically to independent thread launched through #pragma acc parallel loop directive.The data used in this function is allocated and initialized in previous step; thus the present keyword is used to indicate that the variables declared in the following brackets are already created on the GPU accelerator.During the OpenAcc_2DPCMA procedure, the workload which is related to an inner node  is implicitly distributed to a specific GPU thread  to execute and these two variables are treated as input parameters in 2DPCMA listed in Algorithm 1.

Three-Dimensional Parallel Deforming Mesh Algorithm.
Degand and Farhat's work [17] extended the two-dimensional torsional springs deforming mesh algorithm to threedimensional ones.The main idea of their work is to split the quality constraints of a tetrahedron into the quality constraints of several triangles and then transform each 2D 6 × 6 stiffness matrix to 3D 9 × 9 matrix via the coordinate transformation formula.Burg [18] proposed a universal formulation of the torsional spring idea where two-dimensional formulation can be directly extendible to three-dimensional case by analyzing the equations of volume and area for a mesh cell.As the three-dimensional formulation is more straightforward in Clarence's work, the parallel algorithm of this formulation is considered as follows.
As demonstrated in Figure 5, considering face  and the opposite point  of an tetrahedron cell   , the stiffness matrix is expressed as  The complete stiffness matrix with respect to corner  can be obtained by accumulating the stiffness matrix formed from every face containing point .Thus, all inner points ( inner ) can produce 12 inner × 12 inner matrix, but there are only 3 inner unknown variables that need to be solved, which indicates that the 12 inner row of the matrix is not linearly independent.In fact, the first three rows of the stiffness matrix with respect to node  and cell   , denoted as  , respectively, when the same cell   and , ,  node are considered.
Algorithm 2 lists the pseudocode for 3DPCMA algorithm that is capable of assembling the final stiffness matrix in parallel.Much like 2DPCMA algorithm, the notion of 3DPCMA algorithm is to split the task of filling the CSR format matrix into each subpart of three rows and distribute the subtask to independent GPU thread.Nevertheless, unlike 2DPCMA algorithm, to complete filling the values of the neighbors of node  may require more than one accumulation while the same work is finished in one loop process (lines ( 17)-( 20) in Algorithm 1) in 2DPCMA.This is to increase cache space utilization rate because a lot of variables such as X  , Y  , Z  need to read when an element   that contains  starts to be visited (line (4) in Algorithm 2) and these variables could be reused to compute as many related data as possible.However, this causes a problem that the position of each neighbor of node  which is already operated could be visited again in the next loop.So extra data structure is required to be constructed to record the relative address of each neighbor with respect to node  for possible revisiting.Moreover, more complicated than 2DPCMA, 3DPCMA procedure requires the adjacent data to provide the information with which three points along with node  make up a tetrahedron because the linear arrangement of the neighbors, unlike the strategy illustrated in Figure 2, can not embrace the configuration of a tetrahedron implicitly.
It should be noted that a temporary matrix-variable  is declared in Algorithm 2 to store the 3 × 3 submatrix of Input: (1) Unstructured grid metrics (UGM), including the coordinates of the mesh nodes (X, Y) and the indices of the nodes that form the elements; adjacent data structure of grid (2) Row array, _, which stores the starting positions of rows.
(3) Inner node identifier  and the GPU thread identifier .Output: (1) Column array, _, which stores the positions of the no-zero values in each row.
(2) Value array, V_, which stores the no-zero values in each row.
(3) Right hand vector of the linear system of equations, b.Procedure: Thread  is responsible for the following tasks (1) initialize vector b that  is to calculate: b similar steps are performed like previous ( 9)-( 13) lines for node  (18) End For Algorithm 2: Three-dimensional parallel coefficient matrix assembly (3DPCMA) algorithm.

𝐾
,  .In fact, the temporary variable  can be substituted by a single variable to store each value of  ,  and be used to update the corresponding position of V_ in turn because the value in each position of  ,  can be explicitly expressed and calculated by  ,  .The implementation that contains less temporary variables is preferable as the registers are costful hardware resource of GPU.However, this lowlevel consideration, termed as register usage optimization, is less popular but very helpful for efficiency improvement in directive-based OpenACC implementation.

Unsteady Solver Parallelized by OpenACC
In this section, the proposed algorithms are integrated into an unstructured, vertex-centered finite volume method based numerical solver to form a full GPU version of an ALE solver that is able to simulate unsteady flows with deforming mesh.By demonstrating Figure 6, the vertex-centered finite volume Ω  around a node  on two-dimensional triangle meshes, the basic numerical schemes employed in this solver, are structured as follows: (i) Roe scheme, in which the convective fluxes are computed at a face of the control volume from the left and right state by solving the Rieman problem, is employed for convective flux discretization.The mathematical formulation for Roe scheme is expressed as where the right and left state, W  , W  , are constructed by using Venkatakrishnan limiter [44] and the gradients of convective variables.(ii) The central scheme which evaluates the flow quantities at a face of the control volume by averaging the quantities of left and right volumes is applied to discretize both ALE flux F ALE  and viscous flux F V in (1).(iii) The implicit schemes of dual time-stepping based on the second-order time [42] accuracy is employed to discrete (1) in time.By introducing a pseudo-time variable  * and unsteady residual variable  * , the dual time-stepping scheme can be formulated as

Compute inviscid & viscous fluxes
where (iv) The Spalart-Allmaras (SA) one-equation turbulence model [45] is adopted.It can be written in integral form as The convective flux and viscous flux are given by  , = ] and  V, = ⃗  ⋅ ⃗  with  being the contravariant velocity, ⃗  being the outward facing unit normal vector of the control surface, and ⃗  being the normal viscous stresses.The source term   is introduced in [45].The first-order upwind scheme and central scheme are adopted to discretize the convective and viscous flux and the implicit time scheme [42] is employed for advancing SA equation in time.
The detailed steps of the solver are demonstrated as Figure 7 in which the procedures of PCMA algorithm, P-Bicgstabsolver, updates nodes info.are to be executed between each two physical steps.Listing 2 gives an overview of the Ope-nACC implementation of the unsteady solver.It starts with alloc_mem_init subroutine to allocate and initialize variables on the host.Then the variables which are used as intermediate variables for computation are explicitly created on GPU memory by using #pragma acc data create clause.And the variables which are created and initialized by alloc_mem_init are copied from CPU to GPU by using copyin keyword.The previous two steps of data arrangement operation are followed by a pair of curly braces where the main part of GPU computation is placed.This data transfer consideration avoids extra data exchange between the host and the device during the whole computational process.By considering (1), (10), GCL law (3), and mesh deformation equations, there are 8 unknown variables (, , V, , , , ) to be solved in (1) and (10) at each pseudo-time step and 4 unknown variables (GCL volume, movements of each node in , ,  directions) between each two physical time steps, where  denotes the number of mesh nodes.
In the main computation section, each step is encapsulated as a subroutine that accepts necessary parameters.These functions have no difference with the serial implementation.At this point in the OpenACC implementation, it requires no effort except two directive-based data declarations out of the curly braces.Even in each function, the modification of the CPU code is concentrated only in for loop regions and data segments where read-write conflicts exist.Take Set_Time_Step function, for example (shown in Listing 2), #pragma acc parallel loop clause, is declared to make the for loop parallel and this clause is followed by present keyword to declare that the global arrays used in the following for loop are already present on GPU memory.The variables which are not mentioned in the data statement are treated as private variables during the loop.Edge-based data structure for parallel implementation will encounter read-write conflict inevitably because the quantities of a common node shared

Experiments
The following test cases were conducted on a heterogeneous platform YUAN at supercomputing center of Chinese Academy of Sciences.YUAN consists of 270 blades nodes, 30 GPU nodes, and 40 MIC (Intel Many Integrated Core Architecture) nodes.Each GPU node has two Intel E5-2658 V2 (Ivy Bridge, 2.8 GHz) CPUs and two Nvidia Tesla K20 GPU cards.The configuration parameters of K20 GPU are list in Table 1.

Performance of Parallel Mesh Deformation.
In this part, both triangular and tetrahedral meshes with variety of computational scales are available to capture the performance of the parallel deforming mesh algorithm on the GPU device.The evaluation of performance is speedup which is defined as the ratio of wall clock times running on Intel E5-2658 V2 CPUs and K20 Card.In order to make fair comparison, the serial version of deforming mesh code is compiled by using GCC compiler with highest optimization option -O3.Table 2 lists two 2D and two 3D mesh configurations performed in this test.Table 3 gives the CPU times, GPU times for each part of mesh deforming procedure, and the speedup performances.The fourth column in Table 3 shows the speedup for matrix assembly running on K20 GPU card against Intel E5-2658 CPUs.It is shown that higher speedup can be achieved on the first two 2D meshes compared to that on the following two 3D meshes.This can be explained by the fact that the workload each GPU thread undertakes varies owing to neighbor sweeping procedure in Algorithms CTMA: CPU time of matrix assembly; GTMA: GPU time of matrix assembly; SPMA: speedup for matrix assembly; CTB: CPU time of p-bicgstab solver; GTB: GPU time of p-bicgstab solver; SPB: speedup for p-bicgstab solver; OSP: overall speedup.
1 and 2 and a 3D tetrahedral node has a larger disparity in the number of neighbors than that of a 2D triangle node, which leads to more load imbalance of workload on GPU threads for 3D meshes.The fourth column data also indicates that larger mesh size results in higher speedup for the same mesh type.This is to be expected, because larger mesh size is more conductive to make best use of the GPU resources and this trend can be sustained before the saturation of the available GPU resources is achieved.The fifth and sixth column of Table 3 list the CPU time and GPU time of the sparse linear system of equations obtained by PCMA algorithm, respectively.Six iterations are performed in the P-BiCGSTAB procedure and all of the four cases attain an accuracy of 1.0 × 10 −8 within these 6 iterations.In order to make the GPU implementation of P-BiCGSTAB solver more efficient, cuBLAS and cuSPARSE [46], the optimized, parallel GPU-accelerated libraries that provide efficient matrix-vector, sparse matrix-vector, and vector-vector multiplications, are introduced.However, the API functions in cuBLAS or cuSPARSE require explicit data pointers in the GPU memory while the general OpenACC model uses unified data pointers and switches on CPU and GPU platform implicitly.Fortunately, OpenACC model provides a way to invoke the third-party CUDA library via OpenACC directive declaration with #pragma acc host_data use_device (pointerA, pointerB), which returns the corresponding device pointers for the following CUDA-based API calls.cusparseDcsrilu0() was used to compute the incomplete-LU factorization as the preconditioner and cusparseDcsrmv() was called to conduct SpMV during the process of P-BiCGSTAB.The last column lists the overall speedups by combining the wall clock times of the previous two subprocedures.And about 8.78x∼14.56xspeedup can be obtained for both 2D and 3D meshes.
Figures 8 and 9 show the speedup trend of parallel deforming mesh algorithm for 2D mesh and 3D mesh, respectively, with the size of mesh nodes increases.Nine meshes whose numbers of mesh nodes were within [2.7 × 10 5 , 56.7 × 10 5 ] interval were generated by Triangle [47] for 2D test.And seven meshes whose numbers of mesh nodes were within interval [3.2 × 10 5 , 41.2 × 10 5 ] were generated by TetGen [48] for 3D PDMA test.An average of 10 iterations were performed for P-BiCGSTAB process.The results indicate that larger mesh size produces higher speedup.That is expected, because larger meshes make better use of the available GPU resources, such as GPU memory, stream processors, and registers.However, the increment of speedup is not proportional to the increment of mesh size as the mesh size increases; this can be explained by the fact that the available resources of GPU are gradually saturated as the processing data of the mesh increases.Figures 8 and  9 demonstrate that the speedups of both 2D and 3D case become stable at around 30x.
The performance of the present algorithms was also investigated by OpenMP directives which is a popular shared-memory programming mode.The 2D test mesh was configured by 5672268 nodes and 11114646 triangles, and the 3D test mesh consisted of 4127686 nodes and 24023132 tetrahedrons.As the test node has two Intel E5-2658 V2 CPUs  with each having 10 cores, total 20 threads can be launched.Table 4 shows the comparisons of multicore CPU and GPU results on the two test cases for mesh deformation algorithm.It is shown in Table 4 that a speedup of 2.6x can be obtained on K20 GPU in comparison with 20 OpenMP threads for both of 2D and 3D cases, which demonstrates the computing power of GPU through light-weighted threads.

Simulation Cases.
The first simulating case used by the presented solver is about the unsteady Euler solutions induced by the NACA0012 airfoil.The varying of the angle of attack by pitching harmonically about the quarter chord with respect to physical time is given as where  0 = 0.016 ∘ ,   = 2.51 ∘ are the initial and amplitude of angle of attack, respectively,  = 0.0814 denotes the reduced frequency based on semichord, and  represents the physical time.The initial Mach number is  = 0.755.Figures 10 and 11 show the unstructured mesh that contains 94085 nodes and 186306 triangles in a far and close view, respectively.Steady state was first calculated at  0 = 0.016 ∘ and the obtained flow is used as the initial flow field for the following unsteady simulation.Figure 12 shows the Mach contours around the NACA0012 airfoil at the steady state.A physical time step of △ = 0.2 was employed for the unsteady simulation.In the pseudo-time process, a    To obtain stable unsteady solutions, three cycles of periodic motion were computed and comparisons were conducted between the GPU results, experiment data, and Batina's results stated in [49], which are shown in Figures 13  and 14.It is noted that the calculated results show a slight deviation when comparing with experiment data but agree well with Batina's work.The validation through comparing  instantaneous pressure coefficient with experiment data has also been conducted at a specified angle of attack, which is shown in Figure 15 for  = 2.34 ∘ (.. = 69 ∘ ).Table 5 lists the wall clock times executing on CPU and GPU for five primary procedures of the unsteady solver.It is shown that solving linear system equations, inviscid residual computing, gradients, and limiter calculations are the most time-consuming part in the solver.In Implicit_Time_Advance procedure, 3 iterations in BiCGSTAB scheme make the iteration convergence to the given solver_Error.And All three parts can be obtained over 10x speedup on GPU with respect to Intel E5-2658 V2 CPU core.According to statistics, advancing one pseudo-step within a physical step requires 1.8876 seconds on CPU and 0.1620 seconds on GPU for this case, and conducting a deforming mesh step requires 1.2886 on CPU and 0.1208 on GPU, in which the numerical solver and mesh deforming parts are both accelerated.
The second example for validating the unsteady computing with deforming mesh on GPU involves the transonic simulation induced by the ONERA M6 wing.The initial conditions are configured as Mach = 0.84, Re = 21.66 × 10 6 , the motion of wing on different spans is expressed as (, ) =  0 +   (/) sin(), where  0 = 0 ∘ and   = 10 ∘ denote the average and the amplitude of angle of attack for this periodic movement,  is the wing span and  represents the specified station along the span direction, and  = 0.27 is the reduced frequency based on mean chord.The hybrid mesh used for this case contains 450634 nodes, 961146 tetrahedrons, and 548688 prisms.Figure 16 shows the hybrid mesh at the symmetrical plane.Steady calculations were performed at  0 = 0 ∘ at first to provide initial solution of the flow for subsequent unsteady simulations.And the steady state Mach contours at the symmetrical plane are shown in Figure 17.In order to deform the hybrid mesh, Liu's method [20] was employed by providing a coarse tetrahedral mesh as a background grid.The reason why tetrahedral mesh is considered as the background mesh is that it is more robust in the treatment of complex geometries than that of Delaunay graph mesh.However, it requires more CPU time to finish mesh deforming at each step.The present parallel algorithm for deforming tetrahedral mesh was first applied in Liu's scheme and then the interpolation process [20] was conducted to deform the foreground mesh.In this case, the background mesh provided contains 108396 nodes and 582752 tetrahedrons.△ = 0.12 was selected for the global physical time step and 5 periods of simulation were computed to lead to stable periodic results.Figures 18 and 19 demonstrate the instantaneous pressure distributing on 20% wing span along  direction in a quarter and half of the last unsteady cycle, respectively.It took about 0.31 seconds to deform the auxiliary tetrahedral mesh in the background and 0.05 seconds to interpolate the movement of the background mesh into the computational mesh through DGM interpolation process   [20].And 0.19 seconds was required to update the quantities that are associated with the computational mesh on K20 GPU card.Figure 20 gives the CPU and GPU time comparisons for each part of the solver.It is shown in Figure 20 that the calculation of gradients and limiters, the computing of upwind residual, and the advancement of one-step flow solution are the three most time-consuming parts in advancing ALE (1) equations.However, the process of implicit time stepping, which is the most time-consuming procedure, does not achieve the highest speedup among the three procedures.
That is mainly because an unsymmetrical linear system of equations which contains a lot of memory intensive operations such as SpMV (Sparse Matrix Multiply Vector), cross product of vectors, is required to be solved and accelerating the operations of this type on GPU is generally not as easy as that of compute intensive type.Another factor that affects the parallel efficiency of implicit time advance is that the GPU resources are not saturated due to relatively small-scale of the equations and there is not enough computational workload to switch during memory access.Memory intensive operations also account for the relatively low speedup obtained by 16 OpenMP threads on the procedure of implicit time advance.Another reason is that the OpenMP directives for solving equations were implemented within the outer loop and the frequent creating, switching, and destroying of OpenMP threads increased the extra overhead.Overall, higher speedup was obtained on GPU platform in comparison with multicore CPU environment.This result also demonstrates that GPU has great abilities in accelerating fine-grained computational task by utilizing its light weighed threads.These threads can be launched and switched with extremely low overhead.Two procedures, involving gradient and limiter computation, achieved the lowest speedup on GPU; one of the reasons is that the two procedures contain more atomic addition (or subtraction) operations than SetTimeStep and SA Turbulence procedures.Actually, the procedure of upwind residual involves the maximum number of atomic operations among the procedures and these atomic operations had much impact on its acceleration.The atomic operations were eliminated by storing edge related values into global memory individually first and accumulating these values to nodes via looping the surrounding edges of these nodes.However, this strategy was not applied on the other procedures that contain atomic operations because it requires extra code revision effort and much more memory usage.The time spent on atomic operations accounts for nearly 15% of the total wall clock time of this case, which benefits from the optimized atomic operations of the K-series GPU card.And this reduces the effort on code revision and makes parallel code developing more efficient.Overall, it took about 14.8 seconds on a single CPU core, 0.9457 seconds on K20 GPU card, and 1.56955 seconds on 16 CPU cores for advancing one pseudo-step in time.About 15.6x speedup can be achieved on K20 GPU card relative to a single Intel E5-2568 V2 CPU core and about 1.66x speedup relative to 16 OpenMP threads.

Summary and Future Work
The parallel deforming mesh algorithm is present and implemented by using directive-based programming model-OpenACC.And the proposed algorithms are integrated into an OpenACC-parallelized unstructured ALE solver which can conduct unsteady simulation with deforming mesh completely on GPU.The OpenACC implementation of the unsteady solver requires minor revision of the original serial codes, which makes code migration easier and efficient.The effort is underway to extend the present unsteady solver to CPU + GPU heterogeneous computing platform by considering combining MPI and OpenACC and the new GPU-Direct communication feature will be used to reduce overhead while transferring data from one GPU to a remote GPU.

Figure 4 :
Figure 4: The process of filling data for an inner neighbor () and a boundary neighbor () of , respectively.
by looping every face that contains  while the remaining rows are the same as the first three rows of

1 Figure 11 :
Figure 11: Close view of the unstructured grids around NACA0012 airfoil.

Figure 13 :
Figure 13: Lift coefficient corresponds to instantaneous angle of attack.

Figure 14 :
Figure 14: Moment coefficient corresponds to instantaneous angle of attack.

Figure 16 :Figure 17 :
Figure 16: View of the hybrid mesh around ONERA M6 wing.

Figure 18 :Figure 19 :
Figure 18: One-quarter of last cycle: Instantaneous pressure coefficient distributing on 20% wing span along  direction.

Figure 20 :
Figure 20: CPU and GPU time comparisons for each part of the solver.
Figure 3: Fill the corresponding segments of V_, _, and , respectively, for the  th inner node.

Table 2 :
Four mesh configurations for test.

Table 3 :
CPU times, GPU times, and speedups on four test meshes.

Table 4 :
Performance comparisons of multicore CPU and GPU results on mesh deformation algorithm.

Table 5 :
Wall clock time of each procedure for the NACA0012 case.