A Computational Realization of a Semi-Lagrangian Method for Solving the Advection Equation

. A parallel implementation of a method of the semi-Lagrangian type for the advection equation on a hybrid architecture computation system is discussed. The difference scheme with variable stencil is constructed on the base of an integral equality between the neighboring time levels. The proposed approach allows one to avoid the Courant-Friedrichs-Lewy restriction on the relation between time step and mesh size. The theoretical results are confirmed by numerical experiments. Performance of a sequential algorithm and several parallel implementations with the OpenMP and CUDA technologies in the C language has been studied.


Introduction
Many physical phenomena in transport processes are modeled by time-dependent hyperbolic conservation laws [1][2][3][4].The finite volume method (FVM) is a standard conservative method to construct numerical approximations for solving hyperbolic conservation problems.Modern modifications of FVM [5][6][7][8] provide well-established conservative methods for solving the governing advection equations.Moreover, some of them were developed to treat high gradients and discontinuities of a solution [7,8].In spite of their advances for hyperbolic equations, these methods have the limitation consisting in a time step restriction, mainly for stability sake.On the other hand, during the last three decades the idea of applying the method of characteristics to advective quantities forward in time has rapidly developed and has gained popularity in many areas [9][10][11][12][13].In contrast to traditional Eulerian schemes, semi-Lagrangian algorithms provide unconditional stability and allow using large time steps.Despite unconditional stability, these methods are explicit and therefore they look well suited for parallelization.Now semi-Lagrangian methods are intensively studied and their efficiency for convection-dominated problems is proved.For a more detailed discussion about the comparison of traditional Eulerian and semi-Lagrangian schemes for hyperbolic conservation laws, see [6,14,15].
Initially semi-Lagrangian algorithms, as methods of characteristics, were developed with application in climate prediction [16][17][18][19][20][21].The simplest schemes use the approximation of a trajectory (or curvilinear characteristic) by a straight line and employ a low-order interpolation to compute a numerical solution.Nowadays, simplicity and efficiency of these schemes make them quite popular in different fields of numerical modeling like fluid dynamics applications [9,12,22], shallow water equations [10], fiber dynamics described by the Fokker-Planck equation [11], heatconduction equation [23], and so forth.Now modern semi-Lagrangian algorithms involve a higher-order approximation of a curvilinear characteristic and employ a higher-order interpolation; see, for example, [22].Recently, considerable efforts have been made to construct the conservative semi-Lagrangian methods [9,20,[24][25][26][27][28].For instance, Scroggs and Semazzi [9] presented a semi-Lagrangian finite volume
For the unknown function (, , ) the following initial and boundary conditions are specified:

Numerical Scheme
Subdivide the time segment [0, ] into  time levels   = ,  = 0, . . ., , with the time step  = /.Let Ω be a closed quadrangle at the time level   .For each of its points on the segment  ∈ [ −1 ,   ] we construct the characteristics defined by the system of ordinary differential equations with the initial value at level  =   as a parameter: With the help of these characteristics the edges of Ω generate four surfaces   ,  = 1, . . ., 4, with the edges   at  =  −1 (Figure 1).
If Ω is located near the inflow boundary  = 0, surfaces   can cross the plane  = 0.In this case we get an additional curvilinear polygon  on the plane  = 0 (Figure 2).
Generally speaking,  and  can be triangular, pentagonal, or empty domains.If one of them is empty, then the integral over an empty domain is supposed to be equal to zero.Since there is no fundamental difference, we consider only the most common case with quadrangular domains.For Ω, , and  the following statement is valid.Then Here (  ,   ,   ) is the outer normal to Γ and sign "⋅" means the scalar product.The normal (  ,   ,   ) equals (1, 0, 0) on Ω, (−1, 0, 0) on , and (0, −1, 0) on .For any   the normal (  ,   ,   ) is orthogonal to all tangent directions of   including the tangent of characteristics Therefore (1, , V) ⋅ (  ,   ,   )  = 0. Taking into account this reasoning in (10), we get the equality that is equivalent to the statement of theorem.Now construct the uniform mesh  ℎ with mesh-size ℎ = 1/,  ≥ 2:  ℎ = {(  ,   ) :   = ℎ,   = ℎ; ,  = 0, . . ., } .(12) We will find an approximate solution  ℎ (, , ) at each time level  =   ∀ = 0, . . .,  as a grid function with values ρ  , =  ℎ (  ,   ,   ) ∀,  = 0, . . ., , unlike the values of the exact solution To construct the difference scheme with a variable stencil, we suppose that the function  ℎ at time level  −1 is already known and we need to find it at level   .To compute ρ  , for some ,  = 1, 2, . . .,  − 1, we take the square Ω , with four vertices (  ±ℎ/2,   ±ℎ/2) and apply Theorem 1.To determine ρ  , on the boundary of  we use the rectangles Ω , which are adjoined to this boundary inside  (Figure 3).
Without loss of generality we describe the construction of the difference equations for inner nodes with ,  = 1, 2, . . .,  − 1 only.Thus, due to Theorem 1 we get Here the curvilinear polygons  −1 , and  −1 , are formed by the characteristics (6) that issue out of the edges of square Ω , .To compute the second integrals in (15) numerically, first we replace the exact function ( −1 , , ) by its bilinear interpolant with the help of the basis functions To compute the integral over the domain  −1 , , we also use the bilinear interpolant in (  ,   )  (  , 0,   )  , (, ) (18) with the basis functions otherwise, ∀ = 0, . . ., ,  = 0, . . ., .
The left-hand side of ( 15) is approximated by the midpoint quadrature rule with second-order accuracy: So, instead of the exact equality (15) we get the approximate one: (, 0, ) . ( To simplify the numerical computation of the right-hand side in (21), we approximate the domains  −1 , and  −1 , by simpler ones.Since in the more general case both domains are curvilinear quadrangles, we demonstrate the approximation only for quadrangular  −1 , .Introduce four additional points (  ± ℎ/2,   ) and (  ,   ± ℎ/2) on the square Ω , at time level   and denote each of the eight nodes by   = (x  , ŷ ),  = 1, . . ., 8. Out of each   we pass the corresponding characteristic to the time level  −1 which gives the point   = (  ,   ) (Figure 4).
To evaluate the order of convergence, we use the discrete analogue of the  1 ()-norm: For the numerical solution computed by ( 26) the following theorem is valid.
Theorem 3. Let the solution (, , ) of the problem (1)-( 5) be sufficiently smooth and let the discrete solution  ℎ be computed by (26).Assume that  = cℎ.Then we have the following estimate: with the constants  1 and  2 independent of , ℎ, , and c.
Proof.Use the induction on .For  = 0 inequality ( 29) is valid because of the exact initial condition (4).Suppose the estimate ( 29) is valid for some  − 1 ≥ 0 and prove it for .

Corollary 4.
Let the conditions of Theorem 3 be valid.Then for   =  one has the estimate Remark 5. Let the functions  init (, ) and  in (, ) be greater than zero in the initial and boundary conditions (4)- (5).Then the interpolants ρ ℎ ( 0 , , ) and ()   (, 0, ) due to (3) are nonnegative.The integration of them results in nonnegative values in (26).Thus, by induction we can prove the inequality  ℎ (  ,   ,   ) ≥ 0 ∀ = 1, . . ., , ,  = 0, . . ., . (43) Remark 6.The strategy of the domain approximation with 8 nodes is not optimal, of course.In fact it is enough to use 4 nodes for rectangles.But a theoretical justification becomes much more complicated.We did not demonstrate it here since the main purpose of the paper is to verify parallel properties of the proposed algorithm.Of course, such an optimization reduces the number of arithmetical operations but has no influence on the parallel properties of the algorithm.The same applies to the difference schemes of higher order.

The Numerical Algorithm and Its Parallel Implementations
The constructed algorithm is implemented in the following way.
(2.3) Compute according to (26) where the integrals are calculated over each nonempty intersection The end of the space loop.
The end of the time loop.
The algorithm is explicit with respect to time, since to calculate  ℎ (  , , ) at each time level   the data are used only from the previous time level  −1 .
In the shared memory case for the OpenMP-technology it is sufficient to parallelize the general space loop at each time level using an OpenMP directive like the following one: #pragma omp parallel for collapse (2) . . .In order for paralleling to be correct, the data-sharing attributes of all variables to intermediate outcomes of items (2.1)-(2.4)have to be private for each thread.
Another justified approach to paralleling the algorithm is the usage of the NVIDIA CUDA technology for GPU.
The main aspects of the parallel implementation related to various features of general-purpose GPU programming are briefly discussed below.
All functions used in the numerical calculations on a CPU must be recompiled for a GPU.If we use NVIDIA CUDA these functions must be declared with the special qualifier host device which indicates that the NVCC compiler creates two versions of its executable code for a CPU (host) and for a GPU (device) separately.GPU will call a device version of a function while CPU will call its host version.
The principles of efficient CUDA programming are as follows: (1) the maximal use of inherent parallelism of the problem and (2) the optimization of memory access.
The first version of our parallel CUDA-algorithm is based on the inherent parallelism of our numerical explicit approach.Every thread treats only one cell Ω , ∈  ℎ ; hence, the space loop body (items (2.1)-(2.4) of Algorithm 7) is the general computation kernel.
While programming for GPU, the correct defining of the kernel configuration is important.The kernel configuration includes two parameters, namely, the number of blocks (blockCount) in a grid and the number of threads (blockSize) per block.There is a limit in 1024 threads per block for our GPU NVIDIA hardware.In the first CUDA-algorithm no threads amount optimization is used.
Consequently, the simplest parallel version of Algorithm 7 for the CPU/GPU hybrid architecture is the following.
(2) Allocate host (CPU) and device (GPU) memory; copy initial data from host to device.
( The end of the time loop. (4) Copy the results from device to host.
In order to decrease the execution time of Algorithm 8, items (3.5)-(3.6)must be performed as rare as possible, for instance, only at time levels where accuracy control, data for drawing, and so forth are needed.
Algorithm 8 has two general disadvantages: (1) small speedup in comparison to the sequential version (Figure 8) and ( 2) impossibility of execution with a fine computational mesh (Table 2).What is the matter with these problems?
First, the general loop has a lot of selection statements.The main selection is between two different ways of catching in item (3.1.1) of Algorithm 8 (to be more exact, in the item (2.3) of sequential Algorithm 7).The cells whose trajectories intersect the boundary and the internal cells are processed in different ways.We can use two different kernels which execute in parallel.
We use data parallelism only in Algorithm 8.However, there is yet another class of parallelism to be exploited on NVIDIA GPU.This parallelism is similar to the task parallelism that is found in multithreaded CPU applications.NVIDIA CUDA task parallelism is based on CUDA streams.A CUDA stream represents a queue of GPU operations such as kernel launches, memory copies, and event starts and stops.The order in which operations are added to the stream specifies the order in which they will be executed.Each stream may be considered as a certain task on the GPU, and there are opportunities for these tasks to execute in parallel [31].Thus we apply CUDA streams to our two kernels to improve parallelism and total GPU utilization.
There are many selections in item (2.3) for determining mutual arrangement of  −1 , and cells of the mesh of the previous time level.Unfortunately, we cannot avoid these selections.
Secondly, the CUDA kernel in (3.3) of Algorithm 8 idles in the context of computation.We apply loop unrolling in order to eliminate this kernel.
Thirdly, the basic CUDA kernel of Algorithm 8 is not optimal for memory access.To optimize concurrent read access global memory of simultaneously running threads, constant memory is preferable to use.Applying this approach in our program we allocate all invariable values in constant memory GPU.
Moreover, for the optimization of parameters of kernels launch we use the CUDA Occupancy Calculator.It calculates optimal streaming multiprocessor (SM) utilization taking into account GPU compute capability, CUDA device properties, the number of blocks in a grid, the number of threads per block, the size of the shared-memory space, and the number of registers per thread.
Consequently, the second CUDA-version of Algorithm 7 for the CPU/GPU hybrid architecture is the following.
(2) Allocate host (CPU) and device (GPU) memory, copy initial data from host to device, and copy constant data from host to constant memory of device.(5) Copy results from device to host.

Numerical Experiments
Specify the velocities and take the initial and boundary conditions in the following form: Numerical experiments were performed with the ICM SB RAS FLAGMAN computation system of the following configuration.One of the purposes for the numerical experiments was to check the convergence order in  and ℎ.Therefore the computations were performed on the sequence of  ×  regular square grids,  = 10 ⋅ 2  ,  = 0, . . ., 6.The number of time steps is defined by  = ℎ/5.
Assume that {  ℎ } 6 =0 is the set of solutions found on the sequence of square grids.The expression log as a function in  can be considered as the order of convergence (Figure 6).The corresponding exact values of   , were computed by the characteristics method directly.In Figure 6 we can see the first order of convergence in ℎ.
In our sequential and OpenMP computational experiments we compare the GCC and Intel C++ compilers.The execution time for the better Intel compiling code is on average by 15% less than the better GCC one.All presented numerical results were compiled with −O2 optimization level.Let us remark that we try to use other compiler options (−O3, −parallel, −AVX), but the performance increases only slightly or sometimes even decreases.For the CUDA-version we used the NVCC compiler.
The results of computation speedup of the OpenMPversion are presented in Table 1 and in Figure 8.The first line of the table shows speedup (or rather slowing down) of one thread that executed the OpenMP-code as compared with the sequential code.Generally, the compiler optimization under the OpenMP library and overhead related to the synchronization of OpenMP-threads can be estimated for these data.As we can see in our case, overhead is small.
One of the purposes of the studies is to assess influence of the HyperThreading (HT) technology on the parallelism.As long as only 12 physical cores are available on the CPU, we have access to 24 logical cores when HT is enabled and to 12 logical cores when HT is disabled, respectively.Experiments show that when HT is enabled the execution time with 24 threads is about 14% less than that with 12 threads when HT is disabled (Figure 7).As our code is compute-intensive, probably, the advantage of HT may be related to optimization of memory access.The execution time of the sequential, OpenMP, and two versions of CUDA codes is given in Table 2.The ≪ * ≫ symbol signifies that the result has not been reached in reasonable time; for the CUDA-versions the ≪ * * ≫ symbol means that the kernel launch failed due to registers bottleneck.For OpenMP the results on 12 threads when HT is disabled and on 24 threads when HT is enabled are demonstrated.Comparative information on a possibility of code optimization by the GCC compiler is also presented in Table 2.The execution time of the code compiled without optimization and with −O2 and −O3 optimization levels, respectively, was measured.It is clear from Table 2 that compiler optimization considerably (more than two times) reduces the execution time of the sequential code.As the compiler does not optimize the CUDA-code, the execution time of the CUDA-versions does not depend on compiling options.
Table 3 and Figure 8 present the data on computation speedup of the best OpenMP-version and two CUDAversions in comparison with the sequential program compiled with −O2.Speedup of the second CUDA-version in comparison with the OpenMP-version is given as well.Numerical experiments show that on fine grids in comparison with the sequential program the best Open-MP program with 24 threads gives more than 12 times speedup and the second CUDA-version shows about 16 times speedup.

Discussion
Nowadays there are a lot of algorithms of the family of semi-Lagrangian methods.As we mentioned above, the presented method is based on square grid only and uses accessory algorithms that makes computation more resource-intensive.Nevertheless, such a complication allows us to prove firstorder convergence.Furthermore, the theorem which allows one to take into a volume of substance passed through a boundary is presented.This enables us to prove the balance equation.Numerical experiments completely confirm the theoretical convergence results.
As for parallel implementation of our approach, we can note the following.
Though the algorithm is explicit we are not satisfied with the results of the CUDA-versions.
First of all, there is a problem with feasibility of CUDAcode on fine grids (more than 640 × 640 nodes).Profilefeedback analysis shows at least two causes: (1) assumed computational domain decomposition and (2) the problem of hardware constraint on the amount of registers on streaming multiprocessors.
Concerning the first item, it should be noted that in our approach a 2D computational domain is mapped to a 1D array of cells, and every thread treats one cell (e.g., see Pseudocode 1).
For the parameters of kernels launch to be readily adaptable, we can apply "thread reuse"; namely, every thread treats some uncoupled cells (see Pseudocode 2).
Besides, we can use several video adapters.In this case we should resolve the problem of the distribution of computational cells between devices under the following restriction: we do not know in advance how many cells of the previous time level are required to calculate an actual value (e.g., varying-width shadow line).
The problem with registers in our case is related to deep nesting level of functions calls in the item (2.3) for determining the mutual arrangement of  −1 , and cells of a mesh of the previous time level.Indeed, this is a bottleneck of our sequential algorithm.To resolve this issue, we should modify the initial sequential algorithm, unfortunately.
In addition, the item (2.3) has many flow control instructions ("if " statements, mainly), but we cannot say that this branching creates some significant performance penalty in terms of SIMT architecture.Indeed, in a CUDA kernel, any flow control instruction (if, switch, do, for, while) can significantly affect the instruction throughput by causing threads of the same warp to diverge [32].If this happens, the different execution paths must be serialized, since all of the threads of a warp share a program counter; this increases the total number of instructions executed for this warp.When all the different execution paths have completed, the threads converge back to the same execution path.
Fortunately, this rule works in the cases where the control flow depends on the thread ID only.However, in our case we have many branches which do not depend on thread ID.Inside our computational kernel, the main part of branches looks like Pseudocode 3.
As we can see, our "IF" statement does not depend on a thread ID.Thus, this type of branch does not affect the performance of the CUDA kernel.

Figure 5 :
Figure 5: Approximation of nodes and edges.

of Algorithm 7 ( 3 . 6 )
in parallel for each internal cell Ω , .Synch point: wait for calculations of both kernels to be completed.(3.7)If necessary, copy results from device to host.(3.8)If necessary, calculate the norms of a solution, an error, and other statistic data with respect to an actual time step.The end of the time loop.(4) If  is odd then repeat items (3.1)-(3.2).

Figure 6 :
Figure 6: The order of convergence.

Figure 7 :
Figure 7: Execution time of OpenMP-code (the main vertical axis) and speedup in comparison with the sequential code (additional vertical axis).The comparison of results when HyperThreading is switched On or Off.

Table 2 :
Execution time of all versions of program.

Table 3 :
Speedup of parallel versions.Number of mesh points in one space dimension (number of the time steps)