A parallel implementation of a method of the semi-Lagrangian type for the advection equation on a hybrid architecture computation system is discussed. The difference scheme with variable stencil is constructed on the base of an integral equality between the neighboring time levels. The proposed approach allows one to avoid the Courant-Friedrichs-Lewy restriction on the relation between time step and mesh size. The theoretical results are confirmed by numerical experiments. Performance of a sequential algorithm and several parallel implementations with the OpenMP and CUDA technologies in the C language has been studied.
1. Introduction
Many physical phenomena in transport processes are modeled by time-dependent hyperbolic conservation laws [1–4]. The finite volume method (FVM) is a standard conservative method to construct numerical approximations for solving hyperbolic conservation problems. Modern modifications of FVM [5–8] provide well-established conservative methods for solving the governing advection equations. Moreover, some of them were developed to treat high gradients and discontinuities of a solution [7, 8]. In spite of their advances for hyperbolic equations, these methods have the limitation consisting in a time step restriction, mainly for stability sake. On the other hand, during the last three decades the idea of applying the method of characteristics to advective quantities forward in time has rapidly developed and has gained popularity in many areas [9–13]. In contrast to traditional Eulerian schemes, semi-Lagrangian algorithms provide unconditional stability and allow using large time steps. Despite unconditional stability, these methods are explicit and therefore they look well suited for parallelization. Now semi-Lagrangian methods are intensively studied and their efficiency for convection-dominated problems is proved. For a more detailed discussion about the comparison of traditional Eulerian and semi-Lagrangian schemes for hyperbolic conservation laws, see [6, 14, 15].
Initially semi-Lagrangian algorithms, as methods of characteristics, were developed with application in climate prediction [16–21]. The simplest schemes use the approximation of a trajectory (or curvilinear characteristic) by a straight line and employ a low-order interpolation to compute a numerical solution. Nowadays, simplicity and efficiency of these schemes make them quite popular in different fields of numerical modeling like fluid dynamics applications [9, 12, 22], shallow water equations [10], fiber dynamics described by the Fokker-Planck equation [11], heat-conduction equation [23], and so forth. Now modern semi-Lagrangian algorithms involve a higher-order approximation of a curvilinear characteristic and employ a higher-order interpolation; see, for example, [22]. Recently, considerable efforts have been made to construct the conservative semi-Lagrangian methods [9, 20, 24–28]. For instance, Scroggs and Semazzi [9] presented a semi-Lagrangian finite volume method that used a rectangular grid for a system of conservation laws and satisfied the discrete conservation relation, but the numerical results demonstrate some violation of full conservation. Early modifications of the semi-Lagrangian approach use a rectangular grid which is fixed throughout the simulation [9, 17, 24, 25]. Semi-Lagrangian schemes allow using spatial grids independently of one another. As a result, adaptive grids are widely used in modern versions of this approach [20, 28]. In spite of the progress in semi-Lagrangian methods, for most of them [9–12, 20, 24, 28] convergence has not been theoretically proved.
In this paper, we present a sketch of the theoretical proof and a numerical justification for the difference scheme of the semi-Lagrangian family. We start with the theorem about an exact equality that involves two spatial integrals over domains at neighboring time levels and the third integral over an inflow boundary. To prove convergence theoretically, we use a square grid, the bilinear interpolation, and the Runge-Kutta method for the fourth-order approximation of characteristics. This allows us to prove first-order convergence in a discrete analogue of the L1-norm. The theoretical convergence estimates are confirmed by numerical results.
In the remaining part of the paper, some parallel implementations of this method are studied. We discuss the design subtleties of parallel implementations of the algorithm by the OpenMP-technology for shared memory computational systems and by the CUDA technology for general-purpose GPU programming. In addition, the influence of the HyperThreading technology on the performance of our OpenMP-code is studied. Moreover, the difficulties of the algorithm implementation and the performance for hybrid architecture computation systems are discussed for our CUDA codes.
2. Formulation of the Problem
Let D=[0,1]×[0,1]. In the closed domain [0,T]×D consider the two-dimensional advection equation
(1)∂ρ∂t+∂(uρ)∂x+∂(vρ)∂y=0,
where u(t,x,y) and v(t,x,y) are known and are sufficiently smooth in [0,T]×D. We suppose for simplicity that ∀t∈[0,T] the coefficients satisfy the no-slip conditions at the upper and lower sides of D(2)u(t,x,y)|y=0=u(t,x,y)|y=1=0,v(t,x,y)|y=0=v(t,x,y)|y=1=0
and the flow conditions at the left and right sides of D(3)u(t,x,y)|x=0≥0,u(t,x,y)|x=1≥0.
For the unknown function ρ(t,x,y) the following initial and boundary conditions are specified:
(4)∀(x,y)∈Dρ(0,x,y)=ρinit(x,y),(5)∀(t,y)∈[0,T]×[0,1]ρ(t,0,y)=ρin(t,y).
3. Numerical Scheme
Subdivide the time segment [0,T] into K time levels tk=kτ, k=0,…,K, with the time step τ=T/K. Let Ω be a closed quadrangle at the time level tk. For each of its points on the segment t∈[tk-1,tk] we construct the characteristics defined by the system of ordinary differential equations
(6)x′-(t)=u(t,x-(t),y-(t)),y′-(t)=v(t,x-(t),y-(t)),vvvvvvvvvt∈[tk-1,tk],
with the initial value at level t=tk as a parameter:
(7)x-(tk)=x0,y-(tk)=y0,(x0,y0)∈Ω.
With the help of these characteristics the edges of Ω generate four surfaces Sn, n=1,…,4, with the edges Cn at t=tk-1 (Figure 1).
Curvilinear quadrangle Q.
If Ω is located near the inflow boundary x=0, surfaces Sn can cross the plane x=0. In this case we get an additional curvilinear polygon I on the plane x=0 (Figure 2).
Appearance of the boundary quadrangle I.
Generally speaking, I and Q can be triangular, pentagonal, or empty domains. If one of them is empty, then the integral over an empty domain is supposed to be equal to zero. Since there is no fundamental difference, we consider only the most common case with quadrangular domains. For Ω, Q, and I the following statement is valid.
Theorem 1.
For a smooth solution of the problem (1)–(5) we have the equality
(8)∫Ωρ(tk,x,y)dΩ=∫Qρ(tk-1,x,y)dQ+∫I(ρu)(t,0,y)dI.
Proof.
Denote the volume bounded by Ω,Q,I, and surfaces Sn by V and its boundary by Γ. Apply the Gauss-Ostrogradsky theorem to the left-hand side of the equality
(9)∫V(∂ρ∂t+∂(uρ)∂x+∂(vρ)∂y)dV=0.
Then
(10)∫V(∂ρ∂t+∂(uρ)∂x+∂(vρ)∂y)dV=∫Γ(ρ,ρu,ρv)·(nt,nx,ny)TdΓ=∫Γρ(1,u,v)·(nt,nx,ny)TdΓ=0.
Here (nt,nx,ny) is the outer normal to Γ and sign “·” means the scalar product. The normal (nt,nx,ny) equals (1,0,0) on Ω,(-1,0,0) on Q, and (0,-1,0) on I. For any Sn the normal (nt,nx,ny) is orthogonal to all tangent directions of Sn including the tangent of characteristics (1+u2+v2)-1/2(1,u,v). Therefore (1,u,v)·(nt,nx,ny)T=0. Taking into account this reasoning in (10), we get the equality
(11)∫Ωρ(tk,x,y)dΩ-∫Qρ(tk-1,x,y)dQ-∫I(ρu)(t,0,y)dI=0
that is equivalent to the statement of theorem.
Now construct the uniform mesh Dh with mesh-size h=1/N, N≥2:
(12)Dh={(xi,yj):xi=ih,yj=jh;i,j=0,…,N}.
We will find an approximate solution ρh(t,x,y) at each time level t=tr∀r=0,…,K as a grid function with values
(13)ρ~i,jr=ρh(tr,xi,yj)∀i,j=0,…,N,
unlike the values of the exact solution
(14)ρi,jr=ρ(tr,xi,yj).
To construct the difference scheme with a variable stencil, we suppose that the function ρh at time level tk-1 is already known and we need to find it at level tk. To compute ρ~i,jk for some i,j=1,2,…,N-1, we take the square Ωi,j with four vertices (xi±h/2,yj±h/2) and apply Theorem 1. To determine ρ~i,jk on the boundary of D we use the rectangles Ωi,j which are adjoined to this boundary inside D (Figure 3).
Boundary rectangles.
Note that ρ~0,jk=ρ0,jk are known from the boundary condition (5).
Without loss of generality we describe the construction of the difference equations for inner nodes with i,j=1,2,…,N-1 only. Thus, due to Theorem 1 we get
(15)∫Ωi,jρ(tk,x,y)dΩ=∫Qi,jk-1ρ(tk-1,x,y)dQ+∫Ii,jk-1(ρu)(t,0,y)dI.
Here the curvilinear polygons Qi,jk-1 and Ii,jk-1 are formed by the characteristics (6) that issue out of the edges of square Ωi,j. To compute the second integrals in (15) numerically, first we replace the exact function ρ(tk-1,x,y) by its bilinear interpolant
(16)ρhI(tk-1,x,y)=∑q=0N∑p=0Nρp,qk-1φp,q(x,y)
with the help of the basis functions
(17)φp,q(x)={(1-|xp-x|h)(1-|yq-y|h)∀(x,y)∈[xp-1,xp+1]×[yq-1,yq+1],0otherwise,∀p,q=0,…,N.vv
To compute the integral over the domain Ii,jk-1, we also use the bilinear interpolant
(18)(ρu)τI(t,y)=∑q=0N∑r=0Kρin(tr,yq)u(tr,0,yq)ψr,q(t,y)
with the basis functions
(19)ψr,q(x)={(1-|tr-t|τ)(1-|yq-y|h)∀(t,y)∈[tr-1,tr+1]×[yq-1,yq+1],0otherwise,∀r=0,…,K,q=0,…,N.vv
The left-hand side of (15) is approximated by the midpoint quadrature rule with second-order accuracy:
(20)∫Ωi,jρ(tk,x,y)dΩ≈mes(Ωi,j)ρi,jk∀i,j=0,1,…,N.
So, instead of the exact equality (15) we get the approximate one:
(21)mes(Ωi,j)ρi,jk≈∫Qi,jk-1ρhI(tk-1,x,y)dQ+∫Ii,jk-1(ρu)τI(t,0,y)dI.
To simplify the numerical computation of the right-hand side in (21), we approximate the domains Qi,jk-1 and Ii,jk-1 by simpler ones. Since in the more general case both domains are curvilinear quadrangles, we demonstrate the approximation only for quadrangular Qi,jk-1. Introduce four additional points (xi±h/2,yj) and (xi,yj±h/2) on the square Ωi,j at time level tk and denote each of the eight nodes by An=(x^n,y^n), n=1,…,8. Out of each An we pass the corresponding characteristic to the time level tk-1 which gives the point Bn=(x-n,y-n) (Figure 4).
Approximation of a curvilinear quadrangle.
To compute the coordinates of the point Bn numerically, we solve the system of ordinary differential equations (6) with the initial condition
(22)x~(tk)=xn,y~(tk)=yn,
by the fourth-order Runge-Kutta method [29]. Thus, we find the approximation Bnh=(x~n(tk-1),y~n(tk-1)) of the point Bn. The nodes Bnh, n=1,…,8, define the polygon Pi,jk-1 which is considered as a quadrangle with four parabolic edges (Figures 4-5). The constructed domain Pi,jk-1 approximates Qi,jk-1. In the same way we construct the polygon Li,jk-1 which approximates Ii,jk-1. For the above approximation the following statement is valid [30].
Approximation of nodes and edges.
Lemma 2.
Let the coordinates of the nodes Bn, n=1,…,8, be computed within the fourth-order accuracy x-n-x~n(tk)=O(τ4), y-n-y~n(t)=O(τ4). Assume that the ratio between τ and h is fixed: τ=c~h. Then for all i,j=0,…,N(23)
mes
(Qi,jk-1∖Pi,jk-1)+
mes
(Pi,jk-1∖Qi,jk-1)=O(h4),
mes
(Ii,jk-1∖Li,jk-1)+
mes
(Li,jk-1∖Ii,jk-1)=O(h4),
where the notation
mes(Ω) means the measure of the domain Ω.
Thus, the replacement of Qi,jk-1 by Pi,jk-1 and Ii,jk-1 by Li,jk-1, i=1,…,N, j=0,…,N, reduces the approximate equality (21) to another one:
(24)mes(Ωi,j)ρi,jk≈∫Pi,jk-1ρhI(tk-1,x,y)dP+∫Li,jk-1(ρu)τI(t,0,y)dL.
Divide it by mes(Ωi,j) and replace ρhI by the interpolant ρ~hI of the known grid function ρh at the level tk-1:
(25)ρ~hI(tk-1,x,y)=∑q=0N∑p=0Nρ~p,qk-1φp,q(x,y).
As a result, we get the equation for finding ρ~i,jk as an approximation of ρi,jk:
(26)ρ~i,jk=1mes(Ωi,j)∫Pi,jk-1ρ~hI(tk-1,x,y)dP+1mes(Ωi,j)∫Li,jk-1(ρu)τI(t,0,y)dLvvvvvv∀i=1,…,N,j=0,1,…,N,(27)ρ~0,jk=ρin(tk,0,yj)∀j=0,…,N.
To compute the integrals numerically, we decompose the domain Pi,jk-1 (or Li,jk-1) into several triangles which lie only in one cell [xp-1,xp]×[yq,yq+1], p,q=0,1,…,N-1, and have only one parabolic edge and two straight-line edges parallel to the coordinate axes. Then we replace the integral over the domain Pi,jk-1 (or Li,jk-1) by a sum of integrals. Thus we compute integrals directly without any quadrature rule.
To evaluate the order of convergence, we use the discrete analogue of the L1(D)-norm:
(28)∥ρh∥L1h=∑i,j=0N|ρi,jh|mes(Ωi,j).
For the numerical solution computed by (26) the following theorem is valid.
Theorem 3.
Let the solution ρ(t,x,y) of the problem (1)–(5) be sufficiently smooth and let the discrete solution ρh be computed by (26). Assume that τ=c~h. Then we have the following estimate: ∀k=0,1,…,K(29)∥ρ(tk,·)-ρh(tk,·)∥L1h≤k(c1h2+c2c~2h3),
with the constants c1 and c2 independent of k, h, τ, and c~.
Proof.
Use the induction on k. For k=0 inequality (29) is valid because of the exact initial condition (4). Suppose the estimate (29) is valid for some k-1≥0 and prove it for k.
From Theorem 1∀i=1,…,N, j=0,…,N we get
(30)∫Ωi,jρ(tk,x,y)dΩ=∫Qi,jk-1ρ(tk-1,x,y)dQ+∫Ii,jk-1ρ(t,0,y)u(t,0,y)dI.
For the first integral we have the equality
(31)ρ(tk,xi,yj)mes(Ωi,j)=∫Ωi,jρ(tk,x,y)dΩ+δi,jkmes(Ωi,j),
where
(32)|δi,jk|≤{c~1h2∀i=1,…,N-1,c~2hwheni=N.
In the second and third integrals we replace the polygons Qi,jk-1 and Ii,jk-1 by Pi,jk-1 and Li,jk-1, respectively, according to the foregoing approximation. Then we replace ρ(tk-1,x,y) and ρ(t,0,y)u(t,0,y) by their piecewise bilinear interpolants. Due to Lemma 2 and boundedness of the functions ρ(t,x,y) and u(t,x,y), formulae (30)-(31) are represented in the following way:
(33)ρi,jkmes(Ωi,j)=∫Pi,jk-1ρhI(tk-1,x,y)dP+∫Li,jk-1(ρu)τI(t,0,y)dL+δi,jkmes(Ωi,j)+γi,jk-1+ηi,jk-1mes(Pi,jk-1)+θi,jk-1mes(Li,jk-1),
where |γi,jk-1|≤c~3h4, |ηi,jk-1|≤c~4h2, |θi,jk-1|≤c~5τh. Now multiply (26) by mes(Ωi,j) and subtract it from (33). Then we have
(34)(ρi,jk-ρ~i,jk)mes(Ωi,j)=∫Pi,jk-1∑p,q=0N(ρp,qk-1-ρ~p,qk-1)ψp,q(x,y)dP+δi,jkmes(Ωi,j)+γi,jk-1+ηi,jk-1mes(Pi,jk-1)+θi,jk-1mes(Li,jk-1).
Now, let us sum up the modulus of the last equation for all i=1,…,N, j=0,…,N and use the decomposition
(35)ρ~i,jk-1=ρi,jk-1+ξi,jk-1
at level tk-1 with a grid function ξk-1(x,y) satisfying the estimate
(36)∥ξk-1∥L1h≤(k-1)(c1h2+c2c~2h3),
due to the induction hypothesis. Then we get
(37)∥ξk∥L1h≤∑i,j=0N(∫Pi,jk-1∑p,q=0N|ξp,qk-1|ψp,q(x,y)dP)+∑i,j=0N|δi,jk|mes(Ωi,j)+|γi,jk-1|+∑i,j=0N(|ηi,jk-1|mes(Pi,jk-1)+|θi,jk-1|mes(Li,jk-1)).
Since
(38)∑i,j=0N|δi,jk|mes(Ωi,j)≤(c~1+2c~2)h2,∑i,j=0Nmes(Pi,jk-1)≤1,∑i,j=0Nmes(Li,jk-1)=τ,
we have
(39)∥ξk∥L1h≤∑i,j=0N(∫Pi,jk-1∑p,q=0N|ξp,qk-1|ψp,q(x,y)dP)+(c~1+2c~2+c~3+c~4)h2+c~5hτ2.
Finally consider the transformations
(40)∑i,j=0N(∫Pi,jk-1∑p,q=0N|ξp,qk-1|ψp,q(x,y)dP)=∑p,q=0N(∑i,j=0N∫Pi,jk-1|ξp,qk-1|ψp,q(x,y)dP)≤∑p,q=0N(|ξp,qk-1|∫Dψp,q(x,y)dD).
It leads to the inequality
(41)∥ξk∥L1h≤∑p,q=0N|ξp,qk-1|mes(Ωp,q)+(c~1+2c~2+c~3+c~4)h2+c~5hτ2.
Denote c1=(c~1+2c~2+c~3+c~4) and c2=c~5. Thus due to the relation τ=c~h we get inequality (29).
Corollary 4.
Let the conditions of Theorem 3 be valid. Then for tk=T one has the estimate
(42)∥ρ(T,·)-ρh(T,·)∥L1h≤T(c1hc~+c2c~h2).
Remark 5.
Let the functions ρinit(x,y) and ρin(t,y) be greater than zero in the initial and boundary conditions (4)-(5). Then the interpolants ρ~hI(t0,x,y) and (ρu)τI(t,0,y) due to (3) are nonnegative. The integration of them results in nonnegative values in (26). Thus, by induction we can prove the inequality
(43)ρh(tr,xi,yj)≥0∀r=1,…,K,i,j=0,…,N.
Remark 6.
The strategy of the domain approximation with 8 nodes is not optimal, of course. In fact it is enough to use 4 nodes for rectangles. But a theoretical justification becomes much more complicated. We did not demonstrate it here since the main purpose of the paper is to verify parallel properties of the proposed algorithm. Of course, such an optimization reduces the number of arithmetical operations but has no influence on the parallel properties of the algorithm. The same applies to the difference schemes of higher order.
4. The Numerical Algorithm and Its Parallel Implementations
The constructed algorithm is implemented in the following way.
Algorithm 7 (sequential).
Set ρh(0,xi,yj)=ρinit(xi,yj),i,j=0,…,N, as the initial data (4).
Time loop: for each time step k=1,…,K do:
Space loop: for each cell Ωi,j,i=1,…,N,j=0,…,N do:
For each node An=(x^n,y^n),n=1,…,4, solve the system (18)–(20) and determine the corresponding vertex coordinates Bnh=(x~n(tk-1),y~n(tk-1)) of a polygon Pi,jk-1.
If a certain characteristic AnBnh intersects the plane x=0 then the coordinates of this cross-point are determined.
according to (26) where the integrals are calculated over each nonempty intersection Pi,jk-1∩{[xp,xp+1]×[yq,yq+1]} and Li,jk-1∩{[tk-1,tk]×[yq,yq+1]} separately.
Compute ρh(tk,xi,yj)=Ji,jk-1/mes(Ωi,j).
The end of the space loop.
Put ρh(tk,0,yj)=ρin(tk,yj) for all j=0,…,N.
If necessary, calculate the norms of a solution, an error, and other statistic data with respect to an actual time step.
The end of the time loop.
Note that items (2.1)–(2.3) are compute-intensive, especially the procedure of determining the mutual arrangement of Pi,jk-1 and the cells {[xp,xp+1]×[yq,yq+1]}p,q=0N-1 at the previous time level in item (2.3).
The algorithm is explicit with respect to time, since to calculate ρh(tk,x,y) at each time level tk the data are used only from the previous time level tk-1.
Another advantage of Algorithm 7 is data independence in the general space loop; that is, the items (2.1)–(2.4) are carried out for any pair (i,j),i=1,…,N,j=0,…,N, independently. In this connection the data parallelism is used.
In the shared memory case for the OpenMP-technology it is sufficient to parallelize the general space loop at each time level using an OpenMP directive like the following one:
#pragma omp parallel for collapse (2)
…
In order for paralleling to be correct, the data-sharing attributes of all variables to intermediate outcomes of items (2.1)–(2.4) have to be private for each thread.
Another justified approach to paralleling the algorithm is the usage of the NVIDIA CUDA technology for GPU. The main aspects of the parallel implementation related to various features of general-purpose GPU programming are briefly discussed below.
All functions used in the numerical calculations on a CPU must be recompiled for a GPU. If we use NVIDIA CUDA these functions must be declared with the special qualifier __host__ __device__ which indicates that the NVCC compiler creates two versions of its executable code for a CPU (host) and for a GPU (device) separately. GPU will call a device version of a function while CPU will call its host version.
The principles of efficient CUDA programming are as follows: (1) the maximal use of inherent parallelism of the problem and (2) the optimization of memory access.
The first version of our parallel CUDA-algorithm is based on the inherent parallelism of our numerical explicit approach. Every thread treats only one cell Ωi,j∈Dh; hence, the space loop body (items (2.1)–(2.4) of Algorithm 7) is the general computation kernel.
While programming for GPU, the correct defining of the kernel configuration is important. The kernel configuration includes two parameters, namely, the number of blocks (blockCount) in a grid and the number of threads (blockSize) per block. There is a limit in 1024 threads per block for our GPU NVIDIA hardware. In the first CUDA-algorithm no threads amount optimization is used.
Consequently, the simplest parallel version of Algorithm 7 for the CPU/GPU hybrid architecture is the following.
Algorithm 8 (CUDA parallel, version 1).
Calculate the kernel configuration (blockSize, blockCount) using data about a computational domain.
Allocate host (CPU) and device (GPU) memory; copy initial data from host to device.
Time loop: for each time step k=1,…,K do:
Call the first CUDA kernel (basic):
For each cell Ωi,j, i=1,…,N, j=0,…,N, execute items (2.1)–(2.4) of Algorithm 7 in parallel.
Synch point: wait for calculations to be completed.
Call the second CUDA kernel (assistive):
Copy data from an actual time level array to the previous one in parallel.
Synch point: wait for copying to be completed.
If necessary, copy results from device to host.
If necessary, calculate the norms of a solution, an error, and other statistic data with respect to an actual time step.
The end of the time loop.
Copy the results from device to host.
In order to decrease the execution time of Algorithm 8, items (3.5)-(3.6) must be performed as rare as possible, for instance, only at time levels where accuracy control, data for drawing, and so forth are needed.
Algorithm 8 has two general disadvantages: (1) small speedup in comparison to the sequential version (Figure 8) and (2) impossibility of execution with a fine computational mesh (Table 2). What is the matter with these problems?
First, the general loop has a lot of selection statements.
The main selection is between two different ways of catching in item (3.1.1) of Algorithm 8 (to be more exact, in the item (2.3) of sequential Algorithm 7). The cells whose trajectories intersect the boundary and the internal cells are processed in different ways. We can use two different kernels which execute in parallel.
We use data parallelism only in Algorithm 8. However, there is yet another class of parallelism to be exploited on NVIDIA GPU. This parallelism is similar to the task parallelism that is found in multithreaded CPU applications. NVIDIA CUDA task parallelism is based on CUDA streams. A CUDA stream represents a queue of GPU operations such as kernel launches, memory copies, and event starts and stops. The order in which operations are added to the stream specifies the order in which they will be executed. Each stream may be considered as a certain task on the GPU, and there are opportunities for these tasks to execute in parallel [31]. Thus we apply CUDA streams to our two kernels to improve parallelism and total GPU utilization.
There are many selections in item (2.3) for determining mutual arrangement of Pi,jk-1 and cells of the mesh of the previous time level. Unfortunately, we cannot avoid these selections.
Secondly, the CUDA kernel in (3.3) of Algorithm 8 idles in the context of computation. We apply loop unrolling in order to eliminate this kernel.
Thirdly, the basic CUDA kernel of Algorithm 8 is not optimal for memory access. To optimize concurrent read access global memory of simultaneously running threads, constant memory is preferable to use. Applying this approach in our program we allocate all invariable values in constant memory GPU.
Moreover, for the optimization of parameters of kernels launch we use the CUDA Occupancy Calculator. It calculates optimal streaming multiprocessor (SM) utilization taking into account GPU compute capability, CUDA device properties, the number of blocks in a grid, the number of threads per block, the size of the shared-memory space, and the number of registers per thread.
Consequently, the second CUDA-version of Algorithm 7 for the CPU/GPU hybrid architecture is the following.
Algorithm 9 (CUDA parallel, version 2).
Calculate kernel configuration (blockSize, blockCount) using data about a computational domain.
Allocate host (CPU) and device (GPU) memory, copy initial data from host to device, and copy constant data from host to constant memory of device.
Time loop: for each time step k=1,3,5,…,K-1 do:
Call the first CUDA kernel to boundary cells access by the first CUDA stream:
Execute items (2.1)–(2.4) of Algorithm 7 in parallel for each cell Ωi,j whose characteristics intersect the boundary.
Call the second CUDA kernel to inner cells access by the second CUDA stream:
Execute items (2.1), (2.3)-(2.4) of Algorithm 7 in parallel for each internal cell Ωi,j.
Synch point: wait for calculations of both kernels to be completed.
Call the first CUDA kernel to boundary cells access by the first CUDA stream:
Execute items (2.1)–(2.4) of Algorithm 7 in parallel for each cell Ωi,j whose characteristics intersect the boundary.
Call the second CUDA kernel to inner cells access by the second CUDA stream:
Execute items (2.1), (2.3)-(2.4) of Algorithm 7 in parallel for each internal cell Ωi,j.
Synch point: wait for calculations of both kernels to be completed.
If necessary, copy results from device to host.
If necessary, calculate the norms of a solution, an error, and other statistic data with respect to an actual time step.
The end of the time loop.
If K is odd then repeat items (3.1)-(3.2).
Copy results from device to host.
5. Numerical Experiments
Specify the velocities
(45)u(t,x,y)=100y(1-y)[π2-arctg(x)],v(t,x,y)=arctg(x(1-x)y(1-y)(1+t)10)
and take the initial and boundary conditions in the following form:
(46)∀(x,y)∈Dρ(0,x,y)=ρinit(x,y)=1,∀(t,y)∈[0,T]×[0,1]ρ(t,0,y)=ρin(t,y)=1.
Numerical experiments were performed with the ICM SB RAS FLAGMAN computation system of the following configuration.
Software. OS: UBUNTU 11.04; C/C++: GCC 4.5.2, INTEL C++ Compiler 13.1.0; CUDA C/C++: NVCC 5.0; NVIDIA CUDA 5.0; BOOST 1.53; NVIDIA CUDA-GDB 5.0.
One of the purposes for the numerical experiments was to check the convergence order in τ and h. Therefore the computations were performed on the sequence of N×N regular square grids, N=10·2n, n=0,…,6. The number of time steps is defined by τ=h/5.
Assume that {ρhn}n=06 is the set of solutions found on the sequence of square grids. The expression log2(∥ρ-ρhn∥L1h/∥ρ-ρhn+1∥L1h) as a function in n can be considered as the order of convergence (Figure 6). The corresponding exact values of ρi,jK were computed by the characteristics method directly. In Figure 6 we can see the first order of convergence in h.
The order of convergence.
In our sequential and OpenMP computational experiments we compare the GCC and Intel C++ compilers. The execution time for the better Intel compiling code is on average by 15% less than the better GCC one. All presented numerical results were compiled with −O2 optimization level. Let us remark that we try to use other compiler options (−O3, −parallel, −AVX), but the performance increases only slightly or sometimes even decreases. For the CUDA-version we used the NVCC compiler.
The results of computation speedup of the OpenMP-version are presented in Table 1 and in Figure 8. The first line of the table shows speedup (or rather slowing down) of one thread that executed the OpenMP-code as compared with the sequential code. Generally, the compiler optimization under the OpenMP library and overhead related to the synchronization of OpenMP-threads can be estimated for these data. As we can see in our case, overhead is small.
Speedup of OpenMP-code (HyperThreading is switched Off/On).
Number of threads
Number of points in one space dimension (number of the time steps)
80*80 (400)
160*160 (800)
320*320 (1600)
640*640 (3200)
1280*1280 (6400)
1
0.94/1.00
0.99/1.00
0.99/1.00
0.99/1.00
0.99/1.00
4
3.86/3.89
3.90/3.93
3.92/3.93
3.93/3.95
3.93/3.96
8
7.40/7.35
7.48/7.37
7.53/7.57
7.55/7.59
7.56/7.60
12
10.86/10.86
11.08/11.04
11.17/11.12
11.21/11.14
11.20/10.86
16
4.15/8.76
4.77/8.94
6.32/8.96
7.54/8.98
7.60/8.88
20
4.69/10.63
5.82/10.85
6.55/10.94
7.26/10.96
7.89/10.73
24
5.71/10.98
6.17/12.80
7.30/12.55
7.97/12.83
8.45/12.85
Execution time of all versions of program.
Version
Number of mesh points in one space dimension (number of the time steps)
80*80 (400)
160*160 (800)
320*320 (1600)
640*640 (3200)
1280*1280 (6400)
Sequential, −O0
20.16
159.40
1268.46
10107.80
*
Sequential, −O2
9.99
78.97
626.72
4980.61
39598.90
Sequential, −O3
9.87
78.08
619.80
4936.25
39202.91
OpenMP(12), −O0, HT Off
1.98
12.52
103.45
819.06
6519.87
OpenMP(12), −O2, HT Off
0.92
7.13
56.10
444.15
3535.53
OpenMP(24), −O2, HT On
0.91
6.17
49.93
388.20
3080.91
CUDA version 1
2.55
13.06
74.02
**
**
CUDA version 2
3.20
8.27
42.60
308.82
**
One of the purposes of the studies is to assess influence of the HyperThreading (HT) technology on the parallelism. As long as only 12 physical cores are available on the CPU, we have access to 24 logical cores when HT is enabled and to 12 logical cores when HT is disabled, respectively. Experiments show that when HT is enabled the execution time with 24 threads is about 14% less than that with 12 threads when HT is disabled (Figure 7). As our code is compute-intensive, probably, the advantage of HT may be related to optimization of memory access.
Execution time of OpenMP-code (the main vertical axis) and speedup in comparison with the sequential code (additional vertical axis). The comparison of results when HyperThreading is switched On or Off.
Speedup of parallel versions.
The execution time of the sequential, OpenMP, and two versions of CUDA codes is given in Table 2. The ≪*≫ symbol signifies that the result has not been reached in reasonable time; for the CUDA-versions the ≪**≫ symbol means that the kernel launch failed due to registers bottleneck. For OpenMP the results on 12 threads when HT is disabled and on 24 threads when HT is enabled are demonstrated.
Comparative information on a possibility of code optimization by the GCC compiler is also presented in Table 2. The execution time of the code compiled without optimization and with −O2 and −O3 optimization levels, respectively, was measured. It is clear from Table 2 that compiler optimization considerably (more than two times) reduces the execution time of the sequential code. As the compiler does not optimize the CUDA-code, the execution time of the CUDA-versions does not depend on compiling options.
Table 3 and Figure 8 present the data on computation speedup of the best OpenMP-version and two CUDA-versions in comparison with the sequential program compiled with −O2. Speedup of the second CUDA-version in comparison with the OpenMP-version is given as well. Numerical experiments show that on fine grids in comparison with the sequential program the best Open-MP program with 24 threads gives more than 12 times speedup and the second CUDA-version shows about 16 times speedup.
Speedup of parallel versions.
Number of mesh points in one space dimension (number of the time steps)
80*80 (400)
160*160 (800)
320*320 (1600)
640*640 (3200)
OpenMP(24), -O2, HT ON/sequential
10.98
12.80
12.55
12.83
CUDA version 1/sequential
3.92
6.05
8.47
**
CUDA version 2/sequential
3.12
9.55
14.71
16.13
CUDA version 2/OpenMP(12), -O2, HT Off
0.29
0.86
1.32
1.44
CUDA version 2/OpenMP(24), -O2, HT On
0.28
0.75
1.17
1.26
6. Discussion
Nowadays there are a lot of algorithms of the family of semi-Lagrangian methods. As we mentioned above, the presented method is based on square grid only and uses accessory algorithms that makes computation more resource-intensive. Nevertheless, such a complication allows us to prove first-order convergence. Furthermore, the theorem which allows one to take into account a volume of substance passed through a boundary is presented. This enables us to prove the balance equation. Numerical experiments completely confirm the theoretical convergence results.
As for parallel implementation of our approach, we can note the following.
Though the algorithm is explicit we are not satisfied with the results of the CUDA-versions.
First of all, there is a problem with feasibility of CUDA-code on fine grids (more than 640 × 640 nodes). Profile-feedback analysis shows at least two causes: (1) assumed computational domain decomposition and (2) the problem of hardware constraint on the amount of registers on streaming multiprocessors.
Concerning the first item, it should be noted that in our approach a 2D computational domain is mapped to a 1D array of cells, and every thread treats one cell (e.g., see Pseudocode 1).
For the parameters of kernels launch to be readily adaptable, we can apply “thread reuse”; namely, every thread treats some uncoupled cells (see Pseudocode 2).
Besides, we can use several video adapters. In this case we should resolve the problem of the distribution of computational cells between devices under the following restriction: we do not know in advance how many cells of the previous time level are required to calculate an actual value (e.g., varying-width shadow line).
The problem with registers in our case is related to deep nesting level of functions calls in the item (2.3) for determining the mutual arrangement of Pi,jk-1 and cells of a mesh of the previous time level. Indeed, this is a bottleneck of our sequential algorithm. To resolve this issue, we should modify the initial sequential algorithm, unfortunately.
In addition, the item (2.3) has many flow control instructions (“if” statements, mainly), but we cannot say that this branching creates some significant performance penalty in terms of SIMT architecture. Indeed, in a CUDA kernel, any flow control instruction (if, switch, do, for, while) can significantly affect the instruction throughput by causing threads of the same warp to diverge [32]. If this happens, the different execution paths must be serialized, since all of the threads of a warp share a program counter; this increases the total number of instructions executed for this warp. When all the different execution paths have completed, the threads converge back to the same execution path.
Fortunately, this rule works in the cases where the control flow depends on the thread ID only. However, in our case we have many branches which do not depend on thread ID. Inside our computational kernel, the main part of branches looks like Pseudocode 3.
<bold>Pseudocode 3</bold>
if ((indCurSqOx[1] >= 0) && (indCurSqOy[1] >= 0)) {
do_smth();
} else {
do_smth_else();
}
As we can see, our “IF” statement does not depend on a thread ID. Thus, this type of branch does not affect the performance of the CUDA kernel.
7. Conclusion
We present an unconditionally stable semi-Lagrangian scheme of the first-order accuracy. The numerical experiments confirm the theoretical studies. Performance of sequential and several parallel algorithms implemented with the OpenMP and CUDA technologies in the C language was studied. The optimization potential of the CUDA-version remains unexhausted yet.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
Acknowledgment
This work was supported by Project 14-11-00147 of the Russian Scientific Foundation.
GodunovS. K.ZabrodinA. V.IvanovM. Y.KraikoA. N.ProkopovG. P.GodunovS.AndersonJ. D.LentineM.GrétarssonJ. T.FedkiwR.An unconditionally stable fully conservative semi-Lagrangian methodLevyD.PuppoG.RussoG.Central WENO schemes for hyperbolic systems of conservation lawsLeVequeR. J.ClainS.DiotS.LoubèreR.A high-order finite volume method for systems of conservation laws with Multi-dimensional Optimal Order Detection (MOOD)KaserM.IskeA.ADER schemes on adaptive triangular meshes for scalar conservation lawsScroggsJ. S.SemazziF. H. M.A conservative semi-Lagrangian method for multidimensional fluid dynamics applicationsBehrensJ.A parallel adaptive finite-element semi- lagrangian advection scheme for the shallow water equationsKlarA.ReuterswärdP.SeaïdM.A semi-Lagrangian method for a Fokker-Planck equation describing fiber dynamicsPironneauO.On the transport-diffusion algorithm and its applications to the Navier-Stokes equationsCarliniE.FalconeM.FerrettiR.A time-adaptive semi-Lagrangian approximation to mean curvature motionDurranD. R.MortonK. W.StaniforthA.CoteJ.Semi-Lagrangian integration schemes for atmospheric models—a reviewPriestleyA.A quasi-conservative version of the semi-Lagrangian advection schemeRitchieH.Semi-Lagrangian advection on a Gaussian gridRobertA.YeeT. L.RitchieH.A semi-Lagrangian and semi-implicit numerical integration scheme for multilevel atmospheric modelsBehrensJ.MentrupL.A conservative scheme for 2D and 3D adaptive semi-Lagrangian advection2004TUM M0411Technische Universität München, Fakultät für MathematikWiin-NielsonA.On the application of trajectory methods in numerical forecastingShaidurovV. V.ShchepanovskayaG. I.YakubovichV.Numerical simulation of supersonic flows in a channelChenH.LinQ.ShaidurovV. V.ZhouJ.Error estimates for triangular and tetrahedral finite elements in combination with a trajectory approximation of the first derivatives for advection-diffusion equationsPhillipsT. N.WilliamsA. J.Conservative semi-Lagrangian finite volume schemesLapriseJ. P. R.PlanteA.A class of semi-Lagrangian integrated-mass (SLIM) numerical transport algorithmsArbogastT.WangW.Convergence of a fully conservative volume corrected characteristic method for transport problemsTakizawaK.YabeT.NakamuraT.Multi-dimensional semi-Lagrangian scheme that guarantees exact conservationIskeA.KaserM.Conservative semi-Lagrangian advection on adaptive unstructured meshesNovikovE. A.KrylovV. I.SandersJ.KandrotE.