Parallel Algorithms of Well-Balanced and Weighted Average Flux for Shallow Water Model Using CUDA

Compute Unified Device Architecture (CUDA) implementations are presented of a well-balanced finite volume method for solving a shallow water model. The CUDA platform allows programs to run parallel on GPU. Four versions of the CUDA algorithm are presented in addition to a CPU implementation. Each version is improved from the previous one. We present the following techniques for optimizing a CUDA program: limiting register usage, changing the global memory access pattern, and using loop unroll. The accuracy of all programs is investigated in 3 test cases: a circular dam break on a dry bed, a circular dam break on a wet bed, and a dam break flow over three humps. The last parallel version shows 3.84x speedup over the first CUDA implementation. We use our program to simulate a real-world problem based on an assumed partial breakage of the Srinakarin Dam located in Kanchanaburi province, Thailand. The simulation shows that the strong interaction between massive water flows and bottom elevations under wet and dry conditions is well captured by the well-balanced scheme, while the optimized parallel program produces a 57.32x speedup over the serial version.


Introduction
Shallow water equations are derived from the conservation of mass and momentum under hydrostatic approximation. This model can be used to describe fluid flow under various scenarios over time. One can apply the shallow water equations to simulate dam break problems, the flood plain over wet and dry areas, or even tsunami waves in some senses.
Shallow water equations can be solved numerically using the finite volume method. The computational domain is discretized by rectangular [1], triangular [2,3], tetrahedral [4], or unstructured cells. The applications of shallow water equations using Roe-type solutions can be found in [5,6]. The flux gradient at each cell interface can be approximated by the TVD-WAF (total variation diminishing-weighted average flux) or HLL (Harten-Lax-van Leer) methods [7]. A well-balanced scheme is introduced later for balancing the flux gradients and source term approximation at each cell interface; see [1] and compare the results in [8,9]. Due to the simplicity of the rectangular cell, we implement a well-balanced scheme based on the weighted average flux (WAF) of the finite volume method extended from [8][9][10][11] to simulate various complex flow problems. The data elevation model (DEM) is available in rectangular format, so it is easy to utilize as an input in our program for simulating real-world problems. For a very large domain, the computational time is massive; thus, several approaches are proposed to reduce the computational time. One concept uses parallel computation. Our main study is to develop parallel programs on GPU where the parallel version is developed from the serial version proposed in [8,9]. The most challenging is the implementation using unstructured meshes since the data structure and hence the memory access is not straightforward. Several types of research implement unstructured grid; see [2,3], but the rectangular domain is simple and not much difficult to develop parallel programming.
We implemented a program that runs on GPU. The program is written using the CUDA (Compute Unified Device Architecture) programming model. Several parallel implementations of the shallow water equations on CUDA can be found in the literature [17][18][19][20][21][22]. Most of them report a huge speedup in the computational time compared to serial implementation running on CPU. However, comparing the CUDA program with a serial program might not provide a meaningful result as there are several issues involved in the comparison. The first is the selection of the CPU and GPU models to be compared. It is arguable whether the selection should be based on factors such as the number of cores, number of transistors, clock speed, or price. The second is the effort required to optimize a program for both CPU and GPU implementation. For a balanced comparison, a baseline CPU implementation must be highly optimized (using an efficient data structure, taking advantage of special instructions, or taking advantage of a multicore architecture). Third, the size of the main memory is normally much larger than the size of the graphics card memory. As a result, the program that runs on the CPU can solve a larger problem than the one running on GPU. Thus, both the CPU and GPU implementations should solve the large problem.
From these reasons, our paper focuses on optimizing a CUDA implementation of the shallow water equation [8,9]. Nsight, a CUDA profiler, is used to determine an area for computational improvement. We implement several programs based on Nsight reports.

Numerical Method
2.1. Finite Volume Method. The finite volume method is a popular method for solving shallow water equations, for examples, see [5,8,10,23]. The two-dimensional shallow water equation can be written as where h is the water depth, u and v are the flow velocities in the x and y directions, respectively, g is the acceleration due to gravity, and z is the bottom elevation. The shallow water equations (1)-(3) can be written in a conservation form as where U = ðh, hu, hvÞ T , F = ðhu, hu 2 + ð1/2Þgh 2 , huvÞ T , G = ðhv, huv, hv 2 + ð1/2Þgh 2 Þ T , and Applying the finite volume method, we obtain the discretized scheme as where △x =x i+1/2 − x i−1/2 , △y =y j+1/2 − y j−1/2 , and S ij is the approximation of source term at cell I ij . Here,F i+1/2,j and G i+1/2,j are numerical fluxes at the right and upper interfaces, respectively. The notations are similar to the other interfaces of a cell. In this work, we apply the weighted average flux method to approximate numerical fluxes. The reconstruction steps are implemented and the well-balanced scheme is also applied based on [8,9].
2.2. Weighted Average Flux (WAF). Weighted average flux was first proposed by [11]. Intercell flux, F∧ WAF at the cell interface, x i+1/2,j , is defined as an integral average of the flux function FðUÞ at the half-time step aŝ It can be written in the form of wave structure aŝ where N c is the number of waves in the solution of the Riemann problem and F ðkÞ i+1/2,j is the k th flux in the solution of the Riemann problem. In this work, we will apply the numerical flux in equation (7) to equation (5).

Linear Reconstruction.
Linear reconstruction is applied before calculating the numerical fluxes in equations (6) and (7). U ij is the approximation of U which is defined by the cell averages over cell I ij = ðx i−ð1/2Þ , x i+ð1/2Þ Þ × ðy j−ð1/2Þ , y j+ð1/2Þ Þ as The approximate solution using the cell average in equation (8) yields just the first-order scheme; see [11]. We can obtain the second-order scheme by applying linear reconstruction to all conservative variables. For example, in the x direction, we reconstruct variables in the right and the left limits at interfaces as where σ ij is a slope limiter that has various versions. Here, we apply the minmod slope limiter given by 2 Modelling and Simulation in Engineering where Similar calculations can be performed in the y direction. Slope limiters can avoid spurious oscillations in the numerical solution for high-gradient free-surface flows; see [8,11].
2.4. Well-Balanced Finite Volume Method. The hydrostatic well-balanced scheme proposed by Audusse et al. [24] is applied in this work. The concept is designed for balancing between the flux gradient and bottom slope approximations at a cell interface. The advantage of this scheme is the conservation properties of water at rest. Before calculating numerical flux, it is required to reconstruct water depth h at interfaces in both the x and y directions. The wellbalanced scheme can preserve the solution at a steady state and ensure nonnegativity of the water height using the following approximations where z i+ð1/2Þ,j = max ðz − i+ð1/2Þ,j , z + i+ð1/2Þ,j Þ and z i,j+ð1/2Þ = max ðz − i,j+ð1/2Þ , z + i,j+ð1/2Þ Þ. The objective of this reconstruction is to balance the conservative fluxes and the bed slope source term with frictionless conditions. Finally, all of the presented techniques are combined in the finite volume method. The numerical scheme is called the well-balanced finite volume with WAF scheme [8]. The discretization form is The updated solution can be obtained by the forward method. To improve the stability of the finite volume scheme, the source term including the roughness coefficient will be approximated by the splitting implicit method; see more details in [8]. Briefly, we solve the shallow water system by considering the ordinary differential equation where n is the time step. For example, the implicit forward time in the x direction is approximated by This can be rewritten in an explicit formula as where the implicit coefficient D x is defined by where n m is the Manning coefficient. This technique provides a more stable finite volume method when dealing with a strong nonlinear friction term if we compare it with the standard explicit method.
For the present step, we can calculate cell by cell sweeping along the x direction until all numbers of cells are calculated. The computation by forwarding time and sweeping cell by cell allows us to modify the serial computation to a parallel computation. This is the main objective of our present work. We will design and give some concepts in the next section.

CUDA
GPU has a large number of cores compared with a multicore CPU, for example, NVIDIA QUADRO M6000 has 3072 cores. As a result, a GPU can process efficiently 3D graphics data such as texture shading. Therefore, GPU is suitable for massive parallel computations. Utilizing a GPU to run a general-purpose parallel program would greatly improve the program's running time. Compute Unified Device Architecture (CUDA) is a parallel model that allows a program to run on a GPU. CUDA was created by NVIDIA, a company that manufactures GPUs.

Thread.
A thread is a sequence of programmed instructions that can be executed independently. A group of threads on the CUDA model is called a block. A group of blocks is called a grid. All threads in a grid can be launched from the same kernel, and each thread in the same block has a unique thread ID. Furthermore, every block in a grid has a unique block ID as shown in Figure 1.
We can create 1-, 2-, or 3-dimensional CUDA thread blocks. The information for 2-dimensional grid and thread blocks can be obtained using the variables of gridDim.x and gridDim.y. The variables of blockDim.x, blockDim.y, and blockDim.z are the numbers of threads in the x, y, and z dimensions, respectively, on a block. The variables block-Idx.x, bloxkIdx.y, and bloxkIdx.z are the block IDs in the x , y, and z dimensions, respectively. Finally, the variables of 3 Modelling and Simulation in Engineering threadIdx.x, threadIdx.y, and threadIdx.z are the thread IDs in the x, y, and z dimensions, respectively.
A unique ID can be made for each thread per kernel based on the block ID and thread ID and is called a global thread ID. Figure 2 shows the design of the global thread ID in the two-dimensional thread in a block.

3.2.
Kernel. The CUDA program can be implemented in many languages including C, C++, or Fortran. The host and device refer to the CPU and GPU systems, respectively. A kernel is a function that can be parallel executed in a GPU. A kernel is declared by the __global__ keyword. For example, a kernel that receives a pointer can be written as A kernel is launched using the <<<grid_size, block_    for i = 0 to row 4: for j = 0 to column 5: for q in ½0, 1 6: Calculation flux i + q in x direction using equations (6) and (7)  7: Calculation flux j + q in y direction using equations (6) and (7)

5
Modelling and Simulation in Engineering Table 1). An example of launching 3 blocks with 256 threads per block can be written as 3.3. Memory Hierarchy. GPUs have various memory types: global memory, texture memory, constant memory, shared memory, local memory, and register. The CUDA threads can access data from different memory spaces. Each thread has a private local memory and register. All threads in the same block can access the same shared memory. All threads in all grids can access global, constant, and texture memory. Figure 3 shows that the host can access global, constant, and texture memory. Data can be transferred between host and device memories. Local memory can read and write per thread that can be declared in a kernel function. Shared memory can read and write per block which is allocated by the __shared__ qualifier. The performances of memory on GPUs differ because shared memory is based on chip memory. It will be able to read and write faster than the local or global memory. Shared memory has a latency that is lower than global memory and also reduces the blank conflicts between the thread in a block.
Our experiments are performed on GeForce GTX 1050 Ti GPU, which has CUDA Compute Capability 6.1 (see Table 1).

Serial and Parallel Implementations
4.1. Serial: Cell-Based Calculation. The first of our implementations is a serial program based on the wellbalanced finite volume method for equation (5). The computational domain is divided into finite volume cells defined by indices of rows and columns. A solution (h, hu, hv) at the present time step is stored for computing a solution for the next time step. We compute four sides of numerical fluxes which have two values in the x direction and two values in the y direction; see Figure 4 for these flux calculations.
To obtain an updated solution, we compute fluxesF    (6) and (7)  4: Store flux to memory 5: end for 6: end for Algorithm 3: Calculate flux in the x or y direction. 6 Modelling and Simulation in Engineering computations over cells will be executed again for the next time step. This is a serial version as shown in Algorithm 1.    4.3. Parallel V1. When CUDA threads are used to compute on GPU, we can identify the thread ID in the global thread ID. We need to transfer data between CPU and GPU at the start and the end of our program. Data can be transferred to GPU using the 1-dimensional array. We will apply this technique to store data on the GPU as shown in Figure 6. Numerical values at the cell interfaces and the cell center will be stored in the GPU by the 1-dimensional array of Cell, Flux x, and Flux y, respectively. Figure 7 shows an outline of our first parallel implementation that uses edge-based calculation. The initial input data are water depth h, flow velocity in the x direction, flow velocity in the y direction, and bottom elevation. These inputs are initialized in a host using CPU and transferred to a device by a graphics adapter. Then, CUDA kernels or global functions are launched. We implement functions cudaRunKernel-Fluxy and cudaRunKernelFluxx for computing fluxes in both directions. Each thread will get data ðh, u, v, zÞ on cell edges from the device and then call a CUDA device function for solving equations (6) and (7). We use the straight array to store fluxes in the x and y directions on GPU. New solutions are updated using equations (8)-(11) based on compu-teNewFVM that executes in parallel on the CUDA device. Finally, the bottom slopes and the friction terms from equations (12) and (13) can be computed using the computeMN kernel.

Serial
The performance of this algorithm can be evaluated using the running time metric. Our first CUDA implemen-tation is 6.5 times faster than the serial edge-based calculation for the rectangular dam break flow problem. All results will be summarized again in Results.

Parallel Occupancy.
To optimize the performance of the parallel V1 program, we will identify the hotspot using NVI-DIA Nsight which is a CUDA profiler. Since our CUDA implementations have 4 kernels, the profiler results of these kernels are shown in Figure 8. The kernels that take cudaR-unKernelFluxy and cudaRunKernelFluxy are the hotspot. The occupancy of these two kernels is very low (12.50%). An occupancy is equal to the number of active warps per max warp per streaming multiprocessor (SM). This SM is a part of the GPU that runs the CUDA kernel. Each SM contains several streaming processors (SP) that can execute one instruction at a time. SM can be limited due to various reasons. Occupancy is a metric that is related to the number of active warps on a multiprocessor. Low occupancy always interferes with the ability to hide memory latency, resulting in performance; see [25,26]. Figure 9 shows that our first parallel implementation has low performance for register usage.
We use the __launch_bounds__ directive to limit the usage of the register. The directive has two parameters: MAX_THREADS_PER_BLOCK and MIN_BLOCKS_PER_ SM, and the result from varying those parameters is shown in Figure 10. The best result is not at 100% occupancy but at only 75% occupancy and is obtained when MAX_ THREADS_PER_BLOCK is 512 and MIN_BLOCKS_PER_ SM is 3. We find these results by testing different experiments as shown in Figure 11. Our second parallel implementation called parallel occupancy is 2.42x speedup compared with the parallel V1. This finding shows the computational benefits in that we can obtain better speedup without refactoring the code.   Modelling and Simulation in Engineering The previous implementation of parallel occupancy uses a 2D thread block which is natural for the 2D domain. In this proposed version, we will transfer the 2D data to the device using a 1D array. Therefore, we refactor the code to launch both hotspot kernels using a 1D thread block. We assume that threads can access memory consecutively and coalesce. Unexpectedly, this implementation is slower than the previous one as shown by the Nsight report in Figure 12(a). The computation of flux is slower only in the y directions. We launch both kernels using 64 threads in one dimension of the thread block. When we launch the cudaRunKernelFluxy kernel that  computes flux in the y direction using ð1, 64Þ thread block as shown in Figure 12(b), and the kernel runs faster. We design to transfer data from CPU to GPU using cudaMemcopy that can transfer data in all cells from CPU to 1-dimensional array on GPU consecutively (see Figure 6). According to the Nsight report in Figure 13, we found that the control flow efficiency is improved after we change the block dimension from ð64, 1Þ to ð1, 64Þ in the kernel cudaRunKernelFluxy. Control flow efficiency is defined by the ratio of the active thread to the maximum number of threads per warp. Figure 13 shows the difference in the control flow efficiency for the same launch configuration block dimension in both the x and y dimensions; see Figure 13(a). The difference in the launch configuration between the x and y directions is shown in Figure 13(b). When the block dimension is changed to ð1, 64Þ, all threads in the same block are executing the code and can access memory consecutively at the same padding. Due to improving the control flow efficiency, our third parallel implementation (called parallel memory pattern) has 1.3x speedup  11 Modelling and Simulation in Engineering compared with the parallel occupancy for a rectangular dam break problem. All results will be summarized again in Results.
4.6. Parallel Unroll. From our various experiments, we found from the profiler results that about 78% of the computations involve calculating flux. This is performed in the kernels cudaRunKernelFluxy and cudaRunKernelFluxx. Both kernels call a device function computefluxxhydro. Inside the function, there is a loop for computing TVD-WAF flux when the number of waves in the Riemann problem (N c ) is set to be 2; for details, see equation (7). Thus, we have a small loop to compute the speed of waves as shown in Figure 14(a) for every finite volume cell. To optimize the computational time of this point, we propose to unroll this loop and calculate it separately. The idea is shown in Figure 14(b).
The loop of the wave speed is unrolled, and we call this fourth parallel implementation a parallel loop unroll and its computational time is 1.2 times faster than the parallel memory pattern for a rectangular dam break problem.

Accuracy Tests
Before applying our various programs to solve real flow problems, the accuracy of our implementations is verified by comparing the output with the previous serial implementations in [8,9]. Accuracy tests are performed for 3 problems: a circular dam break on a wet bed, a circular dam break on a dry bed, and dam break flows over three humps.

Circular Dam Break on Wet
Bed. The domain has 200 × 200 rectangular grid cells. Inside the domain, there is a cylindrical dam located at the center of the domain. The radius of the dam is 50 meters. The calculation is performed without source terms. The initial water depth inside the dam is 10 meters and outside the dam is 1 meter. The surface and contour plots of both serial and parallel implementations are shown in Figure 15. Both serial implementations (cell-and edge-based) provide the same results as shown in Figures 15(a) and 15(c), while the results from the parallel program are shown in Figures 15(b) and 15(d). The agreement for all our implementations. This small discrepancy may be from the difference in floating-point architecture between GPU and CPU as reported in [27].

Circular Dam Break on Dry
Bed. In this problem, the water depth inside a dam is 10 meters and outside the dam is 0 meters. Figure 16 compares the surface and contour plots of both the serial and parallel implementations. Both serial implementations (cell-and edge-based) provide the same results. All parallel implementations give the same output as well. Both serial and parallel programs provide nearly equal outputs at time t = 3 with the RMSE = 0:0050. Since     (15). The computational domain is divided into 128 × 128 uniform grid cells with the Manning coefficient set at 0.018. This problem demonstrates a strong interaction between the high gradient water depth and the friction bottoms for the wet and dry areas. Figure 17 compares the surface and contour plots from both the serial and parallel implementations. The visual results are consistent, and they agree well with the results in [8]. Both serial implementations provide the same result. All parallel implementations also provide very similar results with RMSE = 0:0006.

Performance of Implementations
The performance for implementations is compared based on their individual running times. The calculations can be divided into 3 parts: initialization, numerical computation, and visualization. For the serial programs, initialization includes reading input parameters and allocating and initializing data in the main memory. For parallel programs, initialization adds the overhead of transferring the data from the host memory (main memory) to the device memory (graphics card memory). However, we do not take initialization and visualization times into account. We compare only the numerical computation of the serial and parallel implementations because the initialization and visualization times are small compared with the numerical computation (about 0.003%). Another reason is that the overall running time is varied by some parameters such as the number of iterations or grid cells. Therefore, comparing only numerical computation gives a better view of the code optimization effect. All implementations are tested with four dam break problems as shown in Table 2. The domain sizes are 128 × 128 (16k) and 1024 × 1024 (1M) cells. The CPU specifications in our experiments are AMD Ryzen 5 1500x Quad-Core Processor 3.5 GHz with 16 GB memory, and for the GPU, the specifications are GeForce GTX 1050Ti, with 4 GB memory (see Table 1). Since the running time is not equal for every run, we run the program 10 times and report the average. The size of the thread block is 16 × 16, 64 × 1, or 1 × 64. Table 2, all parallel versions are faster than the serial versions and the improved parallel version is faster than the previous version. As shown in Table 3, the parallel unroll for the 1024 × 1024 domain shows the best performance (63.35x) for the rectangular dam break problem. The ratio of computational time for the serial to the parallel program is close to the ratio of the domain. The large domain achieves a better speedup than the small domain. The optimal number of finite volume cells in each direction also affects the speedup of parallel algorithms.

As shown in
It should be noted that the speedup from CPU to GPU is depended on both hardware architecture and computer code. If one compares the same code on different hardware, speedup will be different. Then, our objective is to implement the CUDA program using Nsight report to develop better parallel algorithms from the first serial program.

Srinakarin Dam Break Simulation
In this section, we will apply our developed algorithms to simulate a real-world problem. We study the Srinakarin Dam located on the KkwaeYai River, Kanchanaburi province, Thailand. The Electricity Generating Authority of Thailand (EGAT) constructed this dam for water supply and power generation over forty years ago. Its height from the ground is about 140 meters, and it has an embankment dam 610 meters long. The study area covers 419 km 2 from 14.125000°N to 14.300000°N and from 99.000000°E to 99.142300°E; see Figure 18 for the location.
The experiment is set up for a partial dam break simulation. The bottom elevation was obtained from the NASA Shuttle Radar Topographic Mission (SRTM) in a Digital Elevation Model (DEM) with a resolution of 90 × 90 m (http:// srtm.csi.cgiar.org). The domain size is 512 × 320 cells. From the EGAT report, we set the initial water height in the Srinakarin Dam at 60 meters and the flow velocities in the x and y directions are set at zero. This implies that the water is stationary over the dam, and a mass of water is released suddenly to the KkwaeYai River downstream. The Manning coefficient is assumed to be zero, since we consider a very strong interaction between water depth and bottom elevation on the KkwaeYai River. The sensitivity in time step size is tested and found that dt < 0:2 provides the same results in our simulation. Then, we set dt = 0:1. The better choice is the application of adaptive time step to satisfy the CFL condition during time integration; see [11]. But for easy control of the measurement of speedup between parallel and serial algorithms, we fix the value of time step in all programs to avoid a floating-point error for calculating time step that will result in different computational time for the same problem.
The simulation after 90 minutes shows that the water in the Srinakarin Dam flows along the river and reaches the Tha Thung na Dam about 27 kilometers downstream from the Srinakarin Dam. We compare the results of the serial and all of our parallel programs and find they all provide the same result. Figures 19(a)-19(c) show the water flow and flood plain along the river at starting times of 60 and 90 minutes, respectively. The water height at the Tha Thung na Dam is about 4 meters (see Figure 20). The water height depends on the initial conditions and the dam break type.
We compare the performance of each implementation. The running time of the serial programs on the AMD Ryzen 5 1500x Quad-Core Processor 3.5 GHz with 16 GB memory is more than the real time (5400 sec), while the running time of parallel programs on the GeForce GTX 1050Ti with 4 GB memory is faster. Table 4 shows the results from the optimized parallel program. The parallel unroll program provides the best performance, being 57.3x faster than the baseline of the serial program. Furthermore, it is faster than the first parallel program (parallel V1) at 3.32x.

Conclusion
Serial and parallel programs for simulating shallow water flows are presented in this work. We develop two serial programs which are cell-based and edge-based versions. A cellbased program is a logical implementation of the governing equations. The edge-based application is 1.9x faster than the cell-based program. Our first CUDA parallel program is 6.55 times faster than the edge-based program. We use Nsight to find an area of improvement. The hotspot has low occupancy which can be improved by decreasing the number of registers per thread. This is counterintuitive because using more registers can increase speedup for a serial program.
However, for the CUDA architecture, registers are the sharing resources between threads in the same block. Lowering the number of registers causes more threads to be run at the same time. Our second parallel implementation is aimed at better memory coalescence. However, coalescence access is 1.33 times faster due to the improvement in control flow efficiency. Finally, the last CUDA program uses loop unrolls in the main computation, resulting in being 3.84 times faster than our first CUDA implementation. The loop unroll approach is proposed to reduce the computational time for calculating fluxes in the Riemann problem of the weighted average flux (WAF) method. Generally, the WAF method contains a small loop of wave components in each cell and we unroll this approximation separately to provide optimized performance using the last parallel program. This concept can be extended to other problems that apply the WAF method to obtain numerical flux at cell interfaces.

Data Availability
The data that support the findings of this study are openly available in http://srtm.csi.cgiar.org.

Disclosure
This study extends our preliminary work presented at the 11th International Conference on Computer Science & Education (ICCSE) 2016 according to the following link: https:// ieeexplore.ieee.org/document/7581572.