Solving the Caputo Fractional Reaction-Diffusion Equation on GPU

. We present a parallel GPU solution of the Caputo fractional reaction-diffusion equation in one spatial dimension with explicit finite difference approximation. The parallel solution, which is implemented with CUDA programming model, consists of three procedures: preprocessing, parallel solver, and postprocessing. The parallel solver involves the parallel tridiagonal matrix vector multiplication, vector-vector addition, and constant vector multiplication. The most time consuming loop of vector-vector addition and constant vector multiplication is optimized and impressive performance improvement is got. The experimental results show that the GPU solution compares well with the exact solution. The optimized GPU solution on NVIDIA Quadro FX 5800 is 2.26 times faster than the optimized parallel CPU solution on multicore Intel Xeon E5540 CPU.


Introduction
The idea of fractional derivatives can be dated back to the 17th century.A fractional differential equation is a kind of equation which uses fractional derivatives.Fractional equations can be used to describe some physical phenomenons more accurately than the classical integer order differential equation [1].The reaction-diffusion equations play an important role in dynamical systems of mathematics, physics, chemistry, bioinformatics, finance, and other research areas.Some analytical methods were proposed for fractional differential equations [2,3].The stability of Cauchy fractional differential equations was studied [4,5] and more attention should be paid to the interesting Ulam's type stability [6].There have been a wide variety of numerical approximation methods proposed for fractional equations [7,8], for example, finite difference method [9], finite element method, and spectral techniques [10].Interest in fractional reactiondiffusion equations has increased [11].In 2000, Henry and Wearne [12] derived a fractional reaction-diffusion equation from a continuous-time random walk model with temporal memory and sources.The fractional reaction-diffusion system with activator and inhibitor variables was studied by Gafiychuk et al. [13].Haubold et al. [14] developed a solution in terms of the H-function for a unified reaction-diffusion equation.The generalized differential transform method [15] was presented for fractional reaction-diffusion equations.Saxena et al [16] gave investigation of a closed form solution of a generalized fractional reaction-diffusion equation.
Parallel computing is used to solve computation intensive applications simultaneously [17][18][19].In recent years, the computing accelerators such as graphics processing unit (GPU) provided a new parallel method of accelerating computation intensive simulations [20][21][22].The use of general purpose GPU is possible by the advance of programming models and hardware resources.The GPU programming models such as NVIDIA's compute unified device architecture (CUDA) [23] become more mature than before and simplify the development of nongraphic applications.GPU presents an energy efficient architecture for computation intensive domains like particle transport [24,25] and molecular dynamics [26].
It is time consuming to numerically solve fractional differential equations for high spatial dimension or big time integration.Short memory principle [27,28] and parallel computing [29][30][31][32] can be used to overcome this difficulty.The parallel algorithms of one-and two-dimensional time fractional equations are studied and good parallel scalability is got [31,32].Optimization of the sum of constant vector multiplication is presented and 2-time speedup can be got [31].The parallel implicit iterative algorithm was studied for two-dimensional time fractional problem at the first time [32].
Gong et al. [29] presented a parallel algorithm for Riesz space fractional equations.The parallel efficiency of the presented parallel algorithm of 64 processes is up to 79.39% compared with 8 processes on a distributed memory cluster system.Diethelm [30] implemented the fractional version of the second-order Adams-Bashforth-Moulton method for fractional ordinary equations on a parallel computer.Domain decomposition method is regarded as the basic mathematical background for many parallel applications.A domain decomposition algorithm for time fractional reaction-diffusion equation with implicit finite difference method was proposed [33].The domain decomposition algorithm keeps the same parallelism but needs much fewer iterations, compared with Jacobi iteration in each time step, until nothing has been recorded on accelerating the numerical solution of Caputo fractional reaction-diffusion equation on GPU.
This paper focuses on the Caputo fractional reactiondiffusion equation: on a finite domain 0 ≤  ≤   and 0 ≤  ≤ .The  and  are constants.If  equals 1, (1) is the classical reaction-diffusion equation.The fractional derivative is in the Caputo form.

GPU Architecture and CUDA Programming Model.
The architecture of GPU is optimized for rendering real-time graphics, a computation and memory access intensive problem domain with enormous inherent parallelism.Not like CPU, a much larger portion of GPUs resources is devoted to data processing rather than to caching or control flow.The NVIDIA Quadro FX 5800 GPU has 30 multiprocessor units which are called the streaming multiprocessors (SMs).Each SM contains 8 SIMD CUDA cores.Each CUDA core runs at 1.30 GHz.The multiprocessors create, manage, and execute concurrent threads in hardware with near zero scheduling overhead.The single instruction multiple thread (SIMT) unit, which is akin to SIMD vector organizations, creates, manages, schedules, and executes threads in groups of 32 parallel threads called warp [23].
The programming model is a bridge between hardware and application.CUDA comes with a software environment that allows developers to use C as a high-level programming language.There are three key abstractions in CUDA: a hierarchy of thread execution model (grid or kernel, thread block, and thread), shared memory, and barrier synchronization.These abstractions provide fine-grained data level parallelism and thread parallelism, nested within coarsegrained data level parallelism and task level parallelism.Each CUDA kernel creates a single grid, which consists of many thread blocks.Each thread block schedules groups of threads that can share data through on-chip shared memory and synchronize with one another using barrier synchronization.Threads within a block are organized into warps which run in SIMT fashion.CUDA threads may access data from multiple memory spaces during their execution.The memory spaces include global, texture, and constant memory for grid, on-chip shared memory for thread block, and private register files for thread.The memory access time to different memory spaces varies from several to hundreds of cycles.These memory accesses must be coalesced to improve the performance of global memory access.

Details of GPU Solution
The parallel solution consists of three parts.The first part is preprocessing, which prepares the initial matrices, vectors, and so on.The second part is the parallel solver, which focuses on the iteration of time step with (9).The third part is postprocessing, which outputs the final results and so on.
The preprocessing includes initialization of parallel environment, distribution of computing task, allocation of local memory space, and initialization of variables and arrays.Matrices  × and  × are prepared before the computation of (9).For example, matrix  can be got according to (10).The postprocessing is simple.The results of the exact solution are performed.The max absolute error of the exact and parallel solutions is computed and outputted.Both the results of the exact and parallel solution are saved in files which are necessary for plot.Other operations of postprocessing include free memory space and stop the parallel environment.
In order to get   , the right-sided computation of (9) should be performed.There are mainly one tridiagonal matrix vector multiplication, many constant vector multiplications, and many vector-vector additions in the right-sided computation.
The parallel solution uses the data level parallelism of GPU architecture.The parallel solution with CUDA for (1) is described in Algorithm 1.The preprocessing involves lines 1 to 3. The parallel solver, which is the most time consuming procedure, involves lines 4 to 12.The postprocessing involves lines 13 to 14 and other additional operations are not shown in Algorithm 1.Because the time spent on the preprocessing and postprocessing is trivial when the number of time steps is big enough, the preprocessing and postprocessing time is omitted for the measured time. 1 and  2 are used to record the measured time of the parallel CPU and GPU solutions.
The parallel solution uses the data level parallelism of GPU architecture.The parallel solution with CUDA for (1) is described in Algorithm 1.The preprocessing involves lines 1 to 3. The parallel solver, which is the most time consuming procedure, involves lines 4 to 12.The postprocessing involves lines 13 to 14 and other additional operations are not shown in Algorithm 1. BS stands for the CUDA thread block size and /BS is the number of CUDA thread blocks.BS is the predefined constant like 16, 32, 64, and so forth.Because the time spent on the preprocessing and postprocessing is trivial when the number of time steps is big enough, the preprocessing and postprocessing time is omitted for the measured time. 1 and  2 are used to record the measured time of the serial and parallel solution.
Except for the initialization of variables and arrays, there are four CUDA kernels.The first kernel is  0 , which computes the initial condition according to () in (1).The second kernel is V, which performs the tridiagonal matrix vector multiplication.The third kernel is V VV, which stands for constant vector multiplication and vectorvector addition.The fourth kernel is V, which performs the constant vector multiplication and vector-vector addition.The last kernel is V, which stands for constant vector multiplication.The CUDA kernels  1 and V are simple and will not be described in detail.

Implementation.
The CUDA kernel V for constant vector multiplication and vector-vector addition is shown in Algorithm 2. Algorithm 2 computes   +  2   and saves the final vector into   .Most elements of tridiagonal matrix  are zero.The most common data structure used to store a sparse matrix for sparse matrix vector multiplication computations is compressed sparse row (CSR) format [34] shown in So in the following parts of this paper, matrix  stands for the format of ( 11) not the format of (10).With the format of (11), the global memory is coalesced and can improve the performance of global memory access.
The CUDA kernel V for tridiagonal matrix vector multiplication is shown in Algorithm 3. One thread block deals with the multiplication of one row of matrix  and one column of U. Algorithm 3 computes  −1 and saves the final vector into   .The shared memory is used to improve the memory access speed.The synchronization function ℎ() is used to ensure the correctness of the logic of the algorithm.
In Algorithm 3, each GPU thread deals with the multiplication of one row of tridiagonal matrix  and vector  −1 .Each thread needs to use three elements of vector   .The nearest two threads use the same two elements of vector  −1 .We can use the shared memory to improve the memory access performance.So the elements of vector  −1 which will be used by threads in a thread block are stored into shared memory, shown between lines 6 and 16 of Algorithm 3. The real computation of tridiagonal matrix multiplication is shown between lines 18 and 21.Finally, the results are stored into the global memory of   .

Optimization. In Algorithm
Algorithm 3: CUDA kernel for tridiagonal matrix vector multiplication.
the total number of the invocations of kernel V is  2 ( − 1)/2.The most time consuming part of Algorithm 1 is the loop of line 9. Loop 9 can be combined into one CUDA kernel V as shown in Algorithm 4. The array  is the coefficient of (8) in global memory.So the optimized parallel solution for Caputo fractional reaction-diffusion equation is similar to Algorithm 1 except that the loop (lines 9-10) in Algorithm 1 is replaced with the optimized CUDA kernel of Algorithm 4.

Experiment Platforms.
The experiment platforms consist of one GPU and one CPU listed in Table 1.For the purpose of fair comparisons, we measure the performance provided by GPU compared to the MPI code running on multicore  (, 0) = 0,  ∈ (0, 2) with  = 1,  = 1, and The exact solution of ( 13) is 4.3.Accuracy of the GPU Implementation.The GPU solution compares well with the exact solution to the time fractional diffusion equation in this test case of (13), shown in Figure 1.
The  and ℎ for the GPU solution are /2048 and 2.0/16.The maximum absolute errors for  = 0.3, 0.5, and 0.7 are 3.29 × 10 −5 , 1.07 × 10 −4 , and 2.30 × 10 −4 .In fact, the accuracy and convergence of the GPU solution are the same as the serial and parallel MPI solution on CPU [31].
4.4.Total Performance Improvement.In this section, the performance of the optimized GPU solution presented in this paper is compared with the performance of the parallel CPU solution [31].The parallel CPU solution makes full use of four cores of E5540.The optimized GPU solution is presented in Section 3.2.
For fixed  = 128, the performance comparison between GPU and multicore CPU is shown in Table 2.The thread block size is 32.
For fixed  = 122880, the performance comparison between optimized GPU solution and parallel CPU solution is shown in Table 3.The thread block size is 32.The impact of the optimized CUDA kernel on constant vector multiplication and vector-vector addition with fixed  = 128 is shown in Table 4.The performance improvement with fixed  = 491520 is shown in Table 5.All thread block sizes are 64.The basic GPU solution is Algorithm 1 and the optimized GPU solution uses the optimized CUDA kernel of Algorithm 4. The thread block size (BS) is a key parameter for parallel GPU algorithms.The impact of BS is shown in Table 6.From Table 6, we can see that thread block size 32 is the best choice.

Conclusions and Future Work
In this paper, the numerical solution of Caputo fractional reaction-diffusion equation with explicit finite difference method is accelerated on GPU.The iteration of time step consists of tridiagonal matrix vector multiplication, constant vector multiplication, and vector-vector addition.The details of the GPU solution and some basic CUDA kernels are presented.The most time consuming loop (constant vector multiplication and vector-vector addition) is optimized.The experimental results show the GPU solution compares well with the exact analytic solution and is much faster than parallel CPU solution.So the power of parallel computing on GPU for solving fractional applications should be recognized.
As a part of the future work, first, the stability, like Ulam's type, of different fractional equations should be paid attention to [35][36][37].Second, parallelizing the implicit numerical method of fractional differential equations is challenging.
1, the kernels V VV, V, and V are invoked  times.In each time step, kernel V is invoked n (1, 2, . . ., ) times.Because

Table 1 :
Technical specifications of experiment platforms.

Table 2 :
Performance comparison between optimized GPU solution on Quadro FX 5800 and parallel CPU solution on E5540 with fixed  = 128.

Table 4 :
Performance improvement for fixed  with the optimization of constant vector multiplication and vector-vector addition.

Table 5 :
Performance improvement for fixed  with the optimization of constant vector multiplication and vector-vector addition.

Table 6 :
Impact of thread block size (BS).