DDNS Discrete Dynamics in Nature and Society 1607-887X 1026-0226 Hindawi Publishing Corporation 10.1155/2014/820162 820162 Research Article Solving the Caputo Fractional Reaction-Diffusion Equation on GPU Liu Jie 1 http://orcid.org/0000-0003-0349-1100 Gong Chunye 1, 2, 3 Bao Weimin 2, 3 Tang Guojian 3 Jiang Yuewen 4 Popa Dorian 1 School of Computer Science National University of Defense Technology Changsha 410073 China nudt.edu.cn 2 Science and Technology on Space Physics Laboratory Beijing 100076 China 3 College of Aerospace Science and Engineering National University of Defense Technology Changsha 410073 China nudt.edu.cn 4 Department of Engineering Science University of Oxford Oxford OX2 0ES UK ox.ac.uk 2014 1762014 2014 01 04 2014 27 05 2014 17 6 2014 2014 Copyright © 2014 Jie Liu et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

We present a parallel GPU solution of the Caputo fractional reaction-diffusion equation in one spatial dimension with explicit finite difference approximation. The parallel solution, which is implemented with CUDA programming model, consists of three procedures: preprocessing, parallel solver, and postprocessing. The parallel solver involves the parallel tridiagonal matrix vector multiplication, vector-vector addition, and constant vector multiplication. The most time consuming loop of vector-vector addition and constant vector multiplication is optimized and impressive performance improvement is got. The experimental results show that the GPU solution compares well with the exact solution. The optimized GPU solution on NVIDIA Quadro FX 5800 is 2.26 times faster than the optimized parallel CPU solution on multicore Intel Xeon E5540 CPU.

1. Introduction

The idea of fractional derivatives can be dated back to the 17th century. A fractional differential equation is a kind of equation which uses fractional derivatives. Fractional equations can be used to describe some physical phenomenons more accurately than the classical integer order differential equation . The reaction-diffusion equations play an important role in dynamical systems of mathematics, physics, chemistry, bioinformatics, finance, and other research areas.

Some analytical methods were proposed for fractional differential equations [2, 3]. The stability of Cauchy fractional differential equations was studied [4, 5] and more attention should be paid to the interesting Ulam’s type stability . There have been a wide variety of numerical approximation methods proposed for fractional equations [7, 8], for example, finite difference method , finite element method, and spectral techniques . Interest in fractional reaction-diffusion equations has increased . In 2000, Henry and Wearne  derived a fractional reaction-diffusion equation from a continuous-time random walk model with temporal memory and sources. The fractional reaction-diffusion system with activator and inhibitor variables was studied by Gafiychuk et al. . Haubold et al.  developed a solution in terms of the H-function for a unified reaction-diffusion equation. The generalized differential transform method  was presented for fractional reaction-diffusion equations. Saxena et al  gave investigation of a closed form solution of a generalized fractional reaction-diffusion equation.

Parallel computing is used to solve computation intensive applications simultaneously . In recent years, the computing accelerators such as graphics processing unit (GPU) provided a new parallel method of accelerating computation intensive simulations . The use of general purpose GPU is possible by the advance of programming models and hardware resources. The GPU programming models such as NVIDIA’s compute unified device architecture (CUDA)  become more mature than before and simplify the development of nongraphic applications. GPU presents an energy efficient architecture for computation intensive domains like particle transport [24, 25] and molecular dynamics .

It is time consuming to numerically solve fractional differential equations for high spatial dimension or big time integration. Short memory principle [27, 28] and parallel computing  can be used to overcome this difficulty. The parallel algorithms of one- and two- dimensional time fractional equations are studied and good parallel scalability is got [31, 32]. Optimization of the sum of constant vector multiplication is presented and 2-time speedup can be got . The parallel implicit iterative algorithm was studied for two-dimensional time fractional problem at the first time .

Gong et al.  presented a parallel algorithm for Riesz space fractional equations. The parallel efficiency of the presented parallel algorithm of 64 processes is up to 79.39% compared with 8 processes on a distributed memory cluster system. Diethelm  implemented the fractional version of the second-order Adams-Bashforth-Moulton method for fractional ordinary equations on a parallel computer. Domain decomposition method is regarded as the basic mathematical background for many parallel applications. A domain decomposition algorithm for time fractional reaction-diffusion equation with implicit finite difference method was proposed . The domain decomposition algorithm keeps the same parallelism but needs much fewer iterations, compared with Jacobi iteration in each time step, until nothing has been recorded on accelerating the numerical solution of Caputo fractional reaction-diffusion equation on GPU.

This paper focuses on the Caputo fractional reaction-diffusion equation: (1)    0 D t α u ( x , t ) + μ u ( x , t ) = 2 u ( x , t ) x 2 + K f ( x , t ) ( 0 < α < 1 ) u ( x , 0 ) = ϕ ( x ) , x [ 0 , x R ] u ( 0 , t ) = u ( x R , t ) = 0 , x [ 0 , T ] on a finite domain 0 x x R and 0 t T . The μ and K are constants. If α equals 1, (1) is the classical reaction-diffusion equation. The fractional derivative is in the Caputo form.

2. Background 2.1. Numerical Solution

The fractional derivative of f ( t ) in the Caputo sense is defined as  (2)    0 C D t α f ( t ) = 1 Γ ( 1 - α ) 0 t f ( ξ ) ( t - ξ ) α d ξ ( 0 < α < 1 ) .

If f ( t ) is continuous bounded derivatives in [ 0 , T ] for every T > 0 , we can get (3)    0 C D t α f ( t ) = lim ξ 0 , n ξ = t ξ α i = 0 n ( - 1 ) i ( α i ) = f ( 0 ) t - α Γ ( 1 - α ) + 1 Γ ( 1 - α ) 0 t f ( ξ ) ( t - ξ ) α d ξ .

Define τ = T / N , h = x R / ( M + 1 ) , t n = n τ , and x i = 0 + i h for 0 n N ,    0 i M + 1 . Define u i n , φ i n , and ϕ i as the numerical approximation to u ( x i , t n ) , f ( x i , t n ) , and ϕ ( x i ) . We can get  (4)    0 C D t α u ( x , t ) | x i t n = 1 τ Γ ( 1 - α ) [ b 0 u i n - k = 1 n - 1 ( b n - k - 1 - b n - k ) u i k - b n - 1 u i 0 ] + ( τ 2 - α ) , where 1 i M , n 1 , and (5) b l = τ 1 - α 1 - α [ ( l + 1 ) 1 - α - l 1 - α ] , l 0 .

Using explicit center difference scheme for 2 u ( x , t ) / x 2    , we can get (6) 2 u ( x , t ) x 2 | x i t n = 1 h 2 ( u i + 1 n - 1 - 2 u i n - 1 + u i - 1 n - 1 ) + ( h 2 ) .

The explicit finite difference approximation for (1) is (7) 1 τ Γ ( 1 - α ) [ b 0 u i n - k = 1 n - 1 ( b n - k - 1 - b n - k ) u i k - b n - 1 u i 0 ] + μ u i n = u i + 1 n - 1 - 2 u i n - 1 + u i - 1 n - 1 h 2 + K φ i n .

Define s = b 0 + μ τ Γ ( 1 - α ) , a 1 = τ Γ ( 1 - α ) / ( s h 2 ) , a 2 = K τ Γ ( 1 - α ) / s , U n = ( u 1 n , u 2 n , , u M n ) T , F n = ( f 1 n , f 2 n , , f M n ) T , f i n = τ Γ ( 1 - α ) K φ i n , and r l as (8) r l = b l - b l + 1 s .

Equation (7) evolves as (9) U n = k = 1 n - 1 r n - 1 - k U k + b n - 1 U 0 + A U n - 1 + a 2 F n , where matrix A is a tridiagonal matrix, defined by (10) A M × M = ( - 2 a 1 a 1 a 1 - 2 a 1 a 1 a 1 a 1 - 2 a 1 ) .

2.2. GPU Architecture and CUDA Programming Model

The architecture of GPU is optimized for rendering real-time graphics, a computation and memory access intensive problem domain with enormous inherent parallelism. Not like CPU, a much larger portion of GPUs resources is devoted to data processing rather than to caching or control flow. The NVIDIA Quadro FX 5800 GPU has 30 multiprocessor units which are called the streaming multiprocessors (SMs). Each SM contains 8 SIMD CUDA cores. Each CUDA core runs at 1.30 GHz. The multiprocessors create, manage, and execute concurrent threads in hardware with near zero scheduling overhead. The single instruction multiple thread (SIMT) unit, which is akin to SIMD vector organizations, creates, manages, schedules, and executes threads in groups of 32 parallel threads called warp .

3. Details of GPU Solution

The parallel solution consists of three parts. The first part is preprocessing, which prepares the initial matrices, vectors, and so on. The second part is the parallel solver, which focuses on the iteration of time step with (9). The third part is postprocessing, which outputs the final results and so on.

The preprocessing includes initialization of parallel environment, distribution of computing task, allocation of local memory space, and initialization of variables and arrays. Matrices A M × M and F M × N are prepared before the computation of (9). For example, matrix A can be got according to (10). The postprocessing is simple. The results of the exact solution are performed. The max absolute error of the exact and parallel solutions is computed and outputted. Both the results of the exact and parallel solution are saved in files which are necessary for plot. Other operations of postprocessing include free memory space and stop the parallel environment.

In order to get U n , the right-sided computation of (9) should be performed. There are mainly one tridiagonal matrix vector multiplication, many constant vector multiplications, and many vector-vector additions in the right-sided computation.

The tridiagonal matrix vector multiplication is A U n .

The constant vector multiplications are V k = r n - 1 - k U k , a 2 F n , and so on.

The vector-vector additions are k = 1 n - 1 V k and so on.

The parallel solution uses the data level parallelism of GPU architecture. The parallel solution with CUDA for (1) is described in Algorithm 1. The preprocessing involves lines 1 to 3. The parallel solver, which is the most time consuming procedure, involves lines 4 to 12. The postprocessing involves lines 13 to 14 and other additional operations are not shown in Algorithm 1. Because the time spent on the preprocessing and postprocessing is trivial when the number of time steps is big enough, the preprocessing and postprocessing time is omitted for the measured time. T 1 and T 2 are used to record the measured time of the parallel CPU and GPU solutions.

<bold>Algorithm 1: </bold>Parallel solution for Caputo fractional reaction-diffusion equation with CUDA.

(1)    Init CUDA environment

(2)   Allocate GPU global memory A , U , F , r , b .

(3)   Init variables and arrays on GPU

(4)   record time T 1

(5)   call kernel i n i t U 0 M / BS , BS ( )

(6)   for   n = 1   to   N   by Step  1 do

(7)               call kernel t r i m v m C U M / BS , BS ( )

(8)               call kernel c v m _ v v a C U M / BS , BS ( )

(9)               for   k = 1   to   n   by Step  1 do

(10)                     call kernel c v m a C U M / BS , BS ( )

(11)              call kernel c v m C U M / BS , BS ( )

(12) record time T 2

(13) output T 2 - T 1 and U N

(14) free GPU memory and stop CUDA environment

The parallel solution uses the data level parallelism of GPU architecture. The parallel solution with CUDA for (1) is described in Algorithm 1. The preprocessing involves lines 1 to 3. The parallel solver, which is the most time consuming procedure, involves lines 4 to 12. The postprocessing involves lines 13 to 14 and other additional operations are not shown in Algorithm 1. BS stands for the CUDA thread block size and M /BS is the number of CUDA thread blocks. BS is the predefined constant like 16, 32, 64, and so forth. Because the time spent on the preprocessing and postprocessing is trivial when the number of time steps is big enough, the preprocessing and postprocessing time is omitted for the measured time. T 1 and T 2 are used to record the measured time of the serial and parallel solution.

Except for the initialization of variables and arrays, there are four CUDA kernels. The first kernel is i n i t U 0 , which computes the initial condition according to ϕ ( x ) in (1). The second kernel is t r i m v m C U , which performs the tridiagonal matrix vector multiplication. The third kernel is c v m _ v v a C U , which stands for constant vector multiplication and vector-vector addition. The fourth kernel is c v m a C U , which performs the constant vector multiplication and vector-vector addition. The last kernel is c v m C U , which stands for constant vector multiplication. The CUDA kernels i n i t U 1 and c v m C U are simple and will not be described in detail.

3.1. Implementation

The CUDA kernel c v m a C U for constant vector multiplication and vector-vector addition is shown in Algorithm 2. Algorithm 2 computes U c o l O u t + a 2 U c o l I n and saves the final vector into U c o l O u t .

<bold>Algorithm 2: </bold>CUDA kernel for constant vector multiplication and vector vector addition.

input: c o l O u t ,  c o l I n ,  a 2 ,  U

output: U

(2)         local variable g i d , ξ

(3)          g i d t h r e a d I d . x + b l o c k I d . x · b l o c k D i m . x

(4)          ξ a 2 × U [ M · c o l I n + g i d ]

(5)          U [ M · c o l O u t + g i d ] U [ M · c o l O u t + g i d ] + ξ

Most elements of tridiagonal matrix A are zero. The most common data structure used to store a sparse matrix for sparse matrix vector multiplication computations is compressed sparse row (CSR) format  shown in (11) A 3 × M = ( 0 a 1 a 1 a 1 - 2 a 1 - 2 a 1 - 2 a 1 - 2 a 1 a 1 a 1 a 1 0 ) . So in the following parts of this paper, matrix A stands for the format of (11) not the format of (10). With the format of (11), the global memory is coalesced and can improve the performance of global memory access.

The CUDA kernel t r i m v m C U for tridiagonal matrix vector multiplication is shown in Algorithm 3. One thread block deals with the multiplication of one row of matrix A and one column of U . Algorithm 3 computes A U n - 1 and saves the final vector into U n . The shared memory is used to improve the memory access speed. The synchronization function _ _ s y n c t h r e a d s ( ) is used to ensure the correctness of the logic of the algorithm.

<bold>Algorithm 3: </bold>CUDA kernel for tridiagonal matrix vector multiplication.

input: n , M , BS , A , U

output: U

(1)    shared memory s m [ BS + 2 ]

(3)                local variables t i d , g i d , BS , ξ

(4)                 t i d t h r e a d I d . x

(5)                 g i d t h r e a d I d . x + b l o c k I d . x · b l o c k D i m . x

(6)                 s m [ t i d + 1 ] U [ M · ( n - 1 ) + g i d ]

(7)                if   0 = = t i d   then

(8)                          if   0 = = g i d   then

(9)                                     s m [ 0 ] 0

(10)                        else

(11)                                    s m [ 0 ] U [ M · ( n - 1 ) + g i d - 1 ]

(12)              if   BS - 1 = = t i d   then

(13)                        if   M = = g i d   then

(14)                                   s m [ BS + 1 ] 0

(15)                        else

(16)                                   s m [ BS + 1 ] U [ M · ( n - 1 ) + g i d + 1 ]

(17)               _ _ s y n c t h r e a d s ( )

(18)               ξ 0.0

(19)               ξ ξ + s m [ t i d + 0 ] · A [ 0 * M + g i d ]

(20)              ξ ξ + s m [ t i d + 1 ] · A [ 1 * M + g i d ]

(21)               ξ ξ + s m [ t i d + 2 ] · A [ 2 * M + g i d ]

(22)              U [ M · n + g i d ] U [ M · n + g i d ] + ξ

In Algorithm 3, each GPU thread deals with the multiplication of one row of tridiagonal matrix A and vector U n - 1 . Each thread needs to use three elements of vector U n . The nearest two threads use the same two elements of vector U n - 1 . We can use the shared memory to improve the memory access performance. So the elements of vector U n - 1 which will be used by threads in a thread block are stored into shared memory, shown between lines 6 and 16 of Algorithm 3. The real computation of tridiagonal matrix multiplication is shown between lines 18 and 21. Finally, the results are stored into the global memory of U n .

3.2. Optimization

In Algorithm 1, the kernels c v m _ v v a C U , c v m C U , and t r i m v m C U are invoked N times. In each time step, kernel c v m a C U is invoked n ( 1,2 , , N ) times. Because (12) I ( I - 1 ) 2 = 1 + 2 + I the total number of the invocations of kernel c v m a C U is N 2 ( N - 1 ) / 2 . The most time consuming part of Algorithm 1 is the loop of line 9. Loop 9 can be combined into one CUDA kernel c v m a O p t C U as shown in Algorithm 4. The array g p u R is the coefficient of (8) in global memory.

<bold>Algorithm 4: </bold>Optimized CUDA kernel for constant vector multiplication and vector-vector addition.

input: n , g p u R , U

output: U

(2)           local variables g i d , ξ , k

(3)            g i d t h r e a d I d . x + b l o c k I d . x · b l o c k D i m . x

(4)            ξ 0

(5)           for   k = 1   to   n - 1   by Step  1 do

(6)                      ξ ξ + g p u R [ k ] · U [ M · ( n - 1 - k ) + g i d ]

(7)            U [ M · n + g i d ] U [ M · n + g i d ] + ξ

So the optimized parallel solution for Caputo fractional reaction-diffusion equation is similar to Algorithm 1 except that the loop (lines 9-10) in Algorithm 1 is replaced with the optimized CUDA kernel of Algorithm 4.

4. Experimental Results 4.1. Experiment Platforms

The experiment platforms consist of one GPU and one CPU listed in Table 1. For the purpose of fair comparisons, we measure the performance provided by GPU compared to the MPI code running on multicore CPU . Both codes run on double precision floating point operations. The CPU code is compiled by the mpif90 compiler with level three optimization. The GPU code is compiled by the NVCC compiler provided by CUDA version 3.0 with level three optimization too.

Technical specifications of experiment platforms.

 CPU Intel Xeon E5540, 4 cores, 2.53 GHz GPU NVIDIA Quadro FX 5800, 240 SPs, 1.30 GHz Operating system Kylin server version 3.1 CPU compiler GPU compiler mpif90, Intel Fortran version 11.1NVCC, CDUA version 3.0 Communication MPICH2 version 1.3rc2
4.2. Numerical Example

The following Caputo fractional reaction-diffusion equation  was considered: (13)    0 D t α u ( x , t ) + μ u ( x , t ) = 2 u ( x , t ) x 2 + K f ( x , t ) ( 0 < α < 1 ) u ( x , 0 ) = 0 , x ( 0,2 ) u ( 0 , t ) = u ( 2 , t ) = 0 with μ = 1 , K = 1 , and (14) f ( x , t ) = 2 Γ ( 2.3 ) x ( 2 - x ) t 1.3 + x ( 2 - x ) t 2 + 2 t 2 . The exact solution of (13) is (15) u ( x , t ) = x ( 2 - x ) t 2 .

4.3. Accuracy of the GPU Implementation

The GPU solution compares well with the exact solution to the time fractional diffusion equation in this test case of (13), shown in Figure 1. The τ and h for the GPU solution are T / 2048 and 2.0 / 16 . The maximum absolute errors for T = 0.3 , 0.5 , and 0.7 are 3.29 × 1 0 - 5 , 1.07 × 1 0 - 4 , and 2.30 × 1 0 - 4 . In fact, the accuracy and convergence of the GPU solution are the same as the serial and parallel MPI solution on CPU .

Comparison of exact solution to the parallel GPU solution at T = 0.3 , 0.5 , 0.7 .

4.4. Total Performance Improvement

In this section, the performance of the optimized GPU solution presented in this paper is compared with the performance of the parallel CPU solution . The parallel CPU solution makes full use of four cores of E5540. The optimized GPU solution is presented in Section 3.2.

For fixed N = 128 , the performance comparison between GPU and multicore CPU is shown in Table 2. The thread block size is 32.

Performance comparison between optimized GPU solution on Quadro FX 5800 and parallel CPU solution on E5540 with fixed N = 128 .

M CPU GPU Speedup
245760 0.58 0.46 1.26
491520 1.23 0.90 1.36
737280 1.95 1.27 1.54
983040 3.27 1.89 1.73
1228800 4.66 2.06 2.26

For fixed M = 122880 , the performance comparison between optimized GPU solution and parallel CPU solution is shown in Table 3. The thread block size is 32.

Performance comparison between optimized GPU solution on Quadro FX 5800 and parallel CPU solution on E5540 with fixed M = 122880 .

N CPU GPU Speedup
128 0.29 0.25 1.14
256 1.06 0.84 1.26
512 3.96 2.82 1.40
1024 15.20 9.33 1.63
2048 60.16 33.59 1.79
4.5. Performance Issues of GPU Solution

With M = 491520 , N = 128 , and thread block size 64, the runtime of Algorithm 1 on Quadro FX 5800 is 1.228 seconds. Without the loop of line 9 in Algorithm 1, the runtime is only 0.032 seconds. That is to say, about 97.39% of runtime is spent on the loop of line 9. This is the reason why we develop the optimized GPU solution with an optimized CUDA kernel of Algorithm 4.

The impact of the optimized CUDA kernel on constant vector multiplication and vector-vector addition with fixed N = 128 is shown in Table 4. The performance improvement with fixed M = 491520 is shown in Table 5. All thread block sizes are 64. The basic GPU solution is Algorithm 1 and the optimized GPU solution uses the optimized CUDA kernel of Algorithm 4.

Performance improvement for fixed N with the optimization of constant vector multiplication and vector-vector addition.

M Original Optimization Speedup
245760 0.64 0.53 1.21
491520 1.26 1.05 1.20
737280 1.85 1.56 1.19
983040 2.53 2.10 1.20
1228800 3.06 2.59 1.18

Performance improvement for fixed M with the optimization of constant vector multiplication and vector-vector addition.

N Original Optimization Speedup
256 0.72 0.53 1.36
512 2.86 1.75 1.63
1024 11.36 5.60 2.03
2048 45.32 18.65 2.43
4096 181.01 66.71 2.71

The thread block size (BS) is a key parameter for parallel GPU algorithms. The impact of BS is shown in Table 6. From Table 6, we can see that thread block size 32 is the best choice.

Impact of thread block size (BS).

BS Runtime BS Runtime
4 13.5 64 3.35
8 6.84 128 3.61
16 3.59 256 3.55
32 2.82 512 3.47
5. Conclusions and Future Work

In this paper, the numerical solution of Caputo fractional reaction-diffusion equation with explicit finite difference method is accelerated on GPU. The iteration of time step consists of tridiagonal matrix vector multiplication, constant vector multiplication, and vector-vector addition. The details of the GPU solution and some basic CUDA kernels are presented. The most time consuming loop (constant vector multiplication and vector-vector addition) is optimized. The experimental results show the GPU solution compares well with the exact analytic solution and is much faster than parallel CPU solution. So the power of parallel computing on GPU for solving fractional applications should be recognized.

As a part of the future work, first, the stability, like Ulam’s type, of different fractional equations should be paid attention to . Second, parallelizing the implicit numerical method of fractional differential equations is challenging.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This research work is supported in part by the National Natural Science Foundation of China under Grants no. 11175253 and no. 61170083, in part by Specialized Research Fund for the Doctoral Program of Higher Education under Grant no. 20114307110001, and in part by 973 Program of China under Grant no. 61312701001. The authors would like to thank the anonymous reviewers for their helpful comments as well.

Huang F. Liu F. The time fractional diffusion equation and the advection-dispersion equation The ANZIAM Journal 2005 46 3 317 330 10.1017/S1446181100008282 MR2124926 ZBL1072.35218 Chen S. Jiang X. Analytical solutions to time-fractional partial differential equations in a two-dimensional multilayer annulus Physica A: Statistical Mechanics and Its Applications 2012 391 15 3865 3874 10.1016/j.physa.2012.03.014 MR2917572 Pandey R. K. Singh O. P. Baranwal V. K. An analytic algorithm for the space-time fractional advection-dispersion equation Computer Physics Communications 2011 182 5 1134 1144 10.1016/j.cpc.2011.01.015 MR2774789 ZBL1217.65196 Ibrahim R. W. Ulam-hyers stability for cauchy fractional differential equation in the unit disk Abstract and Applied Analysis 2012 2012 10 613270 MR2947735 ZBL1246.39024 10.1155/2012/613270 Brzdęk J. Brillouët-Belluot N. Ciepliński K. Xu B. Ulam's type stability Abstract and Applied Analysis 2012 2012 2 329702 MR2999925 ZBL1259.39019 10.1155/2012/329702 Brillouët-Belluot N. Brzdk J. Ciepliński K. On some recent developments in ulam's type stability Abstract and Applied Analysis 2012 2012 41 716936 10.1155/2012/716936 Liu Q. Liu F. Turner I. Anh V. Numerical simulation for the 3D seepage flow with fractional derivatives in porous media IMA Journal of Applied Mathematics 2009 74 2 201 229 10.1093/imamat/hxn044 MR2495308 ZBL1169.76427 Ashyralyev A. Cakir Z. On the numerical solution of fractional parabolic partial differential equations with the dirichlet condition Discrete Dynamics in Nature and Society 2012 2012 15 696179 MR2965736 ZBL1248.65085 10.1155/2012/696179 Ashyralyev A. Dal F. Finite difference and iteration methods for fractional hyperbolic partial differential equations with the neumann condition Discrete Dynamics in Nature and Society 2012 2012 15 434976 MR2948178 ZBL1248.65086 10.1155/2012/434976 Li C. Zeng F. Liu F. Spectral approximations to the fractional integral and derivative Fractional Calculus and Applied Analysis 2012 15 3 383 406 10.2478/s13540-012-0028-x MR2944106 ZBL1276.26016 Chen J. H. An implicit approximation for the caputo fractional reaction-dispersion equation Journal of Xiamen University (Natural Science) 2007 46 5 616 619 MR2358654 ZBL1164.65528 Henry B. I. Wearne S. L. Fractional reaction-diffusion Physica A: Statistical Mechanics and Its Applications 2000 276 3-4 448 455 10.1016/S0378-4371(99)00469-0 MR1780373 Gafiychuk V. Datsko B. Meleshko V. Mathematical modeling of time fractional reaction-diffusion systems Journal of Computational and Applied Mathematics 2008 220 1-2 215 225 10.1016/j.cam.2007.08.011 MR2444166 ZBL1199.35152 Haubold H. J. Mathai A. M. Saxena R. K. Further solutions of fractional reaction-diffusion equations in terms of the h-function Journal of Computational and Applied Mathematics 2011 235 5 1311 1316 10.1016/j.cam.2010.08.016 MR2728067 Rida S. Z. El-Sayed A. M. A. Arafa A. A. M. On the solutions of time-fractional reaction-diffusion equations Communications in Nonlinear Science and Numerical Simulation 2010 15 12 3847 3854 10.1016/j.cnsns.2010.02.007 MR2652657 ZBL1222.65115 Saxena R. K. Mathai A. M. Haubold H. J. Solution of generalized fractional reaction-diffusion equations Astrophysics and Space Science 2006 305 3 305 313 2-s2.0-33845323067 10.1007/s10509-006-9191-z Pennycook S. J. Hammond S. D. Mudalige G. R. Wright S. A. Jarvis S. A. On the acceleration of wavefront applications using distributed many-core architectures The Computer Journal 2012 55 2 138 153 2-s2.0-84856898868 10.1093/comjnl/bxr073 Mo Z. Zhang A. Cao X. Liu Q. Xu X. An H. Pei W. Zhu S. Jasmin: a parallel software infrastructure for scientific computing Frontiers of Computer Science in China 2010 4 4 480 488 2-s2.0-78650483559 10.1007/s11704-010-0120-5 Zhang R. A three-stage optimization algorithm for the stochastic parallel machine scheduling problem with adjustable production rates Discrete Dynamics in Nature and Society 2013 2013 15 280560 MR3037750 ZBL1264.90098 10.1155/2013/280560 Yang X.-J. Liao X.-K. Lu K. Hu Q.-F. Song J.-Q. Su J.-S. The TianHe-1A supercomputer: its hardware and software Journal of Computer Science and Technology 2011 26 3 344 351 2-s2.0-79959969892 10.1007/s02011-011-1137-8 Wang Y.-X. Zhang L.-L. Liu W. Che Y.-G. Xu C.-F. Wang Z.-H. Zhuang Y. Efficient parallel implementation of large scale 3D structured grid CFD applications on the Tianhe-1A supercomputer Computers and Fluids 2013 80 244 250 2-s2.0-84859169853 10.1016/j.compfluid.2012.03.003 Xu C. Deng X. Zhang L. Jiang Y. Cao W. Fang J. Che Y. Wang Y. Liu W. Kunkel J. Ludwig T. Meuer H. Parallelizing a high-order CFD software for 3D, multi-block, structural grids on the TianHe-1A supercomputer Supercomputing 2013 7905 Heidelberg, Germany Springer 26 39 Lecture Notes in Computer Science NVIDIA Corporation CUDA Programming Guide Version 3.1 2010 Gong C. Liu J. Chi L. Huang H. Fang J. Gong Z. GPU accelerated simulations of 3D deterministic particle transport using discrete ordinates method Journal of Computational Physics 2011 230 15 6010 6022 2-s2.0-79956143384 10.1016/j.jcp.2011.04.010 Gong C. Liu J. Huang H. Gong Z. Particle transport with unstructured grid on GPU Computer Physics Communications 2012 183 3 588 593 2-s2.0-84855469169 10.1016/j.cpc.2011.12.002 Wu Q. Yang C. Tang T. Xiao L. Exploiting hierarchy parallelism for molecular dynamics on a petascale heterogeneous system Journal of Parallel and Distributed Computing 2013 73 12 1592 1604 Podlubny I. Fractional Differential Equations 1999 198 San Diego, Calif, USA Academic Press Mathematics in Science and Engineering MR1658022 Xu Y. He Z. The short memory principle for solving abel differential equation of fractional order Computers & Mathematics with Applications 2011 62 12 4796 4805 10.1016/j.camwa.2011.10.071 MR2855624 ZBL1236.34008 Gong C. Bao W. Tang G. A parallel algorithm for the Riesz fractional reaction-diffusion equation with explicit finite difference method Fractional Calculus and Applied Analysis 2013 16 3 654 669 10.2478/s13540-013-0041-8 MR3071206 Diethelm K. An efficient parallel algorithm for the numerical solution of fractional differential equations Fractional Calculus and Applied Analysis 2011 14 3 475 490 10.2478/s13540-011-0029-1 MR2837642 ZBL1273.65101 Gong C. Bao W. Tang G. Yang B. Liu J. An efficient parallel solution for Caputo fractional reaction-diffusion equation The Journal of Supercomputing 2014 10.1007/s11227-014-1123-z Gong C. Bao W. Tang G. Jiang Y. Liu J. A parallel algorithm for the two dimensional time fractional diffusion equation with implicit difference method The Scientific World Journal 2014 2014 8 219580 10.1155/2014/219580 Gong C. Bao W. Tang G. Jiang Y. Liu J. A domain decomposition method for time fractional reaction-diffusion equation The Scientific World Journal 2014 2014 5 681707 10.1155/2014/681707 Williams S. Oliker L. Vuduc R. Shalf J. Yelick K. Demmel J. Optimization of sparse matrix-vector multiplication on emerging multicore platforms Parallel Computing 2009 35 3 178 194 2-s2.0-60949098907 10.1016/j.parco.2008.12.006 Fošner A. On the generalized Hyers-Ulam stability of module left ( m , n ) -derivations Aequationes Mathematicae 2012 84 1-2 91 98 10.1007/s00010-012-0124-3 MR2968204 Popa D. Hyers-Ulam stability of the linear recurrence with constant coefficients Advances in Difference Equations 2005 2 101 107 MR2197125 ZBL1095.39024 Agarwal R. P. Xu B. Zhang W. Stability of functional equations in single variable Journal of Mathematical Analysis and Applications 2003 288 2 852 869 10.1016/j.jmaa.2003.09.032 MR2020201 ZBL1053.39042