TSWJ The Scientific World Journal 1537-744X Hindawi Publishing Corporation 219580 10.1155/2014/219580 219580 Research Article A Parallel Algorithm for the Two-Dimensional Time Fractional Diffusion Equation with Implicit Difference Method http://orcid.org/0000-0003-0349-1100 Gong Chunye 1, 2, 3 Bao Weimin 1, 2 Tang Guojian 1 Jiang Yuewen 4 Liu Jie 3 Liu F. Sikorskii A. Yuste S. B. 1 College of Aerospace Science and Engineering National University of Defense Technology Changsha 410073 China nudt.edu.cn 2 Science and Technology on Space Physics Laboratory Beijing 100076 China 3 School of Computer Science National University of Defense Technology Changsha 410073 China nudt.edu.cn 4 Department of Engineering Science University of Oxford Oxford OX2 0ES UK ox.ac.uk 2014 1232014 2014 09 01 2014 06 02 2014 12 3 2014 2014 Copyright © 2014 Chunye Gong et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

It is very time consuming to solve fractional differential equations. The computational complexity of two-dimensional fractional differential equation (2D-TFDE) with iterative implicit finite difference method is O ( M x M y N 2 ) . In this paper, we present a parallel algorithm for 2D-TFDE and give an in-depth discussion about this algorithm. A task distribution model and data layout with virtual boundary are designed for this parallel algorithm. The experimental results show that the parallel algorithm compares well with the exact solution. The parallel algorithm on single Intel Xeon X5540 CPU runs 3.16–4.17 times faster than the serial algorithm on single CPU core. The parallel efficiency of 81 processes is up to 88.24% compared with 9 processes on a distributed memory cluster system. We do think that the parallel computing technology will become a very basic method for the computational intensive fractional applications in the near future.

1. Introduction

Building fractional mathematical models for specific phenomenon and developing numerical or analytical solutions for these fractional mathematical models are very hot in recent years. Fractional diffusion equations have been used to represent different kinds of dynamical systems . But the fractional applications are rare. One reason for rare fractional applications is that the computational cost of approximating for fractional equations is too much heavy. The idea of fractional derivatives dates back to the 17th century. A fractional differential equation is a kind of equation which uses fractional derivatives. Fractional equations provide a powerful instrument for the description of memory and hereditary properties of different substances.

There has been a wide variety of numerical methods proposed for fractional equations [2, 3], for example, finite difference method , finite element method [8, 9], spectral method [10, 11], and meshless techniques . Zhuang and Liu  presented an implicit difference approximation for two-dimensional time fractional diffusion equation (2D-TFDE) on a finite domain and discussed the stability and convergence of the method. The numerical result of an example agrees well with their theoretical analysis. Tadjeran and Meerschaert presented a numerical method, which combines the alternating directions implicit (ADI) approach with a Crank-Nicolson discretization and a Richardson extrapolation to obtain an unconditionally stable second-order accurate finite difference method, to approximate a two-dimensional fractional diffusion equation . Two ADI schemes based on the L 1 approximation and backward Euler method are considered for the two-dimensional fractional subdiffusion equation .

It is very time consuming to numerically solve fractional differential equations for high spatial dimension or big time integration. Short memory principle  and parallel computing [16, 17] can be used to overcome this difficulty. Parallel computing is used to solve computation intensive applications simultaneously . Large scale applications in science and engineering such as particle transport , different linear and nonlinear systems , nonnumerical intelligent algorithm , and computational fluid dynamics  can rely on parallel computing. Diethelm  implemented the fractional version of the second-order Adams-Bashforth-Moulton method on a parallel computer and discussed the precise nature of the parallelization concept. This is the first attempt for parallel computing on fractional equations. Following that, Gong et al.  presented a parallel algorithm for one-dimensional Riesz space fractional diffusion equation with explicit finite difference method. The numerical solution of Riesz space fractional equations has global dependence on grid points, which means the approximation of a grid point will depend on the approximation of all grid points in one time step. The numerical solution of time fractional equations has global dependence on time steps, which means that the approximation of a grid point will depend on the approximation of the grid point in all time steps. Global dependence means the nonlocal property of fractional deviates on time or space. Explicit method is easy to be parallelized but is restrict by its stability condition. Implicit method is hard to be solved by Gauss elimination method and often uses the iterative scheme. Until today, the power of parallel computing for high dimensional and time fractional differential equations has not been tried.

This paper focuses on the two-dimensional time fractional diffusion equation studied by Zhuang and Liu : (1) α u ( x , y , t ) t α = a ( x , y , t ) 2 u ( x , y , t ) x 2 + b ( x , y , t ) 2 u ( x , y , t ) y 2 + f ( x , y , t ) , u ( x , y , 0 ) = ϕ ( x , y ) , ( x , y ) Ω , u ( x , y , t ) | Ω = 0 , t [ 0 , T ] , where Ω = { ( x , y ) 0 x L x , 0 y L y ,    a ( x , y , t ) > 0 , b ( x , y , t ) > 0 } . The fractional derivative is in the Caputo form.

2. Background: Numerical Solution

The fractional derivative of f ( t ) in the Caputo sense is defined as  (2) α f ( t ) t α = 1 Γ ( 1 - α ) 0 t f ( ξ ) ( t - ξ ) α d ξ ( 0 < α < 1 ) .

If f ( t ) is continuous bounded derivatives in [ 0 , T ] for every T > 0 , we can get (3) α f ( t ) t α = lim ξ 0 , n ξ = t ξ α i = 0 n ( - 1 ) i ( α i ) = f ( 0 ) t - α Γ ( 1 - α ) + 1 Γ ( 1 - α ) 0 t    f ( ξ ) ( t - ξ ) α d ξ .

Define τ = T / N , h x = L x / M x , h y = L y / M y ,       t n = n τ ,       x i = i h x , and y j = j h y , for 0 n N , 0 i M x , and 0 j M y . Let u i , j n , φ i n , f i , j n , ϕ i , j , a i , j n , and b i , j n be the numerical approximation to u ( x i , y j , t n ) , f ( x i , y j , t n ) , ϕ ( x i , y j ) , a ( x i , y j , t n ) , and b ( x i , y j , t n ) . We can get the implicit approximating scheme  for (1): (4) u i , j n + 1 - u i , j n + s = 1 n b s ( u i , j n + 1 - s - u i , j n - s ) = μ 1 Γ ( 2 - α ) a i , j n + 1 ( u i + 1 , j n + 1 - 2 u i , j n + 1 + u i - 1 , j n + 1 ) + μ 2 Γ ( 2 - α ) b i , j n + 1 ( u i , j + 1 n + 1 - 2 u i , j n + 1 + u i , j - 1 n + 1 ) + τ α Γ ( 2 - α ) f i , j n + 1 , where b s = ( s + 1 ) 1 - α - s 1 - α ( s = 0,1 , 2 , , N ) , μ 1 = τ α / h x 2 , and μ 2 = τ α / h y 2 . The h x and h y are the step size along X and Y directions defined above.

3. Parallel Algorithm 3.1. Analysis

Let c 1 = c 1 ( i , j , k ) = μ 1 Γ ( 2 - α ) a i , j n + 1 , and let c 2 = c 2 ( i , j , k ) = μ 2 Γ ( 2 - α ) b i , j n + 1 ; (4) can be rewritten as (5) - c 1 ( u i + 1 , j n + 1 + u i - 1 , j n + 1 ) + ( 1 + 2 c 1 + 2 c 2 ) u i , j n + 1 - c 2 ( u i , j + 1 n + 1 + u i , j - 1 n + 1 ) = u i , j n - s = 1 n b s u i , j n + 1 - s + s = 1 n b s u i , j n - s + τ α Γ ( 2 - α ) f i , j n + 1 .

The explicit schemes are conditionally stable and need very small τ for high dimensional problems for both classical and fractional equations. The implicit schemes are unconditionally stable but need to get the inverse of the coefficient matrix. Sometimes the sparse coefficient matrix is too large, making a direct method too difficult to use. So, the iterative method can be used to avoid matrix inverse: (6) u i , j n + 1 , k + 1 = 1 1 + 2 c 1 + 2 c 2 × ( s = 1 n c 1 ( u i + 1 , j n + 1 , k + u i - 1 , j n + 1 , k ) + c 2 ( u i , j + 1 n + 1 , k + u i , j - 1 n + 1 , k ) + u i , j n - s = 1 n b s u i , j n + 1 - s + s = 1 n b s u i , j n - s + τ α Γ ( 2 - α ) f i , j n + 1 ) until Δ u = | u i , j n + 1 , k + 1 - u i , j n + 1 , k | is smaller than a predefined threshold ϵ . u 0 M x , 0 M y n + 1 , k + 1 are the iterative variables. u 0 M x , 0 M y n are the known variables for the unknown n + 1 time step.

It is very time consuming to solve the 2D-TFDE by iterative method of (6). For determining N , M x , M y and assuming if there are K iterations for each time step on average, there are about M x M y ( N 2 / 2 + 1.5 N + 6 K N ) arithmetical logical operations ignoring the computation of the coefficients. So, the computational complexity is O ( M x M y N 2 ) , which is much more heavy than the classical integer order 2D partial differential equations O ( M x M y N ) .

Besides the heavy computational cost, the memory space requirement is the other problem. Because each unknown time step needs to use all the values of the previous time steps, all the values of u 0 M x , 0 M y 0 N need to be stored into the memory space. When N is big enough, the memory complexity is O ( M x M y N ) , which is far bigger than the classical integer order 2D partial differential equations O ( M x M y ) .

The computation of (6) can be divided into two parts.

Part 1 i , j = u i , j n - s = 1 n b s u i , j n + 1 - s + s = 1 n b s u i , j n - s + τ α Γ ( 2 - α ) f i , j n + 1 . The unknown value u i , j n + 1 , k + 1 of grid point P i , j at the time step n + 1 relies on the value of grid point P i , j at all previous time steps of Part 1 i , j .

Part 2 i , j = c 1 ( u i + 1 , j n + 1 , k + u i - 1 , j n + 1 , k ) + c 2 ( u i , j + 1 n + 1 , k + u i , j - 1 n + 1 , k ) . The unknown value u i , j n + 1 , k + 1 of grid point P i , j relies on the value of P i + 1 , j , P i - 1 , j , P i , j + 1 , P i , j - 1 .

The data dependence of 2D-TFDE is shown in Figure 1. u i , j n + 1 relies on the neighboring grid points at the same time step and the same position of all the previous time steps.

The data dependence of 2D-TFDE of grid point P i , j of time step t n + 1 .

3.2. Task Distribution Model and Data Layout

The task distribution of the total computation should be designed on distributed memory systems, with the goal of making the total computations as efficient as possible. There are three main issues in choosing a task distribution model for these computations:

load balance: ensure splitting of the computations reasonably evenly among all computing processors/processes throughout the time stepping;

less communication: the task distribution model should keep the communication among different computing processes as less as possible;

convenient programming: the parallel algorithm based on the task distribution model should not change the serial algorithm too much.

The goal of keeping attention on these issues is achieving high execution efficiency and high scalability of the parallel algorithm on distributed memory systems for 2D-TFDE.

Refer to (6). Part 2 i , j computation has no data dependence. Part 1 i , j computation has data dependence among neighboring grid points. There are mainly two kinds of task distribution models. The first one is one-dimensional distribution (ODD): splitting the domain of all grid points along the X or Y direction on average. The task distribution model of the parallel algorithm  for the one-dimensional Riesz space fractional equation is ODD. The parallel algorithm based on ODD will not change the serial algorithm much and the load balance is guaranteed. If task is divided along X direction and M y is very big, the communication will influence the scalability of the parallel algorithm. The second one is two-dimensional distribution (TDD): splitting the domain of all grid points along the X and Y direction on average. So, the computing processes have a two-dimensional grid layout, with process id ( p i , p j ) and 0 p i P x , 0 p j P y . P x , P y are the dimension size of the processes grid. The task distribution with TDD is shown in Figure 2.

The two-dimensional task distribution model for 2D-TFDE.

With the TDD, the data layout is described in Figure 3. Each subdomain with a process may have less than four virtual boundaries to receive the boundary data from its nearest neighbors. The virtual boundary is shown with dotted lines. The process ( p x , P y - 1 ) ( 0 p x P x ) has four virtual boundaries. The process ( p x , P y ) only has three virtual boundaries since there is no process that stays on its right hand. A virtual boundary may have several layer grid points, which depends on the discrete scheme on space. In this paper, there is only one layer grid point for a virtual boundary with (4). In every iteration of (6), the processes exchange the data near the virtual boundaries shown in Figure 3. After the exchange, every process performs its own computation according to (6).

Data layout for 2D-TFDE.

3.3. Implementation

The parallel algorithm for 2D-TFDE uses the mechanisms of process level parallelism. The process level parallelism is a kind of task level parallelism. The parallel algorithm for (1) is described in Algorithm 1.

<bold>Algorithm 1: </bold>Parallel algorithm for 2D-TFDE.

(1) init parallel environment

(2) for  all MPI processes do in parallel

(3)    get the input parameters like M x , M y , N , P x , P y , ϵ 0 .

(4)    allocate local memory u , c 1 , c 2 , f , Part1, v and so forth

(5)    init variables and arrays

(6)    get process id ( p x , p y )

(7)    compute the initial condition u 0 with ϕ ( x , y ) and boundary condition

(8)    record time T 1

(9)    for   n = 0   to   N - 1   do

(10)     compute c 1 , c 2 , f et al.

(11)      v i , j u i , j n with I i , j

(12)      f i , j f ( x i , y j , ( n + 1 ) τ ) with I i , j

(13)     Part 1 i , j u i , j n + τ α Γ ( 2 - α ) f i , j n + 1 with I i , j

(14)     for   s = 1   to   n   do

(15)     Part 1 i , j Part 1 i , j - b s u i , j n + 1 - s + b s u i , j n - s with I i , j

(16)     while   ϵ ϵ 0   do

(17)     u i , j n + 1 1 / ( 1 + 2 c 1 + 2 c 2 ) ( c 1 ( v i + 1 , j + v i - 1 , j ) + c 2 ( v i , j + 1 + v i , j - 1 ) ) with I i , j

(18)     if   p x < P x   then

(19)       send right boundary to its right neighbor

(20)       receive left boundary of its right neighbor

(21)     if    p y < P y   then

(22)       send top boundary to its top neighbor

(23)       receive bottom boundary of its top neighbor

(24)     if   p x > 0   then

(25)       send left boundary to its left neighbor

(26)       receive right boundary of its left neighbor

(27)     if   p y < 0   then

(28)       send bottom boundary to its bottom neighbor

(29)       receive top boundary of its bottom neighbor

(30)     ϵ max | v - u n + 1 |   with   I i , j

(31)    get global maximum of ϵ of all processes

(32)     v i , j u i , j n + 1   with   I i , j

(33)  record time T 2

(34)  output T 2 - T 1

(35)  stop parallel environment

Each process only allocates its local memory. Assuming M x , M y are divisible by P x , P y , the process with four virtual boundaries will allocate ( M x / P x + 2 ) ( M y / P y + 2 ) N memory space for array u . The calculation of process id has three steps:

get the MPI global id ID ;

p y = ID / P x ;

p x = ID - P x p y .

The computations of c 1 ( i , j ) , c 2 ( i , j ) , f i , j , and so forth depend on the particular functions of coefficient and source terms. Performing these computations, every time step is a good choice. If these computations are performed out of the main loop (lines 9–32), a lot of memory space is required. If these computations are performed in the “While" loop (lines 16–32), it is too time consuming. The u 0 stands for the zero time step u i , j 0 and v stands for v i , j . I i , j means the iteration 1 i M x / P x , 1 j M y / P y . If a process has neighbors, it should exchange the boundary data with its neighbors. The received boundary data are stored into the designed virtual boundaries. The lines 3–7 of Algorithm 1 are the preprocessing for the parallel algorithm. The lines 9–32 are the main time marching loops. T 1 , T 2 are used to record the execution time.

4. Experimental Results and Discussion

The experiment platform is a cluster with distributed memory system (DSM) architecture. One computing node consists of two Intel Xeon E5540 CPUs. The specifications of the cluster are listed in Table 1. The code runs on double precision floating point operations and is compiled by the mpif90 compiler with level three optimization (-O3). For convenience to compare the runtime, the inner loop (lines 16–32) of Algorithm 1 is fixed as 3.

Technical specifications of the experiment platform.

 CPU Intel Xeon E5540, 4 cores, 2.53 GHz Operating system Kylin server version 3.1 Compiler mpif90, Intel Fortran, version 11.1 Communication MPICH2, version 1.3rc2

4.1. Numerical Example and Convergence of the Parallel Algorithm

The following time fractional ( α = 0.4 ) differential equation  was considered: (7) 0.4 u ( x , y , t ) t 0.4 = 2 t 1.6 π Γ ( 0.6 ) 2 u ( x , y , t ) x 2 + t 1.6 12 π Γ ( 0.6 ) 2 u ( x , y , t ) y 2 + f ( x , y , t ) , u ( x , y , 0 ) = sin ( π x ) sin ( π y ) , ( x , y ) Ω , u ( x , y , t ) | Ω = 0 , t [ 0 , T ] , where f ( x , y , t ) = ( 25 t 1.6 / 12 Γ ( 0.6 ) ) ( t 2 + 2 ) sin ( π x ) sin ( π y ) , Ω = { ( x , y ) 0 < x < 1,0 < y < 1 } , and Ω is the boundary of Ω . The exact solution of the above equation is u ( x , y , t ) = ( t 2 + 1 ) sin ( π x ) sin ( π y ) .

The computational results for different α at t = 1.0 and y = 0.5 are shown in Figure 4. Figure 4 shows that the order of the fractional time derivative α governs the value of unknown u . With the increase of α to 1, (1) approaches the classical PDE. Figure 5 shows the numerical solutions with α = 0.4 ,    t = 1.0 .

The numerical approximation whose transport is governed by the TFDE (7) for various α = 0.2 , 0.4 , 0.6 , 0.8 when y = 0.5 ,    t = 1.0 .

The approximation solution of (7) when α = 0.4 and t = 1.0 .

The parallel algorithm compares well with the exact analytic solution to the fractional partial differential equation in this test case of (7) with α = 0.4 , shown in Figure 6. The Δ t and h are 1.0 / 100 and 1.0 / 10 . The maximum absolute error is 8.36 × 1 0 - 3 .

Comparison of exact solution to the solution of the parallel algorithm at time t = 1.0.

4.2. Performance Improvement

For fixed N = 10 , the performance comparison between single process and four processes (single CPU) is shown in Figure 7. The X step number in (6) is M , which is the x-coordinate of Figure 7. M = M x = M y ranges from 2048 to 10240. With M = 2028 , the runtime of one process is 23.45 seconds and the runtime of four processes is 6.64 seconds. The speedup is 3.53. With M = 10240 , the runtime of one process is 803.88 seconds and the runtime of four processes is 192.76 seconds. The speedup is 4.17. From Figure 7, the parallel algorithm with fixed N = 10 is more than 4 times faster than the serial algorithm.

Performance comparison between one process and four processes on E5540 with fixed N = 10 .

For fixed M = 2560 = M x = M y , the performance comparison between single process and four processes is shown in Figure 8. For single process, the X, Y step number is 2560 . For four processes, the X, Y step number is 1280 with P x = 2 ,    P y = 2 . N ranges from 16 to 512. With N = 16 , the runtime of one process is 17.63 seconds and the runtime of four processes is 4.65 seconds. The speedup is 3.79. With N = 512 , the runtime of one process is 4415.78 seconds and the runtime of four processes is 1394.99 seconds. The speedup is 3.16. The performance of four processes is about 3.2 times higher than the performance of single process with M = 2560 .

Performance comparison between one process and four processes on E5540 with fixed M .

4.3. Scalability

The scalability of the parallel algorithm on the large scale cluster system is shown in Figure 9. The technical specifications of the cluster system are listed in Table 1. N is fixed with 10 for all conditions. Each process has the same ( M x / P x , M y / P y ) with M = M x = M y and P x = P y . M varies from 16650, 33300, and 49950 for 9, 36, and 81 processes. The runtime of 9 processes is 83.02 seconds and the runtime of 81 processes is 94.08 seconds. The parallel efficiency of 81 processes is 88.24% compared with 9 processes. Here, the parallel efficiency is defined as the ratio of the runtime of different number of processes with the same work load on each process.

Scalability of the parallel algorithm on the cluster system.

4.4. Discussion

The parallel Algorithm 1 will have good parallel scalability on distributed memory system. From Figure 3, we can see that each subdomain has only virtual boundary at every direction (top, bottom, left, and right). Assuming that the size of the subdomain is M a , M b ( M a > 0 , M b > 0 ) , the inner iteration of line 16 in Algorithm 1 has about 8 M a M b arithmetic operations with 1 / ( 1 + 2 c 1 + 2 c 2 ) precomputed. It needs to establish 8 communications for neighbors except the global communication for ϵ . The arithmetic operation of each time step besides the inner iteration is constant as K M a M b . K is bigger than 4 n M a M b . The communication data is 4 M a + 4 M b + 1 grid point. Assuming that finishing one arithmetic operation needs time t a and there are L inner iterations, the computing time of each time step is ( K + 8 L ) M a M b . Assume that t b is the time to establish the communication, t c is the transform time for a grid point, and t d is the global communication time. So, the total communication time for a time step is L ( 9 t b + 4 M a t c + 4 M b t c + t d ) . The communication/computation ratio β is as follows: (8) β = L ( 9 t b + 4 M a t c + 4 M b t c + t d ) ( K + 8 L ) M a M b . The computation time is determined with the multiplication of M a M b and the communication time is determined with the addition of M a and M b . The extreme of β is as follows: (9) lim M a , M b L ( 9 t b + 4 M a t c + 4 M b t c + t d ) ( K + 8 L ) M a M b = lim M a ( lim M b L ( 9 t b + 4 M a t c + 4 M b t c + t d ) ( K + 8 L ) M a M b ) = lim M a L ( 4 t c ) ( K + 8 L ) M a = 0 . That means we can enhance the parallel efficiency by enlarging the size of subdomain.

The time t and number of grid points will affect the convergence property. The exact solution of (7) shows that u ( 0.5,0.5 , t ) = t 2 + 1 .

The bigger t becomes, the more inner iterations are needed. With M = M x = M y = 5 , N = M 2 , the first inner time step t 1 needs 5 Jacobi iterations and the last inner time step t N needs 31 iterations for T = 1.0 . For T = 2.0 , t 1 becomes 7 and t N becomes 61.

The bigger M becomes, the more inner iterations are needed. The T is fixed as 1.0. For M = 10 , t 1 becomes 6 and t N becomes 66. For M = 10 , t 1 becomes 3 and t N becomes 136.

The reason for the phenomenon above is that Δ u ( u n + 1 - u n ) changes dramatically if the source term f ( x , y , t ) is big. The iteration times with L = 1.0 , M = 15 , N = M 2 are shown in Table 2.

Impact of the source term on iteration times.

f ( x , y , t ) T = 2.0 T = 3.0
25 t 1.6 12 Γ ( 0.6 ) ( t 2 + 2 ) sin ( π x ) sin ( π y ) 284 444
25 12 Γ ( 0.6 ) 2 sin ( π x ) sin ( π y ) 253 361
25 12 Γ ( 0.6 ) sin ( π x ) sin ( π y ) 245 348
1.0 Γ ( 0.6 ) sin ( π x ) sin ( π y ) 238 336

The parallel algorithm is compatible with short memory principle . The computing time ( K + 8 L ) M a M b will become small with a smaller K , which is determined by n . The Gauss-Seidel iteration method will have better convergent speed than Jacobi iteration method, but it is hard to parallelize the Gauss-Seidel method.

As analyzed in Section 3.1, the computational complexity is O ( M x M y N 2 ) . Define the following function: (10) w = log 2 ( T 2 - T 1 ) . w varies almost linearly, as shown in Figure 10. Figure 10 shows that the heavy computation is a real challenge from the point of view of computer science.

The linear variation of w .

The heavy memory usage is the other challenge besides the heavy computation. Ignoring the memory usage of the coefficients and the source term f i , j n , u i , j n needs 8 M x M y N bytes memory space. It needs 100 GB memory with M x = 10240 , M y = 10240 , and N = 1024 . As discussed above, the bigger the M x , M y are, the smaller the β (communication/computation ratio) is. So, the heavy memory usage will limit the parallel efficiency of the parallel algorithm. This kind of contradictions exists in many places. One contradiction is the easy parallelization with bad convergence of the Jacobi iterative method. Another contradiction is the hard parallelization and good convergence of the Gauss-Seidel iterative method.

5. Conclusions and Future Work

In this paper, we present a parallel algorithm for 2D-TFDE with implicit differential method. The parallel solution is analyzed and implemented with MPI programming model. The experimental results show that the parallel algorithm compares well with the exact solution and can scale well on large scale distributed memory cluster system. So, the power of parallel computing for the time consuming fractional differential equations should be recognized.

The numerical solution for fractional equations is very computationally intensive. As a part of the future work, first, the numerical solution of high dimensional space fractional equations has global reliance on almost whole grid points, which is very challenging for real applications. Second, the Krylov subspace method with preconditioner will enhance the convergence for (4) and should be paid attention to. Third, accelerating the parallel algorithm on heterogeneous system  should be paid attention to.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This research work is supported by the National Natural Science Foundation of China under Grant no. 11175253, also by 973 Program of China under Grant no. 61312701001. The authors would like to thank the anonymous reviewers for their helpful comments also.

Magin R. L. Fractional Calculus in Bioengineering 2006 Redding, Calif, USA Begell House Liu F. Turner I. Anh V. Yang Q. Burrage K. A numerical method for the fractional Fitzhugh-Nagumo monodomain model ANZIAM Journal 2013 54 C608 C629 Ding H. Li C. Mixed spline function method for reaction-diffusion equations Journal of Computational Physics 2013 242 103 123 10.1016/j.jcp.2013.02.014 Zhuang P. Liu F. Finite difference approximation for two-dimensional time fractional diffusion equation Journal of Algorithms & Computational Technology 2007 1 1 1 15 10.1260/174830107780122667 Yuste S. B. Acedo L. An explicit finite difference method and a new von Neumann-type stability analysis for fractional diffusion equations SIAM Journal on Numerical Analysis 2005 42 5 1862 1874 2-s2.0-25444472344 10.1137/030602666 Liu F. Zhuang P. Anh V. Turner I. Burrage K. Stability and convergence of the difference methods for the space-time fractional advection-diffusion equation Applied Mathematics and Computation 2007 191 1 12 20 2-s2.0-34547673244 10.1016/j.amc.2006.08.162 Yuste S. B. Quintana-Murillo J. A finite difference method with non-uniform timesteps for fractional diffusion equations Computer Physics Communications 2012 183 12 2594 2600 10.1016/j.cpc.2012.07.011 Zhang X. Huang P. Feng X. Wei L. Finite element method for two-dimensional time-fractional tricomi-type equations Numerical Methods for Partial Differential Equations 2013 29 4 1081 1096 10.1002/num.21745 Agrawal O. P. A general finite element formulation for fractional variational problems Journal of Mathematical Analysis and Applications 2008 337 1 1 12 2-s2.0-34548226960 10.1016/j.jmaa.2007.03.105 Li C. Zeng F. Liu F. Spectral approximations to the fractional integral and derivative Fractional Calculus and Applied Analysis 2012 15 3 383 406 10.2478/s13540-012-0028-x Leonenko N. N. Meerschaert M. M. Sikorskii A. Fractional pearson diffusions Journal of Mathematical Analysis and Applications 2013 403 2 532 546 10.1016/j.jmaa.2013.02.046 Zhuang P. Gu Y. T. Liu F. Turner I. Yarlagadda P. K. D. V. Time-dependent fractional advection-diffusion equations by an implicit MLS meshless method International Journal for Numerical Methods in Engineering 2011 88 13 1346 1362 2-s2.0-81955168106 10.1002/nme.3223 Tadjeran C. Meerschaert M. M. A second-order accurate numerical method for the two-dimensional fractional diffusion equation Journal of Computational Physics 2007 220 2 813 823 2-s2.0-33845628108 10.1016/j.jcp.2006.05.030 Zhang Y.-N. Sun Z.-Z. Alternating direction implicit schemes for the two-dimensional fractional sub-diffusion equation Journal of Computational Physics 2011 230 24 8713 8728 2-s2.0-80053633596 10.1016/j.jcp.2011.08.020 Podlubny I. Fractional Differential Equations 1999 San Diego, Calif, USA Academic Press Gong C. Bao W. Tang G. A parallel algorithm for the Riesz fractional reaction-diffusion equation with explicit finite difference method Fractional Calculus and Applied Analysis 2013 16 3 654 669 10.2478/s13540-013-0041-8 Diethelm K. An efficient parallel algorithm for the numerical solution of fractional differential equations Fractional Calculus and Applied Analysis 2011 14 3 475 490 2-s2.0-80051704994 10.2478/s13540-011-0029-1 Yan J. Tan G.-M. Sun N.-H. Optimizing parallel S n sweeps on unstructured grids for multi-core clusters Journal of Computer Science and Technology 2013 28 4 657 670 10.1007/s11390-013-1366-9 Mo Z. Zhang A. Cao X. Liu Q. Xu X. An H. Pei W. Zhu S. JASMIN: a parallel software infrastructure for scientific computing Frontiers of Computer Science in China 2010 4 4 480 488 2-s2.0-78650483559 10.1007/s11704-010-0120-5 Chen F. Shen J. A GPU parallelized spectral method for elliptic equations in rectangular domains Journal of Computational Physics 2013 250 555 564 10.1016/j.jcp.2013.05.031 Haga J. B. Osnes H. Langtangen H. P. A parallel block preconditioner for large-scale poroelasticity with highly heterogeneous material parameters Computational Geosciences 2012 16 3 723 734 2-s2.0-84860586160 10.1007/s10596-012-9284-4 Gong C. Liu J. Chi L. Huang H. Fang J. Gong Z. GPU accelerated simulations of 3D deterministic particle transport using discrete ordinates method Journal of Computational Physics 2011 230 15 6010 6022 2-s2.0-79956143384 10.1016/j.jcp.2011.04.010 Talamo A. Numerical solution of the time dependent neutron transport equation by the method of the characteristics Journal of Computational Physics 2013 240 248 267 10.1016/j.jcp.2012.12.020 Gong C. Liu J. Huang H. Gong Z. Particle transport with unstructured grid on GPU Computer Physics Communications 2012 183 3 588 593 2-s2.0-84855469169 10.1016/j.cpc.2011.12.002 Pennycook S. Hammond S. Wright S. Herdman J. Miller I. Jarvis S. An investigation of the performance portability of OpenCL Journal of Parallel and Distributed Computing 2012 73 11 1439 1450 10.1016/j.jpdc.2012.07.005 Gu J. Gu X. Gu M. A novel parallel quantum genetic algorithm for stochastic job shop scheduling Journal of Mathematical Analysis and Applications 2009 355 1 63 81 2-s2.0-60749118275 10.1016/j.jmaa.2008.12.065 Salvadore F. Bernardini M. Botti M. GPU accelerated flow solver for direct numerical simulation of turbulent flows Journal of Computational Physics 2013 235 129 142 10.1016/j.jcp.2012.10.012 Yang X.-J. Liao X.-K. Lu K. Hu Q.-F. Song J.-Q. Su J.-S. The TianHe-1A supercomputer: its hardware and software Journal of Computer Science and Technology 2011 26 3 344 351 10.1007/s02011-011-1137-8