A Parallel Algorithm for the Two-Dimensional Time Fractional Diffusion Equation with Implicit Difference Method

It is very time consuming to solve fractional differential equations. The computational complexity of two-dimensional fractional differential equation (2D-TFDE) with iterative implicit finite difference method is O(M x M y N 2). In this paper, we present a parallel algorithm for 2D-TFDE and give an in-depth discussion about this algorithm. A task distribution model and data layout with virtual boundary are designed for this parallel algorithm. The experimental results show that the parallel algorithm compares well with the exact solution. The parallel algorithm on single Intel Xeon X5540 CPU runs 3.16–4.17 times faster than the serial algorithm on single CPU core. The parallel efficiency of 81 processes is up to 88.24% compared with 9 processes on a distributed memory cluster system. We do think that the parallel computing technology will become a very basic method for the computational intensive fractional applications in the near future.


Introduction
Building fractional mathematical models for specific phenomenon and developing numerical or analytical solutions for these fractional mathematical models are very hot in recent years. Fractional diffusion equations have been used to represent different kinds of dynamical systems [1]. But the fractional applications are rare. One reason for rare fractional applications is that the computational cost of approximating for fractional equations is too much heavy. The idea of fractional derivatives dates back to the 17th century. A fractional differential equation is a kind of equation which uses fractional derivatives. Fractional equations provide a powerful instrument for the description of memory and hereditary properties of different substances.
There has been a wide variety of numerical methods proposed for fractional equations [2,3], for example, finite difference method [4][5][6][7], finite element method [8,9], spectral method [10,11], and meshless techniques [12]. Zhuang and Liu [4] presented an implicit difference approximation for two-dimensional time fractional diffusion equation (2D-TFDE) on a finite domain and discussed the stability and convergence of the method. The numerical result of an example agrees well with their theoretical analysis. Tadjeran and Meerschaert presented a numerical method, which combines the alternating directions implicit (ADI) approach with a Crank-Nicolson discretization and a Richardson extrapolation to obtain an unconditionally stable secondorder accurate finite difference method, to approximate a two-dimensional fractional diffusion equation [13]. Two ADI schemes based on the 1 approximation and backward Euler method are considered for the two-dimensional fractional subdiffusion equation [14].
It is very time consuming to numerically solve fractional differential equations for high spatial dimension or big time integration. Short memory principle [15] and parallel computing [16,17] can be used to overcome this difficulty. Parallel computing is used to solve computation intensive applications simultaneously [18][19][20][21]. Large scale applications in science and engineering such as particle transport [22][23][24], 2 The Scientific World Journal different linear and nonlinear systems [25], nonnumerical intelligent algorithm [26], and computational fluid dynamics [27] can rely on parallel computing. Diethelm [17] implemented the fractional version of the second-order Adams-Bashforth-Moulton method on a parallel computer and discussed the precise nature of the parallelization concept. This is the first attempt for parallel computing on fractional equations. Following that, Gong et al. [16] presented a parallel algorithm for one-dimensional Riesz space fractional diffusion equation with explicit finite difference method. The numerical solution of Riesz space fractional equations has global dependence on grid points, which means the approximation of a grid point will depend on the approximation of all grid points in one time step. The numerical solution of time fractional equations has global dependence on time steps, which means that the approximation of a grid point will depend on the approximation of the grid point in all time steps. Global dependence means the nonlocal property of fractional deviates on time or space. Explicit method is easy to be parallelized but is restrict by its stability condition. Implicit method is hard to be solved by Gauss elimination method and often uses the iterative scheme. Until today, the power of parallel computing for high dimensional and time fractional differential equations has not been tried.

Parallel Algorithm
3.1. Analysis. Let 1 = 1 ( , , ) = 1 Γ(2 − ) +1 , , and let 2 = 2 ( , , ) = 2 Γ(2 − ) +1 , ; (4) can be rewritten as The explicit schemes are conditionally stable and need very small for high dimensional problems for both classical and fractional equations. The implicit schemes are unconditionally stable but need to get the inverse of the coefficient matrix. Sometimes the sparse coefficient matrix is too large, making a direct method too difficult to use. So, the iterative method can be used to avoid matrix inverse: It is very time consuming to solve the 2D-TFDE by iterative method of (6). For determining , , and assuming if there are iterations for each time step on average, there are about logical operations ignoring the computation of the coefficients. So, the computational complexity is ( 2 ), which is much more heavy than the classical integer order 2D partial differential equations ( ). Besides the heavy computational cost, the memory space requirement is the other problem. Because each unknown time step needs to use all the values of the previous time steps, all the values of 0 → 0 → ,0 → need to be stored into the memory space. When is big enough, the memory complexity is ( ), which is far bigger than the classical integer order 2D partial differential equations ( ). The computation of (6) can be divided into two parts.
, . The unknown value +1, +1 , of grid point , at the time step + 1 relies on the value of grid point , at all previous time steps of Part1 , .
The data dependence of 2D-TFDE is shown in Figure 1. +1 , relies on the neighboring grid points at the same time step and the same position of all the previous time steps.

Task Distribution Model and Data
Layout. The task distribution of the total computation should be designed on distributed memory systems, with the goal of making the total computations as efficient as possible. There are three main issues in choosing a task distribution model for these computations: (i) load balance: ensure splitting of the computations reasonably evenly among all computing processors/processes throughout the time stepping; (0, P y ) (iii) convenient programming: the parallel algorithm based on the task distribution model should not change the serial algorithm too much.
The goal of keeping attention on these issues is achieving high execution efficiency and high scalability of the parallel algorithm on distributed memory systems for 2D-TFDE. Refer to (6). Part2 , computation has no data dependence. Part1 , computation has data dependence among neighboring grid points. There are mainly two kinds of task distribution models. The first one is one-dimensional distribution (ODD): splitting the domain of all grid points along the X or Y direction on average. The task distribution model of the parallel algorithm [16] for the one-dimensional Riesz space fractional equation is ODD. The parallel algorithm based on ODD will not change the serial algorithm much and the load balance is guaranteed. If task is divided along X direction and is very big, the communication will influence the scalability of the parallel algorithm. The second one is two-dimensional distribution (TDD): splitting the domain of all grid points along the X and Y direction on average. So, the computing processes have a two-dimensional grid layout, with process id ( , ) and 0 ≤ ≤ , 0 ≤ ≤ . , are the dimension size of the processes grid. The task distribution with TDD is shown in Figure 2.
With the TDD, the data layout is described in Figure 3. Each subdomain with a process may have less than four virtual boundaries to receive the boundary data from its nearest neighbors. The virtual boundary is shown with dotted lines. The process ( , − 1) (0 ≤ ≤ ) has four virtual boundaries. The process ( , ) only has three virtual boundaries since there is no process that stays on its right hand. A virtual boundary may have several layer grid points, which depends on the discrete scheme on space. In this paper, there is only one layer grid point for a virtual boundary with (4). In every iteration of (6), the processes exchange 4 The Scientific World Journal (P x , P y − 1) (P x , P y ) · · · · · · · · · · · · . . . the data near the virtual boundaries shown in Figure 3. After the exchange, every process performs its own computation according to (6).

Implementation.
The parallel algorithm for 2D-TFDE uses the mechanisms of process level parallelism. The process level parallelism is a kind of task level parallelism. The parallel algorithm for (1) is described in Algorithm 1. Each process only allocates its local memory. Assuming , are divisible by , , the process with four virtual boundaries will allocate ( / + 2)( / + 2) memory space for array . The calculation of process id has three steps: The computations of 1 ( , ), 2 ( , ), , , and so forth depend on the particular functions of coefficient and source terms. Performing these computations, every time step is a good choice. If these computations are performed out of the main loop (lines 9-32), a lot of memory space is required. If these computations are performed in the "While" loop (lines 16-32), it is too time consuming. The 0 stands for the zero time step 0 , and V stands for V , . , means the iteration 1 ≤ ≤ / , 1 ≤ ≤ / . If a process has neighbors, it should exchange the boundary data with its neighbors. The received boundary data are stored into the designed virtual boundaries. The lines 3-7 of Algorithm 1 are the preprocessing for the parallel algorithm. The lines 9-32 are the main time marching loops. 1 , 2 are used to record the execution time.

Experimental Results and Discussion
The experiment platform is a cluster with distributed memory system (DSM) architecture. One computing node consists of two Intel Xeon E5540 CPUs. The specifications of the cluster are listed in Table 1. The code runs on double precision floating point operations and is compiled by the
The parallel algorithm compares well with the exact analytic solution to the fractional partial differential equation in this test case of (7) with = 0.4, shown in Figure 6. The Δ and ℎ are 1.0/100 and 1.0/10. The maximum absolute error is 8.36 × 10 −3 .

Performance Improvement. For fixed
= 10, the performance comparison between single process and four processes (single CPU) is shown in Figure 7. The X step number in (6) is , which is the x-coordinate of Figure 7. = = ranges from 2048 to 10240. With = 2028, the runtime of one process is 23.45 seconds and the runtime of four processes is 6.64 seconds. The speedup is 3.53. With = 10240, the runtime of one process is 803.88 seconds and the runtime of four processes is 192.76 seconds. The speedup is 4.17. From Figure 7, the parallel algorithm with fixed = 10 is more than 4 times faster than the serial algorithm.

Scalability.
The scalability of the parallel algorithm on the large scale cluster system is shown in Figure 9. The technical specifications of the cluster system are listed in Table 1.
is fixed with 10 for all conditions. Each process has the same ( / , / ) with = = and = .
varies from 16650, 33300, and 49950 for 9, 36, and 81 processes. The runtime of 9 processes is 83.02 seconds and the runtime of 81 processes is 94.08 seconds. The parallel efficiency of 81 processes is 88.24% compared with 9 processes. Here, the parallel efficiency is defined as the ratio of the runtime of different number of processes with the same work load on each process.

4.4.
Discussion. The parallel Algorithm 1 will have good parallel scalability on distributed memory system. From Figure 3, we can see that each subdomain has only virtual boundary at every direction (top, bottom, left, and right). Assuming that the size of the subdomain is , . Assume that is the time to establish the communication, is the transform time for a grid point, and is the global communication time. So, the total communication time for a time step is (9 + 4 + 4 + ). The communication/computation ratio is as follows: .
The Scientific World Journal 7 That means we can enhance the parallel efficiency by enlarging the size of subdomain. The time and number of grid points will affect the convergence property. The exact solution of (7) shows that (0.5, 0.5, ) = 2 + 1.
(1) The bigger becomes, the more inner iterations are needed. With = = = 5, = 2 , the first inner time step 1 needs 5 Jacobi iterations and the last inner time step needs 31 iterations for = 1.0. For = 2.0, 1 becomes 7 and becomes 61.
The reason for the phenomenon above is that Δ ( +1 − ) changes dramatically if the source term ( , , ) is big. The iteration times with = 1.0, = 15, = 2 are shown in Table 2.
The parallel algorithm is compatible with short memory principle [15]. The computing time ( + 8 ) will become small with a smaller , which is determined by . The Gauss-Seidel iteration method will have better convergent speed than Jacobi iteration method, but it is hard to parallelize the Gauss-Seidel method.
As analyzed in Section 3.1, the computational complexity is ( 2 ). Define the following function: varies almost linearly, as shown in Figure 10. Figure 10 shows that the heavy computation is a real challenge from the point of view of computer science. The heavy memory usage is the other challenge besides the heavy computation. Ignoring the memory usage of the coefficients and the source term , , , needs 8 bytes memory space. It needs 100 GB memory with = 10240, = 10240, and = 1024. As discussed above, the bigger the , are, the smaller the (communication/computation ratio) is. So, the heavy memory usage will limit the parallel efficiency of the parallel algorithm. This kind of contradictions exists in many places. One contradiction is the easy parallelization with bad convergence of the Jacobi iterative method. Another contradiction is the hard parallelization and good convergence of the Gauss-Seidel iterative method.

Conclusions and Future Work
In this paper, we present a parallel algorithm for 2D-TFDE with implicit differential method. The parallel solution is analyzed and implemented with MPI programming model. The experimental results show that the parallel algorithm compares well with the exact solution and can scale well on large scale distributed memory cluster system. So, the power of parallel computing for the time consuming fractional differential equations should be recognized.
The numerical solution for fractional equations is very computationally intensive. As a part of the future work, first, the numerical solution of high dimensional space fractional equations has global reliance on almost whole grid points, which is very challenging for real applications. Second, the Krylov subspace method with preconditioner will enhance the convergence for (4) and should be paid attention to. Third, accelerating the parallel algorithm on heterogeneous system [28] should be paid attention to.